The data science process

Like any area of scientific inquiry, data science involves obtaining and analyzing data to test hypotheses.

  1. Framing the problem. A good understanding of the background discipline is essential to formulate the problem to be solved and define the hypothesis to be tested. This will also give you clues on what kind of data you need to collect.
  2. Data collection. Where does the data reside and how is it to be accessed? It is possible that you obtain data from multiple sources, in different formats, etc. You may need to download data from sites that require specific query syntax or the use of APIs.
  3. Data cleaning. Most likely, the data collected will need some transformation into a format suitable for preliminary analysis. This step is often referred to as data wrangling or data munching.
  4. Exploratory analysis. This is a critical step to extract preliminary features and trends. It allows to detect mistakes, determine appropriate models, and explore relationships between variables. Exploratory analyses can be graphical or non-graphical (summary statistics). Some useful analyses may include calculating mean, central tendency, spread, skewness, and kurtosis, time series to find patterns, scatterplots for correlation, clustering, dimension reduction, etc. The hypothesis may need to be reformulated at this step.
  5. Modeling and analysis. Machine learning algorithms differ by the kind and quantity of data they require, what kind of answers they can output, and many other aspects. A good understanding of how they work is essential if we decide to apply ML. In many cases, simpler statistical methods suffice for testing our hypothesis.
  6. Interpretation and communication of results. The data scientist needs to understand the field where the data was generated and the mathematical operations performed during the analysis in order to interpret the results and draw meaningful conclusions. There are many ways to communicate the results depending on the audience, from building a new application or process, to publishing an article in a research journal.

See Essential skills for data science for an overview of the skills needed in each step of the data science process, and courses you can take at McGill to develop the expertise.

Back to top