What Is Data Science?

The goal of Data Science is to derive high-level information and knowledge from data. It has this objective in common with Statistics, Data Processing and other disciplines, but two aspects set Data Science apart. The first is the type of data: Data Science’s techniques are usually deployed on large volumes of structured, semi-structured and unstructured data, what has been termed “Big Data”. The second aspect is that Data Science takes an interdisciplinary approach, combining methods of Mathematics, Statistics and Computer Science to a practical area of study, such as Neurology, Genetics, Literary Studies or Marketing. Such an approach requires individuals and teams with expertise in these diverse fields. The sections below explore each of these aspects in turn.

Big Data

The quantity of digitized data is growing at an exponential rate. A 2014 study by the International Data Corporation (IDC) estimated that the amount of data is more than doubling every two years. From 2013 to 2020, it will have grown by a factor of 10.

Illustration of how the quantity of data will grow between 2013 to 2020, according to IDC, 2014.

Most data is transient in nature. For example, the readout from an implanted medical device will usually be within the normal range and is not informative a day later. But again according to IDC, as of 2013, about 22% of data would have some further use if it were analyzed. In our example, it would be very valuable to find commonalities between readouts from people a day before they suffer a heart attack, and thus create an early warning system. Yet IDC found that only about 5% of the potentially useful data actually is analyzed. Why such a small proportion? The growth in volumes has been accompanied by a rapid change in the nature of data. Traditional data processing uses data that was structured so that a machine could efficiently process and store it. Let us use a payroll application as a simple example:

  • The data is highly structured. The payroll record is typically a set of pre-defined fields in a specific format shared by all such records. To avoid redundancy, Information that changes from one pay to the other (like hours worked for a particular pay period) is stored in one table while more stable data (employee name and address) is stored in another. So the data is designed in view of facilitating machine readability and automated processing.
  • The method to calculate the output (the pay cheque) is known in advance. A person could easily make the same calculation and arrive at the same outcome. Automation simply makes processing faster.

In contrast, the growth we are witnessing is mostly in semi- or unstructured data. Consider the 19,000 genes in a human genome [1], the 6,000 Twitter tweets posted every second [2] or the 400 hours of video footage uploaded to YouTube every minute [3]. None of these forms of content were designed to be watched, read or processed by a computer. Nor is it a simple matter to specify in advance a set of rules to arrive at a particular output. Indeed, whereas a payroll system’s purpose is clear, stable and understood before the data is collected, this new, less structured data might only be used long after it is collected, in myriad ways we cannot currently envision. Data Science takes an incremental approach to hypothesizing an underlying structure in the data, making inferences, and then testing if the inferences hold.

Digitized data was once formatted specifically for machine processing and storage. Today, many forms of digitized data, like e-books, blog posts, tweets and online videos, were never intended to be processed by a computer.

The Five V's of Big Data

In 2001, analyst Doug Laney used “Three V’s” to describe what we now call Big Data. Over time, authors have expanded that list by adding a favourite V of their own. Here are five frequently mentioned Vs:

  • Volume: Data Science methods are typically applied to datasets too large to be stored in the memory of a single computer. Using larger volumes of data helps reduce two common types of inferential error: it allows the model to rely on fewer assumptions, and it increases the likelihood that the sample from which the data was drawn accurately represents the larger population. [4]
  • Variety: New data is coming from a growing number of sources and in many forms, including some we would not have imagined at the beginning of the century: tweets, cellphone metadata, and readouts from embedded devices being just three examples.
  • Velocity: Videos, cellphone calls, and readings from medical devices are examples of data that arrive in real-time and/or streams. Analyzing the data must therefore take the time dimension into account.
  • Veracity: Not all data can be trusted. As the quantity of data grows and comes from more diverse sources, the possibility of error increases as well. According to IBM, “one in three business leaders don’t trust the information they use to make decisions.” [5]
  • Value: If 22% of data is potentially useful, then at least 78% is not. The sheer quantity of haystacks makes finding the needle one of the great challenges of Data Science.

Data Protection and Privacy

Many of us post photos and videos on social media. But a great deal more data is collected about us, often without our explicit content or even our knowledge: our movements as tracked by cellphones and GPS devices, our purchases, and even our genes. How comfortable are any of us with the idea that marketers or governments can utilize such data for their own ends? This concern is central to Data Science. IDC (2014) estimated that only about half of the data that requires protection actually is protected. How can we apply appropriate safeguards are developed for the other half? This question becomes all the more pressing when we consider that much of Big Data streams in real time, and so the protections must be applied in real time as well.

Essential Skills for a Data Scientist

The above sketch of Data Science shows that a practitioner requires a special mix of expertise and skills. The required expertise includes a background in the following disciplines [6]:

Mathematics and Statistics: Deriving inferences from data is at the very heart of Statistics. A good grounding in probability, Bayesian statistics and hypothesis testing is essential. Modelling data also requires a solid understanding of correlation and causation.
Artificial Intelligence (AI), including Machine Learning, Natural Language Processing and Computer Vision: Because traditional computational approaches rely on pre-specified rules, they are not always suitable to the challenges that Data Science faces. New approaches create agents that learn from data rather than applying fixed rules.
Distributed Processing: Since Big Data is typically too large to be stored or processed on one machine, several strategies allow a processing task to be split over several machines. MapReduce is one such strategy: the programmer defines a function map() for each computer to execute, and a function reduce() to consolidate the results. Data science job listings often ask for experience with Hadoop, a framework that implements MapReduce, as well as alternative paradigms more appropriate for algorithms that iterate through the data several times.
Domain knowledge of the specific area to which the problem pertains: Solving problems with real-world applications requires knowledge of that application. For example, genome sequence requires a solid background in Biology and Genetics, while Natural Language Processing may require a solid grounding in Linguistics and Literary Studies. A glance at the other pages on this site, particularly People and Research, will demonstrate the rich variety of research domains that are benefitting from Data Science.
Problem Formulation: Like any science, Data Science involves proposing a hypothesis that explains a phenomenon, and then testing that hypothesis on data. These two steps require finding commonalities across what appear to be very different issues.
Data Presentation and Visualization: Once the data scientists have obtained their new insights, it is important that these insights be conveyed to an audience that might not be familiar with more than one of the above fields. The ability to communicate findings in a succinct and readily-comprehensible way is therefore a crucial skill.

Such diversity requires individuals versed in more than a single area of expertise. It also means that Data Science will typically be carried out by a multi-disciplinary team, not a lone specialist.

The Data Science Process

Data Science follows the same steps as any other scientific enquiry: a researcher formulates a hypothesis, collects data, and then analyses the data to see if it supports the hypothesis. If the data does not support the hypothesis, or even if it does, the hypothesis can be tweaked and the process starts again. Not only can the entire cycle repeat, but each step is not over and done in one shot. Rather, at each step it may be necessary to turn back to the previous steps and revisit some actions or decisions. For example, once the data is collected and features of interest are chosen (Step 2), the initial hypothesis may need to be refined. The discussion below will try to outline the essence of each step, emphasizing the aspects of this scientific methodology that are more particular to Data Science:

The five steps in a typical Data Science project. Dashed lines indicate optional paths. Diagram based on ECS Educational Services 2015, University of San Diego 2010 and Gualtieri 2013.

  1. Explore the data: The scientific team looks at the data on hand and tries to formulate the research question. Visualization tools can help form an initial picture of the domain of interest. Once the question and hypothesis are decided upon, it becomes possible to identify what data would be needed, and whether or not available data is sufficient. The team and stakeholders should also agree how they will know if the project is a success or not, as there is bound to be some ambiguity in the final results.
  2. Prepare the data: The team collects the complete data set, both from the resources on hand and new sources if needed. This step will inevitably involve some clean-up or “munging” to make the data usable, as real world data is often less tidy than what we initially imagine. This single step can take as much as half of the total effort for the entire project [7] and so should not be underestimated, though it often is.
  3. Choose the model(s) and training approach: With the data in hand, the team can explore which features of the data seem to be most useful. The team will also choose the models that are best suited to the question or data. Are we categorizing data (likely or unlikely to vote) or outputting a value (e.g. optimal insurance premium)? Does the existing data include the correct answer or not? For example, if we are trying to detect a case of fraud, do our datasets tell us which data belong to fraudulent cases and which do not, or are we instead trying to identify potentially suspicious activity? Machine learning algorithms differ by the kind and quantity of data they require, what kind of answers they can output, and many other aspects.
  4. Build the model: In this phase, an algorithm is deployed and the model is actually created and trained using the data. Usually, a portion of the data will be reserved for testing, allowing the research team to determine if the model’s performance will generalize well to new data. The team must also ensure that the model’s errors can be tolerated. If the purpose of the model is to classify tumours as cancerous or benign, does the model find most cancerous tumours but also flag many benign tumours? If so, is the latter within an acceptable range?
  5. Communicate the results: The team can now say whether or not the initial hypothesis is supported or not. In the process of answering the initial research question, the team has no doubt made some unexpected discoveries that should be disseminated as well. As the audience will likely include people who know very little about Data Science, it is critical that the results be put in non-technical language. Data visualization tools can be invaluable here.

Notes

[1] http://hmg.oxfordjournals.org/content/early/2014/07/01/hmg.ddu309.full?s...

[2] http://www.internetlivestats.com/twitter-statistics

[3] http://www.faqtube.tv/vidcon-2015-update-youtube-statistics/#more-919

[4] Dhar, 2013.

[5] IBM (n.d.)

[6] This list relies synthesizes similar lists in Dhar, 2013, EMC Education Services, 2015 and O'Neil & Schutt, 2013.

[7] EMC Education Services, 2015: 36.

Sources

V. Dhar (2013). Data Science and prediction. Communications of the ACM 56(12): 64-73. 

EMC Education Services (2015). Data Science and Big Data analytics. Toronto: John Wiley & Sons. Inc.

 J. Gantz & D. Reinsel (2012). The digital universe in 2020: Big Data, bigger digital shadows, and biggest growth in the Far East

M. Gualtieri (2013). Evaluating Big Data predictive analytics solutions.

IBM (n.d.). The four v's of Big Data.

C. O'Neil & R. Schutt (2013). Doing Data Science. Sebastopol, CA: O'Reilly Media, Inc.

 V. Turner, J. Gantz, D. Reinsel & S. Minton (2014). The digital universe of opportunities: Rich data and the increasing value of the Internet of things

University of San Diego (2010). Data management.