Additional Resources

Three Articles

Vasar Dhar's "Data Science and Prediction" is an excellent introduction.

McKinsey & Company published a study in 2011 that is still frequently cited. The most repeated statistic from this report is their prediction that, by 2018, the demand for data scientists would exceed supply by 140,000 people in the United States alone.

Forbes provide this brief but interesting History of Data Science.

Resources on Social Media

Here are a few researchers whose work may interest you.

Sebastian Thrun founded Google[x] and led the Google driverless car project. He is the CEO of the MOOC (Massive Open Online Courses) website Udacity and is a research professor at Stanford University. You can follow him on Twitter (@SebastianThrun). 

Andrew Ng is a professor at Stanford and is the Chief Scientist of the web-services company Baidu. He also co-founded Coursera, where you can find his courses on Machine Learning. Ng has done extensive work on deep learning. He also founded and led the “Google Brain” project: a billion-neuron neural network that famously learned to recognize cats. His website includes many videos and papers. You can also follow him on Twitter (@AndrewYNg).

Liillian Pearson is the president of Data-Mania and the author of Data Science for Dummies. She is very active on Twitter (@BigDataGal), and she has a blog and a YouTube channel.

Cathy O'Neil is a data scientist and the author of the recent book Weapons of Math Destruction (see our news article). Her blog Mathbabe talks about data science and its misuses.

Open Source Tools

​Programming Languages

  • Python is a scripting language that is easy to learn and very widely used. 
  • R is a language geared for statistical computing.

Machine Learning Toolkits

  • NumPy and SciPy are packages that facilliate scientific computing in Python.
  • scikit-learn implements many machine learning algorithms in Python, as well as handling aspects like cross-validation and feature selection.
  • matplotlib is a Python library that provides functionality to visualize data, including a variety of plots and charts.
  • NLTK is a set of tools for Natural Language Processing in Python.