UM Data Science Infrastructure & Consulting Resources


Introduction to Data Science: Courses and Course Materials

Introduction to Data Science: Podcasts

Videos about becoming a data scientist

Interviews with data scientists

Web sites for networking with other data scientists

Data science conferences and symposiums

Data science infographics

What is data science?

How to choose the best training path for becoming a data scientist

★ Careers in data science (as well as continuing education opportunities)

Machine Learning: Tutorials and resources for hackers using Python and Github

Machine Learning: Video Tutorials and Courses

Machine Learning: Podcasts

Machine Learning: Infographics on Pinterest

Data sets to practice machine learning with

Machine Learning: Packages

  • mlpy Machine Learning Python
  • Machine Learning Toolkit MILK
  • MDP a collection of supervised and unsupervised learning algorithms
  • pyBrain modular Machine Learning Library for Python
  • Caffe framework for convolutional neural network algorithms
  • Nolearn framework wrapping scikit neural networks
  • OverFeat Convolutional Network-based image features extractor and classifier
  • Hebel GPU-Accelerated Deep Learning Library in Python
  • neurolab simple and powerful Neural Network Library for Python. Contains based neural networks, train algorithms and flexible framework to create and explore other networks
  • Pylearn2 and Theano deep learning libraries

Machine Learning: Deep Learning

Wikipedia Definition

Machine Learning: Alternatives to and Limitations of Machine Learning

  • Some data scientists believe that Probabilistic Computing will someday overtake and replace machine learning because it is possible for data analysts with limited expertise in data science to quickly arrive at solutions that are easier to interpret. It is based on a Naive Bayes approach.
  • SIDE NOTE: I heard a fascinating story recently related to this approach: there was a head-to-head competition at MIT between experts in three domains: one in Probabilistic Computing ( Dr. Mansinghka, an advisor to Google DeepMind), one in Machine Learning, and one in Statistics. They were given two problem sets – one with a known solution, and one with no known solution. The Probabilistic Computing expert provided both solutions after a few hours, which were deemed significantly better solutions than the ones the ML expert and statistician came up with after about a day and a half.
  • Article on the importance of taking a human-centered approach to machine learning
  • Here’s a tool that helps to address the problem of not knowing your data well enough to implement machine learning “responsibly” – Facets
  • When NOT to use Deep Learning

Machine Learning: Skill Sets Needed

Experimental design  /  Working with machine learning algorithms  / Feature engineering  /  Prediction vs. explanation  /  Network analysis  /  Collaborative filtering / Code up machine learning algorithms on single machines and on clusters of machines / Amazon AWS / Working on problems with terabytes of data / Machine learning pipelines for petabyte-scale data / Algorithmic design / Parallel computing (with MapReduce)

Machine Learning: Potential Tools Needed

Python  /  Python libraries for linear algebra, plotting, machine learning: numpy, matplotlib, sk-learn  /  Github for submitting project code / MapReduce / Hadoop / MrJob / Spark / Spark Core / data frames / Spark Shell / Spark Streaming / Spark SQL / MLlib

The Future of Big Data / ZDNET article

Potential Tools for Working with Big Data

Cloud / Distributed Storage / Ethereum Blockchain / Apache Spark / Docker / CouchDB / Apache Cassandra / OpenStack Swift / Apache Solr / BVLC Caffe / Nvidia Digits / Keras / IBM Watson / GATK

* Elizabeth Austic, Data Science Resources, GitHub Repository