Following up on the talk I just gave at PyCon 2014 in Montreal, I’ve explained parts of my presentation and provided a few additional clarifications. You can catch the talk at Pyvideo.org, my github repo PyCon2014 holds the sample code, and the slides are on SpeakerDeck.
Machine Learning (ML) Overview
Arthur Samuel defined machine learning as, “Field of study that gives computers the ability to learn without being explicitly programmed”. Its about applying algorithm(s) in a program to solve the problem you are faced with and address the type of data that you have. You create a model that will help conduct pattern matching and/or predict results. Then evaluate the model and iterate on it as needed to create the right type of solution for the problem.
Examples of ML in the real world include handwritten analysis which uses neural nets to read millions of mail regularly to sort and classify all the different variations in written addresses. Weather prediction, fraud detection, search, facial recognition, and so forth are all examples of machine learning in the wild.
There are several types of ML algorithms to choose from and apply to a problem and some are listed below. They are broken into categories to give an approach on how to think about applying them. When choosing an algorithm, its important to think about the goal/problem, the type of data available and the time and effort that you have to work on the solution.
A couple starting points to consider are whether the data is unsupervised or supervised. Supervised is whether you have actual data that represent the results you are targeting in order to train the model. Spam filters are built on actual data that have been labeled as spam while unsupervised data doesn’t have a clear picture of the result. For unsupervised learning, there will be questions about the data and you can run algorithms on it to see if patterns emerge that help tell a story. Unsupervised is a challenging type of approach and typically there isn’t necessarily a “right” answer for the solution.
In addition, if the data is continuous (e.g. height, weight) or categorical/discrete (e.g. male/female, Canadian/American) that helps determine the type of algorithm to apply. Basically its about whether the data has a set amount of units that can be defined or if the variations in the data are nearly infinite. These are some ways to evaluate what you have to help identify an approach to solve the problem.
Note, the algorithms categorization has been simplified a bit to help provide context, but some of the algorithms do cross the above boundaries (i.e. linear regression).
Once you have the data and an algorithmic approach, you can work on building a model. A model can be something as simple as an equation for a line (y=mx+b) or as complex as a neural net with many layers and nodes.
Linear Regression is a machine learning algorithm and a simple one to start with where you find the best fit line to represent observed data. In the talk, I showed two different examples of having observed data that exhibited some type of linear trend. There was a lot of noise (data was scattered around the graph), but there was enough of a trend to demo linear regression.
When building a model with linear regression, you want to find the most optimal slope (m) and intercept (b) based on the actual data. See algebra is actually applicable in the real world. This is a simple enough algorithm to calculate the model yourself, but its better to leverage tools like scikit-learn’s library to help you more efficiently calculate the best fit line. What you are calculating is a line that minimizes the distance between all the observed data points.
After generating a model, you should evaluate the performance and iterate to improve the model as needed if it is not performing as expected. For more info, I also explained linear regression in a previous post.
When we have a good model, you can take in new data and output predictions. Those predictions can feed into some type of data product or generate results for a report or visualization.
In my presentation, I used actual head size and brain weight data to build a model that predicts brain weight based on head size. Since the data was fairly small, this decreases the predictive power and increases the potential for error in the model. I went with this data since it was a demo, and I wanted to keep it simple. When graphed, the observed data was spread out which also indicated error and a lot of variance in the data. So it predicts weight with a good amount of variance in the model.
With the linear model I built, I was able to apply it so that I could feed it a head size (x) and it would calculate the predicted brain weight (y). Other models are more complex regarding the underlying math and application. Still you will something similar with other models in regards to making them and then feeding in new features/variables to generate some type of result.
To see the full code solution, checkout the github repository as noted above. The script is written a little differently from the slides because I created functions for each of the major steps. Also, there is an iPython notebook that shows some of the drafts I worked through to build out the code for the presentation
The python stack is becoming pretty popular for scientific computing because of the well supported toolsets. Below is a list of key tools to start learning if you want to work with ML. There are many other python libraries out there for more nuanced needs in the space as well as other stack packages to explore (R, Java, Julia). If you are trying to figure out where to start, here are my recommendation:
- Scikit-Learn = machine learning algorithms
- Pandas = dataframe tool
- NumPy = matrix manipulation tool
- SciPy = stats models
- Matplotlib = visualization
In order to work with ML algorithms and problems, its important to build out your skill set regarding the following:
- Statistics (probability, inferential, descriptive)
- Linear Algebra (vectors & matrices)
- Data Analysis (intuition)
- SQL, Python, R, Java, Scala (programming)
- Databases & APIs (get data)
And of course, the next question is where do I go from here? Below is a beginning list of resources to get you started. I highly recommend Andrew Ng’s class and a couple of links are to sites with more recommendations on what to checkout next:
- Andrew Ng’s Machine Learning on Coursera
- Khan Academy (linear algebra and stats)
- Open Source Data Science Masters
- StackOverflow, Data Tau, Kaggle
- Machine Learning: A Love Story
- Collective Intelligence – Toby Segaran
- Pattern Recognition & Machine Learning – Christopher Bishop
- Think Stats – Allen Downey
- Tom Mitchell
One point to note from this list and I stressed this in the talk, seek out mentors. They are out there and willing to help. You have to put it out there what you want to learn and then be aware when someone offers to help. Also follow-up. Don’t stalk the person but reach out to see if they will make a plan to meet you. They may only have an hour or they may give you more time than you expect. Just ask and if you don’t get a good response or have a hard time understanding what they share, don’t stop there. Keep seeking out mentors. They are an invaluable resource to get you much farther faster.
Last Point to Note
ML is not the solution for everything and many times can be overkill. You have to look at the problem you are working on to determine what makes the most sense in regards to your solution and how much data you have available. Plus, I highly recommend looking for the simple solution first before reaching for something more complex and time-consuming. Sometimes regex is the right answer and there is nothing wrong with that. As mentioned to figure out an approach, its good to understand the problem, the data, the amount of data you have and timing to turn the solution around.
Good luck in your ML pursuit.
These are the main references I used in putting together my talk and post.
- “Analyzing the Analyzers” – Harlan Harris, Sean Murphy, Marck Vaisman
- “Doing Data Science” – Rachel Schutt & Cathy O’Neil
- “Collective Intelligence” – Toby Segaran
- “Some Useful Machine Learning Libraries” (blog)
- University GPA Linear Regression Example
- Scikit-Learn (esp. linear regression)
- Mozy Blog