Category Archives: Talks

PyCon 2015: Neural Nets for Newbies

The ideas and methods in neural nets (NNs) have been around for a long time, but in the last decade plus, we are finally starting to reap significant benefits, and this is just the beginning. This post provides an overview of my recent PyCon talk in Montreal which is a neural net primer of sorts. The video is below, my slides are on SpeakerDeck, and I have a repo on Github named Neural Nets for Newbies.

There is too much to cover to fully explain neural nets them; thus, the post and the talk provide a framework to start to understand neural nets. If you want to learn more, there are plenty of resources, some listed in my deck, to dive into.

What are they?
Machine learning is a set of algorithms for classification and prediction, and artificial neural nets are part of the machine learning space. At its core, neural nets are an algorithm which means an equation to help abstract and find patterns in data. Technically it’s a combination of equations.

The structure is modeled after our brains. Before you get all excited about robots that can think like us (side note that idea has been around since BC), the reality is that we still don’t fully understand how the human brain functions. Neural nets are only loosely mimicking brain functionality. For the enthusiasts out there, yes there are many researchers focused on creating a closer biological representation that acts like our brain. The Bottom line is we aren’t there yet.

The algorithm, structure and many of the ideas around neural net functionality have been around for a while; several of them date back to the 1950s. Neural nets have been applied for commercial solutions as far back as 1959 (reducing phone line echos), but we really haven’t seen significant value until recently. Key reasons are that our computational power (computer processing speed and memory capabilities) and access to useful data (amount of stored data) has significantly improved in the last decade alone.

Why should I care?
Because NNs have achieved technical advancement in areas like:

  • Natural Language Processing (Search & Sentiment)
  • Speech Recognition (Siri)
  • Computer Vision & Facial Recognition (Automatic Image Tagging)
  • Robotics (Automated Car)
  • Recommender Systems (Amazon)
  • Ad Placement

Some of you may roll your eyes at these advancements and complain about how Siri is limited in interactions. Need I remind you that we weren’t talking to our machines at the beginning of this century (or at least it wasn’t common). Hell, we didn’t have iPods at the beginning of this century if you remember what they are. I too fall into the sci-fi trap where I’ve seen it or read about it and so when we actually experience real advancements, it seems so boring and behind the times. Yeah, get over that.

All the areas I mentioned above still have plenty of room for growth and there are definitly other areas I haven’t listed especially in scientific fields. One of the reasons neural nets have had such impressive impact is their way of handling more data especially data that has layers of complexity.  This doesn’t mean that neural nets should be used for all problems. They are overkill in many situations. I cannot stress that enough that every problem is not a nail for the NN hammer.

If you have a good problem and want to apply NNs it’s important to understand how they work.

Ok, so how do they work?
Out the gate, If you want to get serious about applying NNs then you will need to embrace math no matter how much you don’t like it. Below I’ve given you some fundamentals around the math and the structure to get you started.

Basic Structure
Our brains are made up of neurons and synapses, and based on our interactions, certain neurons will fire and send signals to other neurons for data processing/interpretation. There is much more complex stuff going on than just that in our brains, but at a high-level that expresses the structure the neural net models.

NNs at a minimum have three layers: input, hidden, output.

  • Input = data
    • Data that is broken up into consumable information
    • Data can be pre-processed or raw
    • Bias and noise are applied sometimes
  • Hidden = processing units (aka, does math)
    • Made up of neurons
    • A neuron determines if it will be active (math under equation section)
    • Typically there are multiple neurons in a hidden layer (can be thousands or even billions depending on the data used and objective)
  • Output = results
    • One node per classification and just one or many
    • A net to classify dogs or cats in a picture has two output nodes for each type of classification
    • A net to classify handwritten digits between 0-9 has ten output nodes

You can have more than one hidden layer in a neural net, and when you start adding hidden layers, they trade off as inputs and outputs based on where they are in the structure.

Basic Equation
Each neuron represents an equation, and it takes in a set of inputs, multiplies weights, combines the data and then applies an activation function to determine if the neuron is active. A neuron is known as a processing unit because it computes the data to determine its response.

  • Inputs = input layer data in numerical format
  • Weights = coefficients (also known as theta)
    • Specialize each neuron to handle the problem (dataset) you are working with
    • Can initialize randomly
    • One way to initialize is to create a distribution of the existing data set and randomly sample that distribution
    • Often weights are represented between -1 to 1
  • Bias = can be included as an input or used as a threshold to compare data after the activation function is applied
  • Activation Function = data transformation to determine if the neural will send a signal
    • Also known as the energy function
    • There are many different equations that can be used, and it depends on the problem and data you are working with
    • Example equations: sigmoid/logistic, step/binary threshold, linear, rectified linear (combines binary threshold & linear), …
  • Output(s) = each node results in a binary, percentage or number range


Each neuron is unique from other neurons in a hidden layer based on the weights applied. They can also be unique in the inputs and outputs. There are many hyperparameters that you can tweak for one single neuron let alone the whole structure to improve its performance. What makes neural nets powerful is the combination of linear with nonlinear functions in the equation.

When applying a neural net, an effort is needed to optimize the model, so it produces the results you are targeting.

The breakthroughs in neural nets are largely in the area of supervised learning. Supervised learning means you have a dataset labeled with the results you expect. The data is used to train the model so you can make sure it functions as needed. Cross validation is a technique typically used in supervised learning where you split the dataset into a training set to build the model and test set for validation. Note, there are areas in neural net research that explores unlabeled data, but that is too much to cover in this post.

In order to optimize, you start out with a structure and probably randomized weights on each neuron in the hidden layer(s). You’ll run your label data through the structure and come out with results at the end. Then you compare those results to real labels using a loss function to help define the error value. The loss function will transform the comparison, so it becomes a type of compass when going back to optimize the weights on each neuron.

The optimization method (aka back propagation or backprop) is a way of taking the derivative of the loss function and applying it to the weights throughout the model. This method can change all weights on every neuron and because of the way the method works, it does not change the weights equally. You want shifts that vary across weights because each neuron is unique.

  • Error = difference between NN results to the real labels
  • Loss Function = calculates the error  (also referred to as cost function)
    • There are many different equations that are used, and it depends on the problem and data you are working with
    • Example equations: mean squared error, negative log likelihood, cross entropy, hinge, …
  • Regularization = noise applied in the loss function to prevent overfitting
  • Optimization Method = learning method to tune weights
    • There are many different equations that are used, and it depends on the problem and data you are working with
    • Example equations: stochastic gradient descent, Adagrad (J Duchi), Adadelta (M Zeiler), RMSprop (T. Tieleman), …
  • Learning Rate = size of how much to change the weights each time and sometimes part of optimization algorithms

Backprop in essence wiggles (to quote Karpathy) the weights a little each time you run the data through the model during training. You keep running the data through and adjusting the weights until the error stops changing. Hopefully it’s as low as you need it to be for the problem. And if it’s not, you may want to investigate other model structure modifications.

Note reducing the error rate is a common model objective but not always the objective. For the sake of simplicity, that’s our focus right now.

Validation / Testing
Once you’ve stopped training your model, you can run the test data set through it to see how it performs. If the error rate is horrible, then you may have overfit, or there could be a number of other issues to consider. Error rate and other standard validation approaches can be used to check how your model is performing.

Structure Types
I’ve given you a basic structure on how the neural net connects but its important to understand there are variations in that structure that are better for different types of problems. Example types include:

  • Feed Forward (FFN) =  basic structure and passes data forward through the structure in the order of connections
    • There are no loops
    • Data moves in one direction
    • Key Applications: financial prediction, image compression, medical diagnosis and protein structure prediction
  • Recurrent (RNN) = depending on the timing the neuron fires, data can be looped back earlier in the net structure as inputs
    • Data can become input to the same neuron, other neurons in that layer or neurons in a hidden layer prior to that layer
    • Operates on linear progression of time
    • Good for supervised learning in discrete time settings
    • Key Applications: sentiment analysis, speech recognition, NLP
  • Convolutional (CNN) = uses a mixture of hidden layers types (e.g. pooling, convolutional, etc.)
    • Best structure for scaling
    • Inspired by biological processes and variant of multilayer perceptrons
    • Key Applications: computer vision, image & video recognition
  • Other types to checkout:
    • Recursive (RNN) = related to Recurrent but based on structure vs time
    • Restricted Boltzmann Machine (RBM) = 1st neural net to demonstrate learning of latent / hidden variables
    • Autoencoder (Auto) = RBM variant
    • Denoising Autoencoder (DAE)
    • Deep Belief Networks (DBN)

Neural nets can get complex in the structure and combined equations. It can be tricky and time-consuming to develop a useful model and confusing on where to start. Due to extensive research, there are already pre-baked templates for certain types of problems that you can adapt and avoid starting from scratch.

There are a couple other points to note about neural nets to point you in the right direction when developing and deploying.

Systems Engineering
In order to run a neural net to solve problems like mentioned above, it’s important to understand certain system engineering concepts.

The main one to spend time on is graphical processing units (GPUs). These chips are playing a key role in improving latency (speed) to develop NNs. You want every advantage you can get with reducing the time it takes to make a neural net.

GPUs are highly optimized for computation compared to CPUs which is whey they are popular in gaming and research. Granted there are advances going on in CPUs that some argue are making them function more like GPUs. At the heart of this, just spend some time learning about GPUs and try running an NN on it.

I listed a few other topics in my talk that you should research further to go above and beyond single server computation of a neural net.

  • Distributed Computing
  • High-Performance Computing

Note if you go down the distributed path you are starting to get into sharing the data across nodes or splitting the model, which can be extremely tricky.  Try sticking to a single server for as long as possible because you can’t beat that latency and with where technology is, you should be able to do a lot with one computer especially when starting out. Only go down the distributed path when the data and problem are complex enough it can’t be contained on one server.

Python Packages
There are many Python packages you can use to get started with building neural nets and some that will automate most of the process for you to get you off the ground faster. Below is a list of ones I’ve come across so far.

  • Theano
  • Machine Learning Packages
    • Graphlab
    • PyLearn2
    • Lasagne
    • Kayak
    • Blocks
    • OpenDeep
    • PyBrain
    • Keras
    • Sklearn
  • Packages based in C with Python Bindings
    • Caffe
    • CXXNet
    • FANN2
    • GUI with Python API
  • GUI with Python API
    • MetaMind

I highly recommend that you spend time exploring Theano because it’s well documented, will give you the best exposure and control of the math and structure and it’s regularly applied to solve real world problems. Many of the machine learning packages are built off of it. The machine learning packages vary in terms of how easy they are to use, and some have easy integration with GPUs.

MNIST Code Example
For the example in the talk, I used the MNIST (Mixed National Institute of Standards and Technology) dataset, which is the “hello world” of neural nets. It’s handwritten digit analysis of grayscale pictures (28 x 28 pixels).

  • Structure can be as simple as 784 inputs, 1000 hidden units, 10 outputs with at least 794K connections
  • Based on Yann LeCunn’s work at ATT with LeNet in 1990s

For reference, I’ve pulled MNIST examples for some of the Python packages into a Github repository as mentioned above, and you can also find here:

What’s next for NN?
Neural nets will continue to play a signficant role in advancements in all the areas I’ve mentioned especially with natural language processing and computer vision. The real key value for nearl nets is in automatic feature engineering and we will continue to see neural nets applied to help identify features especially as richer datasets for certain problems are captured. 

Additionally, combining neural net structures as well as other machine learing models models with NNs will help drive these advancements. Some great research came out last fall around combinging CNNs with RNNs to apply sentence long descriptions to images. 

Where a number of experts have talked about for the long-term value is the potential impact with unlabeled data. Finding patterns in data that we have no knowledge of or data we’ve labeled with our own set of biases. These types of patterns will drive advancements that may very well be akin to what we read in sci-fi as well as stuff we really haven’t though of yet. 

Reality is NNs are algorithms with the most potential to really create greater intelligence in our machines. Having technology that can reason and come up with new ideas is very possible when NNs are factored in.

Last thoughts…
If you want to get serious about researching neural nets, spend time studying linear algebra (matrix math), calculus (derivatives), existing neural net research and systems engineering (esp. GPUs and distributed systems). The slides I posted have a number of references and there are many other resources online. There are many great talks coming out post conferences that can help you tap into the latest progress. Most importantly, code and practice applying neural nets. Best way to learn is by doing.


Targeting Email with Random Forest at

Last fall, a couple of my colleagues (Kristiane Skiolmen, Scott Lau) and I presented Change’s machine learning email optimization approach as a lecture in Stanford’s Human Computer Interaction Seminar for CS grad students.

The video gives an overview of how uses email to drive petition engagement from the business and social perspective to the specific technical optimization we made. It starts with an overview of Change and examples of petitions that have literally improved and saved lives.

As of the date of the video, here are some stats we presented:

  • 77M total users globally
  • 1.2M users visiting the site daily
  • 450M signatures total
  • 10K declared victories in over 120 countries

Our most successful source of engaging users to sign petitions is email. It’s not an ideal channel and we know that and want to change that. Still since it drives the most response at this time we did take steps to optimize that channel with machine learning. I’m sharing this video and a little about the project so you can see a real world application of machine learning. Below are a couple summary points from the video.

We have an email team that specializes in helping put petitions in front of users who would connect with them. We have groups that define certain petitions to showcase every week through email and the email team was using a cause (topic) filtering model to determine what petitions to send users. It was a manual process of tagging petitions to causes and comparing them to our user base that had been grouped by causes based on petitions they signed.

There are a lot of limitations with this approach from scaling for data size as well as adapting culturally and internationally. Also, the challenge with the manual approach is that some causes had much smaller audiences and lower rates of responses; thus, certain petitions were doomed to fall short of signatures because their cause had a smaller audience.

Our data team built a model to help improve email targeting. Basically, we identified over 500 features (e.g. # petitions signed in the past, etc.) that were predictive of signatures and we tried out a couple classification algorithms to come up with a predictive model to use. The accuracy scores were pretty close on the models we investigated. So we went with a random forest algorithm because we didn’t need to binarize our data, our data is unbalanced (which random forest handles well) and it was the most transparent in feature detection if we wanted to dig into the results.

How it works is each time the email team gets a set of petitions to showcase, they send emails to a sample set of users. Based on the signature response to one petition, a random forest model is developed and then all users are run through the model to predict her/his signature response to that one petition. A random forest model is built per petition the email team showcases that week and we run signature predictions on all users for each of the showcased petitions. Each random forest model produces a probability of signature response per user and then our program sorts the probabilities and identifies the petition with the highest success rate for each user (filtering out ones the user has already received in email). The email team gets back a list of users per petition to send their showcased petitions to for that week.

In the video, I go into more detail around how a random forest works as well as the way it was implemented. Also, Scott provides an overview of how we used Amazon Web Services to implement this data product.

Note there are other ways to approach this problem, but for what we needed, this solution has increased our sign to send rate by 30% which is substantial.  On one petition, for example, we would have had  4% signature response out of a pool of 2M people to email, but our new approach with machine learning enabled us to target 5M users with a 16% signature response rate.

As mentioned,  I don’t see email as the best communication source and even though we can and will improve on our current solution, we are working to incorporate more effective means of engagement.

PyCon 2014 – How to get started with Machine Learning

Following up on the talk I just gave at PyCon 2014 in Montreal, I’ve explained parts of my presentation and provided a few additional clarifications. You can catch the talk at, my github repo PyCon2014 holds the sample code, and the slides are on SpeakerDeck.

Machine Learning (ML) Overview

Arthur Samuel defined machine learning as, “Field of study that gives computers the ability to learn without being explicitly programmed”. Its about applying algorithm(s) in a program to solve the problem you are faced with and address the type of data that you have. You create a model that will help conduct pattern matching and/or predict results. Then evaluate the model and iterate on it as needed to create the right type of solution for the problem.

Examples of ML in the real world include handwritten analysis which uses neural nets to read millions of mail regularly to sort and classify all the different variations in written addresses. Weather prediction, fraud detection, search, facial recognition, and so forth are all examples of machine learning in the wild.


There are several types of ML algorithms to choose from and apply to a problem and some are listed below. They are broken into categories to give an approach on how to think about applying them. When choosing an algorithm, its important to think about the goal/problem, the type of data available and the time and effort that you have to work on the solution.


A couple starting points to consider are whether the data is unsupervised or supervised. Supervised is whether you have actual data that represent the results you are targeting in order to train the model. Spam filters are built on actual data that have been labeled as spam while unsupervised data doesn’t have a clear picture of the result. For unsupervised learning, there will be questions about the data and you can run algorithms on it to see if patterns emerge that help tell a story. Unsupervised is a challenging type of approach and typically there isn’t necessarily a “right” answer for the solution.

In addition, if the data is continuous (e.g. height, weight) or categorical/discrete (e.g. male/female, Canadian/American) that helps determine the type of algorithm to apply. Basically its about whether the data has a set amount of units that can be defined or if the variations in the data are nearly infinite. These are some ways to evaluate what you have to help identify an approach to solve the problem.

Note, the algorithms categorization has been simplified a bit to help provide context, but some of the algorithms do cross the above boundaries (i.e. linear regression).


Once you have the data and an algorithmic approach, you can work on building a model. A model can be something as simple as an equation for a line (y=mx+b) or as complex as a neural net with many layers and nodes.

Linear Regression is a machine learning algorithm and a simple one to start with where you find the best fit line to represent observed data. In the talk, I showed two different examples of having observed data that exhibited some type of linear trend. There was a lot of noise (data was scattered around the graph), but there was enough of a trend to demo linear regression.

When building a model with linear regression, you want to find the most optimal slope (m) and intercept (b) based on the actual data. See algebra is actually applicable in the real world. This is a simple enough algorithm to calculate the model yourself, but its better to leverage tools like scikit-learn’s library to help you more efficiently calculate the best fit line. What you are calculating is a line that minimizes the distance between all the observed data points.

After generating a model, you should evaluate the performance and iterate to improve the model as needed if it is not performing as expected. For more info, I also explained linear regression in a previous post.


When we have a good model, you can take in new data and output predictions. Those predictions can feed into some type of data product or generate results for a report or visualization.

In my presentation, I used actual head size and brain weight data to build a model that predicts brain weight based on head size. Since the data was fairly small, this decreases the predictive power and increases the potential for error in the model. I went with this data since it was a demo, and I wanted to keep it simple. When graphed, the observed data was spread out which also indicated error and a lot of variance in the data. So it predicts weight with a good amount of variance in the model.

With the linear model I built, I was able to apply it so that I could feed it a head size (x) and it would calculate the predicted brain weight (y). Other models are more complex regarding the underlying math and application. Still you will something similar with other models in regards to making them and then feeding in new features/variables to generate some type of result.

To see the full code solution, checkout the github repository as noted above. The script is written a little differently from the slides because I created functions for each of the major steps. Also, there is an iPython notebook that shows some of the drafts I worked through to build out the code for the presentation


The python stack is becoming pretty popular for scientific computing because of the well supported toolsets. Below is a list of key tools to start learning if you want to work with ML. There are many other python libraries out there for more nuanced needs in the space as well as other stack packages to explore (R, Java, Julia). If you are trying to figure out where to start, here are my recommendation:

  • Scikit-Learn = machine learning algorithms
  • Pandas = dataframe tool
  • NumPy = matrix manipulation tool
  • SciPy = stats models
  • Matplotlib = visualization


In order to work with ML algorithms and problems, its important to build out your skill set regarding the following:

  • Algorithms
  • Statistics (probability, inferential, descriptive)
  • Linear Algebra (vectors & matrices)
  • Data Analysis (intuition)
  • SQL, Python, R, Java, Scala (programming)
  • Databases  & APIs (get data)


And of course, the next question is where do I go from here? Below is a beginning list of resources to get you started. I highly recommend Andrew Ng’s class and a couple of links are to sites with more recommendations on what to checkout next:

  • Andrew Ng’s Machine Learning on Coursera
  • Khan Academy (linear algebra and stats)
  • Metacademy
  • Open Source Data Science Masters
  • StackOverflow, Data Tau, Kaggle
  • Machine Learning: A Love Story
  • Collective Intelligence – Toby Segaran
  • Pattern Recognition & Machine Learning – Christopher Bishop
  • Think Stats – Allen Downey
  • Tom Mitchell
  • Mentors

One point to note from this list and I stressed this in the talk, seek out mentors. They are out there and willing to help. You have to put it out there what you want to learn and then be aware when someone offers to help. Also follow-up. Don’t stalk the person but reach out to see if they will make a plan to meet you. They may only have an hour or they may give you more time than you expect. Just ask and if you don’t get a good response or have a hard time understanding what they share, don’t stop there. Keep seeking out mentors. They are an invaluable resource to get you much farther faster.

Last Point to Note

ML is not the solution for everything and many times can be overkill. You have to look at the problem you are working on to determine what makes the most sense in regards to your solution and how much data you have available. Plus, I highly recommend looking for the simple solution first before reaching for something more complex and time-consuming. Sometimes regex is the right answer and there is nothing wrong with that. As mentioned to figure out an approach, its good to understand the problem, the data, the amount of data you have and timing to turn the solution around.

Good luck in your ML pursuit.


These are the main references I used in putting together my talk and post.

  • Zipfian
  • “Analyzing the Analyzers” – Harlan Harris, Sean Murphy, Marck Vaisman
  • “Doing Data Science”  – Rachel Schutt & Cathy O’Neil
  • “Collective Intelligence” – Toby Segaran
  • “Some Useful Machine Learning Libraries” (blog)
  • University GPA Linear Regression Example
  • Scikit-Learn (esp. linear regression)
  • Mozy Blog
  • StackOverflow
  • Wiki