Category Archives: Data Science

PyCon 2015: Neural Nets for Newbies

The ideas and methods in neural nets (NNs) have been around for a long time, but in the last decade plus, we are finally starting to reap significant benefits, and this is just the beginning. This post provides an overview of my recent PyCon talk in Montreal which is a neural net primer of sorts. The video is below, my slides are on SpeakerDeck, and I have a repo on Github named Neural Nets for Newbies.

There is too much to cover to fully explain neural nets them; thus, the post and the talk provide a framework to start to understand neural nets. If you want to learn more, there are plenty of resources, some listed in my deck, to dive into.

What are they?
Machine learning is a set of algorithms for classification and prediction, and artificial neural nets are part of the machine learning space. At its core, neural nets are an algorithm which means an equation to help abstract and find patterns in data. Technically it’s a combination of equations.

The structure is modeled after our brains. Before you get all excited about robots that can think like us (side note that idea has been around since BC), the reality is that we still don’t fully understand how the human brain functions. Neural nets are only loosely mimicking brain functionality. For the enthusiasts out there, yes there are many researchers focused on creating a closer biological representation that acts like our brain. The Bottom line is we aren’t there yet.

The algorithm, structure and many of the ideas around neural net functionality have been around for a while; several of them date back to the 1950s. Neural nets have been applied for commercial solutions as far back as 1959 (reducing phone line echos), but we really haven’t seen significant value until recently. Key reasons are that our computational power (computer processing speed and memory capabilities) and access to useful data (amount of stored data) has significantly improved in the last decade alone.

Why should I care?
Because NNs have achieved technical advancement in areas like:

  • Natural Language Processing (Search & Sentiment)
  • Speech Recognition (Siri)
  • Computer Vision & Facial Recognition (Automatic Image Tagging)
  • Robotics (Automated Car)
  • Recommender Systems (Amazon)
  • Ad Placement

Some of you may roll your eyes at these advancements and complain about how Siri is limited in interactions. Need I remind you that we weren’t talking to our machines at the beginning of this century (or at least it wasn’t common). Hell, we didn’t have iPods at the beginning of this century if you remember what they are. I too fall into the sci-fi trap where I’ve seen it or read about it and so when we actually experience real advancements, it seems so boring and behind the times. Yeah, get over that.

All the areas I mentioned above still have plenty of room for growth and there are definitly other areas I haven’t listed especially in scientific fields. One of the reasons neural nets have had such impressive impact is their way of handling more data especially data that has layers of complexity.  This doesn’t mean that neural nets should be used for all problems. They are overkill in many situations. I cannot stress that enough that every problem is not a nail for the NN hammer.

If you have a good problem and want to apply NNs it’s important to understand how they work.

Ok, so how do they work?
Out the gate, If you want to get serious about applying NNs then you will need to embrace math no matter how much you don’t like it. Below I’ve given you some fundamentals around the math and the structure to get you started.

Basic Structure
Our brains are made up of neurons and synapses, and based on our interactions, certain neurons will fire and send signals to other neurons for data processing/interpretation. There is much more complex stuff going on than just that in our brains, but at a high-level that expresses the structure the neural net models.

NNs at a minimum have three layers: input, hidden, output.

  • Input = data
    • Data that is broken up into consumable information
    • Data can be pre-processed or raw
    • Bias and noise are applied sometimes
  • Hidden = processing units (aka, does math)
    • Made up of neurons
    • A neuron determines if it will be active (math under equation section)
    • Typically there are multiple neurons in a hidden layer (can be thousands or even billions depending on the data used and objective)
  • Output = results
    • One node per classification and just one or many
    • A net to classify dogs or cats in a picture has two output nodes for each type of classification
    • A net to classify handwritten digits between 0-9 has ten output nodes

You can have more than one hidden layer in a neural net, and when you start adding hidden layers, they trade off as inputs and outputs based on where they are in the structure.

Basic Equation
Each neuron represents an equation, and it takes in a set of inputs, multiplies weights, combines the data and then applies an activation function to determine if the neuron is active. A neuron is known as a processing unit because it computes the data to determine its response.

  • Inputs = input layer data in numerical format
  • Weights = coefficients (also known as theta)
    • Specialize each neuron to handle the problem (dataset) you are working with
    • Can initialize randomly
    • One way to initialize is to create a distribution of the existing data set and randomly sample that distribution
    • Often weights are represented between -1 to 1
  • Bias = can be included as an input or used as a threshold to compare data after the activation function is applied
  • Activation Function = data transformation to determine if the neural will send a signal
    • Also known as the energy function
    • There are many different equations that can be used, and it depends on the problem and data you are working with
    • Example equations: sigmoid/logistic, step/binary threshold, linear, rectified linear (combines binary threshold & linear), …
  • Output(s) = each node results in a binary, percentage or number range


Each neuron is unique from other neurons in a hidden layer based on the weights applied. They can also be unique in the inputs and outputs. There are many hyperparameters that you can tweak for one single neuron let alone the whole structure to improve its performance. What makes neural nets powerful is the combination of linear with nonlinear functions in the equation.

When applying a neural net, an effort is needed to optimize the model, so it produces the results you are targeting.

The breakthroughs in neural nets are largely in the area of supervised learning. Supervised learning means you have a dataset labeled with the results you expect. The data is used to train the model so you can make sure it functions as needed. Cross validation is a technique typically used in supervised learning where you split the dataset into a training set to build the model and test set for validation. Note, there are areas in neural net research that explores unlabeled data, but that is too much to cover in this post.

In order to optimize, you start out with a structure and probably randomized weights on each neuron in the hidden layer(s). You’ll run your label data through the structure and come out with results at the end. Then you compare those results to real labels using a loss function to help define the error value. The loss function will transform the comparison, so it becomes a type of compass when going back to optimize the weights on each neuron.

The optimization method (aka back propagation or backprop) is a way of taking the derivative of the loss function and applying it to the weights throughout the model. This method can change all weights on every neuron and because of the way the method works, it does not change the weights equally. You want shifts that vary across weights because each neuron is unique.

  • Error = difference between NN results to the real labels
  • Loss Function = calculates the error  (also referred to as cost function)
    • There are many different equations that are used, and it depends on the problem and data you are working with
    • Example equations: mean squared error, negative log likelihood, cross entropy, hinge, …
  • Regularization = noise applied in the loss function to prevent overfitting
  • Optimization Method = learning method to tune weights
    • There are many different equations that are used, and it depends on the problem and data you are working with
    • Example equations: stochastic gradient descent, Adagrad (J Duchi), Adadelta (M Zeiler), RMSprop (T. Tieleman), …
  • Learning Rate = size of how much to change the weights each time and sometimes part of optimization algorithms

Backprop in essence wiggles (to quote Karpathy) the weights a little each time you run the data through the model during training. You keep running the data through and adjusting the weights until the error stops changing. Hopefully it’s as low as you need it to be for the problem. And if it’s not, you may want to investigate other model structure modifications.

Note reducing the error rate is a common model objective but not always the objective. For the sake of simplicity, that’s our focus right now.

Validation / Testing
Once you’ve stopped training your model, you can run the test data set through it to see how it performs. If the error rate is horrible, then you may have overfit, or there could be a number of other issues to consider. Error rate and other standard validation approaches can be used to check how your model is performing.

Structure Types
I’ve given you a basic structure on how the neural net connects but its important to understand there are variations in that structure that are better for different types of problems. Example types include:

  • Feed Forward (FFN) =  basic structure and passes data forward through the structure in the order of connections
    • There are no loops
    • Data moves in one direction
    • Key Applications: financial prediction, image compression, medical diagnosis and protein structure prediction
  • Recurrent (RNN) = depending on the timing the neuron fires, data can be looped back earlier in the net structure as inputs
    • Data can become input to the same neuron, other neurons in that layer or neurons in a hidden layer prior to that layer
    • Operates on linear progression of time
    • Good for supervised learning in discrete time settings
    • Key Applications: sentiment analysis, speech recognition, NLP
  • Convolutional (CNN) = uses a mixture of hidden layers types (e.g. pooling, convolutional, etc.)
    • Best structure for scaling
    • Inspired by biological processes and variant of multilayer perceptrons
    • Key Applications: computer vision, image & video recognition
  • Other types to checkout:
    • Recursive (RNN) = related to Recurrent but based on structure vs time
    • Restricted Boltzmann Machine (RBM) = 1st neural net to demonstrate learning of latent / hidden variables
    • Autoencoder (Auto) = RBM variant
    • Denoising Autoencoder (DAE)
    • Deep Belief Networks (DBN)

Neural nets can get complex in the structure and combined equations. It can be tricky and time-consuming to develop a useful model and confusing on where to start. Due to extensive research, there are already pre-baked templates for certain types of problems that you can adapt and avoid starting from scratch.

There are a couple other points to note about neural nets to point you in the right direction when developing and deploying.

Systems Engineering
In order to run a neural net to solve problems like mentioned above, it’s important to understand certain system engineering concepts.

The main one to spend time on is graphical processing units (GPUs). These chips are playing a key role in improving latency (speed) to develop NNs. You want every advantage you can get with reducing the time it takes to make a neural net.

GPUs are highly optimized for computation compared to CPUs which is whey they are popular in gaming and research. Granted there are advances going on in CPUs that some argue are making them function more like GPUs. At the heart of this, just spend some time learning about GPUs and try running an NN on it.

I listed a few other topics in my talk that you should research further to go above and beyond single server computation of a neural net.

  • Distributed Computing
  • High-Performance Computing

Note if you go down the distributed path you are starting to get into sharing the data across nodes or splitting the model, which can be extremely tricky.  Try sticking to a single server for as long as possible because you can’t beat that latency and with where technology is, you should be able to do a lot with one computer especially when starting out. Only go down the distributed path when the data and problem are complex enough it can’t be contained on one server.

Python Packages
There are many Python packages you can use to get started with building neural nets and some that will automate most of the process for you to get you off the ground faster. Below is a list of ones I’ve come across so far.

  • Theano
  • Machine Learning Packages
    • Graphlab
    • PyLearn2
    • Lasagne
    • Kayak
    • Blocks
    • OpenDeep
    • PyBrain
    • Keras
    • Sklearn
  • Packages based in C with Python Bindings
    • Caffe
    • CXXNet
    • FANN2
    • GUI with Python API
  • GUI with Python API
    • MetaMind

I highly recommend that you spend time exploring Theano because it’s well documented, will give you the best exposure and control of the math and structure and it’s regularly applied to solve real world problems. Many of the machine learning packages are built off of it. The machine learning packages vary in terms of how easy they are to use, and some have easy integration with GPUs.

MNIST Code Example
For the example in the talk, I used the MNIST (Mixed National Institute of Standards and Technology) dataset, which is the “hello world” of neural nets. It’s handwritten digit analysis of grayscale pictures (28 x 28 pixels).

  • Structure can be as simple as 784 inputs, 1000 hidden units, 10 outputs with at least 794K connections
  • Based on Yann LeCunn’s work at ATT with LeNet in 1990s

For reference, I’ve pulled MNIST examples for some of the Python packages into a Github repository as mentioned above, and you can also find here:

What’s next for NN?
Neural nets will continue to play a signficant role in advancements in all the areas I’ve mentioned especially with natural language processing and computer vision. The real key value for nearl nets is in automatic feature engineering and we will continue to see neural nets applied to help identify features especially as richer datasets for certain problems are captured. 

Additionally, combining neural net structures as well as other machine learing models models with NNs will help drive these advancements. Some great research came out last fall around combinging CNNs with RNNs to apply sentence long descriptions to images. 

Where a number of experts have talked about for the long-term value is the potential impact with unlabeled data. Finding patterns in data that we have no knowledge of or data we’ve labeled with our own set of biases. These types of patterns will drive advancements that may very well be akin to what we read in sci-fi as well as stuff we really haven’t though of yet. 

Reality is NNs are algorithms with the most potential to really create greater intelligence in our machines. Having technology that can reason and come up with new ideas is very possible when NNs are factored in.

Last thoughts…
If you want to get serious about researching neural nets, spend time studying linear algebra (matrix math), calculus (derivatives), existing neural net research and systems engineering (esp. GPUs and distributed systems). The slides I posted have a number of references and there are many other resources online. There are many great talks coming out post conferences that can help you tap into the latest progress. Most importantly, code and practice applying neural nets. Best way to learn is by doing.


MapReduce, MRJob & AWS EMR Pointers

Over the last couple weeks, I’ve been playing around with MapReduce, MRJob and AWS to answer some questions about event data. Granted this is definitely more data engineering focused than data science, but using these tools can be very beneficial if you are analyzing a ton of data (esp. event log data).

This is more of an overview with a few lessons learned on how to setup a MapReduce job using MRJob and AWS EMR. This post focuses more on process and less about the script logic.

First, What is MapReduce (MR)?
MR is an algorithm used especially for large amounts of data to easily apply some type of filter and organization of the data and then condense it into a result. It was born from similar concepts used in functional programming.

Map = procedure to filter and sort
Reduce = procedure to condense and summarize

Word count is typically used as the “Hello World” of MapReduce. So think about taking a book like Hitchhiker’s Guide to count the occurrences of all the words. The map step would create key/value pairs (i.e. dictionary or hash format) for every single word in the book. So a word is the key and a value like the number 1 would be applied (e.g. “hitchhiker”: 1, “galaxy”: 1, “guide”: 1, “hitchhiker”: 1). There would be duplicate keys outputted.

The reduce step condenses all duplicate keys like “hitchhiker” to a single, unique key for each word where all the related values are put into a list (e.g. “hitchhiker”: [1,1,1,1,1,1,….1]. The list will contain a 1 for every occurrence of “hitchhiker in the book. Then reduce can perform a summarization task by literally adding up the numbers or taking the length of the list. The reduce step also outputs a key/value pair for each unique word in the book with the summed value (e.g. “hitchhiker”: 42).

This is a very simplistic example of MR and there are many complex variations based on the problem being solved. For example, MapReduce is cited as a solution to use on something like Twitter follower recommendations. There are a number of online resources that provide more examples and just looking at other examples can help with defining the MR logic. A couple resources I found in my research covered complex patterns and an overview of different types of patterns.

Other Tools/Techniques to Understand:
In order to follow along, these are brief overviews of key approaches and tools. I recommend reading further on each of them.

  • Hadoop = An Apache framework that helps distribute, store and process data across many machines (cluster)
  • HDFS = Hadoop Distributed File System is the storage solution that is part of the Hadoop framework. It is specifically geared for distributed systems.
  • S3 = Simple Storage Service is just an AWS storage system. You cannot open files or run programs in S3
  • EC2 = Elastic Compute Clouds are virtual Amazon computers for rent. You can create, run and terminate servers as needed which led to the term elastic. Its having an additional computer or computers you can configure how you need and run applications on but you don’t have to maintain the hardware or operating system
  • EMR = Elastic MapReduce is an Amazon web service that pre configures EC2 instances with Hadoop. Basically its a service to easily spin up a cluster of Hadoop formatted machines without having to acquire and setup the hardware and software yourself
  • MRJob = Yelp developed package/library so you can write python MapReduce scripts that will run on Hadoop frameworks
  • Screen = GNU software to multiplex terminal sessions / consoles. This means you can run a program encapsulated in its own environment. Or another way to say it is if you kicked off an EMR job from your computer at work inside a screen session, you can detach the session, close your computer down and go home and then log back into the screen session to see that the job has been uninterrupted and is still processing

Hadoop should typically not be your immediate choice when analyzing data. I’ve heard this multiple times from different experts in the field. You really need to think about the type of problems you are solving/questions you are answering, the type, format and size of data and time and money to apply to the problem.

Many times in data science you can take adequate samples of data to answer questions and solve problems without needing Hadoop for processing. So when you take on a challenge and people are immediately saying Hadoop, take time to talk through if that is appropriate. Make your life easier by going for simpler solutions first before bringing out the big guns.

Overview of My Experience / Lessons Learned:

Abstract Challenge:
For the problem I worked on, I was trying to answer a number of questions around activity using event logs. Event data is prolific and can be a good case for using MR.

The event logs were JSON formatted, 1 event per line in several files that were gzipped and stored on S3. The good thing about MRJob is that it has built in protocols to handle unzipping and processing files. There was only one adjustment I made to my script to make it easier to handle the fact that each line was already in a JSON format which was to add the lines below:

  • from mrjob.protocol import JSONValueProtocol
  • INPUT_PROTOCOL = JSONValueProtocol


Local & Small

  • Pulled 1 JSON event to analyze what data was available and focused on the data values needed to answer the question
  • Simulated/created a few variations on the JSON event example to cover key use cases for testing the the MRJob script
  • Developed 1 MR step  (mapper/reducer) to just pull simple data point counts on the dummy JSON values
  • Expanded MR to address the more complex question. This led to a multi-step MR job (2 rounds of map and reduce) which eventually condensed to  1 map and 2 reduce steps
    • If the questions would call for a SQL join or groupby to get the answer then used those data points as keys
    • Used a conditional in the map step to filter and streamlined the yielded results that needed to be sorted and condensed
  • During initial code development,rananddebuggedMRJob code locally on dummy JSON event data which was also stored locally. Used the followingbashcommandtorunMRJob:
    • python [MRJob script] [data file] # this output results to stdout (e.g. terminal)
    • Note you can just run specific steps to focus on debugging like just a map step by appending something like –mapper to the command above

Local & AWS Remote

  • Once the results from the dummy values looked good, spun up an EC2 instance
    • FYI, if you don’t have an account with AWS, sign up and get your access keys (id & secret)
  • Pulled1zipfilefromS3onto the instance because it was too big for the personal computer
    • Used s3cmd (command line tool) to get access to data
    • s3cmd get [filename] # downloads file on EC2 instance
  • Unzipped file and pulled about 100 events into a sample file. Then exported the sample file back to S3
    • gunzip [filename]
    • head -n 100 [filename] > sample.txt
    • s3cmd put [filename] # store file back on s3
  • Ran exploratory data analysis using Pandas on the sample to verify data structure and results
  • Referenced the sample data pulled down from S3 ran locally to debug initially
    • python [MRJob script] [data file]
  • Once the script worked and the numbers made sense, then ran the file through EMR
    • setup MRJob config file and see example further below
    • setup pem file and stored on computer running MRJob script
    • used the following command:
    • python [MRJob script] -o “[s3 bucket for output]” -r emr “[s3 bucket for input]”
    • S3 file path typically starts with s3:// and make sure quotes are around the path
  • There were issues to debug on the sample but once fixed, I setup a screen and ran the code on the full data set

Configuration Tips:
To make it easier to run MRJob, use a configuration file. MRJob provides documentation on how to set this up. One way is to put the file in your root directory at the ~/ and label it .mrjob.conf to make MRJob automatically find it. There are a number of things that can be pre configured and will save how long the command line script is when running an EMR job.

  • runners:
    • emr:
      • cmdenv:
      • TZ: America/
        • aws_access_key_id: [your key]
      • aws_secret_access_key: [your key]
      • ssh_tunnel_to_job_tracker: true
      • aws_region: us-west-2
      • ec2_instance_type: m1.xlarge
      • ec2_key_pair: [pem file name]
      • ec2_key_pair_file: ~/.ssh/[pem file name].pem
      • num_ec2_core_instances: 0
      • enable_emr_debugging: true
      • ami_version: latest

Note, above is an example and there are many variations and additional parameters you can add and change based on the job you are running. Also, some parameters have to be entered in at the command line and cannot be added to the configuration file.

When you first run a small sample job on EMR, do it with 1 instance and on something with less horsepower. In this case, you would setup the EC2 instance in the m1 range and only 0 instances. These commands relate to the instance type and number:

  • ec2_instance_type: m1.small
  • num_ec2_core_instances: 0

If your script calls for a package that is not standard on EMR, you will need to bootstrap/load the EMR instances with the package prior to running the job. In my case, I needed dateutil and in order to load it, I first needed to load pip. So I added the following commands to my config file:

  • bootstrap:
  • – sudo apt-get install -y python-pip || sudo yum install -y python-pip
  • – sudo pip install python-dateutil

Also when setting up AWS, you will need to create an EC2 pem file for encryption or better known as key pairs. This is different from the access keys. AWS provides a step by step process to setup the pem file. Be sure to remember the name you give it and to move the file that is downloaded into a folder you reference in your configuration. Most people  typically put it in the .ssh file on the root directory. Also for MAC and Linux, be sure to change the file permission by running the command in the terminal on the pem file: chmod 400 . You can confirm the permissions changed if you are in the same folder as the pem file by running: ls -la. The following commands reference the pem file in .mrjob.conf file:

  • ec2_key_pair: [pem file name]
  • ec2_key_pair_file: ~/.ssh/[pem file name].pem

A couple additional setup tips:

  • When running on more than the sample data, be sure to add to the command line or configuration file: –no-output. This makes sure that the job does not output all stdout values on your local computer when the full data set is being processed. You really don’t want that
  • Set S3 and EMR to the same region so Amazon will not charge for bandwidth used between them
  • Stop instances when you are not using them to save money

If you want to see what’s going on with the EMR instances, you can login while they are running and poke around. Login to the AWS console and go to EMR. Click on the cluster that is running your job. There will be a Master public DNS you will want to use. Just use the following command in your terminal:

  • ssh hadoop@[EMR Master public DNS] OR
  • ssh -i hadoop@[EMR Master public DNS]

This will connect your terminal directly into the EMR instances. You can poke around and see what’s going on while they are running. Unless you specify to keep EMR running after the job is done, then the instances will terminate at the end of the job and boot you out.

Troubleshooting Pointers:
Data Inconsistency – Be careful to analyze the data and confirm what you have access to. This is a common challenge. In my case, there were missing values out of different events I worked with which required changing the code a few times to do a check for values as well as get information out of different data points that were more consistent. Bottom line, don’t trust the data.

DateTime & UTC – I’ve wrestled with this devil many times in the past and it still tripped me up on this project. Make sure if your conditionals are working with time, and they typically will be with event logs, to deliberately translate and compare datetime in UTC format.

Traceback Error – If your EMR terminates with errors then you can go into the AWS dashboard and the EMR section. Choose the cluster that you ran, then expand the Steps area. Click on View jobs under the job that failed. Then click on View tasks under the job that failed. Select View attempts next to a task that failed and choose stderr link on a task that failed. That will open an error log that can help provide more context around what went wrong.

Final Thoughts:
There are so many variations on the process outlined above not to mention many different tools you can use. This post was to give some pointers on my approach at a pseudo high-level for those trying to figure out the end to end process. You really have to research and figure out what works best for your situation.

Where I would go next with this area is to play around with something like Spark to handle streaming data and to explore implementing machine learning algorithms on massive data. Although neural nets are more of a personal interest for me right now.

PyCon 2014 – How to get started with Machine Learning

Following up on the talk I just gave at PyCon 2014 in Montreal, I’ve explained parts of my presentation and provided a few additional clarifications. You can catch the talk at, my github repo PyCon2014 holds the sample code, and the slides are on SpeakerDeck.

Machine Learning (ML) Overview

Arthur Samuel defined machine learning as, “Field of study that gives computers the ability to learn without being explicitly programmed”. Its about applying algorithm(s) in a program to solve the problem you are faced with and address the type of data that you have. You create a model that will help conduct pattern matching and/or predict results. Then evaluate the model and iterate on it as needed to create the right type of solution for the problem.

Examples of ML in the real world include handwritten analysis which uses neural nets to read millions of mail regularly to sort and classify all the different variations in written addresses. Weather prediction, fraud detection, search, facial recognition, and so forth are all examples of machine learning in the wild.


There are several types of ML algorithms to choose from and apply to a problem and some are listed below. They are broken into categories to give an approach on how to think about applying them. When choosing an algorithm, its important to think about the goal/problem, the type of data available and the time and effort that you have to work on the solution.


A couple starting points to consider are whether the data is unsupervised or supervised. Supervised is whether you have actual data that represent the results you are targeting in order to train the model. Spam filters are built on actual data that have been labeled as spam while unsupervised data doesn’t have a clear picture of the result. For unsupervised learning, there will be questions about the data and you can run algorithms on it to see if patterns emerge that help tell a story. Unsupervised is a challenging type of approach and typically there isn’t necessarily a “right” answer for the solution.

In addition, if the data is continuous (e.g. height, weight) or categorical/discrete (e.g. male/female, Canadian/American) that helps determine the type of algorithm to apply. Basically its about whether the data has a set amount of units that can be defined or if the variations in the data are nearly infinite. These are some ways to evaluate what you have to help identify an approach to solve the problem.

Note, the algorithms categorization has been simplified a bit to help provide context, but some of the algorithms do cross the above boundaries (i.e. linear regression).


Once you have the data and an algorithmic approach, you can work on building a model. A model can be something as simple as an equation for a line (y=mx+b) or as complex as a neural net with many layers and nodes.

Linear Regression is a machine learning algorithm and a simple one to start with where you find the best fit line to represent observed data. In the talk, I showed two different examples of having observed data that exhibited some type of linear trend. There was a lot of noise (data was scattered around the graph), but there was enough of a trend to demo linear regression.

When building a model with linear regression, you want to find the most optimal slope (m) and intercept (b) based on the actual data. See algebra is actually applicable in the real world. This is a simple enough algorithm to calculate the model yourself, but its better to leverage tools like scikit-learn’s library to help you more efficiently calculate the best fit line. What you are calculating is a line that minimizes the distance between all the observed data points.

After generating a model, you should evaluate the performance and iterate to improve the model as needed if it is not performing as expected. For more info, I also explained linear regression in a previous post.


When we have a good model, you can take in new data and output predictions. Those predictions can feed into some type of data product or generate results for a report or visualization.

In my presentation, I used actual head size and brain weight data to build a model that predicts brain weight based on head size. Since the data was fairly small, this decreases the predictive power and increases the potential for error in the model. I went with this data since it was a demo, and I wanted to keep it simple. When graphed, the observed data was spread out which also indicated error and a lot of variance in the data. So it predicts weight with a good amount of variance in the model.

With the linear model I built, I was able to apply it so that I could feed it a head size (x) and it would calculate the predicted brain weight (y). Other models are more complex regarding the underlying math and application. Still you will something similar with other models in regards to making them and then feeding in new features/variables to generate some type of result.

To see the full code solution, checkout the github repository as noted above. The script is written a little differently from the slides because I created functions for each of the major steps. Also, there is an iPython notebook that shows some of the drafts I worked through to build out the code for the presentation


The python stack is becoming pretty popular for scientific computing because of the well supported toolsets. Below is a list of key tools to start learning if you want to work with ML. There are many other python libraries out there for more nuanced needs in the space as well as other stack packages to explore (R, Java, Julia). If you are trying to figure out where to start, here are my recommendation:

  • Scikit-Learn = machine learning algorithms
  • Pandas = dataframe tool
  • NumPy = matrix manipulation tool
  • SciPy = stats models
  • Matplotlib = visualization


In order to work with ML algorithms and problems, its important to build out your skill set regarding the following:

  • Algorithms
  • Statistics (probability, inferential, descriptive)
  • Linear Algebra (vectors & matrices)
  • Data Analysis (intuition)
  • SQL, Python, R, Java, Scala (programming)
  • Databases  & APIs (get data)


And of course, the next question is where do I go from here? Below is a beginning list of resources to get you started. I highly recommend Andrew Ng’s class and a couple of links are to sites with more recommendations on what to checkout next:

  • Andrew Ng’s Machine Learning on Coursera
  • Khan Academy (linear algebra and stats)
  • Metacademy
  • Open Source Data Science Masters
  • StackOverflow, Data Tau, Kaggle
  • Machine Learning: A Love Story
  • Collective Intelligence – Toby Segaran
  • Pattern Recognition & Machine Learning – Christopher Bishop
  • Think Stats – Allen Downey
  • Tom Mitchell
  • Mentors

One point to note from this list and I stressed this in the talk, seek out mentors. They are out there and willing to help. You have to put it out there what you want to learn and then be aware when someone offers to help. Also follow-up. Don’t stalk the person but reach out to see if they will make a plan to meet you. They may only have an hour or they may give you more time than you expect. Just ask and if you don’t get a good response or have a hard time understanding what they share, don’t stop there. Keep seeking out mentors. They are an invaluable resource to get you much farther faster.

Last Point to Note

ML is not the solution for everything and many times can be overkill. You have to look at the problem you are working on to determine what makes the most sense in regards to your solution and how much data you have available. Plus, I highly recommend looking for the simple solution first before reaching for something more complex and time-consuming. Sometimes regex is the right answer and there is nothing wrong with that. As mentioned to figure out an approach, its good to understand the problem, the data, the amount of data you have and timing to turn the solution around.

Good luck in your ML pursuit.


These are the main references I used in putting together my talk and post.

  • Zipfian
  • “Analyzing the Analyzers” – Harlan Harris, Sean Murphy, Marck Vaisman
  • “Doing Data Science”  – Rachel Schutt & Cathy O’Neil
  • “Collective Intelligence” – Toby Segaran
  • “Some Useful Machine Learning Libraries” (blog)
  • University GPA Linear Regression Example
  • Scikit-Learn (esp. linear regression)
  • Mozy Blog
  • StackOverflow
  • Wiki 

Jeeves is Talking!

Coolest moment this week was when I figured out that I just needed to add  one line of code to my program to get my computer to talk in my Jeeves project (thus video above).

This past week was all about fixing stuff, fine tuning, iterating and putting together a presentation and peripheral stuff for demo day which will be Thurs. Of course there is always more I could and would do, but I’ve been continually shifting priorities based on time and end goal.


In case you haven’t seen the previous posts, I’m building an email classification tool. It’s a binary classifier that is focused on determining if an email I receive is a meeting that needs a location picked/defined/identified (whatever word makes it clear). If the email classifies as true then a text is sent to my phone. I do have a working model in place and the video above shows my program run through the full process and the computer telling me the results (which was a little lagniappe I added in addition to the text output).

Classification Accuracy:

In last week’s post, I mentioned how the classification model (logistic regression) used in my product pipeline was not performing well even though I was getting a score of around ~85% accuracy.  All the classification models I tested had given ~80-90% accuracy scores, and I had said how the models were flawed because of the representation of the data. My data has ~15% true cases in it so as mentioned if it classified all emails as false then it would be right ~85% of the time.

What I need to clarify is that the ROC curve I was using also provides a type of accuracy metric, but the equation accounts for skewed class distribution (think of it as adjusting the 15/85 split to 50/50). So if my ROC curve is great than 50% (area under the curve) then its showing that the classifier is getting some true cases correctly classified, and my ROC curve had been around 70-80% on most of the models the week before.

So I did some investigation into how the logistic regression model I used in my product pipeline was performing and found that I hooked it up incorrectly. When I take in a new email message, I had to put it into a list format before splitting it up into features. The way I was passing the message, each word in one email message was being treated as a single email. I figured this out when I printed out the feature set shape and the length of the original message. So I just needed brackets around the email message to make the program see it as a list object. Sometimes its just that small of a fix. Now my classification model works great and it sends me texts on new emails that should be labeled as true.


This week, I’ve improved my model by expanding how I build features such as using tf-idf, lemmatizing, n-grams, normalizing and few other ways to clean and consolidate the features (e.g. words).

Tf-idf is a way to give high weights to words based on how frequently they show up in a document but to decrease the weight if the word shows up frequently throughout all the documents (corpus) that are used in the analysis. So it helps reduce the value of my name as a predictor since my name shows up throughout the corpus and it should not be weight strongly.

Lemmatization helps group different inflected forms of words together to be analyzed as a single item (e.g. walk and walking). Using n-grams helps create groupings of words into bi-grams, tri-grams, etc. This means that in addition to having single word features, I’m also accounting for groups of words that could be good predictors for a true case. For example ‘where should we meet’ is a combination of words that can be a very strong predictor for the true case and possibly stronger than the single word meet. N-grams in some ways allows for context.

There are some other techniques I used to build out my features but those mentioned above give a sense of the approach. After those changes, my ROC curve now shows ~80-90% on most classification models that I’m comparing.

There are more things I want to do with my feature development, but they are lower priority right now with such good performance results and other things taking priority with career day so close.

Code Stuff:

I spent a good chunk of time cleaning up and streamlining my code. I was trying to set it up to easily run the model comparison whenever I made feature changes. I also needed to make sure I consistently split up data used for cross validation in my model comparisons. Cross validation is a way to use part of the data to build the model and save a set of data to test and validate the performance.  So I got my code in a good enough state where its easy to re-run, expand and ensure that there is some validity to the scores its producing. Plus, it helps to make my code cleaner so I can understand it when I go back to it to add things in.

And if you want to checkout the code for my project, you can find it at my Code_Name_Jeeves Github repository.

Next Steps:

Depending on time, I definitely have other feature ideas such as adding in just a binary analysis of whether a date is referenced or not in the email message. I’d also like to run another grid search on the data pipeline to help with fine tuning parameters. More importantly, adding in more data to my training set would be a great value add  as well as just using a different data set to test my product can help with validating performance. Of course, if there was more time then a couple of days it would be great to build this out so my computer gives me location recommendations, but that one will have to be another time.

Last Note / Official End Result:

If you noticed a buzzing sound at the end of the video, that is my phone receiving the text message that is should get (see below).


Zipfian Project Week 1 & Closing the Loop

One week down for our final projects and one week left to go. This has definitely been the most fun and rewarding weeks so far because I’ve been connecting the dots in more ways than one.

Everyone talks about getting your minimum viable product done (mvp)  when we do final projects like this. Thankfully I had enough experience with Hackbright to know what that meant as well as how to approach it. I got my full product pipeline built between last Sat and yesterday. I’ve tested that it works (well pseudo works) and feel really good to see the emails get pulled in, analyzed and if the condition is true then I get a text on my phone. Now I really need to fix my actual classifier because it is just not that accurate.

I named my project Code Name Jeeves in case you are looking for it on Github. I was inspired by a lot of things for wanting to make my computer smarter, but the Iron Man movie with the computer Jeeves was one of those moments where I distinctly remember thinking, “why can’t I just talk to my computer, yet” (and not in a Siri way). Basically why is my computer not doing more things for me than it does right now. Thus, the name and really I didn’t want to spend a ton of time thinking of a name.

So this past week was pretty productive:

  • Evaluated and applied email packages to pull gmail data
  • Setup Postgres data storage
  • Labeled emails that would classify as true
  • Applied vectorizer to generate a feature set (e.g. bag of words)
  • Tested several classifier models
  • Reworked the code to close the loop on my full product

Data / Emails

I spent the first two days experimenting with a couple packages to pull gmail (most built off of IMAP), and I ended up picking the package by Charlie Guo. Its simple enough to understand and apply quickly but has enough functionality that allows me to do some data manipulation when pulling the emails.

I made a quick decision during that time to go with Postgres for primary data storage.  I’m a fan of building with “going live” in mind just in case, and since I know Heroku, I knew Postgres was a top option for persisting data. Reality is that its a bit super charged for the size of data that I’m storing right now, but the setup is there and available to me as I need it. Plus, it was good to get practice with writing SQL to setup tables, store and access data.

I also spent some time going through my emails and manually labeling them. Because of the email package I used to pull the data, it made it so I could label my emails in gmail that would classify as true and then store that labeling in my database. Basically I added a column in my database that would put True in the cell of that email row if the email had the label I gave it on gmail.

I had to manually build out my training dataset because I need it to help build my classification model. This manual labeling is a bit of hinderance because of course it takes time to get it done, and we have limited time for this effort. I could crowd source this, but I don’t want to give my emails out. I could use the Enron dataset and crowd source getting those emails labeled. That feels a bit overkill for these two weeks and I really don’t want to use Enron’s data. I knew this would be an issue and I’m working to adjust for that where I can (esp. in my features and classifier model).


After getting the data, I spent a day analyzing what I had and building out a feature set. For natural language processing, a simple feature set can be just counting up word occurrences in the training dataset. This can be expanded further but I opted to keep it simple to start so I could continue to get my pipeline built out.

So to help further explain features, think of them as variables that help predict the unknown. If this was a linear model like y = mx + b, they are the x variables that help define what y will look like and the classifier defines the coefficients for the model which in this case would be m and b.


I spent the last couple of days of the week exploring as many classifiers as possible in the scikit-learn package. Ones I tried:

  • Logistic Regression
  • Naive Bayes (Gaussian, Multinomial, Bernoulli)
  • SVC
  • Random Forest
  • Ada Boost
  • Gradient Boost

Initially I just ran the standard models without any tuning. To run them, I pass in my training set of X features and y labels (which is my manual labeling of whether that email should classify as true or false).

Gaussian Naive Bayes, Random Forest and Gradient Boost all did pretty well with accuracy scores from 80 – 90% and area under the curve (lift) on a ROC plot of 70-80%. The reality was that they were doing great at classifying my emails as false (not meeting my condition) because in my training data set about 80% of the data was false. So if it classified as false all the time then it was 80% correct.

One thing my instructors helped me appreciate that when I’m classifying the emails, I would prefer to get an email that should be false but is classified as true (confusion matrix of false positive) vs miss an email that was classified as false but should be true (false negative). This is similar to what they target for spam. Classifying the wrong email as spam is worse than getting a little bit of spam in your inbox.

I also worked on applying a grid search for practice which is an approach to testing a variety of parameters to tune the models and improve accuracy scores. Through tuning, I was able to improve Logistic Regression, Multinomial Naive Bayes and SVC into the 90% accuracy range.

As mentioned, certain models like Logistic Regression handle large feature sets (especially for nlp) better than others. Since my actual training dataset is small, Naive Bayes is a good solution to accommodate the limited information. I tried the other classifiers to experiment and learn. I hear the other models are rarely used in the real world because they don’t have enough improvement on scores and are too complex to justify the expense on time and effort to use.

Closing the Loop

I spent Fri. and Sat connecting the dots on my project basically building the code that would run my project from start to text finish.

I had a hard time finding something quickly that would stream my gmail through my app so I decided to just run automatic checks for new emails. When I got them, I open the instance of my customized vectorizer (word counter built with the training set) and apply it to my email to get a feature set.

Then I open my stored classifier instance (also built and tuned using the training set) and I pass the new email feature set into  the classifier. The classifier returns a boolean response and if the response is true then I craft a message and send a text that says that specific email needs a meeting location defined.

So now that the pipeline is built and sort of working (I had it send me text messages when an email classifies as false), I need to go back through and improve my feature set and my classifier model since nothing is classifying as true. There are lots of things I can do and it should be another good week of learning.

Begin with the End

So we are officially kicking-off personal projects for the next 2 weeks. It’s been a bit of a process for all the students figuring out what each of us wanted to do and finalizing it but we got there.

Last Week
It was a review week. We went back over some concepts that will be valuable for projects as well as when we go through interviews like classifiers and distributions. We also worked to determine and finalize our projects and then we spent two full days on final assessments. We had 2 different data science case studies that required working through the full process from getting the data to providing some type of recommendation/report. We also worked in teams on sample interview questions to review over additional class content. The last day there was a little bit of mutiny and we really didn’t get much done on the assessment. Most people were starting to think projects at that point.


I am fascinated by AI. Making my computer smarter or at least make some decisions it doesn’t already handle is something I’ve been interested in before this class. I spent the last several weeks thinking through that idea and how to translate it into a 2 week project. Thankfully, mentors and instructors helped me scope that into something that seems achievable. What I focused in on is to classify an email on if it’s a meeting request that needs a location defined.

So if a friend or colleague wants to meet up but no place has been specified, I want my classifier to classify that email as true. This is setting the stage for a bigger challenge I’d like to solve which is to get the computer to figure out some meeting location options and provide them. Still just doing the email classification seemed a pretty attainable goal in the project timeframe.

The reality is there is a ton more I want to do, and I iterated over a long wish list of ideas, breaking them into smaller concepts and getting lots feedback. I had help steering me towards tasks to get my computer to do that would be realistic in the timeframe we have to work, and when I started defining explicitly what the process would look like from beginning to end, I landed on that one classification step.

Sounds easy and simple but it will be a challenge because I am still learning and getting comfortable with so many components of this. Plus, it is not as easy as it sounds especially when I will be working with a sparse dataset to start. So some of my first steps are to get the data (focused on personal for now) and clean it as well as to manually identify emails that would classify as true.

Then I will have to work on different approaches for applying natural language processing (NLP) to define features for my model. I am going to start with standard bag of words to create features, but will try to explore feature engineering where I explicitly define specific word and symbol groupings (pseudo mix of NLP and Regex). For anyone asking, a feature is a specific attribute (like a word or word pairings) that can help identify / classify the email. So I will work with tools and my own personal inspection to find common words and groupings in my emails that would help classify them as true if they meet the condition.

Once I have the features defined, I will work on building the classification model by applying techniques like cross validation and grid search. Logistic regression is a popular algorithm for classification because its fast to build and tends to be the best option for extremely large feature sets like NLP. So I plan to start there, but I want to explore some of the other models for comparison since this is a great opportunity to practice.

The End State

So my goal is to get my computer to classify new emails and send a text that says, “x email needs a meeting place defined”. With that said, I spent the weekend looking back over an example project I did with Twilio, and I’ve adapted the code to make a working function that will take a message as input and send a text. Thus, the end state is setup and now I just have to build the rest of it that will lead to generating that text message. No problem.

It definitely helps to know where I’m going.


There was some trickier bits with the setup because Anaconda (package to load all the data science software) and Virtualenv don’t play well together. I was able to work through the conflicts and will try to put up a post about how to make them work together sometime in the near future. If you need to know sooner than later, check out this link as one main step to help resolve the conflicts.


Deep Learning Surface

Deep Learning is a tool in the Machine Learning (ML) toolbelt which is a tool in AI and Data Science toolbets. Think of it as an algorithm subset of a larger picture of algorithms and it’s area of expertise is solving some of the more complex problems out there like natural language processing (NLP), computer vision and automatic speech recognition (ASR). Like when you talk to the customer service computer voice on the phone vs. push a button.

Why am I writing about this?  Because its was the topic of my tech talk at Zipfian this week. I chose Deep Learning because I have an interest in making technology smarter, and I was clued into this area of ML as being more advanced in getting computers to act in a more intelligent and human way.

My research was only able to skim the surface because it is an involved topic that would take some time to study above and beyond what Zipfian is covering. Below is a summary of some key points I covered in my talk and additional insights. Also, the presentation slides are at this link.

Deep Learning in a nutshell:

  • Learning algorithms that model high level abstraction
  • Neural networks with many layers are the main structures
  • Term coined in 2006 when Geoff Hinton proved neural net impact


To his credit, Hinton and others have been working on research in this field since the ’80s despite lack of interest and minimal funding. It’s been a hard road for them that has finally started to pay off. AI and neural networks in general have actually been explored since the ’50s, but the biggest problems in that space in general has been computer speed and power. It really wasn’t until the last decade that significant progress and impact has been seen. For example, Google has a project called Brain that can search for specific subjects in videos (like cat images).

I mention Hinton because he’s seen as a central driver of Deep Learning and many look to him to see what’s next. He also organized the Neural Computation and Adaptive Perception (NCAP) group in 2004 that is invite only with some of top researchers and talent in the field. The goal was to help move Deep Learning research forward faster. Actually, many of those NCAP members have been hired by some of the top companies out there diving deep into the research in the last few years. For example:

  • Hinton and Andrew Ng at Google
  • Yann LeCun at Facebook
  • Terrance Sejnowski at US BRAIN Initiative

Its a field that technically has been around for a while but is really taking off with what technology is capable now.


Regarding the structure, neural networks are complex and originally modeled after the brain. They are highly connected nodes (processing elements) that process inputs based on a statistical, adaptive weights. Basically you pass in some chaotic set of inputs (it has lost of noise) and the neural net puts it together as an output. Its assembling a puzzle with all the pieces you feed it.

Below is a diagram of a neural net from a presentation Hinton posted.

The overall goal of neural networks are feature engineering. Its about defining the key attributes/characteristics of the pieces that make up the puzzle you are constructing, and determining how to weight and use them to drive out the overall result. For example, a feature could be a flat edge and you would weight nodes (apply rules) to place those pieces as a boundary of the puzzle. The nodes would have some idea of how to pick up pieces and put them down to create the puzzle.

In order to define weights for nodes, the neural net model is pre-trained on how to put the puzzle together, and the pre-training is driven by an objective function. Objective functions are a mathematical optimization technique to help select the best element from some sort of available alternatives. The function changes depending on the goals of the network. For example, you will have a different set of objectives for automatic speech recognition if you have an audience in the US vs. Australia. So your objectives will take those differences into account to help adjust node weights through each training example and improve upon the output.

A couple other concepts regarding neural nets and Deep Learning are feedfoward and backpropagation (backward propagation of errors). Feedforwad structure passes input through a single layer of nodes where there is an independence on the inputs and unsupervised learning. So nodes can’t see what each other is holding in regards to pieces and can only use their pre-trained weights to help adjust / put the pieces in a place they think is best for the output. Restricted Boltzmann Machine and Denoising Autoencoders are examples of feedforward structures.

Backpropagation is multi-layered / stacked structures that are supervised learning. It tweaks all weights in the neural network based on outputs and defined labels for the data. Backprop can look at the output of the nodes at different points in the process of constructing the final picture (see how the pieces are starting to fit together). If the picture seems to have errors / pieces not coming together then it can adjust weights in the nodes throughout the network to improve results. Gradient descent is another optimization technique that is regularly used as an alternative to backprop. Example backprop neural networks include Deep Belief and Convolutional Neural Networks (regularly used in video processing).

Last Thoughts

So I for one would love to see a Data from Next Gen or a Sarah from her but neural nets are a far off step to create that level of “smart tech”.  Plus, as mentioned above they are one tool in the bigger picture of AI. They are a very cool tool and definitely beating out other algorithms in regards to complexity of problem solving. They are fantastic at classification, prediction, pattern recognition and optimization but they are weak in areas like covering logical inferences, integrating abstract knowledge (‘sibling’ or ‘identical to’) and making sense of stories.

On the whole Deep Learning is a fascinating space for the problems it can handle and is continuing to solve. It will be interesting to see what problems it solves next (esp. with such big names putting research dollars behind it). Below are references that I used to put together this overview and there is plenty more material on the web for additional information.


Below are references I used while researching the topic. Its not exhaustive list but it is a good start.

Side Note on Zipfian

On the whole it was another hectic week. In a very short note, we covered graph theory, NetworkX, k-means algorithm, and clustering overall. There was a lot more detail to all of that but I’ve considering my coverage above, I’m leaving the insight at that for this week.