I’m publishing new content starting in 2017 at nyghtowl.com.
The ideas and methods in neural nets (NNs) have been around for a long time, but in the last decade plus, we are finally starting to reap significant benefits, and this is just the beginning. This post provides an overview of my recent PyCon talk in Montreal which is a neural net primer of sorts. The video is below, my slides are on SpeakerDeck, and I have a repo on Github named Neural Nets for Newbies.
There is too much to cover to fully explain neural nets them; thus, the post and the talk provide a framework to start to understand neural nets. If you want to learn more, there are plenty of resources, some listed in my deck, to dive into.
What are they?
Machine learning is a set of algorithms for classification and prediction, and artificial neural nets are part of the machine learning space. At its core, neural nets are an algorithm which means an equation to help abstract and find patterns in data. Technically it’s a combination of equations.
The structure is modeled after our brains. Before you get all excited about robots that can think like us (side note that idea has been around since BC), the reality is that we still don’t fully understand how the human brain functions. Neural nets are only loosely mimicking brain functionality. For the enthusiasts out there, yes there are many researchers focused on creating a closer biological representation that acts like our brain. The Bottom line is we aren’t there yet.
The algorithm, structure and many of the ideas around neural net functionality have been around for a while; several of them date back to the 1950s. Neural nets have been applied for commercial solutions as far back as 1959 (reducing phone line echos), but we really haven’t seen significant value until recently. Key reasons are that our computational power (computer processing speed and memory capabilities) and access to useful data (amount of stored data) has significantly improved in the last decade alone.
Why should I care?
Because NNs have achieved technical advancement in areas like:
- Natural Language Processing (Search & Sentiment)
- Speech Recognition (Siri)
- Computer Vision & Facial Recognition (Automatic Image Tagging)
- Robotics (Automated Car)
- Recommender Systems (Amazon)
- Ad Placement
Some of you may roll your eyes at these advancements and complain about how Siri is limited in interactions. Need I remind you that we weren’t talking to our machines at the beginning of this century (or at least it wasn’t common). Hell, we didn’t have iPods at the beginning of this century if you remember what they are. I too fall into the sci-fi trap where I’ve seen it or read about it and so when we actually experience real advancements, it seems so boring and behind the times. Yeah, get over that.
All the areas I mentioned above still have plenty of room for growth and there are definitly other areas I haven’t listed especially in scientific fields. One of the reasons neural nets have had such impressive impact is their way of handling more data especially data that has layers of complexity. This doesn’t mean that neural nets should be used for all problems. They are overkill in many situations. I cannot stress that enough that every problem is not a nail for the NN hammer.
If you have a good problem and want to apply NNs it’s important to understand how they work.
Ok, so how do they work?
Out the gate, If you want to get serious about applying NNs then you will need to embrace math no matter how much you don’t like it. Below I’ve given you some fundamentals around the math and the structure to get you started.
Our brains are made up of neurons and synapses, and based on our interactions, certain neurons will fire and send signals to other neurons for data processing/interpretation. There is much more complex stuff going on than just that in our brains, but at a high-level that expresses the structure the neural net models.
NNs at a minimum have three layers: input, hidden, output.
- Input = data
- Data that is broken up into consumable information
- Data can be pre-processed or raw
- Bias and noise are applied sometimes
- Hidden = processing units (aka, does math)
- Made up of neurons
- A neuron determines if it will be active (math under equation section)
- Typically there are multiple neurons in a hidden layer (can be thousands or even billions depending on the data used and objective)
- Output = results
- One node per classification and just one or many
- A net to classify dogs or cats in a picture has two output nodes for each type of classification
- A net to classify handwritten digits between 0-9 has ten output nodes
You can have more than one hidden layer in a neural net, and when you start adding hidden layers, they trade off as inputs and outputs based on where they are in the structure.
Each neuron represents an equation, and it takes in a set of inputs, multiplies weights, combines the data and then applies an activation function to determine if the neuron is active. A neuron is known as a processing unit because it computes the data to determine its response.
- Inputs = input layer data in numerical format
- Weights = coefficients (also known as theta)
- Specialize each neuron to handle the problem (dataset) you are working with
- Can initialize randomly
- One way to initialize is to create a distribution of the existing data set and randomly sample that distribution
- Often weights are represented between -1 to 1
- Bias = can be included as an input or used as a threshold to compare data after the activation function is applied
- Activation Function = data transformation to determine if the neural will send a signal
- Also known as the energy function
- There are many different equations that can be used, and it depends on the problem and data you are working with
- Example equations: sigmoid/logistic, step/binary threshold, linear, rectified linear (combines binary threshold & linear), …
- Output(s) = each node results in a binary, percentage or number range
Each neuron is unique from other neurons in a hidden layer based on the weights applied. They can also be unique in the inputs and outputs. There are many hyperparameters that you can tweak for one single neuron let alone the whole structure to improve its performance. What makes neural nets powerful is the combination of linear with nonlinear functions in the equation.
When applying a neural net, an effort is needed to optimize the model, so it produces the results you are targeting.
The breakthroughs in neural nets are largely in the area of supervised learning. Supervised learning means you have a dataset labeled with the results you expect. The data is used to train the model so you can make sure it functions as needed. Cross validation is a technique typically used in supervised learning where you split the dataset into a training set to build the model and test set for validation. Note, there are areas in neural net research that explores unlabeled data, but that is too much to cover in this post.
In order to optimize, you start out with a structure and probably randomized weights on each neuron in the hidden layer(s). You’ll run your label data through the structure and come out with results at the end. Then you compare those results to real labels using a loss function to help define the error value. The loss function will transform the comparison, so it becomes a type of compass when going back to optimize the weights on each neuron.
The optimization method (aka back propagation or backprop) is a way of taking the derivative of the loss function and applying it to the weights throughout the model. This method can change all weights on every neuron and because of the way the method works, it does not change the weights equally. You want shifts that vary across weights because each neuron is unique.
- Error = difference between NN results to the real labels
- Loss Function = calculates the error (also referred to as cost function)
- There are many different equations that are used, and it depends on the problem and data you are working with
- Example equations: mean squared error, negative log likelihood, cross entropy, hinge, …
- Regularization = noise applied in the loss function to prevent overfitting
- Optimization Method = learning method to tune weights
- There are many different equations that are used, and it depends on the problem and data you are working with
- Example equations: stochastic gradient descent, Adagrad (J Duchi), Adadelta (M Zeiler), RMSprop (T. Tieleman), …
- Learning Rate = size of how much to change the weights each time and sometimes part of optimization algorithms
Backprop in essence wiggles (to quote Karpathy) the weights a little each time you run the data through the model during training. You keep running the data through and adjusting the weights until the error stops changing. Hopefully it’s as low as you need it to be for the problem. And if it’s not, you may want to investigate other model structure modifications.
Note reducing the error rate is a common model objective but not always the objective. For the sake of simplicity, that’s our focus right now.
Validation / Testing
Once you’ve stopped training your model, you can run the test data set through it to see how it performs. If the error rate is horrible, then you may have overfit, or there could be a number of other issues to consider. Error rate and other standard validation approaches can be used to check how your model is performing.
I’ve given you a basic structure on how the neural net connects but its important to understand there are variations in that structure that are better for different types of problems. Example types include:
- Feed Forward (FFN) = basic structure and passes data forward through the structure in the order of connections
- There are no loops
- Data moves in one direction
- Key Applications: financial prediction, image compression, medical diagnosis and protein structure prediction
- Recurrent (RNN) = depending on the timing the neuron fires, data can be looped back earlier in the net structure as inputs
- Data can become input to the same neuron, other neurons in that layer or neurons in a hidden layer prior to that layer
- Operates on linear progression of time
- Good for supervised learning in discrete time settings
- Key Applications: sentiment analysis, speech recognition, NLP
- Convolutional (CNN) = uses a mixture of hidden layers types (e.g. pooling, convolutional, etc.)
- Best structure for scaling
- Inspired by biological processes and variant of multilayer perceptrons
- Key Applications: computer vision, image & video recognition
- Other types to checkout:
- Recursive (RNN) = related to Recurrent but based on structure vs time
- Restricted Boltzmann Machine (RBM) = 1st neural net to demonstrate learning of latent / hidden variables
- Autoencoder (Auto) = RBM variant
- Denoising Autoencoder (DAE)
- Deep Belief Networks (DBN)
Neural nets can get complex in the structure and combined equations. It can be tricky and time-consuming to develop a useful model and confusing on where to start. Due to extensive research, there are already pre-baked templates for certain types of problems that you can adapt and avoid starting from scratch.
There are a couple other points to note about neural nets to point you in the right direction when developing and deploying.
In order to run a neural net to solve problems like mentioned above, it’s important to understand certain system engineering concepts.
The main one to spend time on is graphical processing units (GPUs). These chips are playing a key role in improving latency (speed) to develop NNs. You want every advantage you can get with reducing the time it takes to make a neural net.
GPUs are highly optimized for computation compared to CPUs which is whey they are popular in gaming and research. Granted there are advances going on in CPUs that some argue are making them function more like GPUs. At the heart of this, just spend some time learning about GPUs and try running an NN on it.
I listed a few other topics in my talk that you should research further to go above and beyond single server computation of a neural net.
- Distributed Computing
- High-Performance Computing
Note if you go down the distributed path you are starting to get into sharing the data across nodes or splitting the model, which can be extremely tricky. Try sticking to a single server for as long as possible because you can’t beat that latency and with where technology is, you should be able to do a lot with one computer especially when starting out. Only go down the distributed path when the data and problem are complex enough it can’t be contained on one server.
There are many Python packages you can use to get started with building neural nets and some that will automate most of the process for you to get you off the ground faster. Below is a list of ones I’ve come across so far.
- Machine Learning Packages
- Packages based in C with Python Bindings
- GUI with Python API
- GUI with Python API
I highly recommend that you spend time exploring Theano because it’s well documented, will give you the best exposure and control of the math and structure and it’s regularly applied to solve real world problems. Many of the machine learning packages are built off of it. The machine learning packages vary in terms of how easy they are to use, and some have easy integration with GPUs.
MNIST Code Example
For the example in the talk, I used the MNIST (Mixed National Institute of Standards and Technology) dataset, which is the “hello world” of neural nets. It’s handwritten digit analysis of grayscale pictures (28 x 28 pixels).
- Structure can be as simple as 784 inputs, 1000 hidden units, 10 outputs with at least 794K connections
- Based on Yann LeCunn’s work at ATT with LeNet in 1990s
For reference, I’ve pulled MNIST examples for some of the Python packages into a Github repository as mentioned above, and you can also find here: github.com/nyghtowl/Neural_Net_Newbies.
What’s next for NN?
Neural nets will continue to play a signficant role in advancements in all the areas I’ve mentioned especially with natural language processing and computer vision. The real key value for nearl nets is in automatic feature engineering and we will continue to see neural nets applied to help identify features especially as richer datasets for certain problems are captured.
Additionally, combining neural net structures as well as other machine learing models models with NNs will help drive these advancements. Some great research came out last fall around combinging CNNs with RNNs to apply sentence long descriptions to images.
Where a number of experts have talked about for the long-term value is the potential impact with unlabeled data. Finding patterns in data that we have no knowledge of or data we’ve labeled with our own set of biases. These types of patterns will drive advancements that may very well be akin to what we read in sci-fi as well as stuff we really haven’t though of yet.
Reality is NNs are algorithms with the most potential to really create greater intelligence in our machines. Having technology that can reason and come up with new ideas is very possible when NNs are factored in.
If you want to get serious about researching neural nets, spend time studying linear algebra (matrix math), calculus (derivatives), existing neural net research and systems engineering (esp. GPUs and distributed systems). The slides I posted have a number of references and there are many other resources online. There are many great talks coming out post conferences that can help you tap into the latest progress. Most importantly, code and practice applying neural nets. Best way to learn is by doing.
Last fall, a couple of my colleagues (Kristiane Skiolmen, Scott Lau) and I presented Change’s machine learning email optimization approach as a lecture in Stanford’s Human Computer Interaction Seminar for CS grad students.
The video gives an overview of how Change.org uses email to drive petition engagement from the business and social perspective to the specific technical optimization we made. It starts with an overview of Change and examples of petitions that have literally improved and saved lives.
As of the date of the video, here are some stats we presented:
- 77M total users globally
- 1.2M users visiting the site daily
- 450M signatures total
- 10K declared victories in over 120 countries
Our most successful source of engaging users to sign petitions is email. It’s not an ideal channel and we know that and want to change that. Still since it drives the most response at this time we did take steps to optimize that channel with machine learning. I’m sharing this video and a little about the project so you can see a real world application of machine learning. Below are a couple summary points from the video.
We have an email team that specializes in helping put petitions in front of users who would connect with them. We have groups that define certain petitions to showcase every week through email and the email team was using a cause (topic) filtering model to determine what petitions to send users. It was a manual process of tagging petitions to causes and comparing them to our user base that had been grouped by causes based on petitions they signed.
There are a lot of limitations with this approach from scaling for data size as well as adapting culturally and internationally. Also, the challenge with the manual approach is that some causes had much smaller audiences and lower rates of responses; thus, certain petitions were doomed to fall short of signatures because their cause had a smaller audience.
Our data team built a model to help improve email targeting. Basically, we identified over 500 features (e.g. # petitions signed in the past, etc.) that were predictive of signatures and we tried out a couple classification algorithms to come up with a predictive model to use. The accuracy scores were pretty close on the models we investigated. So we went with a random forest algorithm because we didn’t need to binarize our data, our data is unbalanced (which random forest handles well) and it was the most transparent in feature detection if we wanted to dig into the results.
How it works is each time the email team gets a set of petitions to showcase, they send emails to a sample set of users. Based on the signature response to one petition, a random forest model is developed and then all users are run through the model to predict her/his signature response to that one petition. A random forest model is built per petition the email team showcases that week and we run signature predictions on all users for each of the showcased petitions. Each random forest model produces a probability of signature response per user and then our program sorts the probabilities and identifies the petition with the highest success rate for each user (filtering out ones the user has already received in email). The email team gets back a list of users per petition to send their showcased petitions to for that week.
In the video, I go into more detail around how a random forest works as well as the way it was implemented. Also, Scott provides an overview of how we used Amazon Web Services to implement this data product.
Note there are other ways to approach this problem, but for what we needed, this solution has increased our sign to send rate by 30% which is substantial. On one petition, for example, we would have had 4% signature response out of a pool of 2M people to email, but our new approach with machine learning enabled us to target 5M users with a 16% signature response rate.
As mentioned, I don’t see email as the best communication source and even though we can and will improve on our current solution, we are working to incorporate more effective means of engagement.
For those out there working with Dato(Graphlab) and trying to setup an ODBC connection to just pull all the data straight into the SFrame, here are some tips I’ve learned from troubleshooting.
What is ODBC?
Open Database Connectivity which is a middleware API to help standardize and simplify access to database management systems.
There are a number of links on odbc setup but it was a little tricky to get it to work with Graphlab, Linux and OSX and Graphlab’s documentation is a little sparse in that area right now.
This is one of the links I found that was helpful for setting up on a Linux machine. The following are the steps I used
- wget http://yum.postgresql.org/%5Bversion #]/redhat/rhel-[version #]/pgdg-centos-[OS type & #].noarch.rpm
- Use the package version link from http://yum.postgresql.org/ in the wget command above to pull the rpm file that you need. Note, you are setting up the postgres yum server on your computer to run yum install postgres odbc packages after the fact
- rpm -ivh ./pgdg-[OS type & #].noarch.rpm
- yum install postgresql[version #]-odbc.[version #]
- yum install postgresql[version #]-odbc-debuginfo.[verions #]
- yum install unixODBCl
In the yum install portion, you can combine and separate with spaces each package on one line. You may need to sudo install depending on the role you are logged into the system as and the available permissions. Best practice is to avoid using sudo.
Now that you have the packages installed, update the odbcinist.ini file which should be in /etc/ directory. Sample file contents include:
Description = ODBC for PostgreSQL
Driver = /usr/pgsql-[version #]/lib/psqlodbc.so
Setup = /usr/lib64/libodbcpsqlS.so
Driver64 = /usr/pgsql-[version #]/lib/psqlodbcw.so
Setup64 = /usr/lib64/libodbcpsqlS.so.2.0.0
Database = [database name]
Server = [address for server which if redshift it will look like: ?……redshift.amazonaws.com]
Port = [port for your setup something like 5432 or 5439]
FileUsage = 1
Settings above can vary. Definitely read up on options and how it relates to your connection setup.
This was a little trickier because the documentation wasn’t as clear. I ended up using homebrew package manager and the following steps worked.
- brew update
- brew install unixodbc
- brew install psqlodbc
Next setup odbc.ini which should be under the /usr/local/Cellar/unixodbc/[version #]/etc/ directory. Sample file contents include:
Description = ODBC for PostgreSQL
Driver = PostgreSQL
Database = [database name]
Server = [address for server which if redshift it will look like: ?……redshift.amazonaws.com]
Port = [port for your setup]
Protocol = [protocol for your setup]
Debug = 1
Then setup odbcinst.ini which should also be under the /usr/local/Cellar/unixodbc/[version #]/etc/ directory. Sample file contents include:
Description = PostgreSQL ODBC driver
Driver = /usr/local/Cellar/psqlodbc/[version #]/lib/psqlodbcw.so
Setup = /usr/local/Cellar/unixodbc/[version #]/lib/libodbc.2.dylib
Debug = 0
CommLog = 1
UsageCount = 1
The sticky part for getting graphlab odbc connect to work was that I needed path variables to point to the odbc config files. Thankfully I got this idea from this Stackoverflow post. So in the .bash_profile (which should be in your home directory – use ~/ to get there) add the following:
export ODBCINI=/usr/local/Cellar/unixodbc/[version #]/etc/odbc.ini
export ODBCSYSINI=/usr/local/Cellar/unixodbc/[version #]/etc/
Same with Linux, the setup will vary based on your configuring needs. If at first you don’t succeed, keep researching on how to adjust.
At this point you can go into a python or Ipython kernal and try:
- import graphlab
- graphlab.connect_odbc(“Driver=PostgreSQL;Server=[server address like above];Database=[database name];UID=[username];PWD=[password]”)
For some reason even though the parameters in the connection string are defined in the odbcinst.ini config files, Graphlab complains that the string is missing data without them. Specifically, you need to include Driver, Server, Database, UID and PWD. Its good security to pass in your password at least as a variable that comes form a config file and/or the environment.
Once the odbc connection worked, it made the data product run so much more effectively. I’m able to pull the data directly into the package that will build the model and stripped out an extra step that previously existed to query the data into a middle storage before loading it to the package that would train the model. There are other tools out there like that coming into wider use to cut to the chase regarding data processing and machine learning. Spark is one such tool that I’m especially interested in and will try to write about in the future.
For the last three months I have been working at Change.org as a data scientist and engineer. Its been a great experience so far and I’m blown away that this is where I landed after starting this journey a year plus ago.
I’ve coached others going through moving into engineering about how to believe in themselves and they are smarter than they think. I totally get that you want to fake the confidence till you get there. Don’t be cocky just be resolved to figure stuff out.
Still I felt overwhelmed by the impostor syndrome. The fear of the company figuring out I’m a fraud and firing me during that first month was powerful in my mind. It didn’t matter how much I rationally knew better. Thankfully I have a good community of people who have gone through similar experiences with starting jobs that I was able to fall back on for support.
The feeling has subsided with time as I expected, but it does keep me on my toes to be vigilant in my growth in the space and make sure I’m having positive impact on the company.
Change.org has been an amazing experience for my first data science and engineering job. I couldn’t believe it when they put me through almost a month of training and rotation for on-boarding. It helped me get to know the team and get more comfortable working with the group pretty quickly. I’ve heard of some companies doing this for their employees and it shows how much the company is invested in you.
The people are extremely friendly, welcoming and willing to help when I have questions. It’s not a negatively competitive or condescending environment that makes me feel like I have to hide weaknesses. It has allowed me to ask questions no matter how stupid I think they are and to grow so much faster as well as deliver so much faster.
They have also made plenty of time for me to grow even though I just started. In addition to the near month rotation, they sent me to the GraphLab conference and gave me time off to take a short Spark class I got into at Stanford. And next month they are giving me time to go to StrangeLoop. In the consulting world, there is no way I would have been able to take time away from work to grow myself being so new to the company. Granted I know the more I learn the better I become as an employee. However, not all companies are able to or willing to make the time for this type of growth.
Also as you can gather from above, the company does not take over my life. The hours are 10 to 6 and people typically stick to that with a few working occasionally outside those hours. We do fun stuff together during and after work but it’s not mandatory and makes room for you to have a life.
We do happy hours every other week and sometimes play board games especially on Fri. Earlier in the summer, we would gather around the TV in a big open meeting room and “work” while watching some of the World Cup games. And almost every Friday close to the end of day, we break for what feels a little like an open mic sessions. Anyone can present on a topic they think will be valuable for the team to learn about. It helps us see what other groups are working on or learn about new tools and methods we may want to use. I’ve presented a couple of times already on GraphLab and provided an overview of data science by leveraging my PyCon presentation.
Basically its been a great place to work.
The first week on the job, one of the senior engineers had me ship code. Basically you push up code that will change the live site in some way. This can be a big deal especially for a site that is so comprehensive and beyond just a start-up. So it was pretty cool to do that and not break the site in the process.
During the rotation, I did collaborate on a few bugs, but I was given an assignment to do as time permitted to answer a question using MrJob and Hadoop; thus, the previous post. I knew this from general experience and from talking to my friends who were working. Still I will note that nothing compares to hands on experience. Working on the MrJob project taught me so much about Hadoop, AWS, how to access data at work and just gave me a better understanding of the big data hype.
Lately I’ve been working through implementing a Multi-armed Bayesian Bandit solution. Again teaching me so much through figuring out how to implement for the specific company. We’ve built out a testing environment and coded the solution in Python initially but the live code we are implementing into is in Java and uses the Gradle framework.
I asked for the opportunity and was given the time to take a crack at converting the solution into Java before working with one of the engineers who is more versed in the code base. Java is a bitch but it has been a thrill figuring it out for the last few weeks. I understand much more the concepts around functional programming and interactive kernels and so forth. And I did manage to figure out and convert and test the algorithm in Java which did make me feel fantastic about that accomplishment.
I definitely have days were I’m so excited about the work I’m tackling and feel so lucky to be able to do this for a living.
Before I even started, recruiters were contacting me for other jobs. Literally I changed my LinkedIn profile the week before I started at Change.org and at least 3 recruiters contacted me that week. Very flattering but also funny considering I hadn’t even worked yet. I know people who are still looking who are better versed in math and/or programming than I am and having a company officially hire me added this level of credibility at least for recruiters to want to talk with me. I am stressing this to point out there are many qualified people and I think they are worth a look whether they have a full-time position in this space on their resume or not. Frankly, I think drive and determination are more important characteristics to look for.
Also friends were putting me in touch with people getting into the industry to give them advice on how to go about it successfully. Again flattered but considering I was scared to death of loosing the job the first month, I did not feel qualified to give anyone advice.
Additionally, my path was not easy and this journey is far from over. I still have a ton to learn and I have many days at work where I feel like I know nothing. Again thankfully many in my community have shared those experiences with me, and I know this is typical. Hell one of the senior guys at work was saying he has those days still all the time. That’s the best and worst part about this. Everything keeps changing so it can keep you constantly humble but also challenge you a ton to learn.
For those getting into the space (I heard from a number of you this summer), I highly encourage jumping in. What I have been sharing is that a bootcamp may or may not be the right experience for you. There were people I know who did not get a lot out of Hackbright or Zipfian as much as I know people who did. The approach, people, experience or whatever just didn’t work for some.
I can’t tell you to quit your job and take the risk because I don’t know what is best for you. And I don’t know the hiring stats beyond, it is definitely not 100% hiring rate into the field after those programs. I think anyone who survives those programs should get hired because of the rigor and determination required to sustain through them. Still companies are selective and will do the best for themselves by hiring for talent and fit. You probably won’t like all the people in your class, and may even hate a few. The job search process for you could be almost a year after you are done, maybe more. You could find this is not a field you want to get into. All of these things I’ve seen happen because the bootcamp process is not a gold ticket or a promise of success for everyone.
You make that success for yourself. If you decide to do these programs or whatever approach you take to get into the industry, make sure to own it. It is your responsibility and no one else’s to make you successful. You can have expectations for education you pay for to a point. Still the one who is going to be most concerned for what’s best for you is you and that is not going to change no matter where you go or how much you pay to get into something. You really have to make sure that you show up to whatever you try to do, be willing to participate, thankful for the help you receive in whatever form it comes, constantly look inward on what you can do to improve, try not to compare to others and fight to overcome any internal ego issues that may battle against you.
Get clear on why you are getting into data science and/or engineering. Different reasons can determine the best path for you too. I’m here because the challenge and constant learning makes me feel alive and I love it as much as it frustrates me. I seek opportunities that make sense for what I want to get out of the space and I’m constantly re-centering on what I need and want to learn as I get clearer on what the space is about. You don’t have to have those reasons to get into it. Just be honest with yourself on why you want to be in data science and engineering and what you expect from it. And be open and ready for the fact that whatever you expect will not be what you get. It may be very close or very far off.
I want to see more people working in engineering and data science because it’s definitely needed and there are a lot of people who feel like I do. We are willing to help make the path more accessible. Still it is really on you to figure out what is best for you and then fight for it.
Blog Next Steps
I have learned so much in the last three months and there are many times I’ve thought, I should write a post about it. Reality is this site has become a lower priority while getting up to speed on work and getting a lot clearer on where I want to focus my studies. I’ve also tried to get some level of sanity back into my non-work life. I will try to make time again for posts but its tbd on frequency. Thanks for all those who have been reading so far and sending me great feedback. Seriously, much appreciated.
Over the last couple weeks, I’ve been playing around with MapReduce, MRJob and AWS to answer some questions about event data. Granted this is definitely more data engineering focused than data science, but using these tools can be very beneficial if you are analyzing a ton of data (esp. event log data).
This is more of an overview with a few lessons learned on how to setup a MapReduce job using MRJob and AWS EMR. This post focuses more on process and less about the script logic.
First, What is MapReduce (MR)?
MR is an algorithm used especially for large amounts of data to easily apply some type of filter and organization of the data and then condense it into a result. It was born from similar concepts used in functional programming.
Map = procedure to filter and sort
Reduce = procedure to condense and summarize
Word count is typically used as the “Hello World” of MapReduce. So think about taking a book like Hitchhiker’s Guide to count the occurrences of all the words. The map step would create key/value pairs (i.e. dictionary or hash format) for every single word in the book. So a word is the key and a value like the number 1 would be applied (e.g. “hitchhiker”: 1, “galaxy”: 1, “guide”: 1, “hitchhiker”: 1). There would be duplicate keys outputted.
The reduce step condenses all duplicate keys like “hitchhiker” to a single, unique key for each word where all the related values are put into a list (e.g. “hitchhiker”: [1,1,1,1,1,1,….1]. The list will contain a 1 for every occurrence of “hitchhiker in the book. Then reduce can perform a summarization task by literally adding up the numbers or taking the length of the list. The reduce step also outputs a key/value pair for each unique word in the book with the summed value (e.g. “hitchhiker”: 42).
This is a very simplistic example of MR and there are many complex variations based on the problem being solved. For example, MapReduce is cited as a solution to use on something like Twitter follower recommendations. There are a number of online resources that provide more examples and just looking at other examples can help with defining the MR logic. A couple resources I found in my research covered complex patterns and an overview of different types of patterns.
Other Tools/Techniques to Understand:
In order to follow along, these are brief overviews of key approaches and tools. I recommend reading further on each of them.
- Hadoop = An Apache framework that helps distribute, store and process data across many machines (cluster)
- HDFS = Hadoop Distributed File System is the storage solution that is part of the Hadoop framework. It is specifically geared for distributed systems.
- S3 = Simple Storage Service is just an AWS storage system. You cannot open files or run programs in S3
- EC2 = Elastic Compute Clouds are virtual Amazon computers for rent. You can create, run and terminate servers as needed which led to the term elastic. Its having an additional computer or computers you can configure how you need and run applications on but you don’t have to maintain the hardware or operating system
- EMR = Elastic MapReduce is an Amazon web service that pre configures EC2 instances with Hadoop. Basically its a service to easily spin up a cluster of Hadoop formatted machines without having to acquire and setup the hardware and software yourself
- MRJob = Yelp developed package/library so you can write python MapReduce scripts that will run on Hadoop frameworks
- Screen = GNU software to multiplex terminal sessions / consoles. This means you can run a program encapsulated in its own environment. Or another way to say it is if you kicked off an EMR job from your computer at work inside a screen session, you can detach the session, close your computer down and go home and then log back into the screen session to see that the job has been uninterrupted and is still processing
Hadoop should typically not be your immediate choice when analyzing data. I’ve heard this multiple times from different experts in the field. You really need to think about the type of problems you are solving/questions you are answering, the type, format and size of data and time and money to apply to the problem.
Many times in data science you can take adequate samples of data to answer questions and solve problems without needing Hadoop for processing. So when you take on a challenge and people are immediately saying Hadoop, take time to talk through if that is appropriate. Make your life easier by going for simpler solutions first before bringing out the big guns.
Overview of My Experience / Lessons Learned:
For the problem I worked on, I was trying to answer a number of questions around activity using event logs. Event data is prolific and can be a good case for using MR.
The event logs were JSON formatted, 1 event per line in several files that were gzipped and stored on S3. The good thing about MRJob is that it has built in protocols to handle unzipping and processing files. There was only one adjustment I made to my script to make it easier to handle the fact that each line was already in a JSON format which was to add the lines below:
- from mrjob.protocol import JSONValueProtocol
- INPUT_PROTOCOL = JSONValueProtocol
Local & Small
- Pulled 1 JSON event to analyze what data was available and focused on the data values needed to answer the question
- Simulated/created a few variations on the JSON event example to cover key use cases for testing the the MRJob script
- Developed 1 MR step (mapper/reducer) to just pull simple data point counts on the dummy JSON values
- Expanded MR to address the more complex question. This led to a multi-step MR job (2 rounds of map and reduce) which eventually condensed to 1 map and 2 reduce steps
- If the questions would call for a SQL join or groupby to get the answer then used those data points as keys
- Used a conditional in the map step to filter and streamlined the yielded results that needed to be sorted and condensed
- During initial code development,rananddebuggedMRJob code locally on dummy JSON event data which was also stored locally. Used the followingbashcommandtorunMRJob:
- python [MRJob script] [data file] # this output results to stdout (e.g. terminal)
- Note you can just run specific steps to focus on debugging like just a map step by appending something like –mapper to the command above
Local & AWS Remote
- Once the results from the dummy values looked good, spun up an EC2 instance
- FYI, if you don’t have an account with AWS, sign up and get your access keys (id & secret)
- Pulled1zipfilefromS3onto the instance because it was too big for the personal computer
- Used s3cmd (command line tool) to get access to data
- s3cmd get [filename] # downloads file on EC2 instance
- Unzipped file and pulled about 100 events into a sample file. Then exported the sample file back to S3
- gunzip [filename]
- head -n 100 [filename] > sample.txt
- s3cmd put [filename] # store file back on s3
- Ran exploratory data analysis using Pandas on the sample to verify data structure and results
- Referenced the sample data pulled down from S3 ran locally to debug initially
- python [MRJob script] [data file]
- Once the script worked and the numbers made sense, then ran the file through EMR
- setup MRJob config file and see example further below
- setup pem file and stored on computer running MRJob script
- used the following command:
- python [MRJob script] -o “[s3 bucket for output]” -r emr “[s3 bucket for input]”
- S3 file path typically starts with s3:// and make sure quotes are around the path
- There were issues to debug on the sample but once fixed, I setup a screen and ran the code on the full data set
To make it easier to run MRJob, use a configuration file. MRJob provides documentation on how to set this up. One way is to put the file in your root directory at the ~/ and label it .mrjob.conf to make MRJob automatically find it. There are a number of things that can be pre configured and will save how long the command line script is when running an EMR job.
- TZ: America/
- aws_access_key_id: [your key]
- aws_secret_access_key: [your key]
- ssh_tunnel_to_job_tracker: true
- aws_region: us-west-2
- ec2_instance_type: m1.xlarge
- ec2_key_pair: [pem file name]
- ec2_key_pair_file: ~/.ssh/[pem file name].pem
- num_ec2_core_instances: 0
- enable_emr_debugging: true
- ami_version: latest
Note, above is an example and there are many variations and additional parameters you can add and change based on the job you are running. Also, some parameters have to be entered in at the command line and cannot be added to the configuration file.
When you first run a small sample job on EMR, do it with 1 instance and on something with less horsepower. In this case, you would setup the EC2 instance in the m1 range and only 0 instances. These commands relate to the instance type and number:
- ec2_instance_type: m1.small
- num_ec2_core_instances: 0
If your script calls for a package that is not standard on EMR, you will need to bootstrap/load the EMR instances with the package prior to running the job. In my case, I needed dateutil and in order to load it, I first needed to load pip. So I added the following commands to my config file:
- – sudo apt-get install -y python-pip || sudo yum install -y python-pip
- – sudo pip install python-dateutil
Also when setting up AWS, you will need to create an EC2 pem file for encryption or better known as key pairs. This is different from the access keys. AWS provides a step by step process to setup the pem file. Be sure to remember the name you give it and to move the file that is downloaded into a folder you reference in your configuration. Most people typically put it in the .ssh file on the root directory. Also for MAC and Linux, be sure to change the file permission by running the command in the terminal on the pem file: chmod 400 . You can confirm the permissions changed if you are in the same folder as the pem file by running: ls -la. The following commands reference the pem file in .mrjob.conf file:
- ec2_key_pair: [pem file name]
- ec2_key_pair_file: ~/.ssh/[pem file name].pem
A couple additional setup tips:
- When running on more than the sample data, be sure to add to the command line or configuration file: –no-output. This makes sure that the job does not output all stdout values on your local computer when the full data set is being processed. You really don’t want that
- Set S3 and EMR to the same region so Amazon will not charge for bandwidth used between them
- Stop instances when you are not using them to save money
If you want to see what’s going on with the EMR instances, you can login while they are running and poke around. Login to the AWS console and go to EMR. Click on the cluster that is running your job. There will be a Master public DNS you will want to use. Just use the following command in your terminal:
- ssh hadoop@[EMR Master public DNS] OR
- ssh -i hadoop@[EMR Master public DNS]
This will connect your terminal directly into the EMR instances. You can poke around and see what’s going on while they are running. Unless you specify to keep EMR running after the job is done, then the instances will terminate at the end of the job and boot you out.
Data Inconsistency – Be careful to analyze the data and confirm what you have access to. This is a common challenge. In my case, there were missing values out of different events I worked with which required changing the code a few times to do a check for values as well as get information out of different data points that were more consistent. Bottom line, don’t trust the data.
DateTime & UTC – I’ve wrestled with this devil many times in the past and it still tripped me up on this project. Make sure if your conditionals are working with time, and they typically will be with event logs, to deliberately translate and compare datetime in UTC format.
Traceback Error – If your EMR terminates with errors then you can go into the AWS dashboard and the EMR section. Choose the cluster that you ran, then expand the Steps area. Click on View jobs under the job that failed. Then click on View tasks under the job that failed. Select View attempts next to a task that failed and choose stderr link on a task that failed. That will open an error log that can help provide more context around what went wrong.
There are so many variations on the process outlined above not to mention many different tools you can use. This post was to give some pointers on my approach at a pseudo high-level for those trying to figure out the end to end process. You really have to research and figure out what works best for your situation.
Where I would go next with this area is to play around with something like Spark to handle streaming data and to explore implementing machine learning algorithms on massive data. Although neural nets are more of a personal interest for me right now.
Yeah I know this is not a revelation and some of you out there enjoy the process. I say you are probably a little sadistic or masochistic or both if you do.
It’s a mental and emotional roller-coaster and as mentioned in a previous post, its like taking finals non-stop for as long as you are going through the interview process. It drained my energy and kept me from feeling creative or feeling motivated to just code for fun. Its been so long since I posted code consistently on Github which makes me a little sad.
While interviewing, I was typically too busy managing the interview pipeline and studying. By pipeline, I mean finding companies, setting up interviews, actually interviewing and then following up and doing this for each company. Several companies have multiple rounds from 1-6+ depending on the company. When I wasn’t dealing with the process of interviewing, then I was studying which I’ve mentioned in my last post how wide the range of topics are for data science interviews.
Many people talk about how the interview process is broken and there are some attempts to fix it (at least in the SF tech community from what I’ve heard & read). But let’s face it, putting people in an awkward, stressful fishbowl environment and having them perform tests that typically don’t relate to their job to see if they are a good candidate is biasing the results for people who interview well.
Interviewing well does not mean that person is the best for the job. One of the smartest and most talented people I know does not interview well and any company that passes up on that person is literally missing a gold mine of talent.
I get it that companies can only except so much risk and they are trying to find ways to vet productive and successful people they add to their team. And as mentioned, there are efforts to improve the interview process. Since I was in the thick of interviewing for the last couple of months, I wanted to note a few experiences in what worked and didn’t work in how companies interviewed.
- Treating the interviewee like a person and valuing their time vs. making them feel like another number. Some great experiences were going to lunch with people I would work with to see how we gelled and even just as simple as having them make time for me to ask questions or check if I needed a break during long interviews.
- Being prepared. I had a couple of interviewers tell me they looked up my Github profile before they talked to me which impressed me. On the flip side of this, I had interviews where the person came in telling me they didn’t really know who I was and why I was there. I know they are busy but this ties into my first bullet above.
- Explaining the process and setting interview expectations. Some companies would send an email or do a call where they would explain the process which helped me prepare. There were a few times where interviews turned into surprise tech interviews when they had been explained as just “get to know you” interviews.
- Creating as realistic an environment as possible for tech interviews and keeping it contained. One of the best interviews in relation to this was when the person had me solve a code challenge on his computer. I was told to use any resource I normally would (e.g. Stack Overflow) to solve the problem. The challenge was also focused on a specific problem that related to the type work and the environment was all setup so I could just focus on the problem. Granted he watched me while I worked and this could be too difficult for some but being able to code like I would almost normally was definitely one of the more positive tech interviews.
- Giving constructive feedback. One of my favorite parts from the bullet point above was that the guy gave me solid constructive feedback at the end. This happened in a few other interviews but not always. Getting that type of feedback really helped me see how well that person can communicate and evaluate work. It spoke volumes about the company and experience I would have there.
Look, I’m not breaking new ground with the information above, just giving some thoughts on my experience. I will say that even though interviewing really does suck, it also was very valuable.
When I started interviewing, I had to just tell myself to buckle down and drive after this. I was so tired after Zipfian and PyCon that it was the last thing I wanted to do, but I knew the timing required me to just regroup and push forward. I felt like I didn’t know anything when I started even though I did, and I made a point of making blocks of time so I could study as well as figure out how to solve questions I got from the previous interview. All the interviews and studying did help me become stronger on the topics.
Plus, data science is still such a new and not fully defined field and I still didn’t know how to explain what I wanted to do. The interviews gave me a chance to get clear on how companies define data science and what they really needed as well as what I want to work on in the field.
Many companies were interested in me for more metrics reporting and BI which makes sense with my business background. Still I am very interested in building things such as implementing a recommender system or applying NLP or MapReduce which is really more about developing data products. Hell, as it becomes more practical, I also want to get into applying neural nets. (All in good time.) Ultimately, the interviews allowed me to get clear quickly with companies on what we were both looking for and whether the role was a match.
A couple other things that helped the process:
- Github Profile: I got into a space last year where I was posting daily on Github and that really paid off with some companies who saw that I could code. FYI, I’ve posted a lot of messy code there and I try to focus on getting code up more than worry about making it perfect. It will never be perfect.
- Blog: Explaining my thought process here and progress over the last year helped a few times with people understanding me a little before we met.
- Network: It has been ridiculously helpful the people I’ve met this past year for me to vet what the companies really are like as well as for the companies to check me out beforehand.
- Practice: Even though I don’t love interviewing, I will totally agree that going through several made me stronger and I wouldn’t change that.
So interviews suck but they do have redeeming qualities. When you are going through it, take breaks when you need to, talk to your friends (especially those who have gone through the process) and just keep moving. It can be hard but it is survivable. And good luck!