PyCon 2014 – How to get started with Machine Learning

Following up on the talk I just gave at PyCon 2014 in Montreal, I’ve explained parts of my presentation and provided a few additional clarifications. You can catch the talk at Pyvideo.org, my github repo PyCon2014 holds the sample code, and the slides are on SpeakerDeck.

Machine Learning (ML) Overview

Arthur Samuel defined machine learning as, “Field of study that gives computers the ability to learn without being explicitly programmed”. Its about applying algorithm(s) in a program to solve the problem you are faced with and address the type of data that you have. You create a model that will help conduct pattern matching and/or predict results. Then evaluate the model and iterate on it as needed to create the right type of solution for the problem.

Examples of ML in the real world include handwritten analysis which uses neural nets to read millions of mail regularly to sort and classify all the different variations in written addresses. Weather prediction, fraud detection, search, facial recognition, and so forth are all examples of machine learning in the wild.

Algorithms

There are several types of ML algorithms to choose from and apply to a problem and some are listed below. They are broken into categories to give an approach on how to think about applying them. When choosing an algorithm, its important to think about the goal/problem, the type of data available and the time and effort that you have to work on the solution.

ML_Algorithms

A couple starting points to consider are whether the data is unsupervised or supervised. Supervised is whether you have actual data that represent the results you are targeting in order to train the model. Spam filters are built on actual data that have been labeled as spam while unsupervised data doesn’t have a clear picture of the result. For unsupervised learning, there will be questions about the data and you can run algorithms on it to see if patterns emerge that help tell a story. Unsupervised is a challenging type of approach and typically there isn’t necessarily a “right” answer for the solution.

In addition, if the data is continuous (e.g. height, weight) or categorical/discrete (e.g. male/female, Canadian/American) that helps determine the type of algorithm to apply. Basically its about whether the data has a set amount of units that can be defined or if the variations in the data are nearly infinite. These are some ways to evaluate what you have to help identify an approach to solve the problem.

Note, the algorithms categorization has been simplified a bit to help provide context, but some of the algorithms do cross the above boundaries (i.e. linear regression).

Models

Once you have the data and an algorithmic approach, you can work on building a model. A model can be something as simple as an equation for a line (y=mx+b) or as complex as a neural net with many layers and nodes.

Linear Regression is a machine learning algorithm and a simple one to start with where you find the best fit line to represent observed data. In the talk, I showed two different examples of having observed data that exhibited some type of linear trend. There was a lot of noise (data was scattered around the graph), but there was enough of a trend to demo linear regression.

When building a model with linear regression, you want to find the most optimal slope (m) and intercept (b) based on the actual data. See algebra is actually applicable in the real world. This is a simple enough algorithm to calculate the model yourself, but its better to leverage tools like scikit-learn’s library to help you more efficiently calculate the best fit line. What you are calculating is a line that minimizes the distance between all the observed data points.

After generating a model, you should evaluate the performance and iterate to improve the model as needed if it is not performing as expected. For more info, I also explained linear regression in a previous post.

Prediction

When we have a good model, you can take in new data and output predictions. Those predictions can feed into some type of data product or generate results for a report or visualization.

In my presentation, I used actual head size and brain weight data to build a model that predicts brain weight based on head size. Since the data was fairly small, this decreases the predictive power and increases the potential for error in the model. I went with this data since it was a demo, and I wanted to keep it simple. When graphed, the observed data was spread out which also indicated error and a lot of variance in the data. So it predicts weight with a good amount of variance in the model.

With the linear model I built, I was able to apply it so that I could feed it a head size (x) and it would calculate the predicted brain weight (y). Other models are more complex regarding the underlying math and application. Still you will something similar with other models in regards to making them and then feeding in new features/variables to generate some type of result.

To see the full code solution, checkout the github repository as noted above. The script is written a little differently from the slides because I created functions for each of the major steps. Also, there is an iPython notebook that shows some of the drafts I worked through to build out the code for the presentation

Tools

The python stack is becoming pretty popular for scientific computing because of the well supported toolsets. Below is a list of key tools to start learning if you want to work with ML. There are many other python libraries out there for more nuanced needs in the space as well as other stack packages to explore (R, Java, Julia). If you are trying to figure out where to start, here are my recommendation:

  • Scikit-Learn = machine learning algorithms
  • Pandas = dataframe tool
  • NumPy = matrix manipulation tool
  • SciPy = stats models
  • Matplotlib = visualization

Skills

In order to work with ML algorithms and problems, its important to build out your skill set regarding the following:

  • Algorithms
  • Statistics (probability, inferential, descriptive)
  • Linear Algebra (vectors & matrices)
  • Data Analysis (intuition)
  • SQL, Python, R, Java, Scala (programming)
  • Databases  & APIs (get data)

Resources

And of course, the next question is where do I go from here? Below is a beginning list of resources to get you started. I highly recommend Andrew Ng’s class and a couple of links are to sites with more recommendations on what to checkout next:

  • Andrew Ng’s Machine Learning on Coursera
  • Khan Academy (linear algebra and stats)
  • Metacademy
  • Open Source Data Science Masters
  • StackOverflow, Data Tau, Kaggle
  • Machine Learning: A Love Story
  • Collective Intelligence – Toby Segaran
  • Pattern Recognition & Machine Learning – Christopher Bishop
  • Think Stats – Allen Downey
  • Tom Mitchell
  • Mentors

One point to note from this list and I stressed this in the talk, seek out mentors. They are out there and willing to help. You have to put it out there what you want to learn and then be aware when someone offers to help. Also follow-up. Don’t stalk the person but reach out to see if they will make a plan to meet you. They may only have an hour or they may give you more time than you expect. Just ask and if you don’t get a good response or have a hard time understanding what they share, don’t stop there. Keep seeking out mentors. They are an invaluable resource to get you much farther faster.

Last Point to Note

ML is not the solution for everything and many times can be overkill. You have to look at the problem you are working on to determine what makes the most sense in regards to your solution and how much data you have available. Plus, I highly recommend looking for the simple solution first before reaching for something more complex and time-consuming. Sometimes regex is the right answer and there is nothing wrong with that. As mentioned to figure out an approach, its good to understand the problem, the data, the amount of data you have and timing to turn the solution around.

Good luck in your ML pursuit.

References

These are the main references I used in putting together my talk and post.

  • Zipfian
  • Framed.io
  • “Analyzing the Analyzers” – Harlan Harris, Sean Murphy, Marck Vaisman
  • “Doing Data Science”  – Rachel Schutt & Cathy O’Neil
  • “Collective Intelligence” – Toby Segaran
  • “Some Useful Machine Learning Libraries” (blog)
  • University GPA Linear Regression Example
  • Scikit-Learn (esp. linear regression)
  • Mozy Blog
  • StackOverflow
  • Wiki 

Zipfian Hiring / Demo Day – How it Works

This past week was the big hiring/demo day for Zipfian Academy where we presented our projects to a number of companies seeking data scientists. Zipfian’s hiring day is a lot like Hackbright’s career day. So the nice thing about going through it was that I knew what to expect.

Companies:

Sixteen companies attended hiring day which was a great turnout, and I was thankful there wasn’t another 9 because talking to 16 was still exhausting. Some companies were hiring their first data scientist who would build the strategic direction as well as put it into action. While other companies have teams that they want to grow. So there was a mix of start-ups to large organizations.

Schedule:

In the morning, each company gave a 1 minute introduction on who they were and what they were looking for. Then all the students presented 3 minutes each on our individual projects. We used slides and did a very brief overview or our project goal, approach, results and next steps.  We broke for lunch and then we did the speed interviewing like we did at Hackbright. Each company had a table and the students would rotate around. We were given 7 minutes to talk and the conversations varied based on the company and what they were looking for.

Approach:

Preparation was really focused on reviewing company bios as well as putting together project presentations to explain what we did. Zipfian drilled us on those presentations. They had us draft them the week before and run through them a few times with lots of great feedback. It was tough to go through when there was so much else going on, but it was very valuable to get us in shape and clear on our project story.  We also worked on putting together bios for the companies as well as received bios about the companies who were attending.

During the day, I went with my previous experience of asking questions about the company, roles they were hiring for, culture and tools they use. A few companies asked technical questions and/or specific questions about my project. I also got questions about my background and what role I was looking for.

Afterwards, I changed my approach from my last time through this of sending emails immediately. Partly because I needed a breather and took a day off from school the day after to handle a number of errands that had piled up.  Also, because it was nice to let the experience percolate.

Zipfian & HB Comparison:

The speed interviewing was pretty much the same except having the students rotate and consolidating the project demos to just one presentation. I really appreciated not having to repeat my project spiel more than once, and it kept the individual conversations more authentic (vs. rehearsed speech). That also gave more time to get to know the companies.

On the whole, I felt more comfortable the second time around because I actually knew many of the companies and have a broader picture of the community. There were a few people there that knew and/or worked with some of my fellow Hackbright alums and friends. One of the coolest six degrees was that one of the representatives was at the hardware hackathon where my team built the Arduino car. It gave us something else to talk more about and removed some of the awkward getting to know you process. So even though I didn’t know the people directly, they didn’t all feel like complete strangers to me.

To clarify, I wouldn’t have felt that way at all if I hadn’t gone through Hackbright and all the stuff in the last year. It really set the stage for me to have a better perspective on the hiring day experience.

It is still a very long day and everyone (students and company reps) said the same. Still after it was all said and done, a few of the students who were still around went out for ice cream which was definitely a good way to wrap that day.

Last Note:

One week left for Zipfian and next week is interview prep.

Jeeves is Talking!

Coolest moment this week was when I figured out that I just needed to add  one line of code to my program to get my computer to talk in my Jeeves project (thus video above).

This past week was all about fixing stuff, fine tuning, iterating and putting together a presentation and peripheral stuff for demo day which will be Thurs. Of course there is always more I could and would do, but I’ve been continually shifting priorities based on time and end goal.

Refresher:

In case you haven’t seen the previous posts, I’m building an email classification tool. It’s a binary classifier that is focused on determining if an email I receive is a meeting that needs a location picked/defined/identified (whatever word makes it clear). If the email classifies as true then a text is sent to my phone. I do have a working model in place and the video above shows my program run through the full process and the computer telling me the results (which was a little lagniappe I added in addition to the text output).

Classification Accuracy:

In last week’s post, I mentioned how the classification model (logistic regression) used in my product pipeline was not performing well even though I was getting a score of around ~85% accuracy.  All the classification models I tested had given ~80-90% accuracy scores, and I had said how the models were flawed because of the representation of the data. My data has ~15% true cases in it so as mentioned if it classified all emails as false then it would be right ~85% of the time.

What I need to clarify is that the ROC curve I was using also provides a type of accuracy metric, but the equation accounts for skewed class distribution (think of it as adjusting the 15/85 split to 50/50). So if my ROC curve is great than 50% (area under the curve) then its showing that the classifier is getting some true cases correctly classified, and my ROC curve had been around 70-80% on most of the models the week before.

So I did some investigation into how the logistic regression model I used in my product pipeline was performing and found that I hooked it up incorrectly. When I take in a new email message, I had to put it into a list format before splitting it up into features. The way I was passing the message, each word in one email message was being treated as a single email. I figured this out when I printed out the feature set shape and the length of the original message. So I just needed brackets around the email message to make the program see it as a list object. Sometimes its just that small of a fix. Now my classification model works great and it sends me texts on new emails that should be labeled as true.

Features:

This week, I’ve improved my model by expanding how I build features such as using tf-idf, lemmatizing, n-grams, normalizing and few other ways to clean and consolidate the features (e.g. words).

Tf-idf is a way to give high weights to words based on how frequently they show up in a document but to decrease the weight if the word shows up frequently throughout all the documents (corpus) that are used in the analysis. So it helps reduce the value of my name as a predictor since my name shows up throughout the corpus and it should not be weight strongly.

Lemmatization helps group different inflected forms of words together to be analyzed as a single item (e.g. walk and walking). Using n-grams helps create groupings of words into bi-grams, tri-grams, etc. This means that in addition to having single word features, I’m also accounting for groups of words that could be good predictors for a true case. For example ‘where should we meet’ is a combination of words that can be a very strong predictor for the true case and possibly stronger than the single word meet. N-grams in some ways allows for context.

There are some other techniques I used to build out my features but those mentioned above give a sense of the approach. After those changes, my ROC curve now shows ~80-90% on most classification models that I’m comparing.

There are more things I want to do with my feature development, but they are lower priority right now with such good performance results and other things taking priority with career day so close.

Code Stuff:

I spent a good chunk of time cleaning up and streamlining my code. I was trying to set it up to easily run the model comparison whenever I made feature changes. I also needed to make sure I consistently split up data used for cross validation in my model comparisons. Cross validation is a way to use part of the data to build the model and save a set of data to test and validate the performance.  So I got my code in a good enough state where its easy to re-run, expand and ensure that there is some validity to the scores its producing. Plus, it helps to make my code cleaner so I can understand it when I go back to it to add things in.

And if you want to checkout the code for my project, you can find it at my Code_Name_Jeeves Github repository.

Next Steps:

Depending on time, I definitely have other feature ideas such as adding in just a binary analysis of whether a date is referenced or not in the email message. I’d also like to run another grid search on the data pipeline to help with fine tuning parameters. More importantly, adding in more data to my training set would be a great value add  as well as just using a different data set to test my product can help with validating performance. Of course, if there was more time then a couple of days it would be great to build this out so my computer gives me location recommendations, but that one will have to be another time.

Last Note / Official End Result:

If you noticed a buzzing sound at the end of the video, that is my phone receiving the text message that is should get (see below).

photo

Zipfian Project Week 1 & Closing the Loop

One week down for our final projects and one week left to go. This has definitely been the most fun and rewarding weeks so far because I’ve been connecting the dots in more ways than one.

Everyone talks about getting your minimum viable product done (mvp)  when we do final projects like this. Thankfully I had enough experience with Hackbright to know what that meant as well as how to approach it. I got my full product pipeline built between last Sat and yesterday. I’ve tested that it works (well pseudo works) and feel really good to see the emails get pulled in, analyzed and if the condition is true then I get a text on my phone. Now I really need to fix my actual classifier because it is just not that accurate.

I named my project Code Name Jeeves in case you are looking for it on Github. I was inspired by a lot of things for wanting to make my computer smarter, but the Iron Man movie with the computer Jeeves was one of those moments where I distinctly remember thinking, “why can’t I just talk to my computer, yet” (and not in a Siri way). Basically why is my computer not doing more things for me than it does right now. Thus, the name and really I didn’t want to spend a ton of time thinking of a name.

So this past week was pretty productive:

  • Evaluated and applied email packages to pull gmail data
  • Setup Postgres data storage
  • Labeled emails that would classify as true
  • Applied vectorizer to generate a feature set (e.g. bag of words)
  • Tested several classifier models
  • Reworked the code to close the loop on my full product

Data / Emails

I spent the first two days experimenting with a couple packages to pull gmail (most built off of IMAP), and I ended up picking the package by Charlie Guo. Its simple enough to understand and apply quickly but has enough functionality that allows me to do some data manipulation when pulling the emails.

I made a quick decision during that time to go with Postgres for primary data storage.  I’m a fan of building with “going live” in mind just in case, and since I know Heroku, I knew Postgres was a top option for persisting data. Reality is that its a bit super charged for the size of data that I’m storing right now, but the setup is there and available to me as I need it. Plus, it was good to get practice with writing SQL to setup tables, store and access data.

I also spent some time going through my emails and manually labeling them. Because of the email package I used to pull the data, it made it so I could label my emails in gmail that would classify as true and then store that labeling in my database. Basically I added a column in my database that would put True in the cell of that email row if the email had the label I gave it on gmail.

I had to manually build out my training dataset because I need it to help build my classification model. This manual labeling is a bit of hinderance because of course it takes time to get it done, and we have limited time for this effort. I could crowd source this, but I don’t want to give my emails out. I could use the Enron dataset and crowd source getting those emails labeled. That feels a bit overkill for these two weeks and I really don’t want to use Enron’s data. I knew this would be an issue and I’m working to adjust for that where I can (esp. in my features and classifier model).

Features

After getting the data, I spent a day analyzing what I had and building out a feature set. For natural language processing, a simple feature set can be just counting up word occurrences in the training dataset. This can be expanded further but I opted to keep it simple to start so I could continue to get my pipeline built out.

So to help further explain features, think of them as variables that help predict the unknown. If this was a linear model like y = mx + b, they are the x variables that help define what y will look like and the classifier defines the coefficients for the model which in this case would be m and b.

Classifier

I spent the last couple of days of the week exploring as many classifiers as possible in the scikit-learn package. Ones I tried:

  • Logistic Regression
  • Naive Bayes (Gaussian, Multinomial, Bernoulli)
  • SVC
  • Random Forest
  • Ada Boost
  • Gradient Boost

Initially I just ran the standard models without any tuning. To run them, I pass in my training set of X features and y labels (which is my manual labeling of whether that email should classify as true or false).

Gaussian Naive Bayes, Random Forest and Gradient Boost all did pretty well with accuracy scores from 80 – 90% and area under the curve (lift) on a ROC plot of 70-80%. The reality was that they were doing great at classifying my emails as false (not meeting my condition) because in my training data set about 80% of the data was false. So if it classified as false all the time then it was 80% correct.

One thing my instructors helped me appreciate that when I’m classifying the emails, I would prefer to get an email that should be false but is classified as true (confusion matrix of false positive) vs miss an email that was classified as false but should be true (false negative). This is similar to what they target for spam. Classifying the wrong email as spam is worse than getting a little bit of spam in your inbox.

I also worked on applying a grid search for practice which is an approach to testing a variety of parameters to tune the models and improve accuracy scores. Through tuning, I was able to improve Logistic Regression, Multinomial Naive Bayes and SVC into the 90% accuracy range.

As mentioned, certain models like Logistic Regression handle large feature sets (especially for nlp) better than others. Since my actual training dataset is small, Naive Bayes is a good solution to accommodate the limited information. I tried the other classifiers to experiment and learn. I hear the other models are rarely used in the real world because they don’t have enough improvement on scores and are too complex to justify the expense on time and effort to use.

Closing the Loop

I spent Fri. and Sat connecting the dots on my project basically building the code that would run my project from start to text finish.

I had a hard time finding something quickly that would stream my gmail through my app so I decided to just run automatic checks for new emails. When I got them, I open the instance of my customized vectorizer (word counter built with the training set) and apply it to my email to get a feature set.

Then I open my stored classifier instance (also built and tuned using the training set) and I pass the new email feature set into  the classifier. The classifier returns a boolean response and if the response is true then I craft a message and send a text that says that specific email needs a meeting location defined.

So now that the pipeline is built and sort of working (I had it send me text messages when an email classifies as false), I need to go back through and improve my feature set and my classifier model since nothing is classifying as true. There are lots of things I can do and it should be another good week of learning.

Begin with the End

So we are officially kicking-off personal projects for the next 2 weeks. It’s been a bit of a process for all the students figuring out what each of us wanted to do and finalizing it but we got there.

Last Week
It was a review week. We went back over some concepts that will be valuable for projects as well as when we go through interviews like classifiers and distributions. We also worked to determine and finalize our projects and then we spent two full days on final assessments. We had 2 different data science case studies that required working through the full process from getting the data to providing some type of recommendation/report. We also worked in teams on sample interview questions to review over additional class content. The last day there was a little bit of mutiny and we really didn’t get much done on the assessment. Most people were starting to think projects at that point.

Projects

I am fascinated by AI. Making my computer smarter or at least make some decisions it doesn’t already handle is something I’ve been interested in before this class. I spent the last several weeks thinking through that idea and how to translate it into a 2 week project. Thankfully, mentors and instructors helped me scope that into something that seems achievable. What I focused in on is to classify an email on if it’s a meeting request that needs a location defined.

So if a friend or colleague wants to meet up but no place has been specified, I want my classifier to classify that email as true. This is setting the stage for a bigger challenge I’d like to solve which is to get the computer to figure out some meeting location options and provide them. Still just doing the email classification seemed a pretty attainable goal in the project timeframe.

The reality is there is a ton more I want to do, and I iterated over a long wish list of ideas, breaking them into smaller concepts and getting lots feedback. I had help steering me towards tasks to get my computer to do that would be realistic in the timeframe we have to work, and when I started defining explicitly what the process would look like from beginning to end, I landed on that one classification step.

Sounds easy and simple but it will be a challenge because I am still learning and getting comfortable with so many components of this. Plus, it is not as easy as it sounds especially when I will be working with a sparse dataset to start. So some of my first steps are to get the data (focused on personal for now) and clean it as well as to manually identify emails that would classify as true.

Then I will have to work on different approaches for applying natural language processing (NLP) to define features for my model. I am going to start with standard bag of words to create features, but will try to explore feature engineering where I explicitly define specific word and symbol groupings (pseudo mix of NLP and Regex). For anyone asking, a feature is a specific attribute (like a word or word pairings) that can help identify / classify the email. So I will work with tools and my own personal inspection to find common words and groupings in my emails that would help classify them as true if they meet the condition.

Once I have the features defined, I will work on building the classification model by applying techniques like cross validation and grid search. Logistic regression is a popular algorithm for classification because its fast to build and tends to be the best option for extremely large feature sets like NLP. So I plan to start there, but I want to explore some of the other models for comparison since this is a great opportunity to practice.

The End State

So my goal is to get my computer to classify new emails and send a text that says, “x email needs a meeting place defined”. With that said, I spent the weekend looking back over an example project I did with Twilio, and I’ve adapted the code to make a working function that will take a message as input and send a text. Thus, the end state is setup and now I just have to build the rest of it that will lead to generating that text message. No problem.

It definitely helps to know where I’m going.

Note

There was some trickier bits with the setup because Anaconda (package to load all the data science software) and Virtualenv don’t play well together. I was able to work through the conflicts and will try to put up a post about how to make them work together sometime in the near future. If you need to know sooner than later, check out this link as one main step to help resolve the conflicts.

 

Quick Update on Zipfian and Week 7

Another short one on what was covered this week at Zipfian. Also, for anyone interested, below is a photo of the classroom to give context on where we are working from. 2014-02-18 15.34.38

Its a good space. Far in the front is the work and lecture space and the area closest to the camera is more where we take breaks and eat.

This week was about diving into complex machine learning algorithms. Topics we covered:

  • Supervised Learning
  • SVM
  • K Nearest Neighbors
  • Decision Trees & Random Forest
  • Neural Networks
  • Time Series

I’m not going to explain the above in detail because time is a bit limited this weekend, but I recommend checking them out. Additional note, I’ve spoken to some data scientists who say they typically use the less complex algorithms to do their work because something like a Random Forest is difficult to implement in production.

We also submitted preliminary project proposals and worked on narrowing down our ideas. Final proposals are due next week and then its a free-for-all getting started (if we haven’t already). I definitely have an idea of the direction I’m going in and working to flesh that out. I’ll provide more details on the project in a future blogpost, and I will say that it aligns to my interest in making computers smarter.

Next week the plan is for us to review content and run through case studies to practice our skills.

Half-way Mark – Numb Brain

At the pace we were going, the class showed signs of weariness in week 5 and officially hit the wall this past week. The instructors eased up on us in exercises this week, and gave us a pseudo free day to recharge yesterday.

The week was about assessing where we were and covering MapReduce, Big Data, Flask and to think about projects. It may not sound like they eased up on us but they did.

Assessment

They gave us a practice exercise the Friday of week 5 to do individually. The data was from a company’s click rate on an advertisement based on user and location and our goal was to recommend locations to target for future advertisements. The data was in a pretty messy state across multiple tables so not surprising it took us several hours just to clean and load which was pretty frustrating but very real world. The rest of the time we analyzed the data and applied models to come up with recommendations.

On Monday, we were given an hour exam which was about solving small sample coding problems that covered several topics we’ve gone over so far. After both assessments, we met with the instructors to go over how things were going  and determine where we should focus our studies for the remainder of the course. The assessments were tough to go through but they did help give us an understanding of how we were progressing.

MapReduce, Hadoop & EMR

MapReduce definitely seems so simple at first blush and yet can be devilishly difficult. This technique is really for handling large amounts of information which makes it a valuable tool for Big Data. You apply some type of change and/or combine data across large datasets and then reduce (consolidate) down the data for the results. I’ve been trying to think of a simple example to explain this concept and finding one is challenging to do simply and quickly. But what the heck, here goes…

Consider a dataset that has 1M rows and there were only two columns that had id numbers in them. There can be multiple occurrences of the same id in either column and the each row represents connections like followers on Twitter. You would use a map function to officially group each row of ids and pass them one at a time to the reduce function. The reduce function condenses multiple occurrences of the same id on the left side of the group and makes it a key. Then you can have the function condense down all the values that would have been on the right side of that key id and make those ids a list of values associated to the key id.

Example Map List:

  • A B
  • A C
  • A Z
  • A W

Reduce Result:

  • A: [B, C, Z, W]

This is a really simplified example and not only can you make the functions more complex in processing results, you can run the data through multiple MapReduce functions in a stream to further adjust the data. MapReduce would be used in a case like generating Twitter’s recommended people you should follow. It’s a lot of data to go through for a result that is calculated regularly and needs to be produced quickly.

MapReduce is an optimized model for Big Data and Hadoop is the framework of choice to run the model on because of its ability to handle processing large datasets. In class, we used MrJobs, a Python library, to write MapReduce programs, and we also worked with Hive which is a data warehouse that sits on top of Hadoop and enables querying and running analysis with SQL. There are many other tools like Pig that we could have practiced using but what we covered still hit the core concepts.

Additionally, we learned how to setup an Amazon EC2 instance which is a virtual computer that you can use to run programs. Its great if you want to train models that can take several hours or longer to run (especially if you want to do several at once). It will free up your local computer for shorter term activities. More specifically regarding MapReduce, we learned how to use Amazon EMR (Elastic MapReduce), which allows you to spin up a remote Hadoop cluster to run these types of jobs. You can even do distributed computing to share the work load across multiple virtual computers on Amazon, but it can cost money depending on your processing needs.

Flask

We spent a day in class learning Flask (Python web framework for anyone who hasn’t read this blog before). There are students who plan to build data products and typically those are distributed online.

To clarify what a data product is, Google is an example. There is an interface where based on the user interaction, stuff is done on the backend to provide data in some type of format for you. Zivi, one of the women in my Hackbright class, created Flattest Route which is a great example of a data product. Users input where they are and are going to and the site generate and produce the flattest route to get between those points.

So we went over Flask because it’s a simpler framework to pick up if you are putting something online. It can still be a bit tough for only looking at it in a day if you don’t have experience with frameworks, but the instructors plan to give support as needed through our projects.

Wrap-up

During our pseudo free day, we researched our projects because we will be submitting top project ideas on Monday morning. The rest of next week we will work on fleshing out our project ideas while also learning more complex machine learning algorithms (e.g. Random Forest).

Even though we were all worn out and tried to take it a little easy, it was still an interesting and busy week regarding content.