At the pace we were going, the class showed signs of weariness in week 5 and officially hit the wall this past week. The instructors eased up on us in exercises this week, and gave us a pseudo free day to recharge yesterday.
The week was about assessing where we were and covering MapReduce, Big Data, Flask and to think about projects. It may not sound like they eased up on us but they did.
They gave us a practice exercise the Friday of week 5 to do individually. The data was from a company’s click rate on an advertisement based on user and location and our goal was to recommend locations to target for future advertisements. The data was in a pretty messy state across multiple tables so not surprising it took us several hours just to clean and load which was pretty frustrating but very real world. The rest of the time we analyzed the data and applied models to come up with recommendations.
On Monday, we were given an hour exam which was about solving small sample coding problems that covered several topics we’ve gone over so far. After both assessments, we met with the instructors to go over how things were going and determine where we should focus our studies for the remainder of the course. The assessments were tough to go through but they did help give us an understanding of how we were progressing.
MapReduce, Hadoop & EMR
MapReduce definitely seems so simple at first blush and yet can be devilishly difficult. This technique is really for handling large amounts of information which makes it a valuable tool for Big Data. You apply some type of change and/or combine data across large datasets and then reduce (consolidate) down the data for the results. I’ve been trying to think of a simple example to explain this concept and finding one is challenging to do simply and quickly. But what the heck, here goes…
Consider a dataset that has 1M rows and there were only two columns that had id numbers in them. There can be multiple occurrences of the same id in either column and the each row represents connections like followers on Twitter. You would use a map function to officially group each row of ids and pass them one at a time to the reduce function. The reduce function condenses multiple occurrences of the same id on the left side of the group and makes it a key. Then you can have the function condense down all the values that would have been on the right side of that key id and make those ids a list of values associated to the key id.
Example Map List:
- A B
- A C
- A Z
- A W
- A: [B, C, Z, W]
This is a really simplified example and not only can you make the functions more complex in processing results, you can run the data through multiple MapReduce functions in a stream to further adjust the data. MapReduce would be used in a case like generating Twitter’s recommended people you should follow. It’s a lot of data to go through for a result that is calculated regularly and needs to be produced quickly.
MapReduce is an optimized model for Big Data and Hadoop is the framework of choice to run the model on because of its ability to handle processing large datasets. In class, we used MrJobs, a Python library, to write MapReduce programs, and we also worked with Hive which is a data warehouse that sits on top of Hadoop and enables querying and running analysis with SQL. There are many other tools like Pig that we could have practiced using but what we covered still hit the core concepts.
Additionally, we learned how to setup an Amazon EC2 instance which is a virtual computer that you can use to run programs. Its great if you want to train models that can take several hours or longer to run (especially if you want to do several at once). It will free up your local computer for shorter term activities. More specifically regarding MapReduce, we learned how to use Amazon EMR (Elastic MapReduce), which allows you to spin up a remote Hadoop cluster to run these types of jobs. You can even do distributed computing to share the work load across multiple virtual computers on Amazon, but it can cost money depending on your processing needs.
We spent a day in class learning Flask (Python web framework for anyone who hasn’t read this blog before). There are students who plan to build data products and typically those are distributed online.
To clarify what a data product is, Google is an example. There is an interface where based on the user interaction, stuff is done on the backend to provide data in some type of format for you. Zivi, one of the women in my Hackbright class, created Flattest Route which is a great example of a data product. Users input where they are and are going to and the site generate and produce the flattest route to get between those points.
So we went over Flask because it’s a simpler framework to pick up if you are putting something online. It can still be a bit tough for only looking at it in a day if you don’t have experience with frameworks, but the instructors plan to give support as needed through our projects.
During our pseudo free day, we researched our projects because we will be submitting top project ideas on Monday morning. The rest of next week we will work on fleshing out our project ideas while also learning more complex machine learning algorithms (e.g. Random Forest).
Even though we were all worn out and tried to take it a little easy, it was still an interesting and busy week regarding content.