Category Archives: Machine Learning

PyCon 2014 – How to get started with Machine Learning

Following up on the talk I just gave at PyCon 2014 in Montreal, I’ve explained parts of my presentation and provided a few additional clarifications. You can catch the talk at Pyvideo.org, my github repo PyCon2014 holds the sample code, and the slides are on SpeakerDeck.

Machine Learning (ML) Overview

Arthur Samuel defined machine learning as, “Field of study that gives computers the ability to learn without being explicitly programmed”. Its about applying algorithm(s) in a program to solve the problem you are faced with and address the type of data that you have. You create a model that will help conduct pattern matching and/or predict results. Then evaluate the model and iterate on it as needed to create the right type of solution for the problem.

Examples of ML in the real world include handwritten analysis which uses neural nets to read millions of mail regularly to sort and classify all the different variations in written addresses. Weather prediction, fraud detection, search, facial recognition, and so forth are all examples of machine learning in the wild.

Algorithms

There are several types of ML algorithms to choose from and apply to a problem and some are listed below. They are broken into categories to give an approach on how to think about applying them. When choosing an algorithm, its important to think about the goal/problem, the type of data available and the time and effort that you have to work on the solution.

ML_Algorithms

A couple starting points to consider are whether the data is unsupervised or supervised. Supervised is whether you have actual data that represent the results you are targeting in order to train the model. Spam filters are built on actual data that have been labeled as spam while unsupervised data doesn’t have a clear picture of the result. For unsupervised learning, there will be questions about the data and you can run algorithms on it to see if patterns emerge that help tell a story. Unsupervised is a challenging type of approach and typically there isn’t necessarily a “right” answer for the solution.

In addition, if the data is continuous (e.g. height, weight) or categorical/discrete (e.g. male/female, Canadian/American) that helps determine the type of algorithm to apply. Basically its about whether the data has a set amount of units that can be defined or if the variations in the data are nearly infinite. These are some ways to evaluate what you have to help identify an approach to solve the problem.

Note, the algorithms categorization has been simplified a bit to help provide context, but some of the algorithms do cross the above boundaries (i.e. linear regression).

Models

Once you have the data and an algorithmic approach, you can work on building a model. A model can be something as simple as an equation for a line (y=mx+b) or as complex as a neural net with many layers and nodes.

Linear Regression is a machine learning algorithm and a simple one to start with where you find the best fit line to represent observed data. In the talk, I showed two different examples of having observed data that exhibited some type of linear trend. There was a lot of noise (data was scattered around the graph), but there was enough of a trend to demo linear regression.

When building a model with linear regression, you want to find the most optimal slope (m) and intercept (b) based on the actual data. See algebra is actually applicable in the real world. This is a simple enough algorithm to calculate the model yourself, but its better to leverage tools like scikit-learn’s library to help you more efficiently calculate the best fit line. What you are calculating is a line that minimizes the distance between all the observed data points.

After generating a model, you should evaluate the performance and iterate to improve the model as needed if it is not performing as expected. For more info, I also explained linear regression in a previous post.

Prediction

When we have a good model, you can take in new data and output predictions. Those predictions can feed into some type of data product or generate results for a report or visualization.

In my presentation, I used actual head size and brain weight data to build a model that predicts brain weight based on head size. Since the data was fairly small, this decreases the predictive power and increases the potential for error in the model. I went with this data since it was a demo, and I wanted to keep it simple. When graphed, the observed data was spread out which also indicated error and a lot of variance in the data. So it predicts weight with a good amount of variance in the model.

With the linear model I built, I was able to apply it so that I could feed it a head size (x) and it would calculate the predicted brain weight (y). Other models are more complex regarding the underlying math and application. Still you will something similar with other models in regards to making them and then feeding in new features/variables to generate some type of result.

To see the full code solution, checkout the github repository as noted above. The script is written a little differently from the slides because I created functions for each of the major steps. Also, there is an iPython notebook that shows some of the drafts I worked through to build out the code for the presentation

Tools

The python stack is becoming pretty popular for scientific computing because of the well supported toolsets. Below is a list of key tools to start learning if you want to work with ML. There are many other python libraries out there for more nuanced needs in the space as well as other stack packages to explore (R, Java, Julia). If you are trying to figure out where to start, here are my recommendation:

  • Scikit-Learn = machine learning algorithms
  • Pandas = dataframe tool
  • NumPy = matrix manipulation tool
  • SciPy = stats models
  • Matplotlib = visualization

Skills

In order to work with ML algorithms and problems, its important to build out your skill set regarding the following:

  • Algorithms
  • Statistics (probability, inferential, descriptive)
  • Linear Algebra (vectors & matrices)
  • Data Analysis (intuition)
  • SQL, Python, R, Java, Scala (programming)
  • Databases  & APIs (get data)

Resources

And of course, the next question is where do I go from here? Below is a beginning list of resources to get you started. I highly recommend Andrew Ng’s class and a couple of links are to sites with more recommendations on what to checkout next:

  • Andrew Ng’s Machine Learning on Coursera
  • Khan Academy (linear algebra and stats)
  • Metacademy
  • Open Source Data Science Masters
  • StackOverflow, Data Tau, Kaggle
  • Machine Learning: A Love Story
  • Collective Intelligence – Toby Segaran
  • Pattern Recognition & Machine Learning – Christopher Bishop
  • Think Stats – Allen Downey
  • Tom Mitchell
  • Mentors

One point to note from this list and I stressed this in the talk, seek out mentors. They are out there and willing to help. You have to put it out there what you want to learn and then be aware when someone offers to help. Also follow-up. Don’t stalk the person but reach out to see if they will make a plan to meet you. They may only have an hour or they may give you more time than you expect. Just ask and if you don’t get a good response or have a hard time understanding what they share, don’t stop there. Keep seeking out mentors. They are an invaluable resource to get you much farther faster.

Last Point to Note

ML is not the solution for everything and many times can be overkill. You have to look at the problem you are working on to determine what makes the most sense in regards to your solution and how much data you have available. Plus, I highly recommend looking for the simple solution first before reaching for something more complex and time-consuming. Sometimes regex is the right answer and there is nothing wrong with that. As mentioned to figure out an approach, its good to understand the problem, the data, the amount of data you have and timing to turn the solution around.

Good luck in your ML pursuit.

References

These are the main references I used in putting together my talk and post.

  • Zipfian
  • Framed.io
  • “Analyzing the Analyzers” – Harlan Harris, Sean Murphy, Marck Vaisman
  • “Doing Data Science”  – Rachel Schutt & Cathy O’Neil
  • “Collective Intelligence” – Toby Segaran
  • “Some Useful Machine Learning Libraries” (blog)
  • University GPA Linear Regression Example
  • Scikit-Learn (esp. linear regression)
  • Mozy Blog
  • StackOverflow
  • Wiki 

Jeeves is Talking!

Coolest moment this week was when I figured out that I just needed to add  one line of code to my program to get my computer to talk in my Jeeves project (thus video above).

This past week was all about fixing stuff, fine tuning, iterating and putting together a presentation and peripheral stuff for demo day which will be Thurs. Of course there is always more I could and would do, but I’ve been continually shifting priorities based on time and end goal.

Refresher:

In case you haven’t seen the previous posts, I’m building an email classification tool. It’s a binary classifier that is focused on determining if an email I receive is a meeting that needs a location picked/defined/identified (whatever word makes it clear). If the email classifies as true then a text is sent to my phone. I do have a working model in place and the video above shows my program run through the full process and the computer telling me the results (which was a little lagniappe I added in addition to the text output).

Classification Accuracy:

In last week’s post, I mentioned how the classification model (logistic regression) used in my product pipeline was not performing well even though I was getting a score of around ~85% accuracy.  All the classification models I tested had given ~80-90% accuracy scores, and I had said how the models were flawed because of the representation of the data. My data has ~15% true cases in it so as mentioned if it classified all emails as false then it would be right ~85% of the time.

What I need to clarify is that the ROC curve I was using also provides a type of accuracy metric, but the equation accounts for skewed class distribution (think of it as adjusting the 15/85 split to 50/50). So if my ROC curve is great than 50% (area under the curve) then its showing that the classifier is getting some true cases correctly classified, and my ROC curve had been around 70-80% on most of the models the week before.

So I did some investigation into how the logistic regression model I used in my product pipeline was performing and found that I hooked it up incorrectly. When I take in a new email message, I had to put it into a list format before splitting it up into features. The way I was passing the message, each word in one email message was being treated as a single email. I figured this out when I printed out the feature set shape and the length of the original message. So I just needed brackets around the email message to make the program see it as a list object. Sometimes its just that small of a fix. Now my classification model works great and it sends me texts on new emails that should be labeled as true.

Features:

This week, I’ve improved my model by expanding how I build features such as using tf-idf, lemmatizing, n-grams, normalizing and few other ways to clean and consolidate the features (e.g. words).

Tf-idf is a way to give high weights to words based on how frequently they show up in a document but to decrease the weight if the word shows up frequently throughout all the documents (corpus) that are used in the analysis. So it helps reduce the value of my name as a predictor since my name shows up throughout the corpus and it should not be weight strongly.

Lemmatization helps group different inflected forms of words together to be analyzed as a single item (e.g. walk and walking). Using n-grams helps create groupings of words into bi-grams, tri-grams, etc. This means that in addition to having single word features, I’m also accounting for groups of words that could be good predictors for a true case. For example ‘where should we meet’ is a combination of words that can be a very strong predictor for the true case and possibly stronger than the single word meet. N-grams in some ways allows for context.

There are some other techniques I used to build out my features but those mentioned above give a sense of the approach. After those changes, my ROC curve now shows ~80-90% on most classification models that I’m comparing.

There are more things I want to do with my feature development, but they are lower priority right now with such good performance results and other things taking priority with career day so close.

Code Stuff:

I spent a good chunk of time cleaning up and streamlining my code. I was trying to set it up to easily run the model comparison whenever I made feature changes. I also needed to make sure I consistently split up data used for cross validation in my model comparisons. Cross validation is a way to use part of the data to build the model and save a set of data to test and validate the performance.  So I got my code in a good enough state where its easy to re-run, expand and ensure that there is some validity to the scores its producing. Plus, it helps to make my code cleaner so I can understand it when I go back to it to add things in.

And if you want to checkout the code for my project, you can find it at my Code_Name_Jeeves Github repository.

Next Steps:

Depending on time, I definitely have other feature ideas such as adding in just a binary analysis of whether a date is referenced or not in the email message. I’d also like to run another grid search on the data pipeline to help with fine tuning parameters. More importantly, adding in more data to my training set would be a great value add  as well as just using a different data set to test my product can help with validating performance. Of course, if there was more time then a couple of days it would be great to build this out so my computer gives me location recommendations, but that one will have to be another time.

Last Note / Official End Result:

If you noticed a buzzing sound at the end of the video, that is my phone receiving the text message that is should get (see below).

photo

Zipfian Project Week 1 & Closing the Loop

One week down for our final projects and one week left to go. This has definitely been the most fun and rewarding weeks so far because I’ve been connecting the dots in more ways than one.

Everyone talks about getting your minimum viable product done (mvp)  when we do final projects like this. Thankfully I had enough experience with Hackbright to know what that meant as well as how to approach it. I got my full product pipeline built between last Sat and yesterday. I’ve tested that it works (well pseudo works) and feel really good to see the emails get pulled in, analyzed and if the condition is true then I get a text on my phone. Now I really need to fix my actual classifier because it is just not that accurate.

I named my project Code Name Jeeves in case you are looking for it on Github. I was inspired by a lot of things for wanting to make my computer smarter, but the Iron Man movie with the computer Jeeves was one of those moments where I distinctly remember thinking, “why can’t I just talk to my computer, yet” (and not in a Siri way). Basically why is my computer not doing more things for me than it does right now. Thus, the name and really I didn’t want to spend a ton of time thinking of a name.

So this past week was pretty productive:

  • Evaluated and applied email packages to pull gmail data
  • Setup Postgres data storage
  • Labeled emails that would classify as true
  • Applied vectorizer to generate a feature set (e.g. bag of words)
  • Tested several classifier models
  • Reworked the code to close the loop on my full product

Data / Emails

I spent the first two days experimenting with a couple packages to pull gmail (most built off of IMAP), and I ended up picking the package by Charlie Guo. Its simple enough to understand and apply quickly but has enough functionality that allows me to do some data manipulation when pulling the emails.

I made a quick decision during that time to go with Postgres for primary data storage.  I’m a fan of building with “going live” in mind just in case, and since I know Heroku, I knew Postgres was a top option for persisting data. Reality is that its a bit super charged for the size of data that I’m storing right now, but the setup is there and available to me as I need it. Plus, it was good to get practice with writing SQL to setup tables, store and access data.

I also spent some time going through my emails and manually labeling them. Because of the email package I used to pull the data, it made it so I could label my emails in gmail that would classify as true and then store that labeling in my database. Basically I added a column in my database that would put True in the cell of that email row if the email had the label I gave it on gmail.

I had to manually build out my training dataset because I need it to help build my classification model. This manual labeling is a bit of hinderance because of course it takes time to get it done, and we have limited time for this effort. I could crowd source this, but I don’t want to give my emails out. I could use the Enron dataset and crowd source getting those emails labeled. That feels a bit overkill for these two weeks and I really don’t want to use Enron’s data. I knew this would be an issue and I’m working to adjust for that where I can (esp. in my features and classifier model).

Features

After getting the data, I spent a day analyzing what I had and building out a feature set. For natural language processing, a simple feature set can be just counting up word occurrences in the training dataset. This can be expanded further but I opted to keep it simple to start so I could continue to get my pipeline built out.

So to help further explain features, think of them as variables that help predict the unknown. If this was a linear model like y = mx + b, they are the x variables that help define what y will look like and the classifier defines the coefficients for the model which in this case would be m and b.

Classifier

I spent the last couple of days of the week exploring as many classifiers as possible in the scikit-learn package. Ones I tried:

  • Logistic Regression
  • Naive Bayes (Gaussian, Multinomial, Bernoulli)
  • SVC
  • Random Forest
  • Ada Boost
  • Gradient Boost

Initially I just ran the standard models without any tuning. To run them, I pass in my training set of X features and y labels (which is my manual labeling of whether that email should classify as true or false).

Gaussian Naive Bayes, Random Forest and Gradient Boost all did pretty well with accuracy scores from 80 – 90% and area under the curve (lift) on a ROC plot of 70-80%. The reality was that they were doing great at classifying my emails as false (not meeting my condition) because in my training data set about 80% of the data was false. So if it classified as false all the time then it was 80% correct.

One thing my instructors helped me appreciate that when I’m classifying the emails, I would prefer to get an email that should be false but is classified as true (confusion matrix of false positive) vs miss an email that was classified as false but should be true (false negative). This is similar to what they target for spam. Classifying the wrong email as spam is worse than getting a little bit of spam in your inbox.

I also worked on applying a grid search for practice which is an approach to testing a variety of parameters to tune the models and improve accuracy scores. Through tuning, I was able to improve Logistic Regression, Multinomial Naive Bayes and SVC into the 90% accuracy range.

As mentioned, certain models like Logistic Regression handle large feature sets (especially for nlp) better than others. Since my actual training dataset is small, Naive Bayes is a good solution to accommodate the limited information. I tried the other classifiers to experiment and learn. I hear the other models are rarely used in the real world because they don’t have enough improvement on scores and are too complex to justify the expense on time and effort to use.

Closing the Loop

I spent Fri. and Sat connecting the dots on my project basically building the code that would run my project from start to text finish.

I had a hard time finding something quickly that would stream my gmail through my app so I decided to just run automatic checks for new emails. When I got them, I open the instance of my customized vectorizer (word counter built with the training set) and apply it to my email to get a feature set.

Then I open my stored classifier instance (also built and tuned using the training set) and I pass the new email feature set into  the classifier. The classifier returns a boolean response and if the response is true then I craft a message and send a text that says that specific email needs a meeting location defined.

So now that the pipeline is built and sort of working (I had it send me text messages when an email classifies as false), I need to go back through and improve my feature set and my classifier model since nothing is classifying as true. There are lots of things I can do and it should be another good week of learning.

Begin with the End

So we are officially kicking-off personal projects for the next 2 weeks. It’s been a bit of a process for all the students figuring out what each of us wanted to do and finalizing it but we got there.

Last Week
It was a review week. We went back over some concepts that will be valuable for projects as well as when we go through interviews like classifiers and distributions. We also worked to determine and finalize our projects and then we spent two full days on final assessments. We had 2 different data science case studies that required working through the full process from getting the data to providing some type of recommendation/report. We also worked in teams on sample interview questions to review over additional class content. The last day there was a little bit of mutiny and we really didn’t get much done on the assessment. Most people were starting to think projects at that point.

Projects

I am fascinated by AI. Making my computer smarter or at least make some decisions it doesn’t already handle is something I’ve been interested in before this class. I spent the last several weeks thinking through that idea and how to translate it into a 2 week project. Thankfully, mentors and instructors helped me scope that into something that seems achievable. What I focused in on is to classify an email on if it’s a meeting request that needs a location defined.

So if a friend or colleague wants to meet up but no place has been specified, I want my classifier to classify that email as true. This is setting the stage for a bigger challenge I’d like to solve which is to get the computer to figure out some meeting location options and provide them. Still just doing the email classification seemed a pretty attainable goal in the project timeframe.

The reality is there is a ton more I want to do, and I iterated over a long wish list of ideas, breaking them into smaller concepts and getting lots feedback. I had help steering me towards tasks to get my computer to do that would be realistic in the timeframe we have to work, and when I started defining explicitly what the process would look like from beginning to end, I landed on that one classification step.

Sounds easy and simple but it will be a challenge because I am still learning and getting comfortable with so many components of this. Plus, it is not as easy as it sounds especially when I will be working with a sparse dataset to start. So some of my first steps are to get the data (focused on personal for now) and clean it as well as to manually identify emails that would classify as true.

Then I will have to work on different approaches for applying natural language processing (NLP) to define features for my model. I am going to start with standard bag of words to create features, but will try to explore feature engineering where I explicitly define specific word and symbol groupings (pseudo mix of NLP and Regex). For anyone asking, a feature is a specific attribute (like a word or word pairings) that can help identify / classify the email. So I will work with tools and my own personal inspection to find common words and groupings in my emails that would help classify them as true if they meet the condition.

Once I have the features defined, I will work on building the classification model by applying techniques like cross validation and grid search. Logistic regression is a popular algorithm for classification because its fast to build and tends to be the best option for extremely large feature sets like NLP. So I plan to start there, but I want to explore some of the other models for comparison since this is a great opportunity to practice.

The End State

So my goal is to get my computer to classify new emails and send a text that says, “x email needs a meeting place defined”. With that said, I spent the weekend looking back over an example project I did with Twilio, and I’ve adapted the code to make a working function that will take a message as input and send a text. Thus, the end state is setup and now I just have to build the rest of it that will lead to generating that text message. No problem.

It definitely helps to know where I’m going.

Note

There was some trickier bits with the setup because Anaconda (package to load all the data science software) and Virtualenv don’t play well together. I was able to work through the conflicts and will try to put up a post about how to make them work together sometime in the near future. If you need to know sooner than later, check out this link as one main step to help resolve the conflicts.

 

Deep Learning Surface

Deep Learning is a tool in the Machine Learning (ML) toolbelt which is a tool in AI and Data Science toolbets. Think of it as an algorithm subset of a larger picture of algorithms and it’s area of expertise is solving some of the more complex problems out there like natural language processing (NLP), computer vision and automatic speech recognition (ASR). Like when you talk to the customer service computer voice on the phone vs. push a button.

Why am I writing about this?  Because its was the topic of my tech talk at Zipfian this week. I chose Deep Learning because I have an interest in making technology smarter, and I was clued into this area of ML as being more advanced in getting computers to act in a more intelligent and human way.

My research was only able to skim the surface because it is an involved topic that would take some time to study above and beyond what Zipfian is covering. Below is a summary of some key points I covered in my talk and additional insights. Also, the presentation slides are at this link.

Deep Learning in a nutshell:

  • Learning algorithms that model high level abstraction
  • Neural networks with many layers are the main structures
  • Term coined in 2006 when Geoff Hinton proved neural net impact

Experts

To his credit, Hinton and others have been working on research in this field since the ’80s despite lack of interest and minimal funding. It’s been a hard road for them that has finally started to pay off. AI and neural networks in general have actually been explored since the ’50s, but the biggest problems in that space in general has been computer speed and power. It really wasn’t until the last decade that significant progress and impact has been seen. For example, Google has a project called Brain that can search for specific subjects in videos (like cat images).

I mention Hinton because he’s seen as a central driver of Deep Learning and many look to him to see what’s next. He also organized the Neural Computation and Adaptive Perception (NCAP) group in 2004 that is invite only with some of top researchers and talent in the field. The goal was to help move Deep Learning research forward faster. Actually, many of those NCAP members have been hired by some of the top companies out there diving deep into the research in the last few years. For example:

  • Hinton and Andrew Ng at Google
  • Yann LeCun at Facebook
  • Terrance Sejnowski at US BRAIN Initiative

Its a field that technically has been around for a while but is really taking off with what technology is capable now.

Structure

Regarding the structure, neural networks are complex and originally modeled after the brain. They are highly connected nodes (processing elements) that process inputs based on a statistical, adaptive weights. Basically you pass in some chaotic set of inputs (it has lost of noise) and the neural net puts it together as an output. Its assembling a puzzle with all the pieces you feed it.

Below is a diagram of a neural net from a presentation Hinton posted.

The overall goal of neural networks are feature engineering. Its about defining the key attributes/characteristics of the pieces that make up the puzzle you are constructing, and determining how to weight and use them to drive out the overall result. For example, a feature could be a flat edge and you would weight nodes (apply rules) to place those pieces as a boundary of the puzzle. The nodes would have some idea of how to pick up pieces and put them down to create the puzzle.

In order to define weights for nodes, the neural net model is pre-trained on how to put the puzzle together, and the pre-training is driven by an objective function. Objective functions are a mathematical optimization technique to help select the best element from some sort of available alternatives. The function changes depending on the goals of the network. For example, you will have a different set of objectives for automatic speech recognition if you have an audience in the US vs. Australia. So your objectives will take those differences into account to help adjust node weights through each training example and improve upon the output.

A couple other concepts regarding neural nets and Deep Learning are feedfoward and backpropagation (backward propagation of errors). Feedforwad structure passes input through a single layer of nodes where there is an independence on the inputs and unsupervised learning. So nodes can’t see what each other is holding in regards to pieces and can only use their pre-trained weights to help adjust / put the pieces in a place they think is best for the output. Restricted Boltzmann Machine and Denoising Autoencoders are examples of feedforward structures.

Backpropagation is multi-layered / stacked structures that are supervised learning. It tweaks all weights in the neural network based on outputs and defined labels for the data. Backprop can look at the output of the nodes at different points in the process of constructing the final picture (see how the pieces are starting to fit together). If the picture seems to have errors / pieces not coming together then it can adjust weights in the nodes throughout the network to improve results. Gradient descent is another optimization technique that is regularly used as an alternative to backprop. Example backprop neural networks include Deep Belief and Convolutional Neural Networks (regularly used in video processing).

Last Thoughts

So I for one would love to see a Data from Next Gen or a Sarah from her but neural nets are a far off step to create that level of “smart tech”.  Plus, as mentioned above they are one tool in the bigger picture of AI. They are a very cool tool and definitely beating out other algorithms in regards to complexity of problem solving. They are fantastic at classification, prediction, pattern recognition and optimization but they are weak in areas like covering logical inferences, integrating abstract knowledge (‘sibling’ or ‘identical to’) and making sense of stories.

On the whole Deep Learning is a fascinating space for the problems it can handle and is continuing to solve. It will be interesting to see what problems it solves next (esp. with such big names putting research dollars behind it). Below are references that I used to put together this overview and there is plenty more material on the web for additional information.

References

Below are references I used while researching the topic. Its not exhaustive list but it is a good start.

Side Note on Zipfian

On the whole it was another hectic week. In a very short note, we covered graph theory, NetworkX, k-means algorithm, and clustering overall. There was a lot more detail to all of that but I’ve considering my coverage above, I’m leaving the insight at that for this week.

Machine Learning Starts with Linear Regression

We wrapped up our statistic deep dive on Monday with an exercise around Multi-Armed Bandit  (MAB) and focused the rest of the week on regression.

Main Topics Covered in Class:

  • Multi-Armed Bandit
  • Linear Regression
  • Gradient Descent
  • Cross Validation
  • Final Project Overview
  • Kaggle Competition

We also added in using the scikit learn data package which is primarily used for machine learning algorithms.

MAB / More Stats

MAB was actually interesting to learn about because its goal is to address some of the shortfalls in AB testing. For example, AB testing only compares two options at once and there is potential for bias when showing an old version against a new version. MAB allows testing multiple options at the same time while generating and updating performance scores. There are a couple different algorithm variations in MAB, but it basically is about showing the best performing option most of the time (ex. 90%) and providing some amount of randomization to show a lower performing option to give other options the opportunity to increase in performance (e.g. popularity). How often you randomly show an option can impact how long it takes the performance to change.  The MAB algorithms typically beat out AB in picking the best option to use with the lowest error. This article gives some insight into MAB but beware that the code in the article is a little wonky.

The main take-away from this week is that stats talks a lot about what came before and modeling what the conditions were so you can understand things like best performers based on the past. Whereas machine learning is all about predicting what is to come. When we closed out Mon. in class, they said, “we are done with stats and now we are starting .. well stats (that made me laugh), but this time with machine learning perspective”.

Machine Learning

Apparently linear regression (y=mx+b) is one of the simplest approaches (and most widely known) algorithms used in machine learning and thus, a good place to start. So yeah it is about fitting a line to known data to create a model that predicts your dependent variable (typically called y which could represent something like a price of a house) and figuring out how to minimize residuals (~ errors) and/or reduce cost function (= sum of squared errors)  to improve the line fit. There are a couple different approaches to generate the model accounting for cases such as too many variables and not enough actual data and/or how to account for extreme outliers.

Part of creating the prediction is determining which features/variables to use and there are ways to assess the multicollinearity (finding redundant features so you can simplify the model) and heteroscedasticity (when there are sub-populations in features like age and income). And yeah, good luck with saying that word. We also discussed an alternative to linear regression especially when there is a large number of features. Its a faster way of finding the optimal model with so many variables. Andrew Ng provides some of the best materials to explain this concept and I’m going to reference this further below.

Additionally, we learned about cross-validation and defining test and training sets to work with. Usually you want to set aside 20 – 30% of the data for testing and build a model with your training data. There are different approaches on how to test such as K-fold and leave-one-out. Wiki provides a good description for cross-validation.

Final Projects

Midway through the week, we talked about final projects and about how to approach coming up with an idea and planning. They grouped potential projects into data analysis vs. data product and stressed that we should focus on answering a question first before thinking about techniques. We will only have 2 weeks to do the project and we have to come up with a proposal to get an approval before technically starting. Mainly this is to get us to plan ahead so we optimize our time.

I’m starting to think on an interest I’ve had for a while which is around AI. I want to do something that get’s my computer to predict and solve a problem for me before I know I have the problem. I’ve heard Android is already doing something along these lines, and I know there are a lot of commercial solutions already that can do much more than I can accomplish in a couple of weeks. Still its a challenge I’m interested in tackling to learn more about the space as well as because I want to find ways to make computers smarter. So definitely working through what this will look like.

Simulated Data Science Competition

Last note about the week’s activities is that we competed in a simulated Kaggle competition. I’ve got a link above to the Kaggle site but they primarily provide a contest space for data science challenges. Many companies post projects and awards for the best solution. We took an old contest and ran through an exercise of solving the problem It was great to jump into the deep end and start thinking about how to apply all that we had learned as well as learn how to work in a team to solve this type of problem. It was a stressful but fantastic exercise that reminds me of hackathons and the plan is to have us do this weekly.

Last Thoughts & Key Tip:

I definitely feel like I’m drinking from a firehose. I was a little freaked about it last week, but getting more comfortable with the deluge of information. Our days include a couple of lectures that cover relevant topics but most of the time is spent on exercises where we try to learn concepts while also applying them. We have readings that correspond to every class and usually they don’t spend a lot of time teaching the concepts. You are expected to do a lot of research and study in and out of school. The classroom is very focused on application.

In addition, there are a ton of terms and symbols that are used to explain all these concepts that sometimes mean the same thing or slightly different things and our instructors are not shy from using all the terms and giving content in very abstract form that is at an advance level (as well as giving more concrete examples when asked). And when we are not learning concepts and applying them, we are doing additional side projects to learn techniques needed to be a well rounded data scientist and ready for working in the industry. I’m sharing this to help set expectations that this class is true to the classification of a bootcamp. They don’t make it impossible but they do make you work for it. You just have to decide how hard you want to work for it.

And for the tip, definitely check out Andrew Ng’s Machine Learning videos on Coursera. He does a fantastic job explaining many concepts we cover.

Side Note:

A fellow HB alum and amazing coder, Aimee, has been kind enough to mention me on her blog a couple of times and I wanted to return the favor. She writes some great stuff about coding and data and I definitely recommend checking out her site Aimee Codes.