We wrapped up our statistic deep dive on Monday with an exercise around Multi-Armed Bandit (MAB) and focused the rest of the week on regression.
Main Topics Covered in Class:
- Multi-Armed Bandit
- Linear Regression
- Gradient Descent
- Cross Validation
- Final Project Overview
- Kaggle Competition
We also added in using the scikit learn data package which is primarily used for machine learning algorithms.
MAB / More Stats
MAB was actually interesting to learn about because its goal is to address some of the shortfalls in AB testing. For example, AB testing only compares two options at once and there is potential for bias when showing an old version against a new version. MAB allows testing multiple options at the same time while generating and updating performance scores. There are a couple different algorithm variations in MAB, but it basically is about showing the best performing option most of the time (ex. 90%) and providing some amount of randomization to show a lower performing option to give other options the opportunity to increase in performance (e.g. popularity). How often you randomly show an option can impact how long it takes the performance to change. The MAB algorithms typically beat out AB in picking the best option to use with the lowest error. This article gives some insight into MAB but beware that the code in the article is a little wonky.
The main take-away from this week is that stats talks a lot about what came before and modeling what the conditions were so you can understand things like best performers based on the past. Whereas machine learning is all about predicting what is to come. When we closed out Mon. in class, they said, “we are done with stats and now we are starting .. well stats (that made me laugh), but this time with machine learning perspective”.
Apparently linear regression (y=mx+b) is one of the simplest approaches (and most widely known) algorithms used in machine learning and thus, a good place to start. So yeah it is about fitting a line to known data to create a model that predicts your dependent variable (typically called y which could represent something like a price of a house) and figuring out how to minimize residuals (~ errors) and/or reduce cost function (= sum of squared errors) to improve the line fit. There are a couple different approaches to generate the model accounting for cases such as too many variables and not enough actual data and/or how to account for extreme outliers.
Part of creating the prediction is determining which features/variables to use and there are ways to assess the multicollinearity (finding redundant features so you can simplify the model) and heteroscedasticity (when there are sub-populations in features like age and income). And yeah, good luck with saying that word. We also discussed an alternative to linear regression especially when there is a large number of features. Its a faster way of finding the optimal model with so many variables. Andrew Ng provides some of the best materials to explain this concept and I’m going to reference this further below.
Additionally, we learned about cross-validation and defining test and training sets to work with. Usually you want to set aside 20 – 30% of the data for testing and build a model with your training data. There are different approaches on how to test such as K-fold and leave-one-out. Wiki provides a good description for cross-validation.
Midway through the week, we talked about final projects and about how to approach coming up with an idea and planning. They grouped potential projects into data analysis vs. data product and stressed that we should focus on answering a question first before thinking about techniques. We will only have 2 weeks to do the project and we have to come up with a proposal to get an approval before technically starting. Mainly this is to get us to plan ahead so we optimize our time.
I’m starting to think on an interest I’ve had for a while which is around AI. I want to do something that get’s my computer to predict and solve a problem for me before I know I have the problem. I’ve heard Android is already doing something along these lines, and I know there are a lot of commercial solutions already that can do much more than I can accomplish in a couple of weeks. Still its a challenge I’m interested in tackling to learn more about the space as well as because I want to find ways to make computers smarter. So definitely working through what this will look like.
Simulated Data Science Competition
Last note about the week’s activities is that we competed in a simulated Kaggle competition. I’ve got a link above to the Kaggle site but they primarily provide a contest space for data science challenges. Many companies post projects and awards for the best solution. We took an old contest and ran through an exercise of solving the problem It was great to jump into the deep end and start thinking about how to apply all that we had learned as well as learn how to work in a team to solve this type of problem. It was a stressful but fantastic exercise that reminds me of hackathons and the plan is to have us do this weekly.
Last Thoughts & Key Tip:
I definitely feel like I’m drinking from a firehose. I was a little freaked about it last week, but getting more comfortable with the deluge of information. Our days include a couple of lectures that cover relevant topics but most of the time is spent on exercises where we try to learn concepts while also applying them. We have readings that correspond to every class and usually they don’t spend a lot of time teaching the concepts. You are expected to do a lot of research and study in and out of school. The classroom is very focused on application.
In addition, there are a ton of terms and symbols that are used to explain all these concepts that sometimes mean the same thing or slightly different things and our instructors are not shy from using all the terms and giving content in very abstract form that is at an advance level (as well as giving more concrete examples when asked). And when we are not learning concepts and applying them, we are doing additional side projects to learn techniques needed to be a well rounded data scientist and ready for working in the industry. I’m sharing this to help set expectations that this class is true to the classification of a bootcamp. They don’t make it impossible but they do make you work for it. You just have to decide how hard you want to work for it.
And for the tip, definitely check out Andrew Ng’s Machine Learning videos on Coursera. He does a fantastic job explaining many concepts we cover.
A fellow HB alum and amazing coder, Aimee, has been kind enough to mention me on her blog a couple of times and I wanted to return the favor. She writes some great stuff about coding and data and I definitely recommend checking out her site Aimee Codes.