First week of Zipfian is already done and it does remind me how during Hackbright it felt like it went so fast. The focus for the week was about exposing us to core tools we will use as well as the main activities/processes around working with data.
The main tools used this week were Python, iPython, Git and Bash, and we went through three different exercises where we were gathering, cleaning, exploring and sometimes reporting data. A large part of our exercises throughout the program will be done in Python and we spent 4 of the 5 days using it. This is a bit of a shift for the school because they split more time with R in the last session, and it has to do with the growing popularity of using Python for data science. There’s a great article I read recently on the subject at R-bloggers. We will still use R but the emphasis is more Python.
We also used git and Github throughout the week to handle revision control and this will be daily for the whole program. Zipfian does keep their content in private repositories, but where possible, I will try to share some of the projects on my Github. A number of the resources we are using are public and several of them are referenced on a great open source Github repository by clarecorthell to provide a free approach to getting into data science (Open-Source Data Science Master Curriculum).
Another tool we started using this week which I’m really getting addicted to is iPython. It has nothing to do with Apple, but it is a very helpful kernal for practicing Python on the fly and its notebook (browser GUI) is user friendly when trying to test functions and bits of code in isolation. Some resources to help you get started using iPython outside of its regular site are tips site and an advance tips.
As mentioned in the last post, the first day was spent practicing git and how we will use it throughout the course as well as running through a few practice Python exercises. We did a problem where we coded the function to compute the frequentist approach to statistical inference and the Bayesian approach. Spoiler alert for those who haven’t seen the term frequentist before, it’s basically the fraction of the number of times something happens to total times it could happen (e.g. 4/5 days spent using Python).
The second day we wrote bash scripts all day working with a massive data file that we learned how to parse and clean and parse further into smaller files and then strip out specific bits of info to create url links that we then pulled data from. It was a great exercise in exploring what you can do just with bash as well as getting started in the experience of pulling and exploring data.
Wed. and part of Thurs. we took what we did in bash mostly and repeated it with Python. We spent the rest of Thurs. and Fri. building a recommender. It was the Netflix exercise where you have a set of data for user movie reviews and you want to recommend new movies to that user based on her/his past preferences. Funny enough I had spent week 5 and part of 6 in Hackbright building the actual web framework for the Netflix exercise and we were given the Pearson equation to apply for the recommender (which had similar results). Here we were actually building the recommender itself and leaving out the framework.
We used the Euclidean distance formula on existing product ratings to create a similarity matrix of products to products based on all user ratings. We learned how to use NumPy to create and manipulate matrices, and then we normalized the data to obtain the weighted ratings on products based on that user’s specific tastes. Finally, we outputted the top rated recommendations for the user. During the exercise, we also applied Matplotlib to visualize the data and help us test whether it looked directionally accurate for what we expected.
I was reminded this week that one of the hardest parts with bootcamps is having the stamina to get through it. Sitting and learning for 9 to 12 hours straight (breaking for lunch of course) for at least 5 days in a row can wear you out alone. And doing that while talking and working with another person almost the whole time can be just as exhausting (esp. for introverts which many who do this tend to be). It’s like a marathon and its your head usually that will get in the way of sustaining. It’s also like a marathon in regards to having to pace yourself. I felt it end of day Thurs when my head was just full and didn’t want to brain anymore and all I was good for at that point was sleeping. That of course came after pushing myself to keep reading and coding late on Mon, Tues and Wed, and I’m not the only one because most of the class typically stays late each day.
Coming Up & Tips
In the next couple weeks we will do a stats deep dive as well as machine learning. We will explore data analysis and machine learning packages like Pandas, NumPy, SciPy, Scikit-Learn as well as visualization tools like D3 and MatPlotLib.
On the whole I really did enjoy the week and it helped me appreciate how much I have learned Python since last year. I will say that if you want to do this program, definitely work on practicing Python with online tutorials like Learn Python the Hard Way and Code Academy as well as practice coding your own projects. And definitely start studying linear algebra and stats.