Category Archives: Education

2014 Summer == Full Time Data Science Work

For the last three months I have been working at Change.org as a data scientist and engineer. Its been a great experience so far and I’m blown away that this is where I landed after starting this journey a year plus ago.

Impostor Syndrome
I’ve coached others going through moving into engineering about how to believe in themselves and they are smarter than they think. I totally get that you want to fake the confidence till you get there. Don’t be cocky just be resolved to figure stuff out.

Still I felt overwhelmed by the impostor syndrome. The fear of the company figuring out I’m a fraud and firing me during that first month was powerful in my mind. It didn’t matter how much I rationally knew better. Thankfully I have a good community of people who have gone through similar experiences with starting jobs that I was able to fall back on for support.

The feeling has subsided with time as I expected, but it does keep me on my toes to be vigilant in my growth in the space and make sure I’m having positive impact on the company.

The Company
Change.org has been an amazing experience for my first data science and engineering job. I couldn’t believe it when they put me through almost a month of training and rotation for on-boarding. It helped me get to know the team and get more comfortable working with the group pretty quickly. I’ve heard of some companies doing this for their employees and it shows how much the company is invested in you.

The people are extremely friendly, welcoming and willing to help when I have questions. It’s not a negatively competitive or condescending environment that makes me feel like I have to hide weaknesses. It has allowed me to ask questions no matter how stupid I think they are and to grow so much faster as well as deliver so much faster.

They have also made plenty of time for me to grow even though I just started. In addition to the near month rotation, they sent me to the GraphLab conference and gave me time off to take a short Spark class I got into at Stanford. And next month they are giving me time to go to StrangeLoop. In the consulting world, there is no way I would have been able to take time away from work to grow myself being so new to the company. Granted I know the more I learn the better I become as an employee. However, not all companies are able to or willing to make the time for this type of growth.

Also as you can gather from above, the company does not take over my life. The hours are 10 to 6 and people typically stick to that with a few working occasionally outside those hours. We do fun stuff together during and after work but it’s not mandatory and makes room for you to have a life.

We do happy hours every other week and sometimes play board games especially on Fri. Earlier in the summer, we would gather around the TV in a big open meeting room and “work” while watching some of the World Cup games. And almost every Friday close to the end of day, we break for what feels a little like an open mic sessions. Anyone can present on a topic they think will be valuable for the team to learn about. It helps us see what other groups are working on or learn about new tools and methods we may want to use. I’ve presented a couple of times already on GraphLab and provided an overview of data science by leveraging my PyCon presentation.

Basically its been a great place to work.

The Work
The first week on the job, one of the senior engineers had me ship code. Basically you push up code that will change the live site in some way. This can be a big deal especially for a site that is so comprehensive and beyond just a start-up. So it was pretty cool to do that and not break the site in the process.

During the rotation, I did collaborate on a few bugs, but I was given an assignment to do as time permitted to answer a question using MrJob and Hadoop; thus, the previous post. I knew this from general experience and from talking to my friends who were working. Still I will note that nothing compares to hands on experience. Working on the MrJob project taught me so much about  Hadoop, AWS, how to access data at work and just gave me a better understanding of the big data hype.

Lately I’ve been working through implementing a Multi-armed Bayesian Bandit solution. Again teaching me so much through figuring out how to implement for the specific company.  We’ve built out a testing environment and coded the solution in Python initially but the live code we are implementing into is in Java and uses the Gradle framework.

I asked for the opportunity and was given the time to take a crack at converting the solution into Java before working with one of the engineers who is more versed in the code base. Java is a bitch but it has been a thrill figuring it out for the last few weeks. I understand much more the concepts around functional programming and interactive kernels and so forth.  And I did manage to figure out and convert and test the algorithm in Java which did make me feel fantastic about that accomplishment.

I definitely have days were I’m so excited about the work I’m tackling and feel so lucky to be able to do this for a living.

Strange Stuff
Before I even started, recruiters were contacting me for other jobs. Literally I changed my LinkedIn profile the week before I started at Change.org and at least 3 recruiters contacted me that week. Very flattering but also funny considering I hadn’t even worked yet. I know people who are still looking who are better versed in math and/or programming than I am and having a company officially hire me added this level of credibility at least for recruiters to want to talk with me. I am stressing this to point out there are many qualified people and I think they are worth a look whether they have a full-time position in this space on their resume or not. Frankly, I think drive and determination are more important characteristics to look for.

Also friends were putting me in touch with people getting into the industry to give them advice on how to go about it successfully. Again flattered but considering I was scared to death of loosing the job the first month, I did not feel qualified to give anyone advice.

Additionally, my path was not easy and this journey is far from over. I still have a ton to learn and I have many days at work where I feel like I know nothing. Again thankfully many in my community have shared those experiences with me, and I know this is typical. Hell one of the senior guys at work was saying he has those days still all the time. That’s the best and worst part about this. Everything keeps changing so it can keep you constantly humble but also challenge you a ton to learn.

Sunscreen Advice
For those getting into the space (I heard from a number of you this summer), I highly encourage jumping in. What I have been sharing is that a bootcamp may or may not be the right experience for you. There were people I know who did not get a lot out of Hackbright or Zipfian as much as I know people who did. The approach, people, experience or whatever just didn’t work for some.

I can’t tell you to quit your job and take the risk because I don’t know what is best for you. And I don’t know the hiring stats beyond, it is definitely not 100% hiring rate into the field after those programs. I think anyone who survives those programs should get hired because of the rigor and determination required to sustain through them. Still companies are selective and will do the best for themselves by hiring for talent and fit. You probably won’t like all the people in your class, and may even hate a few. The job search process for you could be almost a year after you are done, maybe more. You could find this is not a field you want to get into. All of these things I’ve seen happen because the bootcamp process is not a gold ticket or a promise of success for everyone.

You make that success for yourself. If you decide to do these programs or whatever approach you take to get into the industry, make sure to own it. It is your responsibility and no one else’s to make you successful. You can have expectations for education you pay for to a point. Still the one who is going to be most concerned for what’s best for you is you and that is not going to change no matter where you go or how much you pay to get into something. You really have to make sure that you show up to whatever you try to do, be willing to participate, thankful for the help you receive in whatever form it comes, constantly look inward on what you can do to improve, try not to compare to others and fight to overcome any internal ego issues that may battle against you.

Get clear on why you are getting into data science and/or engineering. Different reasons can determine the best path for you too. I’m here because the challenge and constant learning makes me feel alive and I love it as much as it frustrates me. I seek opportunities that make sense for what I want to get out of the space and I’m constantly re-centering on what I need and want to learn as I get clearer on what the space is about. You don’t have to have those reasons to get into it. Just be honest with yourself on why you want to be in data science and engineering and what you expect from it. And be open and ready for the fact that whatever you expect will not be what you get. It may be very close or very far off.

I want to see more people working in engineering and data science because it’s definitely needed and there are a lot of people who feel like I do.  We are willing to help make the path more accessible. Still it is really on you to figure out what is best for you and then fight for it.

Blog Next Steps
I have learned so much in the last three months and there are many times I’ve thought, I should write a post about it. Reality is this site has become a lower priority while getting up to speed on work and getting a lot clearer on where I want to focus my studies. I’ve also tried to get some level of sanity back into my non-work life. I will try to make time again for posts but its tbd on frequency. Thanks for all those who have been reading so far and sending me great feedback. Seriously, much appreciated.

Delayed Zipfian Week 12 Wrap-up

A little delayed for good reason but here we go. Zipfian’s 12th week focused on interview practice, a conference and the graduation celebration.

Interview Practice

Most of the 12th week was spent going back through previous materials and drilling/simulating interviews. Data science has a different interview flavor from software engineering. Companies vary in what they are looking for and what they will ask and the content for data science covers a broader range of topics.

Companies may be looking for someone to:

  • build out internal dashboards that provide business metrics
  • design and build data analytics backend
  • implement and maintain tools for others to get access to metrics
  • build customer facing product features that are data driven
  • establish company data strategy
  • different combinations of above
  • all of the above
  • something else not listed

If you know where you want to focus, that can help scope your studies and your job applications to some degree. Still there is a lot of material to go through and ideally you want to review:

  • data science pipeline
  • general experiment design
  • machine learning algorithms
  • probability
  • statistic models & tests
  • data analysis (esp. regarding model performance)
  • programming (esp. white boarding)
  • product metrics / growth hacker

My plan of attack is to pick a topic to study for a couple of hours and then switch to a different topic. There is a little more to this approach and I’m typically tackling areas where I know I’m weaker or I anticipate to cover in an upcoming interview. Zipfian also gave us a number of study resources and sample questions to work through that has helped in the preparation.

Conference(s)

The class attended the Big Data Innovation Summit in Santa Clara during the last week. Content covered machine learning and big data trends.

As you may have read in my last post, I missed half of the week and events because I went to PyCon. PyCon was amazing and even thought it was hard to prep for that while also going through the bootcamp, I am so glad I did it. The conference also had plenty of content around data science and you can find videos for the even at Pyvideo.org.

Graduation

I heard there was lots of dancing and drinking and partying at the graduation celebration and that several previous alum came back to join in the event. It was an all-nighter which reminds me of another bootcamp’s graduation party.

I was definitely sad to miss the festivities, but I’m so happy for all of us that we graduated!

And that is not all folks…

It feels weird and awesome to be done (sorta). It’s funny how everyone started saying close to the end that “you are almost done and then you can relax.” And I would laugh cause I already knew what was coming because of Hackbright.

There is no relaxing immediately after…okay maybe for like a couple of days but then you are right back into the thick of it if you are interviewing. Most bootcamps don’t get this expectation set correctly just yet. I still heard it from the recent Hackbright graduates that they were surprised there was still so much work left to do.  I’ve also heard it from a few people I’ve gotten to know attending other bootcamps in the area.

The interview process kicks in and its lots of studying, interviewing, studying, interviewing, studying, interviewing, sleep a little and then study. The bootcamps have a week or two after hiring/career day to help support those going through the process. And granted not everyone goes into this loop but many do.

Thus, I am currently in the study/interview cycle myself to explore the options out there and looking forward to getting through this part of the process to when I can finally take a minute to relax.

Last Thoughts

One of my cohort, Ike, has been keeping a blog about his experience at Zipfian that I’d definitely recommend if you are interested to know more: Yet Another Data Blog

And I will say this for those wondering about the value of the bootcamps, I have been getting good job leads from Zipfian and PyCon and also through my Hackbright connection. The alum connection from Hackbright has brought me opportunities that I didn’t have last year, and I’m very grateful to have this network.

Zipfian is basically at the point of the path that Hackbright was when I attended last year. Not as many know much about them yet and their alumni network is small right now. Still it is growing and it will be fun to see how strong it becomes in another year.

Zipfian Hiring / Demo Day – How it Works

This past week was the big hiring/demo day for Zipfian Academy where we presented our projects to a number of companies seeking data scientists. Zipfian’s hiring day is a lot like Hackbright’s career day. So the nice thing about going through it was that I knew what to expect.

Companies:

Sixteen companies attended hiring day which was a great turnout, and I was thankful there wasn’t another 9 because talking to 16 was still exhausting. Some companies were hiring their first data scientist who would build the strategic direction as well as put it into action. While other companies have teams that they want to grow. So there was a mix of start-ups to large organizations.

Schedule:

In the morning, each company gave a 1 minute introduction on who they were and what they were looking for. Then all the students presented 3 minutes each on our individual projects. We used slides and did a very brief overview or our project goal, approach, results and next steps.  We broke for lunch and then we did the speed interviewing like we did at Hackbright. Each company had a table and the students would rotate around. We were given 7 minutes to talk and the conversations varied based on the company and what they were looking for.

Approach:

Preparation was really focused on reviewing company bios as well as putting together project presentations to explain what we did. Zipfian drilled us on those presentations. They had us draft them the week before and run through them a few times with lots of great feedback. It was tough to go through when there was so much else going on, but it was very valuable to get us in shape and clear on our project story.  We also worked on putting together bios for the companies as well as received bios about the companies who were attending.

During the day, I went with my previous experience of asking questions about the company, roles they were hiring for, culture and tools they use. A few companies asked technical questions and/or specific questions about my project. I also got questions about my background and what role I was looking for.

Afterwards, I changed my approach from my last time through this of sending emails immediately. Partly because I needed a breather and took a day off from school the day after to handle a number of errands that had piled up.  Also, because it was nice to let the experience percolate.

Zipfian & HB Comparison:

The speed interviewing was pretty much the same except having the students rotate and consolidating the project demos to just one presentation. I really appreciated not having to repeat my project spiel more than once, and it kept the individual conversations more authentic (vs. rehearsed speech). That also gave more time to get to know the companies.

On the whole, I felt more comfortable the second time around because I actually knew many of the companies and have a broader picture of the community. There were a few people there that knew and/or worked with some of my fellow Hackbright alums and friends. One of the coolest six degrees was that one of the representatives was at the hardware hackathon where my team built the Arduino car. It gave us something else to talk more about and removed some of the awkward getting to know you process. So even though I didn’t know the people directly, they didn’t all feel like complete strangers to me.

To clarify, I wouldn’t have felt that way at all if I hadn’t gone through Hackbright and all the stuff in the last year. It really set the stage for me to have a better perspective on the hiring day experience.

It is still a very long day and everyone (students and company reps) said the same. Still after it was all said and done, a few of the students who were still around went out for ice cream which was definitely a good way to wrap that day.

Last Note:

One week left for Zipfian and next week is interview prep.

Jeeves is Talking!

Coolest moment this week was when I figured out that I just needed to add  one line of code to my program to get my computer to talk in my Jeeves project (thus video above).

This past week was all about fixing stuff, fine tuning, iterating and putting together a presentation and peripheral stuff for demo day which will be Thurs. Of course there is always more I could and would do, but I’ve been continually shifting priorities based on time and end goal.

Refresher:

In case you haven’t seen the previous posts, I’m building an email classification tool. It’s a binary classifier that is focused on determining if an email I receive is a meeting that needs a location picked/defined/identified (whatever word makes it clear). If the email classifies as true then a text is sent to my phone. I do have a working model in place and the video above shows my program run through the full process and the computer telling me the results (which was a little lagniappe I added in addition to the text output).

Classification Accuracy:

In last week’s post, I mentioned how the classification model (logistic regression) used in my product pipeline was not performing well even though I was getting a score of around ~85% accuracy.  All the classification models I tested had given ~80-90% accuracy scores, and I had said how the models were flawed because of the representation of the data. My data has ~15% true cases in it so as mentioned if it classified all emails as false then it would be right ~85% of the time.

What I need to clarify is that the ROC curve I was using also provides a type of accuracy metric, but the equation accounts for skewed class distribution (think of it as adjusting the 15/85 split to 50/50). So if my ROC curve is great than 50% (area under the curve) then its showing that the classifier is getting some true cases correctly classified, and my ROC curve had been around 70-80% on most of the models the week before.

So I did some investigation into how the logistic regression model I used in my product pipeline was performing and found that I hooked it up incorrectly. When I take in a new email message, I had to put it into a list format before splitting it up into features. The way I was passing the message, each word in one email message was being treated as a single email. I figured this out when I printed out the feature set shape and the length of the original message. So I just needed brackets around the email message to make the program see it as a list object. Sometimes its just that small of a fix. Now my classification model works great and it sends me texts on new emails that should be labeled as true.

Features:

This week, I’ve improved my model by expanding how I build features such as using tf-idf, lemmatizing, n-grams, normalizing and few other ways to clean and consolidate the features (e.g. words).

Tf-idf is a way to give high weights to words based on how frequently they show up in a document but to decrease the weight if the word shows up frequently throughout all the documents (corpus) that are used in the analysis. So it helps reduce the value of my name as a predictor since my name shows up throughout the corpus and it should not be weight strongly.

Lemmatization helps group different inflected forms of words together to be analyzed as a single item (e.g. walk and walking). Using n-grams helps create groupings of words into bi-grams, tri-grams, etc. This means that in addition to having single word features, I’m also accounting for groups of words that could be good predictors for a true case. For example ‘where should we meet’ is a combination of words that can be a very strong predictor for the true case and possibly stronger than the single word meet. N-grams in some ways allows for context.

There are some other techniques I used to build out my features but those mentioned above give a sense of the approach. After those changes, my ROC curve now shows ~80-90% on most classification models that I’m comparing.

There are more things I want to do with my feature development, but they are lower priority right now with such good performance results and other things taking priority with career day so close.

Code Stuff:

I spent a good chunk of time cleaning up and streamlining my code. I was trying to set it up to easily run the model comparison whenever I made feature changes. I also needed to make sure I consistently split up data used for cross validation in my model comparisons. Cross validation is a way to use part of the data to build the model and save a set of data to test and validate the performance.  So I got my code in a good enough state where its easy to re-run, expand and ensure that there is some validity to the scores its producing. Plus, it helps to make my code cleaner so I can understand it when I go back to it to add things in.

And if you want to checkout the code for my project, you can find it at my Code_Name_Jeeves Github repository.

Next Steps:

Depending on time, I definitely have other feature ideas such as adding in just a binary analysis of whether a date is referenced or not in the email message. I’d also like to run another grid search on the data pipeline to help with fine tuning parameters. More importantly, adding in more data to my training set would be a great value add  as well as just using a different data set to test my product can help with validating performance. Of course, if there was more time then a couple of days it would be great to build this out so my computer gives me location recommendations, but that one will have to be another time.

Last Note / Official End Result:

If you noticed a buzzing sound at the end of the video, that is my phone receiving the text message that is should get (see below).

photo

Zipfian Project Week 1 & Closing the Loop

One week down for our final projects and one week left to go. This has definitely been the most fun and rewarding weeks so far because I’ve been connecting the dots in more ways than one.

Everyone talks about getting your minimum viable product done (mvp)  when we do final projects like this. Thankfully I had enough experience with Hackbright to know what that meant as well as how to approach it. I got my full product pipeline built between last Sat and yesterday. I’ve tested that it works (well pseudo works) and feel really good to see the emails get pulled in, analyzed and if the condition is true then I get a text on my phone. Now I really need to fix my actual classifier because it is just not that accurate.

I named my project Code Name Jeeves in case you are looking for it on Github. I was inspired by a lot of things for wanting to make my computer smarter, but the Iron Man movie with the computer Jeeves was one of those moments where I distinctly remember thinking, “why can’t I just talk to my computer, yet” (and not in a Siri way). Basically why is my computer not doing more things for me than it does right now. Thus, the name and really I didn’t want to spend a ton of time thinking of a name.

So this past week was pretty productive:

  • Evaluated and applied email packages to pull gmail data
  • Setup Postgres data storage
  • Labeled emails that would classify as true
  • Applied vectorizer to generate a feature set (e.g. bag of words)
  • Tested several classifier models
  • Reworked the code to close the loop on my full product

Data / Emails

I spent the first two days experimenting with a couple packages to pull gmail (most built off of IMAP), and I ended up picking the package by Charlie Guo. Its simple enough to understand and apply quickly but has enough functionality that allows me to do some data manipulation when pulling the emails.

I made a quick decision during that time to go with Postgres for primary data storage.  I’m a fan of building with “going live” in mind just in case, and since I know Heroku, I knew Postgres was a top option for persisting data. Reality is that its a bit super charged for the size of data that I’m storing right now, but the setup is there and available to me as I need it. Plus, it was good to get practice with writing SQL to setup tables, store and access data.

I also spent some time going through my emails and manually labeling them. Because of the email package I used to pull the data, it made it so I could label my emails in gmail that would classify as true and then store that labeling in my database. Basically I added a column in my database that would put True in the cell of that email row if the email had the label I gave it on gmail.

I had to manually build out my training dataset because I need it to help build my classification model. This manual labeling is a bit of hinderance because of course it takes time to get it done, and we have limited time for this effort. I could crowd source this, but I don’t want to give my emails out. I could use the Enron dataset and crowd source getting those emails labeled. That feels a bit overkill for these two weeks and I really don’t want to use Enron’s data. I knew this would be an issue and I’m working to adjust for that where I can (esp. in my features and classifier model).

Features

After getting the data, I spent a day analyzing what I had and building out a feature set. For natural language processing, a simple feature set can be just counting up word occurrences in the training dataset. This can be expanded further but I opted to keep it simple to start so I could continue to get my pipeline built out.

So to help further explain features, think of them as variables that help predict the unknown. If this was a linear model like y = mx + b, they are the x variables that help define what y will look like and the classifier defines the coefficients for the model which in this case would be m and b.

Classifier

I spent the last couple of days of the week exploring as many classifiers as possible in the scikit-learn package. Ones I tried:

  • Logistic Regression
  • Naive Bayes (Gaussian, Multinomial, Bernoulli)
  • SVC
  • Random Forest
  • Ada Boost
  • Gradient Boost

Initially I just ran the standard models without any tuning. To run them, I pass in my training set of X features and y labels (which is my manual labeling of whether that email should classify as true or false).

Gaussian Naive Bayes, Random Forest and Gradient Boost all did pretty well with accuracy scores from 80 – 90% and area under the curve (lift) on a ROC plot of 70-80%. The reality was that they were doing great at classifying my emails as false (not meeting my condition) because in my training data set about 80% of the data was false. So if it classified as false all the time then it was 80% correct.

One thing my instructors helped me appreciate that when I’m classifying the emails, I would prefer to get an email that should be false but is classified as true (confusion matrix of false positive) vs miss an email that was classified as false but should be true (false negative). This is similar to what they target for spam. Classifying the wrong email as spam is worse than getting a little bit of spam in your inbox.

I also worked on applying a grid search for practice which is an approach to testing a variety of parameters to tune the models and improve accuracy scores. Through tuning, I was able to improve Logistic Regression, Multinomial Naive Bayes and SVC into the 90% accuracy range.

As mentioned, certain models like Logistic Regression handle large feature sets (especially for nlp) better than others. Since my actual training dataset is small, Naive Bayes is a good solution to accommodate the limited information. I tried the other classifiers to experiment and learn. I hear the other models are rarely used in the real world because they don’t have enough improvement on scores and are too complex to justify the expense on time and effort to use.

Closing the Loop

I spent Fri. and Sat connecting the dots on my project basically building the code that would run my project from start to text finish.

I had a hard time finding something quickly that would stream my gmail through my app so I decided to just run automatic checks for new emails. When I got them, I open the instance of my customized vectorizer (word counter built with the training set) and apply it to my email to get a feature set.

Then I open my stored classifier instance (also built and tuned using the training set) and I pass the new email feature set into  the classifier. The classifier returns a boolean response and if the response is true then I craft a message and send a text that says that specific email needs a meeting location defined.

So now that the pipeline is built and sort of working (I had it send me text messages when an email classifies as false), I need to go back through and improve my feature set and my classifier model since nothing is classifying as true. There are lots of things I can do and it should be another good week of learning.

Begin with the End

So we are officially kicking-off personal projects for the next 2 weeks. It’s been a bit of a process for all the students figuring out what each of us wanted to do and finalizing it but we got there.

Last Week
It was a review week. We went back over some concepts that will be valuable for projects as well as when we go through interviews like classifiers and distributions. We also worked to determine and finalize our projects and then we spent two full days on final assessments. We had 2 different data science case studies that required working through the full process from getting the data to providing some type of recommendation/report. We also worked in teams on sample interview questions to review over additional class content. The last day there was a little bit of mutiny and we really didn’t get much done on the assessment. Most people were starting to think projects at that point.

Projects

I am fascinated by AI. Making my computer smarter or at least make some decisions it doesn’t already handle is something I’ve been interested in before this class. I spent the last several weeks thinking through that idea and how to translate it into a 2 week project. Thankfully, mentors and instructors helped me scope that into something that seems achievable. What I focused in on is to classify an email on if it’s a meeting request that needs a location defined.

So if a friend or colleague wants to meet up but no place has been specified, I want my classifier to classify that email as true. This is setting the stage for a bigger challenge I’d like to solve which is to get the computer to figure out some meeting location options and provide them. Still just doing the email classification seemed a pretty attainable goal in the project timeframe.

The reality is there is a ton more I want to do, and I iterated over a long wish list of ideas, breaking them into smaller concepts and getting lots feedback. I had help steering me towards tasks to get my computer to do that would be realistic in the timeframe we have to work, and when I started defining explicitly what the process would look like from beginning to end, I landed on that one classification step.

Sounds easy and simple but it will be a challenge because I am still learning and getting comfortable with so many components of this. Plus, it is not as easy as it sounds especially when I will be working with a sparse dataset to start. So some of my first steps are to get the data (focused on personal for now) and clean it as well as to manually identify emails that would classify as true.

Then I will have to work on different approaches for applying natural language processing (NLP) to define features for my model. I am going to start with standard bag of words to create features, but will try to explore feature engineering where I explicitly define specific word and symbol groupings (pseudo mix of NLP and Regex). For anyone asking, a feature is a specific attribute (like a word or word pairings) that can help identify / classify the email. So I will work with tools and my own personal inspection to find common words and groupings in my emails that would help classify them as true if they meet the condition.

Once I have the features defined, I will work on building the classification model by applying techniques like cross validation and grid search. Logistic regression is a popular algorithm for classification because its fast to build and tends to be the best option for extremely large feature sets like NLP. So I plan to start there, but I want to explore some of the other models for comparison since this is a great opportunity to practice.

The End State

So my goal is to get my computer to classify new emails and send a text that says, “x email needs a meeting place defined”. With that said, I spent the weekend looking back over an example project I did with Twilio, and I’ve adapted the code to make a working function that will take a message as input and send a text. Thus, the end state is setup and now I just have to build the rest of it that will lead to generating that text message. No problem.

It definitely helps to know where I’m going.

Note

There was some trickier bits with the setup because Anaconda (package to load all the data science software) and Virtualenv don’t play well together. I was able to work through the conflicts and will try to put up a post about how to make them work together sometime in the near future. If you need to know sooner than later, check out this link as one main step to help resolve the conflicts.

 

Quick Update on Zipfian and Week 7

Another short one on what was covered this week at Zipfian. Also, for anyone interested, below is a photo of the classroom to give context on where we are working from. 2014-02-18 15.34.38

Its a good space. Far in the front is the work and lecture space and the area closest to the camera is more where we take breaks and eat.

This week was about diving into complex machine learning algorithms. Topics we covered:

  • Supervised Learning
  • SVM
  • K Nearest Neighbors
  • Decision Trees & Random Forest
  • Neural Networks
  • Time Series

I’m not going to explain the above in detail because time is a bit limited this weekend, but I recommend checking them out. Additional note, I’ve spoken to some data scientists who say they typically use the less complex algorithms to do their work because something like a Random Forest is difficult to implement in production.

We also submitted preliminary project proposals and worked on narrowing down our ideas. Final proposals are due next week and then its a free-for-all getting started (if we haven’t already). I definitely have an idea of the direction I’m going in and working to flesh that out. I’ll provide more details on the project in a future blogpost, and I will say that it aligns to my interest in making computers smarter.

Next week the plan is for us to review content and run through case studies to practice our skills.