One week down for our final projects and one week left to go. This has definitely been the most fun and rewarding weeks so far because I’ve been connecting the dots in more ways than one.
Everyone talks about getting your minimum viable product done (mvp) when we do final projects like this. Thankfully I had enough experience with Hackbright to know what that meant as well as how to approach it. I got my full product pipeline built between last Sat and yesterday. I’ve tested that it works (well pseudo works) and feel really good to see the emails get pulled in, analyzed and if the condition is true then I get a text on my phone. Now I really need to fix my actual classifier because it is just not that accurate.
I named my project Code Name Jeeves in case you are looking for it on Github. I was inspired by a lot of things for wanting to make my computer smarter, but the Iron Man movie with the computer Jeeves was one of those moments where I distinctly remember thinking, “why can’t I just talk to my computer, yet” (and not in a Siri way). Basically why is my computer not doing more things for me than it does right now. Thus, the name and really I didn’t want to spend a ton of time thinking of a name.
So this past week was pretty productive:
- Evaluated and applied email packages to pull gmail data
- Setup Postgres data storage
- Labeled emails that would classify as true
- Applied vectorizer to generate a feature set (e.g. bag of words)
- Tested several classifier models
- Reworked the code to close the loop on my full product
Data / Emails
I spent the first two days experimenting with a couple packages to pull gmail (most built off of IMAP), and I ended up picking the package by Charlie Guo. Its simple enough to understand and apply quickly but has enough functionality that allows me to do some data manipulation when pulling the emails.
I made a quick decision during that time to go with Postgres for primary data storage. I’m a fan of building with “going live” in mind just in case, and since I know Heroku, I knew Postgres was a top option for persisting data. Reality is that its a bit super charged for the size of data that I’m storing right now, but the setup is there and available to me as I need it. Plus, it was good to get practice with writing SQL to setup tables, store and access data.
I also spent some time going through my emails and manually labeling them. Because of the email package I used to pull the data, it made it so I could label my emails in gmail that would classify as true and then store that labeling in my database. Basically I added a column in my database that would put True in the cell of that email row if the email had the label I gave it on gmail.
I had to manually build out my training dataset because I need it to help build my classification model. This manual labeling is a bit of hinderance because of course it takes time to get it done, and we have limited time for this effort. I could crowd source this, but I don’t want to give my emails out. I could use the Enron dataset and crowd source getting those emails labeled. That feels a bit overkill for these two weeks and I really don’t want to use Enron’s data. I knew this would be an issue and I’m working to adjust for that where I can (esp. in my features and classifier model).
After getting the data, I spent a day analyzing what I had and building out a feature set. For natural language processing, a simple feature set can be just counting up word occurrences in the training dataset. This can be expanded further but I opted to keep it simple to start so I could continue to get my pipeline built out.
So to help further explain features, think of them as variables that help predict the unknown. If this was a linear model like y = mx + b, they are the x variables that help define what y will look like and the classifier defines the coefficients for the model which in this case would be m and b.
I spent the last couple of days of the week exploring as many classifiers as possible in the scikit-learn package. Ones I tried:
- Logistic Regression
- Naive Bayes (Gaussian, Multinomial, Bernoulli)
- Random Forest
- Ada Boost
- Gradient Boost
Initially I just ran the standard models without any tuning. To run them, I pass in my training set of X features and y labels (which is my manual labeling of whether that email should classify as true or false).
Gaussian Naive Bayes, Random Forest and Gradient Boost all did pretty well with accuracy scores from 80 – 90% and area under the curve (lift) on a ROC plot of 70-80%. The reality was that they were doing great at classifying my emails as false (not meeting my condition) because in my training data set about 80% of the data was false. So if it classified as false all the time then it was 80% correct.
One thing my instructors helped me appreciate that when I’m classifying the emails, I would prefer to get an email that should be false but is classified as true (confusion matrix of false positive) vs miss an email that was classified as false but should be true (false negative). This is similar to what they target for spam. Classifying the wrong email as spam is worse than getting a little bit of spam in your inbox.
I also worked on applying a grid search for practice which is an approach to testing a variety of parameters to tune the models and improve accuracy scores. Through tuning, I was able to improve Logistic Regression, Multinomial Naive Bayes and SVC into the 90% accuracy range.
As mentioned, certain models like Logistic Regression handle large feature sets (especially for nlp) better than others. Since my actual training dataset is small, Naive Bayes is a good solution to accommodate the limited information. I tried the other classifiers to experiment and learn. I hear the other models are rarely used in the real world because they don’t have enough improvement on scores and are too complex to justify the expense on time and effort to use.
Closing the Loop
I spent Fri. and Sat connecting the dots on my project basically building the code that would run my project from start to text finish.
I had a hard time finding something quickly that would stream my gmail through my app so I decided to just run automatic checks for new emails. When I got them, I open the instance of my customized vectorizer (word counter built with the training set) and apply it to my email to get a feature set.
Then I open my stored classifier instance (also built and tuned using the training set) and I pass the new email feature set into the classifier. The classifier returns a boolean response and if the response is true then I craft a message and send a text that says that specific email needs a meeting location defined.
So now that the pipeline is built and sort of working (I had it send me text messages when an email classifies as false), I need to go back through and improve my feature set and my classifier model since nothing is classifying as true. There are lots of things I can do and it should be another good week of learning.