Coolest moment this week was when I figured out that I just needed to add one line of code to my program to get my computer to talk in my Jeeves project (thus video above).
This past week was all about fixing stuff, fine tuning, iterating and putting together a presentation and peripheral stuff for demo day which will be Thurs. Of course there is always more I could and would do, but I’ve been continually shifting priorities based on time and end goal.
In case you haven’t seen the previous posts, I’m building an email classification tool. It’s a binary classifier that is focused on determining if an email I receive is a meeting that needs a location picked/defined/identified (whatever word makes it clear). If the email classifies as true then a text is sent to my phone. I do have a working model in place and the video above shows my program run through the full process and the computer telling me the results (which was a little lagniappe I added in addition to the text output).
In last week’s post, I mentioned how the classification model (logistic regression) used in my product pipeline was not performing well even though I was getting a score of around ~85% accuracy. All the classification models I tested had given ~80-90% accuracy scores, and I had said how the models were flawed because of the representation of the data. My data has ~15% true cases in it so as mentioned if it classified all emails as false then it would be right ~85% of the time.
What I need to clarify is that the ROC curve I was using also provides a type of accuracy metric, but the equation accounts for skewed class distribution (think of it as adjusting the 15/85 split to 50/50). So if my ROC curve is great than 50% (area under the curve) then its showing that the classifier is getting some true cases correctly classified, and my ROC curve had been around 70-80% on most of the models the week before.
So I did some investigation into how the logistic regression model I used in my product pipeline was performing and found that I hooked it up incorrectly. When I take in a new email message, I had to put it into a list format before splitting it up into features. The way I was passing the message, each word in one email message was being treated as a single email. I figured this out when I printed out the feature set shape and the length of the original message. So I just needed brackets around the email message to make the program see it as a list object. Sometimes its just that small of a fix. Now my classification model works great and it sends me texts on new emails that should be labeled as true.
This week, I’ve improved my model by expanding how I build features such as using tf-idf, lemmatizing, n-grams, normalizing and few other ways to clean and consolidate the features (e.g. words).
Tf-idf is a way to give high weights to words based on how frequently they show up in a document but to decrease the weight if the word shows up frequently throughout all the documents (corpus) that are used in the analysis. So it helps reduce the value of my name as a predictor since my name shows up throughout the corpus and it should not be weight strongly.
Lemmatization helps group different inflected forms of words together to be analyzed as a single item (e.g. walk and walking). Using n-grams helps create groupings of words into bi-grams, tri-grams, etc. This means that in addition to having single word features, I’m also accounting for groups of words that could be good predictors for a true case. For example ‘where should we meet’ is a combination of words that can be a very strong predictor for the true case and possibly stronger than the single word meet. N-grams in some ways allows for context.
There are some other techniques I used to build out my features but those mentioned above give a sense of the approach. After those changes, my ROC curve now shows ~80-90% on most classification models that I’m comparing.
There are more things I want to do with my feature development, but they are lower priority right now with such good performance results and other things taking priority with career day so close.
I spent a good chunk of time cleaning up and streamlining my code. I was trying to set it up to easily run the model comparison whenever I made feature changes. I also needed to make sure I consistently split up data used for cross validation in my model comparisons. Cross validation is a way to use part of the data to build the model and save a set of data to test and validate the performance. So I got my code in a good enough state where its easy to re-run, expand and ensure that there is some validity to the scores its producing. Plus, it helps to make my code cleaner so I can understand it when I go back to it to add things in.
And if you want to checkout the code for my project, you can find it at my Code_Name_Jeeves Github repository.
Depending on time, I definitely have other feature ideas such as adding in just a binary analysis of whether a date is referenced or not in the email message. I’d also like to run another grid search on the data pipeline to help with fine tuning parameters. More importantly, adding in more data to my training set would be a great value add as well as just using a different data set to test my product can help with validating performance. Of course, if there was more time then a couple of days it would be great to build this out so my computer gives me location recommendations, but that one will have to be another time.
Last Note / Official End Result:
If you noticed a buzzing sound at the end of the video, that is my phone receiving the text message that is should get (see below).