So we are officially kicking-off personal projects for the next 2 weeks. It’s been a bit of a process for all the students figuring out what each of us wanted to do and finalizing it but we got there.
It was a review week. We went back over some concepts that will be valuable for projects as well as when we go through interviews like classifiers and distributions. We also worked to determine and finalize our projects and then we spent two full days on final assessments. We had 2 different data science case studies that required working through the full process from getting the data to providing some type of recommendation/report. We also worked in teams on sample interview questions to review over additional class content. The last day there was a little bit of mutiny and we really didn’t get much done on the assessment. Most people were starting to think projects at that point.
I am fascinated by AI. Making my computer smarter or at least make some decisions it doesn’t already handle is something I’ve been interested in before this class. I spent the last several weeks thinking through that idea and how to translate it into a 2 week project. Thankfully, mentors and instructors helped me scope that into something that seems achievable. What I focused in on is to classify an email on if it’s a meeting request that needs a location defined.
So if a friend or colleague wants to meet up but no place has been specified, I want my classifier to classify that email as true. This is setting the stage for a bigger challenge I’d like to solve which is to get the computer to figure out some meeting location options and provide them. Still just doing the email classification seemed a pretty attainable goal in the project timeframe.
The reality is there is a ton more I want to do, and I iterated over a long wish list of ideas, breaking them into smaller concepts and getting lots feedback. I had help steering me towards tasks to get my computer to do that would be realistic in the timeframe we have to work, and when I started defining explicitly what the process would look like from beginning to end, I landed on that one classification step.
Sounds easy and simple but it will be a challenge because I am still learning and getting comfortable with so many components of this. Plus, it is not as easy as it sounds especially when I will be working with a sparse dataset to start. So some of my first steps are to get the data (focused on personal for now) and clean it as well as to manually identify emails that would classify as true.
Then I will have to work on different approaches for applying natural language processing (NLP) to define features for my model. I am going to start with standard bag of words to create features, but will try to explore feature engineering where I explicitly define specific word and symbol groupings (pseudo mix of NLP and Regex). For anyone asking, a feature is a specific attribute (like a word or word pairings) that can help identify / classify the email. So I will work with tools and my own personal inspection to find common words and groupings in my emails that would help classify them as true if they meet the condition.
Once I have the features defined, I will work on building the classification model by applying techniques like cross validation and grid search. Logistic regression is a popular algorithm for classification because its fast to build and tends to be the best option for extremely large feature sets like NLP. So I plan to start there, but I want to explore some of the other models for comparison since this is a great opportunity to practice.
The End State
So my goal is to get my computer to classify new emails and send a text that says, “x email needs a meeting place defined”. With that said, I spent the weekend looking back over an example project I did with Twilio, and I’ve adapted the code to make a working function that will take a message as input and send a text. Thus, the end state is setup and now I just have to build the rest of it that will lead to generating that text message. No problem.
It definitely helps to know where I’m going.
There was some trickier bits with the setup because Anaconda (package to load all the data science software) and Virtualenv don’t play well together. I was able to work through the conflicts and will try to put up a post about how to make them work together sometime in the near future. If you need to know sooner than later, check out this link as one main step to help resolve the conflicts.