Experiments in CL, NLP: Building Backchat, Part 1

Okay, so I may have something wrong with me. As soon as anything important (in my view) comes up, I have to build an app for it. Well, sometimes. Still, the impulse is strong, and so at 2:30 AM or thereabouts I registered a domain name and got to work.

The aim of the project is this: To build a tool to do real-time analysis of Tweets for any event in terms of the sentiment of those tweets towards the various subjects of an event

I am fairly good at doing simple apps quickly. I had all but one component of this app done by the first Leader’s Debate here in the UK (allowing me to collect my data set for future development- around 185,000 tweets from 35,000 users). I’ve thrown in a handy diagram which details the data collection portion of the app as it stands. But here’s the quick overview:

  • Streamer – Uses the Twitter streaming API to receive new tweets and for each tweet, throws them onto the appropriate AMQP exchanges
  • Parser – Receives a tweet and loads it into the database. Doesn’t actually do any parsing as such yet, but could be extended to do so (extracting URIs and hashtags are the things I’m thinking of)
  • Classifier – Receives a tweet and does clever stuff on it to determine all subjects and associated sentiments, passing the results back to AMQP
  • ClassificationLoader – Receives the results from the Classifier and loads them into the database

Now, for starters this app isn’t done yet, so this is all strictly subject to change. For instance, I’d like to have the DB loader pass the tweet on to the classifier instead of the streamer since that’ll let the classifier store with reference to a DB object, and a few things like that. However, this distributed component structure means that I can run multiple copies of every component in parallel to cope with demand, across any number of computers. EC2 included, of course, but I can also use my compute cluster at home where network speed/latency isn’t a huge issue. Right now I don’t need that, but it’s nice to have and doesn’t involve a lot more work. It also lets me be language-agnostic between components, which leads me to…

CL/NLP. Short for computational linguistics/natural language processing, this is a seriously badass area of computer science. It’s still a developing field and a lot of great work is being done in it. As a result, the documentation barely exists, there are no tutorials, no how-to manuals, and what help you have assumes innate knowledge of the field. And I know nothing (Well, I know a fair bit now) about linguistics or computational linguistics or NLP. So, getting started was hard work. I ran into knowtheory, a chap in the Datamapper IRC channel of all places who happened to be a linguist interested in CL and whom has helped out substantially with my methods here.

I’ve gone through about 5 distinct versions and methods for my classifier. The first three were written in Python using the NLTK toolkit, which is great for some stuff but hard to use, especially to get results. That, and using NLTK was giving me very good results but at the cost of speed- several seconds to determine the subjects of a tweet, let alone do sentiment analysis or work out grammatical polarity and all that. Now, getting perfect results at the cost of speed was one way to go, and for all I know it might still be the way to go, but I decided to try a different plan of attack for my fifth attempt. I started fresh in Ruby using the raingrams gem for n-gram analysis, and the classifier gem to perform latent semantic indexing on the tweets.

I boiled this down to a really, really simple proof of concept (It’s worth noting that I spent _days_ on the NLTK approach. Those of you who know me will know that days are very, very rarely used to describe the amount of time I’d spend on one component of an app to get it to a barely-working stage). I figured I could train two trigram models (using sets of three words) for positive and negative sentiment respectively, then use the total probabilistic chance of a given tweet’s words (split into trigrams) appearing in either model as a measure of distance. Positive tweets should have a higher probability in the positively trained model, and a lower probability in the negatively trained one. Neat thing is, this technique sort of worked. I trained LSI to pick up on party names etc, and added common words into an unknown category so that any positive categorization would be quite certain. This doesn’t take into account grammatical polarity or anything like that, but still. Then, using the classifications, I can work out over my initial dataset what the end result was; and here it is:

# Frequencies
Total: 183518 tweets
Labour: 30871, Tory: 35216, LibDem: 25124
# Average Sentiment
#  calculated by sum of (positive_prob - negative_prob)
#  divided by number of tweets for the party
Labour: -0.000217104050691102
Tory: -0.000247080522382047
LibDem: 0.000394512163310021
# Total time for data loading, training and computation
# I could speed this up with rb-gsl but didn't have it installed
real    13m5.759s
user    12m35.800s
sys     0m12.170s

So according to my algorithm, the liberal democrats did very well while labour and especially tories didn’t do so well. Which, if you read the papers, actually fits pretty well. However, algorithmically speaking the individual results on some tweets can be pretty far out, and so there’s lots of room for improvement. And my final approach I think has to consider part-of-speech tagging and chunking, but I need to work out a way to do that faster to be able to integrate it into a realtime app.

All in all, working on Backchat has so far been hugely rewarding and I’ve learned a lot. I’m looking further into CL/NLP and looking at the fields of neural network classifiers for potentially improved results, all of which is great fun to learn about and implement. And hopefully before next Thursday I’ll have a brand new app ready to go for the second Leader’s Debate!