Building Backchat, Part 2

Or: How I learned to give up on projects.

Okay, so, Backchat was hugely interesting as a project. Eventually, I produced a set of graphs using the classifier that showed sentiment over time. These graphs aren’t too accurate but are fairly good at showing how things were going. However, after this I pretty much dropped the project. This was mainly due to exams cropping up and stealing my time away, but also because of how difficult it was to approach a sensible level of accuracy.

In my ‘final’ design I ended up using a bigram classifier. I added parsing of the tweets to pull out mentions of words, URLs and users, and then used this to generate my training sets, which improved things a lot. This gave me several thousand tweets for each training set, which worked okay. However, even with this classifier, which was doing a lot better than most others, my results weren’t very reliable on a tweet-by-tweet basis. Still, it wasn’t too shoddy, and the graphs on the right are fairly reliable I think in terms of general sentiment.

The AMQP-linked network of processors worked extremely well, and resulted in good throughput- I used two parsers, two classifiers and one classifier loader in the end; I was unable to achieve realtime performance due to network constraints. Sadly my ISP at home had decided that I’d used too much bandwidth and clamped me down to 128 kilobits a second. That said, thanks to the streaming API I did not (as far as I know, except for a few hundred to ratelimiting) lose any tweets, I just received them out of order and then reconstructed the correct order using the timestamps for each tweet. The machine I was using for this also pretty much went flat out on disk I/O and CPU usage, but was able to keep up- it’s a fairly old box, only a Pentium 4 with a couple of gigs of RAM.

In any case this was an interesting project and I’ll be open sourcing the data and source in the coming weeks if anyone wants to have a poke at it. While the debates are now gone and done, I’m sure people can come up with some great uses for sentiment analysis outside of UK politics.

Experiments in CL, NLP: Building Backchat, Part 1

Okay, so I may have something wrong with me. As soon as anything important (in my view) comes up, I have to build an app for it. Well, sometimes. Still, the impulse is strong, and so at 2:30 AM or thereabouts I registered a domain name and got to work.

The aim of the project is this: To build a tool to do real-time analysis of Tweets for any event in terms of the sentiment of those tweets towards the various subjects of an event

I am fairly good at doing simple apps quickly. I had all but one component of this app done by the first Leader’s Debate here in the UK (allowing me to collect my data set for future development- around 185,000 tweets from 35,000 users). I’ve thrown in a handy diagram which details the data collection portion of the app as it stands. But here’s the quick overview:

  • Streamer – Uses the Twitter streaming API to receive new tweets and for each tweet, throws them onto the appropriate AMQP exchanges
  • Parser – Receives a tweet and loads it into the database. Doesn’t actually do any parsing as such yet, but could be extended to do so (extracting URIs and hashtags are the things I’m thinking of)
  • Classifier – Receives a tweet and does clever stuff on it to determine all subjects and associated sentiments, passing the results back to AMQP
  • ClassificationLoader – Receives the results from the Classifier and loads them into the database

Now, for starters this app isn’t done yet, so this is all strictly subject to change. For instance, I’d like to have the DB loader pass the tweet on to the classifier instead of the streamer since that’ll let the classifier store with reference to a DB object, and a few things like that. However, this distributed component structure means that I can run multiple copies of every component in parallel to cope with demand, across any number of computers. EC2 included, of course, but I can also use my compute cluster at home where network speed/latency isn’t a huge issue. Right now I don’t need that, but it’s nice to have and doesn’t involve a lot more work. It also lets me be language-agnostic between components, which leads me to…

CL/NLP. Short for computational linguistics/natural language processing, this is a seriously badass area of computer science. It’s still a developing field and a lot of great work is being done in it. As a result, the documentation barely exists, there are no tutorials, no how-to manuals, and what help you have assumes innate knowledge of the field. And I know nothing (Well, I know a fair bit now) about linguistics or computational linguistics or NLP. So, getting started was hard work. I ran into knowtheory, a chap in the Datamapper IRC channel of all places who happened to be a linguist interested in CL and whom has helped out substantially with my methods here.

I’ve gone through about 5 distinct versions and methods for my classifier. The first three were written in Python using the NLTK toolkit, which is great for some stuff but hard to use, especially to get results. That, and using NLTK was giving me very good results but at the cost of speed- several seconds to determine the subjects of a tweet, let alone do sentiment analysis or work out grammatical polarity and all that. Now, getting perfect results at the cost of speed was one way to go, and for all I know it might still be the way to go, but I decided to try a different plan of attack for my fifth attempt. I started fresh in Ruby using the raingrams gem for n-gram analysis, and the classifier gem to perform latent semantic indexing on the tweets.

I boiled this down to a really, really simple proof of concept (It’s worth noting that I spent _days_ on the NLTK approach. Those of you who know me will know that days are very, very rarely used to describe the amount of time I’d spend on one component of an app to get it to a barely-working stage). I figured I could train two trigram models (using sets of three words) for positive and negative sentiment respectively, then use the total probabilistic chance of a given tweet’s words (split into trigrams) appearing in either model as a measure of distance. Positive tweets should have a higher probability in the positively trained model, and a lower probability in the negatively trained one. Neat thing is, this technique sort of worked. I trained LSI to pick up on party names etc, and added common words into an unknown category so that any positive categorization would be quite certain. This doesn’t take into account grammatical polarity or anything like that, but still. Then, using the classifications, I can work out over my initial dataset what the end result was; and here it is:

# Frequencies
Total: 183518 tweets
Labour: 30871, Tory: 35216, LibDem: 25124
# Average Sentiment
#  calculated by sum of (positive_prob - negative_prob)
#  divided by number of tweets for the party
Labour: -0.000217104050691102
Tory: -0.000247080522382047
LibDem: 0.000394512163310021
# Total time for data loading, training and computation
# I could speed this up with rb-gsl but didn't have it installed
real    13m5.759s
user    12m35.800s
sys     0m12.170s

So according to my algorithm, the liberal democrats did very well while labour and especially tories didn’t do so well. Which, if you read the papers, actually fits pretty well. However, algorithmically speaking the individual results on some tweets can be pretty far out, and so there’s lots of room for improvement. And my final approach I think has to consider part-of-speech tagging and chunking, but I need to work out a way to do that faster to be able to integrate it into a realtime app.

All in all, working on Backchat has so far been hugely rewarding and I’ve learned a lot. I’m looking further into CL/NLP and looking at the fields of neural network classifiers for potentially improved results, all of which is great fun to learn about and implement. And hopefully before next Thursday I’ll have a brand new app ready to go for the second Leader’s Debate!

Dominion – Some thoughts

CCP have really hit the nail on the head with the proposed nullsec changes. A few weeks ago I decided to leave my home in EVE, Vanguard Frontiers, of whom I had been a member for over 2 years. Lack of time was the main driver for this, but a contributing factor was the lack of dynamics in 0.0. I’m very into my fleet warfare. Typically not capital scale, but large fleets. I tend not to FC directly, just sit in command chat and keep things moving in the right direction, though I have been known to step up and lead when needed. And it’s gotten boring.

How do I mean boring? Surely, PvP with hundreds of ships can’t not be fun? Actually, it really does get quite boring. There’s the waiting to find some targets if it’s a roam, the waiting for targets to jump in if it’s a gatecamp, the hours of staging and waiting for allies if it’s a major op. And there’s the huge amount of restrictions in any nullsec alliance’s engagement policy- you don’t fight outside cynojammed systems with large fleets if you don’t have to, you never engage unless you’re sure to win, etc. It all leads to a very stale few hours of near-combat, and often opponents will dance around for hours without ever meeting, even between 2-3 systems. And don’t get me started on the politics.

What CCP are trying to do is increase the amount of emergence by increasing population density, decreasing the complexity of combat in 0.0 in terms of cyno jammers, and thus leading to better combat; breaking up the hours upon hours of near-combat (which can be very draining) typically endured by pilots in deep nullsec and encouraging smaller operational groups. This means smaller fleets and more of them, making combat in nullsec actually potentially interesting. And, certainly, more approachable by smaller alliances and larger corporations.

Hopefully this will make nullsec a little more dynamic and thus a little more fun to play. Only time will tell, of course. That and a lot of playtesting on sisi. Hop to it!