Facebook and why your organization should be ignoring it

There’s a huge amount of talk out there about how best to use Facebook as an organization. How you can generate massive amounts of publicity and interest, capture new users and visitors, and maximize engagement. All those silken terms that sales and marketing people love to liberally spray all over their presentations. Well, this is not a blog post about how you can do that. I don’t have much of an issue with people using Facebook as a PR tool and a marketing tool- after all, that is what it was designed to be. Marketing yourself, originally, and like all popular but free websites, the site rapidly became about marketing to users.

No, this is a post about why you should ignore Facebook. Turn a blind eye and let it pass. It will, in time, fade away, like Yahoo, MSN and others before those. It may have a huge number of users, but then so did MySpace. People will move on, and Facebook is already worrying about growth figures. But that’s not why you should be ignoring it. Continue reading Facebook and why your organization should be ignoring it

Experiments in CL, NLP: Building Backchat, Part 1

Okay, so I may have something wrong with me. As soon as anything important (in my view) comes up, I have to build an app for it. Well, sometimes. Still, the impulse is strong, and so at 2:30 AM or thereabouts I registered a domain name and got to work.

The aim of the project is this: To build a tool to do real-time analysis of Tweets for any event in terms of the sentiment of those tweets towards the various subjects of an event

I am fairly good at doing simple apps quickly. I had all but one component of this app done by the first Leader’s Debate here in the UK (allowing me to collect my data set for future development- around 185,000 tweets from 35,000 users). I’ve thrown in a handy diagram which details the data collection portion of the app as it stands. But here’s the quick overview:

  • Streamer – Uses the Twitter streaming API to receive new tweets and for each tweet, throws them onto the appropriate AMQP exchanges
  • Parser – Receives a tweet and loads it into the database. Doesn’t actually do any parsing as such yet, but could be extended to do so (extracting URIs and hashtags are the things I’m thinking of)
  • Classifier – Receives a tweet and does clever stuff on it to determine all subjects and associated sentiments, passing the results back to AMQP
  • ClassificationLoader – Receives the results from the Classifier and loads them into the database

Now, for starters this app isn’t done yet, so this is all strictly subject to change. For instance, I’d like to have the DB loader pass the tweet on to the classifier instead of the streamer since that’ll let the classifier store with reference to a DB object, and a few things like that. However, this distributed component structure means that I can run multiple copies of every component in parallel to cope with demand, across any number of computers. EC2 included, of course, but I can also use my compute cluster at home where network speed/latency isn’t a huge issue. Right now I don’t need that, but it’s nice to have and doesn’t involve a lot more work. It also lets me be language-agnostic between components, which leads me to…

CL/NLP. Short for computational linguistics/natural language processing, this is a seriously badass area of computer science. It’s still a developing field and a lot of great work is being done in it. As a result, the documentation barely exists, there are no tutorials, no how-to manuals, and what help you have assumes innate knowledge of the field. And I know nothing (Well, I know a fair bit now) about linguistics or computational linguistics or NLP. So, getting started was hard work. I ran into knowtheory, a chap in the Datamapper IRC channel of all places who happened to be a linguist interested in CL and whom has helped out substantially with my methods here.

I’ve gone through about 5 distinct versions and methods for my classifier. The first three were written in Python using the NLTK toolkit, which is great for some stuff but hard to use, especially to get results. That, and using NLTK was giving me very good results but at the cost of speed- several seconds to determine the subjects of a tweet, let alone do sentiment analysis or work out grammatical polarity and all that. Now, getting perfect results at the cost of speed was one way to go, and for all I know it might still be the way to go, but I decided to try a different plan of attack for my fifth attempt. I started fresh in Ruby using the raingrams gem for n-gram analysis, and the classifier gem to perform latent semantic indexing on the tweets.

I boiled this down to a really, really simple proof of concept (It’s worth noting that I spent _days_ on the NLTK approach. Those of you who know me will know that days are very, very rarely used to describe the amount of time I’d spend on one component of an app to get it to a barely-working stage). I figured I could train two trigram models (using sets of three words) for positive and negative sentiment respectively, then use the total probabilistic chance of a given tweet’s words (split into trigrams) appearing in either model as a measure of distance. Positive tweets should have a higher probability in the positively trained model, and a lower probability in the negatively trained one. Neat thing is, this technique sort of worked. I trained LSI to pick up on party names etc, and added common words into an unknown category so that any positive categorization would be quite certain. This doesn’t take into account grammatical polarity or anything like that, but still. Then, using the classifications, I can work out over my initial dataset what the end result was; and here it is:

# Frequencies
Total: 183518 tweets
Labour: 30871, Tory: 35216, LibDem: 25124
# Average Sentiment
#  calculated by sum of (positive_prob - negative_prob)
#  divided by number of tweets for the party
Labour: -0.000217104050691102
Tory: -0.000247080522382047
LibDem: 0.000394512163310021
# Total time for data loading, training and computation
# I could speed this up with rb-gsl but didn't have it installed
real    13m5.759s
user    12m35.800s
sys     0m12.170s

So according to my algorithm, the liberal democrats did very well while labour and especially tories didn’t do so well. Which, if you read the papers, actually fits pretty well. However, algorithmically speaking the individual results on some tweets can be pretty far out, and so there’s lots of room for improvement. And my final approach I think has to consider part-of-speech tagging and chunking, but I need to work out a way to do that faster to be able to integrate it into a realtime app.

All in all, working on Backchat has so far been hugely rewarding and I’ve learned a lot. I’m looking further into CL/NLP and looking at the fields of neural network classifiers for potentially improved results, all of which is great fun to learn about and implement. And hopefully before next Thursday I’ll have a brand new app ready to go for the second Leader’s Debate!

EVE Fanfest(feed) 2009

Well, that fateful time of year comes along again- thousands of EVE Online players meet for fanfest in Reykjavik, Iceland. And I can never make it. This year, my studies conspired against me; except they didn’t. While unknown until hours beforehand, I actually had no work and a lecture on basic packet switching keeping me in England. Doh.

Anyway. We got a lot of fluff, this year. Aside from further elaboration on stuff already announced, there were actually no major announcements made at fanfest. We did have some interesting info about New Eden, CCP’s EVE-Online-Online website. And there was some evidence (gasp!) that CCP were listening to third party developer suggestions at the API roundtable.

There was almost enough minor stuff announced to make it worthwhile. We did get a release date for Dominion – 1st December 2009. But no New Eden with the launch. And knowing CCP we’ll probably not get API changes till a bit after that. What’s really awesome though is that we will be getting new APIs. I’m just hoping they’re useful APIs…

Anyway, while I was sitting at home being mostly bored, I decided I’d had enough pressing F5 on the Twitter search page, and put together a website (ff.mmmetrics.co.uk – it’s down now) to grab EVE fanfest feeds from Twitter and Flickr. This became popular enough within a few hours that we had to rip it off the server and give it it’s own Amazon EC2 virtual server, as it was in danger of crashing ISKsense and EVE Metrics. Doh. A wild success, in any case, for a simple but handy website. What the website did make us realise is how little headroom we have on our current server. We kinda knew that already but it did make the point quite well.

EVE Metrics 2.1 has launched mostly well but we’re still having issues with the API processing code. Makurid has been working hard to pin down the cause of the problems and destroy it while I’ve been fixing up servers and moving sites around, and we’re getting a bit closer to having a complete fix. We’re not there yet, but we will be soon with any luck.