Time-series data in Redis

December 30, 2010

For Insanity, I’ve been working on some of the support services that are built into the website that provide our staff with information from ancillary services and tools and bring it into a clear and useful format for decision making and monitoring. Latest on the agenda has been listener figures from our Icecast streaming servers. While this isn’t a perfect benchmark of our performance since we broadcast on traditional media too, it is certainly one of our most important benchmarks in measuring show quality and popularity, not to mention listener habits and trends.

We’ve historically relied on a RRDtool database updated with Munin and an icecast plugin. While this served us well in the single-server days, we recently added a relay server to help listeners with network problems connecting to our JANET-hosted box. Now we have to handle summing two statistics sources and compensating for the added relay connections. At this point I weighed up writing a Munin plugin versus rolling my own and decided to try whipping a solution up using Redis.

Redis is mind-blowingly fast, and exceptionally flexible to boot. We’re already planning to use it for caching, so it makes sense to use it for statistics storage. So, the goal here was:

Simple goals. I did some digging around and there’s a lot of different approaches to storing time-series data in Redis. The scheme I used in the end uses sorted sets to store data, indexed by the timestamp, and with the timestamp included in the data to allow for duplicate values. The sorted sets are partitioned by day; for a regular update interval we’re looking at ~8,000 points per day.

Updates take a bit longer because Redis has to do sorting on insert, but that’s actually scalable – O(log(N)) – and the losses there are regained tenfold when doing retrieval. We keep the datasets small by the partitioning, meaning that N in O(log(N)+M) is kept low- M is dependent on the query. I have yet to benchmark this all because I have yet to notice the extra performance hit on the pages- it’s snappy in the extreme. We’ll have to wait and see on how well it scales up, of course.

We do get a bit of overhead because we have to split the timestamp and value apart before we can use either. But that’s pretty trivial overall. We’re also putting the statistics into a GSL vector using rb-gsl-ng, which means subsequent stats operations on the dataset are fast; we can generate a page with 80-odd sparklines and statistics summaries generated from 80 different queries without adding more than 50ms to the page load time, which is completely acceptable. Overall, this is working very well indeed. I’d love to see Redis add more innate support for timeseries data with duplicate values, but for the time being this workaround is doing okay.

As an addendum to this post, redistat is another tool worth mentioning, partly for the similar application but also for the alternative method of storing key/value data in Redis, albeit in a manner more geared towards counters rather than time-series statistics data. Worth having a look at if you’re interested, in any case.

6 Comments
December 30, 2010 @ 10:59

I recently evaluated Redis for a similar task to this too, along with MongoDB and MySQL 5.1.

Whilst Redis is very fast for the inserts, but it’s not so good at summarising the data in the ways we needed (such as grouping entries for a given metric by the hour and summing them). To do this with Redis meant extracting data and doing the operations ourselves in our code.

Mongo was able to do some of these kinds of operations internally but required turning to mapreduce for others (which was slow).

MySQL/innodb was the winner here for us. Very simple schema with timestamp, metric name, value (primary key is combined timestamp and metric name columns). All the types of operations we need to do are MySQL’s bread and butter – easy and very fast.

I was inserting my test data at 3500/s with no real MySQL tuning (and bin logs enabled). It was faster at grouping and summing too, as we didn’t have to do it ourselves (in Ruby). With 3.5 million records I was able to summarise hourly data for a given metric, suitable for graphing a 60 day graph, in about 80ms.

By my calulcations, we should be able to store 98 million of these stats in MySQL with about 10gig of RAM (which is enough for hourly data for 4 metrics from 17,000 virtual servers for 60 days).

All this and it’s transactional too, so grouping and storing these data to day or week or month after 60 days (and deleting the hourly data atomically) is easy peasy.

Mark Cotner
December 30, 2010 @ 17:07

I have to agree with John Leach as much as I hate too. If your insert requirements allow for a relational DB then using one is a better option. However, I will say that getting data into PostgreSQL can be faster if you do it correctly and you’ll have something other than loop joins available for the data analysis part.

Noone realizes the limitations of MySQL going in. It’s later that it rears its ugly head. Keeps companies like Percona very busy. :)

Just in case you’re wondering where this biased opinion comes from . . . I’m a MySQL DBA.

Nice time series post, by the way. I think redis definitely excels at this if you need to justify the additional write speed. I’ve been working with telemetry data for 10 years now and won an award from MySQL(2nd runner up for app of the year) some time ago. Time series is something I love working with and optimizing.

There’s no rule that says you can’t use redis for insert velocity, write a simple program to summarize said data(usually end up doing this anyway in relational DBs) and put it into a relational store like PostgreSQL. If insert velocity warrants this it could make for a nice RRD like hybrid solution.

‘njoy,
Mark

Mark Cotner
December 30, 2010 @ 17:16

I just noticed you also play eve. I’m awksedgreep(along with 9 other alts) and usually hang around in Sinq. I’d love to chat about time series if you want. Eve mail or hit me up anytime for chat.

‘njoy,
Mark

December 30, 2010 @ 18:36

I don’t disagree at all with the idea that using MySQL/PostgreSQL (we use PostgreSQL as a general database, so that’s an option) is certainly as if not more flexible and gets you some major benefits in terms of the available tools for querying that data (the overhead in processing the data in Ruby is nontrivial for larger datasets, certainly). I may end up going for a PostgreSQL store in the end, but giving Redis a shot and seeing how it performed with the task was if nothing else a good learning experience.

My previous work with EVE Metrics was basically recording millions of points of data on many, many metrics (roughly 100 gigs of data if you include the indexes), which was quite Fun to optimize and get working reliably fast without a powerful server (which, as ever, is the limiting factor). With 8 gigs of RAM also hosting the app servers and a few other websites plus a legacy MySQL server for things that won’t play with PgSQL on that box, it gets crowded fast- obviously throw 16/32/64+ gigs of RAM at the problem and performance will always improve.

For this particular setup we’re using a Linode 512 for the current hosting, though we may have to migrate to something with a little more oomph before too long- 512MB of RAM with an email server and web/app servers doesn’t leave much for DB caching!

I don’t actually play EVE any more, though most of my previous work was focused on the game – CCP’s broken the game too comprehensively lately and made some really stupid decisions about the direction to take with what was once a great virtual world. A real shame – I may come back when they’re done with Incarna, but fleet combat really made it for me and 0.0′s pretty dead these days; fights never happen because you nearly always know the outcome before you engage, so it’s just shooting POS structures and the like now.

What strikes me as potentially an interesting project would be in a similar vein to redistat but with pluggable storage components and focused around time-series data. If nothing else, it would make comparative benchmarking of storage and retrieval of such data much simpler, and potentially lead to a nice framework to use for time series data storage. I may give that a stab this evening and throw something up on GitHub.

Mark Cotner
December 30, 2010 @ 18:59

Well your use of time series data in presorted lists is ideal. I’d look for redistat to follow your lead eventually. I’ve wanted to take on a project using redis like this for some time. Turns out I blogged about using redis for this some time ago . . . and forgot. :)

http://awksedgreep.blogspot.com/2010/06/why-redis-for-time-series-data.html

If your data requirements aren’t huge, and peak insert speed is over 5k/sec then redis is hard to beat.

‘njoy,
Mark

Mark Smith
January 5, 2011 @ 00:35

Your needs probably aren’t such that this is interesting to you, but my company (StumbleUpon) just released an open source timeseries database called OpenTSDB (http://opentsdb.net/). It’s built on the HBase platform (Java) which allows it to do some amazing scaling.

Also, hai Ix! Long time no see. I’m sorry you’re not on IRC anymore and that the jerks won. :(