Time-series data in Redis

For Insanity, I’ve been working on some of the support services that are built into the website that provide our staff with information from ancillary services and tools and bring it into a clear and useful format for decision making and monitoring. Latest on the agenda has been listener figures from our Icecast streaming servers. While this isn’t a perfect benchmark of our performance since we broadcast on traditional media too, it is certainly one of our most important benchmarks in measuring show quality and popularity, not to mention listener habits and trends.

We’ve historically relied on a RRDtool database updated with Munin and an icecast plugin. While this served us well in the single-server days, we recently added a relay server to help listeners with network problems connecting to our JANET-hosted box. Now we have to handle summing two statistics sources and compensating for the added relay connections. At this point I weighed up writing a Munin plugin versus rolling my own and decided to try whipping a solution up using Redis.

Redis is mind-blowingly fast, and exceptionally flexible to boot. We’re already planning to use it for caching, so it makes sense to use it for statistics storage. So, the goal here was:

  • Fast inserts
  • Very fast retrieval of arbitrary time ranges

Simple goals. I did some digging around and there’s a lot of different approaches to storing time-series data in Redis. The scheme I used in the end uses sorted sets to store data, indexed by the timestamp, and with the timestamp included in the data to allow for duplicate values. The sorted sets are partitioned by day; for a regular update interval we’re looking at ~8,000 points per day.

Updates take a bit longer because Redis has to do sorting on insert, but that’s actually scalable – O(log(N)) – and the losses there are regained tenfold when doing retrieval. We keep the datasets small by the partitioning, meaning that N in O(log(N)+M) is kept low- M is dependent on the query. I have yet to benchmark this all because I have yet to notice the extra performance hit on the pages- it’s snappy in the extreme. We’ll have to wait and see on how well it scales up, of course.

We do get a bit of overhead because we have to split the timestamp and value apart before we can use either. But that’s pretty trivial overall. We’re also putting the statistics into a GSL vector using rb-gsl-ng, which means subsequent stats operations on the dataset are fast; we can generate a page with 80-odd sparklines and statistics summaries generated from 80 different queries without adding more than 50ms to the page load time, which is completely acceptable. Overall, this is working very well indeed. I’d love to see Redis add more innate support for timeseries data with duplicate values, but for the time being this workaround is doing okay.

As an addendum to this post, redistat is another tool worth mentioning, partly for the similar application but also for the alternative method of storing key/value data in Redis, albeit in a manner more geared towards counters rather than time-series statistics data. Worth having a look at if you’re interested, in any case.

Comments are disabled for this post