Learning (a tale of memory)

We never really stop learning. Learning is perhaps the most important process to occur in our brains; ignore the past, and you are screwed.

This post is a tale of how we’ve been running into issues with memory usage recently, how we’ve been solving and diagnosing it, and the design decisions that have lead to it.

In the past week or so we’ve been puzzling over some unusually high memory usage of our application servers (we use the Thin server for EVE Metrics and Passenger for everything else). Specifically, our thins were getting fat; chewing up memory, not responding to new requests and churning on locks.

On a hunch and some history, we pulled out RMagick, the Ruby interface to ImageMagick, which had been known to cause issues with memory usage by way of some leaks. We switched the code that we had to use ImageMagick for to use the mini_magick library, and where we could (notably the sparkline graphs on the statistics panel) we switched from server-side to client-side rendering. Those sparklines are now using the excellent jQuery Sparklines plugin. They look better, too!

That did solve our memory issues; we still had problems with sudden spikes in memory, and we pinned down the cause using the Oink diagnostics plugin. Someone (and if you’re reading this, let me know who you are, I’d like to talk, not to bite your head off, but to ask you why the CSVs didn’t leap out at you quite as they should have…) decided to make lots of history requests to the API, for all 59 regions at a time and for 25 items at a time. Admittedly only for one day, but there’s a kicker here.

Those 25 items across 59 regions is not a lot of historical data. But, those of you with EVE Metrics API experience should be shouting, “James, the history API also includes the item API data!” Which it does! This entails loading the market orders. Now, there is a problem here. Let’s say you request 25 items for 59 regions. That is a _lot_ of market orders, potentially. Remember we have 1.5 million orders over the whole market and the really active market is spread across a few thousand items. But there’s a few items with vast numbers of orders. That number of orders can rise quite easily into the tens or hundreds of thousands.

We have to load all those orders into memory to work with them, then allocate the raw data into Gnu Scientific Library vectors, then process that data (fast, thanks to C), and then render out the XML response. Because we have to have all these orders in RAM at once, Ruby’s VM allocates the process more and more memory, which is removed from the OS’s spare resources which are used for caching the filesystem in memory. This cache is the primary mechanism for PostgreSQL, our database engine, to keep data in RAM; subtract from this and you’re hurting DB performance and increasing disk IO. Everything slows down. Eventually, Ruby allocates enough that the OS itself is out of cache and has to start swapping to disk; we’ve not yet run into this since I’ve been monitoring this all closely with Munin, but it would eventually happen. Ruby (along with most other VM-based languages) has a pretty crap GC, and even once the request is over, this memory is not returned to the OS from the VM. This means that things can escalate pretty dramatically if left unchecked.

So, what can we do to fix this? Well, in time I want to remove the item API data from the history API. That’ll improve things greatly in our own application. I’ve also implemented a limit on the number of items and regions that can be checked at once to stop people generating huge memory spikes.

There’s a nasty side-effect here: While these thins are churning away, nomming up memory like there’s no tomorrow, they’re locked up and handling a long running request. If we get enough requests this means the website becomes unavailable, something I personally find intolerable. So I’ve increased the number of app servers we run and split the requests up as they come in between two pools, one of which has a few servers dedicated to handling web site requests, and likewise one of which has a few dedicated to handling API requests, with a few servers shared between the pools. This will help stability in the long run and ensure that at any moment, requests can be fulfilled in a timely manner. Varnish is also getting some tweaks to serve up old content when the backends are busy where a cached object exists.