Welcome to Pandora

We’ve successfully moved all sites, email, DNS, and everything else on our old server, Highpoint, to our brand new machine, Pandora.  This has entailed a lot more downtime than we’d anticipated; this has mostly been due to lack of preparation on my part, a glorious DNS cock-up and the added complexity of having Highpoint’s backhaul fail three times as we tried to move across all the data.

In total it was a fairly mammoth operation by our standards; we transferred in excess of 100 gigabytes of data between the servers over the course of 12 hours, shifted over 20 websites and 3 major webapps, and got everything up and running again in under a day once we’d moved it all to the new box. The downtime has been annoying and I’ve certainly learned some lessons for next time, but here’s the flipside…

We’re now running on a much, much roomier machine. We’ve not got the environment perfectly set up and we’ll no doubt spend the next week tuning everything, adjusting things till they’re just right and fixing bugs, as well as adjusting and rewriting chunks of applications to make use of the extended caching capabilities of our new environment. We’re already using this to great effect in the EVE Metrics APIs but we can make better use of caching throughout our apps.

Once we’ve gotten settled in, we should be performing much better and more reliably than previously. We’ve already seen huge performance gains on our database (we can process more than twice as many uploads per second, for example) and we hope to have things even faster soon.

Of course, to achieve this I have been running on more or less an empty tank as far as sleep is concerned and working things in around my life at university, which has been interesting. Still, we’re at the point now where it’s more or less stable and everything basically works, so now I’m going to grab a few hours of sleep before lectures tomorrow, before a long long lie-in on Saturday. Enjoy!

EVE Scalability Explained

OK, I’ve seen a bunch of posts on the EVE blogosphere about this recently and it’s always been a tricky topic to understand. This post aims to demystify EVE’s architecture and explain in simple terms what EVE’s current issues with scaling for fleet fights are, and approaches for fixing them. So first a disclaimer: I do not work for CCP, I don’t get behind the scenes information. This is a post compiled from several years working on EVE third party development and talking to people who do work at CCP, people who have worked at CCP, and the community at large. To the best of my knowledge this is mostly correct, but I make no promises. If you’re looking for an exact technical description, look elsewhere.

So, let’s start with the basics. This is the (somewhat simplified) hardware layout for Tranquility (click to enlarge).

To sum up in words: There are proxy servers that receive your data and route you to the appropriate sol server, which is running on a sol node or reinforced sol node. These servers communicate with a single, shared database server, which is also used for web services like the API and the MyEVE website (and, soon, Spacebook).

There’s an important distinction to be made here and one that is vital to understanding EVE’s architecture- nodes and servers are not the same thing. Nodes refer to the actual physical hardware (at time of writing, IBM Blade servers) that may run one or more sol servers. Each sol server is, as the name implies (Sol is the name for our sun) responsible for one solar system in EVE. It is a software server process, handling everything that goes on in a system- combat, mining, market, and so on.

EVE’s scalability issues stem from this design, but let’s look at what those issues are. Can EVE handle 56,000 players? Yep, easily. Tranquility will be able to handle many more than that without issue, and because of this design the capacity can be easily expanded by increasing the number of sol nodes for sol servers to run on, spreading the load efficiently and easily. Will you be able to fit 3000 people onto a gate? Nope. Why? Well, because EVE was designed so that the capacity of the whole cluster expanded well, not individual systems. This was a design decision made back in the early days of EVE and it has served EVE well, with the exception of fleet combat and Jita. So how to handle the edge cases?

Well, where does lag come from? Proxy servers have an easy job and they are not a bottleneck in the vast majority of circumstances. The main issues they cause are disconnects; when a proxy server fails, a good chunk of EVE’s inhabitants disappear till they reconnect. The lag is in combat and in high concurrency systems- like Jita, where loads of people trade, talk in local, and fly around suicide ganking each other. This lag stems from intensive processes that have to be done; mathematical steps like calculating transversal velocities between objects, things that have complexity values (algorithmically speaking) of O(n^2) or worse. If you didn’t understand that- well, it just means the more ships you have, the more difficult things get, exponentially.

Obviously, there are optimisations that can be done, better algorithms, and CCP uses them, but the fact remains; this is a lot of work for a computer. Loads. Absolutely shedloads. And that’s all this challenge gets- one computer, at most. In bad cases, it won’t even get that-most sol nodes run multiple servers, the reason why lag sometimes seems to cross between systems- it really can, and does. Reinforced nodes just have more firepower and a guarantee of exclusivity, but they’re still only one computer. And as Google has taught industry, lots of small computers are cheaper, easier to fix, and faster than a single box computer.

True scalability will come to EVE when a sol server can be distributed seamlessly (without rebooting or dropping clients) and near-instantly across multiple sol nodes. That will mean that fleet fights can take all the resources they need, will mean that CCP gets to maintain cheaper hardware, making scaling the hardware cheaper and easier. And you maintain the scalability of the cluster, assuming you keep some hardware spare for sol nodes to grow onto in the event of a fight.

What needs to be done to achieve this? Why haven’t we got this yet? Well, it’s a heck of a lot of work. It’s a huge technical challenge, leaving internet spaceships out of it. Then there’s the hardware prerequisites; you need insanely fast low-latency networking (Infiniband, Fibre Channel, etc), and the extra nodes. It’s a huge investment for CCP, but one they’ll have to make eventually unless they find another way of solving the problem; but any other solution is likely to break immersion and cohesion in the game (grid sharding, etc), and so unlikely.

I hope that helps explain some of the thinking behind EVE’s architecture and why you lost that titan last night. And why it’s likely you’ll lose a few more before it’s fixed.

Minor second disclaimer: It’s 2:30 AM and I’m tired as hell, so this may contain errors. Feel free to point any out in the comments.

Moondoggie & Market Browsing

OK. EVE Metrics is my big market browsing project. It’s very complex, it’s got a lot of data, but it all basically comes down to this: People browse the market with a program running on their computer, and when any market data is viewed, EVE Online writes it to a cache file, the program decodes that and fires it at the server. We collect all these reports and build a single picture of the market in EVE.

There’s the top-down view for you. We’ve never really not had enough data. We get good market coverage in most regions and we’re fairly up to date in the grand scheme of things. But compare the actual market of EVE to EVE Metrics and we’re still a long way off having a truly accurate picture. EVE moves quickly- in some markets, from minute to minute orders will be shuffling around and changing price and being bought out.

With Dominion we got a new browser. This means you can now use the full EVE Metrics website ingame, but also (through some Javascript client hook additions) lets us provide a fantastic new tool to help us get an even better picture of the market in EVE.

If you fire up the IGB and head over to the upload suggestions page, you’ll be given a list of 10 items, and a few options for automatic checking. Choosing this option will prompt EVE Metrics for a list of items to check, and will automatically go and look at those items. It’s slow, but it works. In the space of a few hours with one user, we can get data for an entire region across all the items on the market. This is utterly fantastic and we’re really looking forward to the larger volume of data this is bringing to the site.

So, if you’ve got a spare moment, or you need to go AFK for an hour, or you want to help out while you’re mining, or you’re just tired of clicking the next item in the list, install the uploader and visit the page ingame to get started. Every upload counts and helps us build the biggest, best picture of EVE’s market we can manage to produce. Uploads to EVE Metrics are also syndicated to other websites and tools, of course. Your uploads and contribution of time help hundreds of users who use the site, and tens of thousands more who rely on our pricing, history and order APIs for their applications.

Oh, and if you’re a developer, we now have a server status API with all the information you could possibly want on TQ, Sisi and the API servers. It can be found here (docs here). Enjoy!