EVE Scalability Explained

January 12, 2010

OK, I’ve seen a bunch of posts on the EVE blogosphere about this recently and it’s always been a tricky topic to understand. This post aims to demystify EVE’s architecture and explain in simple terms what EVE’s current issues with scaling for fleet fights are, and approaches for fixing them. So first a disclaimer: I do not work for CCP, I don’t get behind the scenes information. This is a post compiled from several years working on EVE third party development and talking to people who do work at CCP, people who have worked at CCP, and the community at large. To the best of my knowledge this is mostly correct, but I make no promises. If you’re looking for an exact technical description, look elsewhere.

So, let’s start with the basics. This is the (somewhat simplified) hardware layout for Tranquility (click to enlarge).

To sum up in words: There are proxy servers that receive your data and route you to the appropriate sol server, which is running on a sol node or reinforced sol node. These servers communicate with a single, shared database server, which is also used for web services like the API and the MyEVE website (and, soon, Spacebook).

There’s an important distinction to be made here and one that is vital to understanding EVE’s architecture- nodes and servers are not the same thing. Nodes refer to the actual physical hardware (at time of writing, IBM Blade servers) that may run one or more sol servers. Each sol server is, as the name implies (Sol is the name for our sun) responsible for one solar system in EVE. It is a software server process, handling everything that goes on in a system- combat, mining, market, and so on.

EVE’s scalability issues stem from this design, but let’s look at what those issues are. Can EVE handle 56,000 players? Yep, easily. Tranquility will be able to handle many more than that without issue, and because of this design the capacity can be easily expanded by increasing the number of sol nodes for sol servers to run on, spreading the load efficiently and easily. Will you be able to fit 3000 people onto a gate? Nope. Why? Well, because EVE was designed so that the capacity of the whole cluster expanded well, not individual systems. This was a design decision made back in the early days of EVE and it has served EVE well, with the exception of fleet combat and Jita. So how to handle the edge cases?

Well, where does lag come from? Proxy servers have an easy job and they are not a bottleneck in the vast majority of circumstances. The main issues they cause are disconnects; when a proxy server fails, a good chunk of EVE’s inhabitants disappear till they reconnect. The lag is in combat and in high concurrency systems- like Jita, where loads of people trade, talk in local, and fly around suicide ganking each other. This lag stems from intensive processes that have to be done; mathematical steps like calculating transversal velocities between objects, things that have complexity values (algorithmically speaking) of O(n^2) or worse. If you didn’t understand that- well, it just means the more ships you have, the more difficult things get, exponentially.

Obviously, there are optimisations that can be done, better algorithms, and CCP uses them, but the fact remains; this is a lot of work for a computer. Loads. Absolutely shedloads. And that’s all this challenge gets- one computer, at most. In bad cases, it won’t even get that-most sol nodes run multiple servers, the reason why lag sometimes seems to cross between systems- it really can, and does. Reinforced nodes just have more firepower and a guarantee of exclusivity, but they’re still only one computer. And as Google has taught industry, lots of small computers are cheaper, easier to fix, and faster than a single box computer.

True scalability will come to EVE when a sol server can be distributed seamlessly (without rebooting or dropping clients) and near-instantly across multiple sol nodes. That will mean that fleet fights can take all the resources they need, will mean that CCP gets to maintain cheaper hardware, making scaling the hardware cheaper and easier. And you maintain the scalability of the cluster, assuming you keep some hardware spare for sol nodes to grow onto in the event of a fight.

What needs to be done to achieve this? Why haven’t we got this yet? Well, it’s a heck of a lot of work. It’s a huge technical challenge, leaving internet spaceships out of it. Then there’s the hardware prerequisites; you need insanely fast low-latency networking (Infiniband, Fibre Channel, etc), and the extra nodes. It’s a huge investment for CCP, but one they’ll have to make eventually unless they find another way of solving the problem; but any other solution is likely to break immersion and cohesion in the game (grid sharding, etc), and so unlikely.

I hope that helps explain some of the thinking behind EVE’s architecture and why you lost that titan last night. And why it’s likely you’ll lose a few more before it’s fixed.

Minor second disclaimer: It’s 2:30 AM and I’m tired as hell, so this may contain errors. Feel free to point any out in the comments.

8 Comments
  • Pingback: links for 2010-01-12

  • January 12, 2010 @ 12:45

    Linked to your post from mine: http://www.ninveah.com/2010/01/single-server-blues.html

    Thanks for the clearing up of the details. :)

    January 12, 2010 @ 13:25

    Great overall post…I certainly understand more now reading your post. While some details are certainly different I am wondering the following:

    IN the late part of 2009 (IIRC it might have been right around Dominion) but the Database was upgraded to 64 bit. We (where I work) have had MAJOR issues with the 64 bit version of SQL from MS…could this be exacerbating the problem you outline above? I’d love to read your thoughts

    January 12, 2010 @ 14:29

    I know that stability-wise the database server has been having problems more regularly since the upgrade, and that CCP have been working with Microsoft heavily to try and improve things. But yes, I’ve heard a lot about issues with 64-bit MSSQL; CCP really are now 64-bit and pretty much stuck with it. Hopefully the server will mature and improve; it’s not like MySQL/PgSQL where you can poke inside it and fix bugs as you go, though, so who knows how that is going..

    Entirely unrelated, but CCP are now feeding 64-bit IDs out into the API which might break a fair few applications; developers have been used to the 32-bit IDs for say wallet journals that are recycled fairly regularly. We’ve had to update a few database tables in EVE Metrics to handle this; it’s good in the long run, but they could have at least announced it…

    January 12, 2010 @ 17:34

    Thanks for your comment James…I did not realize about the 64bit ID’s either…sigh …. I wonder why they tell us nothing sometimes.

    January 15, 2010 @ 15:40

    Great post and thanks for the info (I love when we see the machines behind the service).

    Makar Kravchenko
    May 21, 2010 @ 09:48

    Thanks for linking this to EVE forum. This is a great insight to how the EVE cluster is operating. I am only a visual basic programmer, with limited knowledge of C/C++. I have fooled with python, and other interpreted languages and have recognized bottlenecks in the interpreter. I did not however realize that Stackless avoids the interpreter lock, which clears up why Stackless IO would be as efficient as it has been thus far. Hopefully CCP moves forward with things like HPC and CUDA for EVE Online, and can make fleet battles playable to marginally larger degrees. The best thing about EVE is the scale of which things occur. To diminish this scale would be disheartening to say the least.

    This pycon CCP attended showcasing their use of stackless python was very informative as well:

    http://us.pycon.org/2009/conference/schedule/event/91/

    Aineko
    May 31, 2010 @ 09:01

    Interesting. The article ties in with some stuff I wrote about eve’s scalability problem a while back. http://www.eveonline.com/iNgameboard.asp?a=topic&threadID=1234590