almost 3 years ago - EVE Online Team - Direct link

With millions of market transactions and hundreds of thousands of explosions every day, keeping New Eden running smoothly – and still setting records after nearly 20 years – is a monumental task. Like our players, we're constantly on the hunt to push boundaries. In the last two years, we've made significant changes to the hardware behind Tranquility, resulting in almost daily records for EVE downtime (among other things). But to see where we are, it's important to look at where we've been.

Back in 2005, only 2 years after launch, EVE was already setting records as a standout single-shard universe. New Eden was generating some 1,250 database calls per second which added up to over 60 million per day! As more players came online, the servers struggled to keep up. Game features were taking over 20 seconds to load, making everyday tasks in New Eden such as warping across the galaxy, completing market transactions at the new major market hub Jita 4-4, and upgrading your ship far more challenging.

The solution to ending this suffering came in the form of a Texas Memory Systems magic box that stored data on RAM chips. It didn't really make sense, cost a boat load of money, and promised the world. We were sold!

Within days of putting the new database servers and storage into production, we were breaking PCU records and EVE pilots were enjoying a much smoother gaming experience. Forums lit up with much happier Capsuleers, and suddenly the server could handle massive fleet battles of... 100v100 with ease. Oh, how times have changed!

Fast forward to 2009 and New Eden was still growing. Not only were there more players, but wormholes were discovered for the first time. The ecosystem was demanding more storage and the accumulation of space junk was wearing on our current storage.

Being so pleased with our last tech adventure, we looked again to Texas Memory Systems for the solution. This time we went a slightly different route and picked their RamSan 500 with a massive two terabyte configuration! Now the entire EVE Database could be stored on an SSD, which was almost unheard of at the time!

Texas Memory Systems were just as excited as us and we managed to convert their president, Woody Hutsell, into a proud member of the Gallente Federation.

Then 2011 rolled around and, while the storage is ok, it became clear that our Database Servers themselves were starting to become our main bottleneck and throttling the expanding universe of EVE. Denizens of New Eden were met with the dreaded three-letter word: lag. Solving this was a multi-pronged attack which saw the introduction of Time Dilation and further optimizations. But while exploring the Database Servers themselves we thought – why think short term? Let's future proof all the things!

What happened next was, at the time, one of the most comprehensive upgrades Tranquility and gaming had ever seen: HyperThreading, aka the jaw-dropping CPU upgrade that took TQ by storm.

This monumental update quadrupled the RAM (128GB DDR2 to 512 DDR3), overhauled and upgraded the network and storage systems (which tripled transfer speeds), and increased storage from 2TB of storage up to 11.5TB! You can read more about it in this news item we released.

The war on Lag is never-ending, and in 2015 we decided to improve things even further with the launch of the TQ Tech III upgrade as well as physically moved datacenters. For the first time, our database servers were in the form of a flex node, Lenovo x880’s to be exact. This meant upping the RAM to 768 GB per node and using super-fast 3.2 GHz e7-8893 v3 CPU’s.

The storage array got a boost as well, moving in new SVC primary controllers and v5000's to help consolidate all the storage together. We added 9x 800GB SSDs and 83x 1.2TB 10k RPM SAS disks. This new hardware allowed us to slowly migrate to the new data center with minimal downtime and lag, ensuring it wouldn’t disrupt players.

In 2016, not so long after the Tech III upgrade, we found that the 4-core CPUs were just not giving us what we needed. While the CPUs were extremely fast, they just didn't provide SQL with enough worker threads to satisfy the incoming requests. We needed more cores for such a high level of database activity. We had a solution, but it was both complex and fraught with danger! Just our style...

We had four monster x880 nodes, two for the primary game cluster and two for another database cluster that hosts the other databases. This secondary cluster was effectively idling and what we needed to do became clear: stack two of the flex nodes together and join them together. Imagine this like when the Decepticon Constructicons merge to form the great Devastator!

This brought the database server back to 32 total cores (4x CPUs with 4/8 hyper-threaded cores each) and a whooping 1.5 TB of RAM! Bigger fights, more market transactions, and more action across New Eden.

The Operations team gave a presentation at Fanfest in 2016: A Day in the life of Operations. They showed actual footage of this merge happening that you can see here:

Things ran well for the next four years with one exception. We had a re-occurring problem related to SQL and NUMA Nodes which caused server crashes, inconsistency, and difficulty in troubleshooting. The way SQL Server side is designed, and because we now had four NUMA nodes, we would end up in situations where one NUMA Node (or one CPU) was maxed out and the others were idling. This would lead to slow startups or no startups at all. We eventually traced the issue to a small subset of the 15,000 open database connections and how EVE was using them. There are two main ways to improve things: more hardware or development time. Or sometimes, and in this case, it’s providing hardware that helps devs fix issues in the software!

So mid 2019 rolls around and this NUMA business was starting to have an actual impact outside of database failovers – leading to more of what we call ‘cold starts’ which is when we bring the cluster online with an empty database memory cache. We don’t like to see these because it makes our server vulnerable- and EVE is one mean S.O.B on such a defenseless database server.

2019 is also when we discovered the mighty AMD EPYC 7742, which was a new hardware system that would help us solve these recurring issues once and for all. We were fortunately up with AMD at a gaming conference, and to our surprise, they were excited to work with us! Well, it’s not all that surprising, internet spaceships are serious business after all.

Before long we were racking the most powerful database server known to CCP! A 2U server with 2 TB of memory and a single CPU rocking 128 cores running at 2.25GHz and capable of boosting to 3.4 GHz! Now that’s a lot of horsepower under the hood.

We ran this server in production for several months to get a feel for it (who doesn't test in production?) and it was an absolute monster! While it ended up as two NUMA nodes (Windows currently cannot support more than 64 hardware threads per NUMA Node) it didn't matter as this ate everything we threw at it and asked for more! All our NUMA issues were handled with ease. We just had one problem: the cost.

We tried out other hardware, but nothing came close to our AMD server. To make everything work, we reviewed the data and narrowed down our criteria: two sockets and less than 128 cores, but more than 32 to allow for growth and overhead. Meeting in the middle seemed best so 64 cores would be our target. Now, how much memory can we fit in the box?!

Jump Jump Jump – it’s mid 2020 and allegedly someone was buying up all the large memory modules to power their cloud services. We decided that we either had to wait or settle on smaller chips. So, we decide to settle. The boxes have 32 memory slots and if you fill them all with 128GB chips you should see something like this.

dellR730 new

But, is 4 TB of memory even settling!? Nestled snuggly amongst all that memory are twin Intel Gold 6346 V3 CPU's running at 3.6GHz, giving us a total of 64 cores across two NUMA Nodes. The two Practically-Tesla-Plaid-Priced 2U Servers are wonderous machines with double the CPU cores, more than double the RAM, double the disk access speed, and they are current-gen hardware in comparison to our previous 5-year-old tech.

We decided that they needed some way to connect to the new SAN as well - so the HBA speed should get bumped as well! Wait, hang on... New SAN? Who said anything about a new SAN?!

To ensure we can give the best possible experience to both players and devs, we went all in! We replaced racks worth of slow SSDs, hundreds of 10k SAS disks, and what seemed like a never-ending number of controllers. In their place went a super slick, insanely speedy box of 9.6 TB NVMe modules! (Two actually, we need high availability after all). This SAN upgrade was not only raw storage but also impacted how we connect to it. We jumped from 16 Gb/s fiber switches to 32 Gb/s! This cut our maintenance job duration in half. Backups are now done in just over 30 minutes.

And funnily enough, it turns out that IBM bought Texas Memory Instruments some time ago. So, while we didn't go out looking for another RamSan, the universe had aligned to make it so. Flash System 7200 is the name and making databases faster is the game!


Image from iOS (1)


Image from iOS (3)

Here are a few images of the previous generation of SAN gear we removed.

blog servers


old san2


old san


Image from iOS (5)

It has now been several months since we put the new database servers into production, and we've also broken multiple shutdown/startup records - The new hardware is performing wonderfully and in no small part to the amazingly smart EVE Devs that fixed the NUMA issues just weeks before these machines finally landed in our Data Center!

Thanks for coming along with us on this historic journey through nearly 20 years of EVE’s database hardware evolution. We hope you enjoyed stepping back in time with us as much as we have. We can’t wait to see what database changes we have in store next! MySQL? NoSQL? YourSQL? Whatever it is, it'll be here helping to ensure that internet spaceships stay serious business!

PS: No hamsters were harmed before, during, or after this blog post. Previous generation workers are always retired with the honor and respect they deserve, to that big (non-cloud-based-server) farm in the sky.

🔍 Click to Enlarge

Regardless if you understand any of this, to join the player discussion, head on over to the official thread on EVE Online forums.

almost 3 years ago - /u/ - Direct link
A lil somethin somethin: You can find the details for this event on the announcement page here.
almost 3 years ago - CCP_Dopamine - Direct link

EVE_News_1920x10801920×1080 91.4 KB
Greetings,

We have released a development blog about the history of EVE servers. There is a lot of technical information for everyone to check and discover the journey the hardware supporting EVE has taken.

almost 3 years ago - EVE Online - Direct link
With millions of market transactions and hundreds of thousands of explosions every day, keeping New Eden running smoothly – and still setting records after nearly 20 years – is a monumental task. Like our players, we’re constantly on the hunt to push boundaries. In the last two years, we’ve made significant changes to the hardware behind Tranquility, resulting in almost daily records for EVE downtime (among other things). But to see where we are, it’s important to look at where we’ve been.

Back in 2005, only 2 years after launch, EVE was already setting records as a standout single-shard universe. New Eden was generating some 1,250 database calls per second which added up to over 60 million per day! As more players came online, the servers struggled to keep up. Game features were taking over 20 seconds to load, making everyday tasks in New Eden such as warping across the galaxy, completing market transactions at the new major market hub Jita 4-4, and upgrading your ship far more challenging.

The solution to ending this suffering came in the form of a Texas Memory Systems magic box that stored data on RAM chips. It didn’t really make sense, cost a boat load of money, and promised the world. We were sold!

Within days of putting the new database servers and storage into production, we were breaking PCU records and EVE pilots were enjoying a much smoother gaming experience. Forums lit up with much happier Capsuleers, and suddenly the server could handle massive fleet battles of... 100v100 with ease. Oh, how times have changed!

Fast forward to 2009 and New Eden was still growing. Not only were there more players, but wormholes were discovered for the first time. The ecosystem was demanding more storage and the accumulation of space junk was wearing on our current storage.

Being so pleased with our last tech adventure, we looked again to Texas Memory Systems for the solution. This time we went a slightly different route and picked their RamSan 500[www.eveonline.com] with a massive two terabyte configuration! Now the entire EVE Database could be stored on an SSD, which was almost unheard of at the time!

Texas Memory Systems were just as excited as us and we managed to convert their president, Woody Hutsell, into a proud member of the Gallente Federation.



Then 2011 rolled around and, while the storage is ok, it became clear that our Database Servers themselves were starting to become our main bottleneck and throttling the expanding universe of EVE. Denizens of New Eden were met with the dreaded three-letter word: lag. Solving this was a multi-pronged attack which saw the introduction of Time Dilation and further optimizations. But while exploring the Database Servers themselves we thought – why think short term? Let’s future proof all the things!

What happened next was, at the time, one of the most comprehensive upgrades Tranquility and gaming had ever seen: HyperThreading, aka the jaw-dropping CPU upgrade that took TQ by storm.

This monumental update quadrupled the RAM (128GB DDR2 to 512 DDR3), overhauled and upgraded the network and storage systems (which tripled transfer speeds), and increased storage from 2TB of storage up to 11.5TB! You can read more about it in this news item[www.eveonline.com] we released.

The war on Lag is never-ending, and in 2015 we decided to improve things even further with the launch of the TQ Tech III upgrade[www.eveonline.com] as well as physically moved datacenters. For the first time, our database servers were in the form of a flex node, Lenovo x880’s to be exact. This meant upping the RAM to 768 GB per node and using super-fast 3.2 GHz e7-8893 v3 CPU’s.

The storage array got a boost as well, moving in new SVC primary controllers and v5000’s to help consolidate all the storage together. We added 9x 800GB SSDs and 83x 1.2TB 10k RPM SAS disks. This new hardware allowed us to slowly migrate to the new data center with minimal downtime and lag, ensuring it wouldn’t disrupt players.

In 2016, not so long after the Tech III upgrade, we found that the 4-core CPUs were just not giving us what we needed. While the CPUs were extremely fast, they just didn’t provide SQL with enough worker threads to satisfy the incoming requests. We needed more cores for such a high level of database activity.We had a solution, but it was both complex and fraught with danger! Just our style...

We had four monster x880 nodes, two for the primary game cluster and two for another database cluster that hosts the other databases. This secondary cluster was effectively idling and what we needed to do became clear: stack two of the flex nodes together and join them together. Imagine this like when the Decepticon Constructicons merge to form the great Devastator!

This brought the database server back to 32 total cores (4x CPUs with 4/8 hyper-threaded cores each) and a whooping 1.5 TB of RAM! Bigger fights, more market transactions, and more action across New Eden.

The Operations team gave a presentation at Fanfest in 2016: A Day in the life of Operations. They showed actual footage of this merge happening that you can see here:



Things ran well for the next four years with one exception. We had a re-occurring problem related to SQL and NUMA Nodes which caused server crashes, inconsistency, and difficulty in troubleshooting. The way SQL Server side is designed, and because we now had four NUMA nodes, we would end up in situations where one NUMA Node (or one CPU) was maxed out and the others were idling. This would lead to slow startups or no startups at all. We eventually traced the issue to a small subset of the 15,000 open database connections and how EVE was using them.There are two main ways to improve things: more hardware or development time. Or sometimes, and in this case, it’s providing hardware that helps devs fix issues in the software!

So mid 2019 rolls around and this NUMA business was starting to have an actual impact outside of database failovers – leading to more of what we call ‘cold starts’ which is when we bring the cluster online with an empty database memory cache. We don’t like to see these because it makes our server vulnerable- and EVE is one mean S.O.B on such a defenseless database server.

2019 is also when we discovered the mighty AMD EPYC 7742, which was a new hardware system that would help us solve these recurring issues once and for all. We were fortunately up with AMD at a gaming conference, and to our surprise, they were excited to work with us! Well, it’s not all that surprising, internet spaceships are serious business after all.

Before long we were racking the most powerful database server known to CCP! A 2U server with 2 TB of memory and a single CPU rocking 128 cores running at 2.25GHz and capable of boosting to 3.4 GHz! Now that’s a lot of horsepower under the hood.

We ran this server in production for several months to get a feel for it (who doesn’t test in production?) and it was an absolute monster! While it ended up as two NUMA nodes (Windows currently cannot support more than 64 hardware threads per NUMA Node) it didn’t matter as this ate everything we threw at it and asked for more! All our NUMA issues were handled with ease. We just had one problem: the cost.

We tried out other hardware, but nothing came close to our AMD server. To make everything work, we reviewed the data and narrowed down our criteria: two sockets and less than 128 cores, but more than 32 to allow for growth and overhead. Meeting in the middle seemed best so 64 cores would be our target. Now, how much memory can we fit in the box?!

Jump Jump Jump – it’s mid 2020 and allegedly someone was buying up all the large memory modules to power their cloud services. We decided that we either had to wait or settle on smaller chips. So, we decide to settle. The boxes have 32 memory slots and if you fill them all with 128GB chips you should see something like this.



But, is 4 TB of memory even settling!? Nestled snuggly amongst all that memory are twin Intel Gold 6346 V3 CPU’s running at 3.6GHz, giving us a total of 64 cores across two NUMA Nodes. The two Practically-Tesla-Plaid-Priced 2U Servers are wonderous machines with double the CPU cores, more than double the RAM, double the disk access speed, and they are current-gen hardware in comparison to our previous 5-year-old tech.

We decided that they needed some way to connect to the new SAN as well - so the HBA speed should get bumped as well! Wait, hang on... New SAN? Who said anything about a new SAN?!

To ensure we can give the best possible experience to both players and devs, we went all in! We replaced racks worth of slow SSDs, hundreds of 10k SAS disks, and what seemed like a never-ending number of controllers. In their place went a super slick, insanely speedy box of 9.6 TB NVMe modules! (Two actually, we need high availability after all). This SAN upgrade was not only raw storage but also impacted how we connect to it. We jumped from 16 Gb/s fiber switches to 32 Gb/s! This cut our maintenance job duration in half. Backups are now done in just over 30 minutes.

And funnily enough, it turns out that IBM bought Texas Memory Instruments some time ago. So, while we didn’t go out looking for another RamSan, the universe had aligned to make it so. Flash System 7200 is the name and making databases faster is the game!

[images.ctfassets.net]





Here are a few images of the previous generation of SAN gear we removed.









It has now been several months since we put the new database servers into production, and we’ve also broken multiple shutdown/startup records - The new hardware is performing wonderfully and in no small part to the amazingly smart EVE Devs that fixed the NUMA issues just weeks before these machines finally landed in our Data Center!

Thanks for coming along with us on this historic journey through nearly 20 years of EVE’s database hardware evolution. We hope you enjoyed stepping back in time with us as much as we have. We can’t wait to see what database changes we have in store next! MySQL? NoSQL? YourSQL? Whatever it is, it’ll be here helping to ensure that internet spaceships stay serious business!

PS: No hamsters were harmed before, during, or after this blog post. Previous generation workers are always retired with the honor and respect they deserve, to that big (non-cloud-based-server) farm in the sky.
almost 3 years ago - CCP_Swift - Direct link

I believe some of the nodes are up for grabs at the Fanfest Charity Silent Auction this year!

almost 3 years ago - Mike_Azariah - Direct link

Before taking a picture of that they would have to spend HOURS cleaning off all the fingerprints from the Button.

m

almost 3 years ago - CCP_DeNormalized - Direct link

We had another amd dual socket 8core cpu box in to play with but it ended up having weird hba issues so we had to rule it out.

There was also a single socket 28 core Intel box we tested - at the time EVE’s code base overran it mercilessly on cold starts. We could barely get the cluster started :slight_smile:

almost 3 years ago - CCP_DeNormalized - Direct link

Peak is around 15,000 while our average is around 8k.

almost 3 years ago - CCP_DeNormalized - Direct link

Next blog will focus on the software side of the database config with all that fun stuff.

almost 3 years ago - CCP_DeNormalized - Direct link

Our next blog will focus on the software side, but you are right, it is Windows 2019 and it’s Standard.

I don’t think the DC version gives us anything unless its being used as a hypervisor host.

almost 3 years ago - CCP_DeNormalized - Direct link

we’re indeed heavily invested in stored procs - not to mention a few enterprise edition features of MSSQL - in particular table partitions for swapping out entire blocks of data for deletes.

Dev Teams are and have been using other database tech for their features for some years now however, so we’re far from ONLY using MS SQL.

There’s at least CosmosDB/PostGres and a few other cloud based managed DB services being used.

Horses for courses and such

almost 3 years ago - CCP_DeNormalized - Direct link

I’ve not done any real in-depth checking into which is more stable.

They both seem fairly reliable though and I’d guess instability comes more from us changing things and bugs being fixed/introduced than actual underlining platform issues.

It’s difficult to baseline and compare things over time when the code base changes daily

almost 3 years ago - CCP_Explorer - Direct link

No.

(Post must be 5 characters, so here are more characters…)

almost 3 years ago - CCP_Explorer - Direct link

Probably never.

almost 3 years ago - CCP_Explorer - Direct link

All production hardware is transferred to use for test servers for many years.

almost 3 years ago - CCP_Explorer - Direct link

Did we ever get EVE started on that 28C Intel box?

almost 3 years ago - CCP_Explorer - Direct link

That would be a massive undertaking; migrating all tables, views, functions, and stored procedures from T-SQL to PL/pgSQL, along with all monitoring, metrics, and institutional knowledge; and would therefore have to entail massive benefits.

almost 3 years ago - CCP_Explorer - Direct link

AMD’s approach of a high number of cores doesn’t fit well with Microsoft’s licensing of SQL Server per core. I would love for Microsoft to measure volume, throughput, usage for licensing purposes somehow differently than just counting the cores.

almost 3 years ago - CCP_DeNormalized - Direct link

Hmm, looking back at my notes I don’t think we actually managed a full start!

We knew we could (we thought we could) if we disabled ESI to allow for a softer startup but I don’t think we went forward with that test.

sept 28, 2020 - * TQ failover testing on Intel28 - no luck - still fails after manual soft numa
sept 23, 2020 - * TQ Failover to Intel 28 - failure

almost 3 years ago - CCP_Explorer - Direct link

I’m not even authorized for datacenter access.

over 2 years ago - CCP_Explorer - Direct link

False. Wormholes are generated and managed by the Wormhole Manager, which is in the game logic code (Python). (The state is persisted in the DB.)