[2023] [Devblog] Tranquility Tech IV!

almost 2 years ago - CCP_Swift - Direct link

EVE Evolved extends to hardware, too. Check out the latest updates from he hardware team in the newest devblog: Tranquility Tech IV | EVE Online

If you have any comments or questions about the content, we’d love to hear it!

almost 2 years ago - Brisc_Rubal - Direct link

I understood a few words in this.

You IT nerds are rubbing off on me.

almost 2 years ago - CCP_Explorer - Direct link

No and perhaps yes. The deprecated hardware of TQ Tech III is headed for our other datacenter, where test servers are hosted. Ultimately then some of the hardware that will be deprecated there (which is then TQ Tech II or older) might get auctioned off.

almost 2 years ago - CCP_Explorer - Direct link

We have verifiable metrics that the DB upgrade has worked well. We are still gathering data on the performance of the new solarsystem simulation hardware, it has only been a month and half in the cluster; but so far, so good.

Re. the issues you describe: Chat is not hosted in the London datacenter at all; rather it is in AWS Ireland and the EVE Client connects directly to it. Same goes for Search, and then Contracts rely on Search to look up types and characters. So these hardware upgrades would not have affected that one way or another. We are aware of some gnarly and difficult-to-repro issues with Search after we updated to the latest version of AWS’ OpenSearch but they are being looked at.

almost 2 years ago - CCP_Explorer - Direct link

We looked at many options and these choices were the best ones. You will note from the database blog https://www.eveonline.com/news/view/a-history-of-eve-database-server-hardware that I linked in this devblog that we looked at AMD choices is great detail and tested them.

almost 2 years ago - CCP_Explorer - Direct link

There are specific nodes that handle the mapping requests that are sent via https://community.eveonline.com/support/fleet-fight/ but those are (currently) running on the same type of hardware as the TQ Tech IV “rank-and-file” machines as those machines perform better than the dedicated hardware in TQ Tech III.

The benefit of sending in Fleet Fight Notifications is that the solarsystem will be isolated on their own node (like Jita is all the time) instead of being mixed with about 20-40 other solarsystems. Then in 1-3 years when we start upgrading again, then the fleet fight nodes will be moved to the best hardware each time.

almost 2 years ago - CCP_Explorer - Direct link

Difficult to say without knowing more details. I just tested logging in and opening a non-Jita market and it opened in a subsecond and everything was very responsive as I clicked around. This could be client-side because it needs to process your orders to highlight them but could be something else. Does the client halt, does it spin up to 100% CPU, does it log while this is happening?

almost 2 years ago - CCP_Explorer - Direct link

The game cluster runs an intra-cluster heartbeat for all the nodes and then there are other tools that monitor the machines themselves, the operating system, and the SQL Server.

I’m not sure I understand the question; but all the machines run Windows Server as their operating system.

almost 2 years ago - CCP_DeNormalized - Direct link

We have a few test servers running windows/sql 2022 - all running fine at the moment, but I do recall hearing some issues with a windows patch causing reboot loops

almost 2 years ago - CCP_DeNormalized - Direct link

My current negotiations with Ops have us at a new DB VM with 512GB of ram to start with. We’ll see how it goes from there

They don’t generally like it when I start a convo haha

We do have several virtual db’s in production though and always looking to migrate away from bare metal where it makes sense (one of the new vm’s is a remote AlwaysOn read-replica)

almost 2 years ago - CCP_Explorer - Direct link

There is only one VM in Tranquility proper, and that is a proxy for internal use only. Everything player-facing is running directly in top of hardware. We have tested using VMs for some of the smaller player-facing services and it worked fine run-time but we had issues starting the cluster since the nodes running on the VMs would lag behind the nodes running on hardware.

There are many VMs and pods running in the Tranquility Ecosystem, inside AWS. None of them existed at the time of Tranquility Tech III but have added since then, outside the simulation/game cluster.

almost 2 years ago - CCP_Explorer - Direct link

I wrote this with input from @CCP_DeNormalized and others.

almost 2 years ago - CCP_Explorer - Direct link

Every time we have upgraded hardware or optimized software, then players have brought more pilots for the next fleet fight.

almost 2 years ago - CCP_Explorer - Direct link

No.

Inserting more characters here since evidently a reply must be 5 characters or more.

almost 2 years ago - CCP_Explorer - Direct link

Probably never. This is a safeguard that all the all nodes involved have finished their work in moving you; this is the timeout when the other nodes can move ahead if there is no confirmation.

almost 2 years ago - CCP_DeNormalized - Direct link

We’re planning a SQL only blog for a later date where we can share more info on the various configs we have going on.

The param sniffing stuff, have not gotten deep into that yet, but that’s basically where the db team is at now - sql 2022 testing and excitement - it’s got a ton of great features that’ll be useful not only to us dba’s but our data engineering team as well.

almost 2 years ago - CCP_DeNormalized - Direct link

Typically we use SOL as a interchangeable term for solar system node - a windows OS based server that can run one or more in-game solar systems/services.

For TQ, all sols are physical servers, most test servers run sols as virtual machines

almost 2 years ago - CCP_DeNormalized - Direct link

I mix up terms sometimes, so it depends on who you are talking to and the context

SOL Server would typically be either the physical or virtual server that our code runs on. A sol node would be one instance of the application code running on the actual SOL Server

almost 2 years ago - CCP_Explorer - Direct link

Tranquility proper is in a datacenter just outside London.

almost 2 years ago - CCP_Explorer - Direct link

Maybe, with future iterations on software.

almost 2 years ago - CCP_Explorer - Direct link

The session timer has probably always been there, in the background. I think it was only exposed in the UI in 2007.

almost 2 years ago - CCP_Explorer - Direct link

A node is a PROXY or SOL process in the game cluster that is assigned specific tasks. PROXY nodes are the nodes that the loadbalancer connect to on inbound connections from the Clients and they mostly handle session management and routing but SOL nodes handle solarsystem simulation and run other services (such as market, skill training, industry).

almost 2 years ago - CCP_Explorer - Direct link

A SOL node is a process on a machine; we run 8 to 14 of them on each machine. But we often use no. 1 as well to refer to machines that only host SOL nodes.

almost 2 years ago - ISD_Bahamut - Direct link

Some big improvements!

almost 2 years ago - CCP_Explorer - Direct link

More on this @Baldeda_Maxi; there was yet another deployment today to Search where the team is making changes and adding logging to try and track down these issues.

almost 2 years ago - CCP_Explorer - Direct link

Spot on analysis @MalcomReynolds_Serenity – BTW, we did indeed look at the Platinum processors but the price was not right.

almost 2 years ago - CCP_Explorer - Direct link

Yeah, what’s up with that @CCP_Swift?

almost 2 years ago - CCP_Explorer - Direct link

@CCP_DeNormalized and I love to hate NUMA.

almost 2 years ago - CCP_Explorer - Direct link

Uuhhh, I don’t know. @CCP_DeNormalized, can you ask around and reply?

almost 2 years ago - CCP_Explorer - Direct link

7 years actually, that’s the extended warranty.

But the part that I covered very quickly in the devblog is that in between TQ Tech III and TQ Tech IV then there was TQ Tech IIIS (evidently S stands for “semis” meaning half in Roman numerals). So we are likely to remove the Gold 5122 and Gold 5222 powered machines in a few years time (possibly in 3-4 years), replace them with whatever is most high-end at that time (which will not be FLEX blades since IBM/Lenovo are stopping production), and move the heaviest workload there - JITA, The Forge market, fleet fights, etc.

Well, there is more to this. The Xeon Platinum processor are very expensive and some of the Gold ones are power-hungry and/or don’t fit our setup. Some liberal colour-coding:

It is also possible that we will move to AMD machines next time since IBM/Lenovo will not be making FLEX blades anymore (but purchasing FLEX blades was a requirement for this upgrade due to existing infrastructure) and the AMD EPYC Zen 4 processors are an interesting option.

Note the lack of AMD CPUs and the lack of Intel 4th gen:

almost 2 years ago - CCP_DeNormalized - Direct link

Love and hate is right

A few years ago, just months before super devs fixed the our numa issues for good nojinx, we were evaluating new DB boxes and had 2 single socket servers in the line up for testing.

An Intel 24/48 core box and the EYPC 64/128 core. Due to whatever was going on within the code base at the time, we could not get the cluster to start with the single numa node. Blocking chains would tie up all the worker threads, max the CPU and threadpool waits would starve the simulation out - sols unable to heart beat. We so hated it…

The EYPC on the other hand made short work of it - was it the fact that it had 128 cores vs. the 48 on the Intel box? Or was it due to windows being unable to address all cores in a single numa node and instead split them into 2 numa nodes.

Numa to the rescue! We love you NUMA! While 1 numa node’s CPUs were maxed out the other had room to get stuff done.

In the end, we picked our current set of boxes, dual socket 16 core cpus - 2 physical numa nodes, a nice bump in cores, but not so many that the sql license cost made our eyes jump out of our heads.

In the time it took to get the DB hardware ordered/shipped/racked devs had the numa issues sorted and now I rarely even think about it!

almost 2 years ago - CCP_DeNormalized - Direct link

I’m told VXRail was looked into some years ago. At the time the idea of having to buy entire compute+storage nodes to expand didn’t make sense for our needs.

It would be very cool to see TQDB as a virtual sql cluster on one of these though - 4tb of mem and 8 tb of storage - spread over multiple compute nodes.

almost 2 years ago - CCP_Explorer - Direct link

We are investigating what happened there, but currently suspect some sort of an infinitive loop or O(n^m) processing where m is 2+. No amount of server hardware will help.

almost 2 years ago - CCP_Explorer - Direct link

The answer here is yes and no.

On a universe level then there a number of things to track in parallel and we do. For example, all markets are separate, so is skill training, industry, various matters with corporations and alliances, planetary interaction, project discovery, etc.

But inside a single solarsystem, in a particular scene, then everything must be tracked serially since the simulation has to keep track of in what order events happened and progress the overall simulation in that scene accordingly. With

5000 undocked players in a system, and each sub-cap launches 5 drones and each carrier/super launches fighters, and a structure or two or three starts launching things like bombs and missiles

then all of these things can affect one another and all these players need to know where everything is at all times and what they are doing.

almost 2 years ago - CCP_Swift - Direct link

Nothing to see here

Recent EVE Online Posts

Patch Notes - Version 22.02

Paint Your Ship Red & Make it Faster?

Server Deep Dive – The power of pretty

Exploit Notification - Attacking While Inside POS Forcefields