Original Post — Direct link
about 4 years ago - /u/DrStephenCW - Direct link

By "macro-service" I meant, "A really big micro-service." The game servers scale horizontally just as the micro-services do in terms of players and concurrency. The one thing they can't do is scale an individual map across multiple cores or servers. Certain maps are slow because they are doing really complicated things and that's why we announced that it is not just a single thing that causes lag; every map has some unique game play element which might be causing lag that needs to be analyzed and optimized. And sometimes a map ships and is fine in terms of lag but uses a skill or effect that later on gets a feature added to it which slows down that effect. It's fine in most cases, but maybe in that older map, the effect or skill was used a lot, and that map slows down. I love our designers. They manage millions of lines of script code across an enormous game. Sometimes there are unexpected interactions in all of those millions of lines of script and it takes a bit of time to track it down and fix it.

about 4 years ago - /u/DrStephenCW - Direct link

Originally posted by notFREEfood

Looking at those service names, it sounds suspiciously like they're running the game servers on windows...

There's also one line that greatly concerns me:

We got rid of Guild Wars 1 virtualization by assigning processes to IP addresses and letting the OS schedule. We know a lot about how the OS schedules things. VMs just added another layer of complexity to scheduling

What we encountered in early 2012 was that the recommended way to deploy those VMs was to add up all our CPU and memory usage and then, as they say, over-commit, assuming that your underlying hypervisor will always do the right thing. And maybe in 2020 they do. But in early 2012 where we had a mix of game servers that hammer the CPU and other servers that use very little CPU but hold important state we found that the servers that use very little CPU wouldn't get scheduled for a long enough period of time that the game would experience timeouts waiting on those servers. AWS has two types of server instances: ones that are over-committed (T-series) and most of the others that are not. This gives us the control to allocate our resources very specifically.

Back in 2012 we had to get rid of the over-commit for the entire system to run reliably, at which point, what was the point of having a hyper-visor? I had to rewrite how every single network connection was made but once I did that we could actually over-commit because the OS would reliably wake-up those processes that didn't use a lot of CPU.