What we encountered in early 2012 was that the recommended way to deploy those VMs was to add up all our CPU and memory usage and then, as they say, over-commit, assuming that your underlying hypervisor will always do the right thing. And maybe in 2020 they do. But in early 2012 where we had a mix of game servers that hammer the CPU and other servers that use very little CPU but hold important state we found that the servers that use very little CPU wouldn't get scheduled for a long enough period of time that the game would experience timeouts waiting on those servers. AWS has two types of server instances: ones that are over-committed (T-series) and most of the others that are not. This gives us the control to allocate our resources very specifically.
Back in 2012 we had to get rid of the over-commit for the entire system to run reliably, at which point, what was the point of having a hyper-visor? I had to rewrite how every single network connection was made but o...
Read more