about 19 hours
ago -
Natalia_GGG
-
Direct link
Incident Report for Today's Deploy
Today at 9am NZT we took down the realm for the deployment of the new account system. This migration was expected to take around 4 hours.
The first thing that went wrong was that the migration took longer to run than it did on our test hardware. This extended the downtime for an extra hour past the point that we had budgeted for.
After the realm was brought back up around 2PM NZT, we found that many players were getting disconnected frequently. This was caused by crashes in one of the backend master servers that caused online account session information to be lost.
We spent around 15 minutes trying to investigate the causes of these crashes but were unable to immediately come up with any solutions so we decided to roll back the patch.
Unfortunately in this case, what would normally take a very short amount of time to roll back took a very long time due to the extensive database migrations that had occurred during deployment. The databases are very large and restoring the backup took quite some time. The realm was brought back and the game restored at 3PM NZT.
The restore of the website databases took even longer and resulted in extended website downtime as well (the website was not available until 4:30PM NZT).
After investigation we have discovered that the crashes were caused by a very simple flaw. The constant that represents the length of an account name used in the account session was still accidentally using an old value, before we added the discriminator. If a player logged in with an account name longer than 27 characters then it would result in an exception being thrown when trying to copy the account name into the account session.
This on its own should not have resulted in the master crashing, but this occurred in an area of the code base that was designed to be exception free, which resulted in the entire process crashing.
The bug itself is already fixed, and we have also changed the code to be more resistant to exceptions occurring.
However, we have decided to delay the redeploy of the patch until Monday NZT. It is clear that we need to do another round of QA on this deployment to make sure that we have found all corner cases before we can be confident in deploying it again.
This is not the level of service you should expect from Grinding Gear Games and we are very sorry for the extended downtime.