Connectivity Post-Mortem:
tl;dr: Our team worked throughout the Easter weekend and around the clock to resolve the server issues players were experiencing. We completely understand how frustrating this experience will have been especially given the huge amount of players eagerly anticipating the launch. We had enough server scaling capacity but our externally hosted database was seeing issues that only appeared at extreme loads.
We’re committed to full transparency with you. Today, just as we have been over the past year.
So we won’t give you the expected “server demand was too much for us”.
We were in fact debugging a complex issue with why some metric calls were bringing down our externally hosted database. We did not face this issue during the demo launch earlier this year.
Our database is used to hold onto everyone’s gear, legendaries, profile and progression.
Tech-heavy insight:
We managed to understand that many server calls were not being managed by RAM but were using an alternative data management method ("swap disk"), which is too slow for the flow of this amount of data. Once this data queued back too far, the service failed. Understanding why it was not using RAM was our key challenge and we worked with staff across multiple partners to troubleshoot this.
We spent over two days and nights applying numerous changes and improvement attempts: we both doubled the database servers and vertically scaled them by approximately 50% (“scale-up and scale out”). We re-balanced user profiles and inventories to new servers. Subsequent to the scale-up and scale-out, we also increased disk IOPS on all servers by approximately 40%. We also increased the headroom on the database, multiplied the number of shards (not the Anomalous kind) and continued to do all we were able to in order to force data into RAM.
Each of these steps helped us improve the resilience of the database when under extreme loads, but none of them were the "fix" we were looking for.
At this moment in time we are still waiting for a final Root Cause Analysis (RCA) from our partners, but ultimately what really helped resolve the overloading issue was configuring our database cache cleaning, which was being run every 60 seconds. At this frequency the database cache cleaning operation demanded too many resources which in turn led to the above mentioned RAM issues and a snowball effect that resulted in the connectivity issues seen.
We reconfigured the database cache cleanup operations to run more often with fewer resources, which in turn had the desired result of everything generally running at a very comfortable capacity.
All of this has enabled the servers to recover and sustain significantly more concurrent user loads.
(JUMP BACK TO INDEX)