almost 4 years ago - /u/HiRezAjax - Direct link

It's been an exciting, yet incredibly rough and disappointing week for us. The 8.1 launch has generated a huge amount of hype for SMITE - leading us to break our top record multiple times for Steam players (even beating our Avatar and Cthulhu launches). However, we also had widespread technical issues that prevented a lot of people from playing the game.

We, as a dev team, are all feeling terrible about this. Internally, no one is ignoring it, no one is downplaying it. We have had our top engineers and leads (including Stew, the CEO) in discord nonstop throughout the week monitoring the issues, making adjustments to our systems. We took immediate action and continuous action.

There are a lot of complex technical issues, and we are going to try to explain them best we can from non-technical people to a non-technical audience.

Key Points (the TLDR)

  • Our issues are not resulting from server capacity. No amount of buying more servers would have prevented our issues this week. Any time we have been able to fix issues by adding capacity, we swifty have. Day one of 8.1 (Tuesday) was when we saw our capacity issues, which were quickly resolved. We had many successes throughout the week in this regard. Each morning's hotfix improved our scalability more.
  • However, the issue causing the most problems, which became most clearly present on Friday, can be described more like a code bug. Except, instead of it breaking a god’s animation, or an item’s function, it prevents people from accessing certain parts of the game, like queues. We are taking concrete steps to fix it with some promising results, but it is not fully resolved yet.
  • PlayStation has had issues unique to its platform that resulted in the game crashing, entirely unrelated to server state or connectivity. This crash bug was identified to be an unintended consequence of an attempted performance improvement feature that was recently added. This was hotfixed wednesday evening and looks as if crashes are way down since then.

Vocabulary

  • When SMITE has “server issues” it's rarely as simple as that. Players use that term to describe general connectivity issues, but we are actually seeing on our end different aspects of the game code failing. Here are some of the unique issues that can occur that players all tend to see as “server issues.”
  • “Player Service” - when this goes down you can't make parties, get stuck in a party with yourself, queue with a party but don't actually get into a game with each other.
  • “Chat Service” - when this goes down you can't use lobby chat or whispers.
  • “Match Manager” - when this goes down you can't queue for games, or get into games from lobbies, or can disconnect from matches you were already in.
  • “Backlog” - meaning the game code can't keep up with all of the requested commands players are pushing through.
  • “Limited” - We activate this manually to slow down the incoming player requests and help our services catch up, and empty the backlog. When limited mode is activated, you see the “SMITE is in high demand” message on login.
  • “Emergency Restart” - If limited doesn't work to clear the backlog, we put SMITE into an emergency restart which kicks everyone from the game and clears the backlog entirely, then resumes logins in limited mode and ramps up over time. Generally, all of the actual downtimes (can’t even log in) players have seen this week have been from manually implementing emergency restarts, or from morning hotfixes being launched.
  • “Safe Mode” - this restricts Ranked Queues and prevents gain/loss of MMR from active Ranked matches when enabled. Also, Deserter Penalties aren’t applied during Safe Mode.
  • “CCU” - concurrent users - refers to the total players inside the game at any given moment.
  • “Performance” - this refers to how well the game runs, this can refer to graphical optimizations, or online connectivity improvements.
  • Any time frames used here will be in US Eastern time

Order of Events

Tuesday

  • We went live with 8.1 on Tuesday morning. We noticed the PlayStation issue pretty quickly after launch and focused on this hotfix as our top priority.
  • Later in the evening we started seeing backlog issues - this is a capacity issue that we do anticipate on big days. This was even bigger than expected, though.
  • We attempted to recover from this backlog by going into safe mode, but our services started crashing regardless, which forced an emergency restart. After the restart the backlog cleared and we saw no further issues.
  • We clearly identify ways to scale things better and plan to implement them early the next morning.

Wednesday

  • We scheduled a brief intended downtime in the morning to ship our hotfix, including a series of gameplay bug fixes and performance improvements.
  • These server improvements could mostly be described as moving specific service code to their own dedicated servers. This is less of a capacity issue and more of an allocation and code issue. This seems to have a big improvement on our scalability.
  • We submitted our fixed version to Sony, which they reviewed and approved, and we launched the PS crash fix hotfix later in the evening.
  • We saw a minor backlog, likely corresponding with a huge amount of PS downloads from their update, but we were able to recover from a brief limited mode.
  • Scaling issues are fixed entirely, and likely this tech will heavily benefit future update launch days.
  • PlayStation crashing is also fixed entirely.

Thursday

  • Early Morning - Another short intended downtime with a server hotfix similar to Wednesday.
  • Wednesday’s relocation of services had good results, so we moved more services to their own dedicated servers.
  • Things generally looked good, we had high CCU (very close to Tuesday) and no backlogs, and no services crashing.
  • PSN network had an outage on their side that did result in a slightly higher than normal amount of PlayStation disconnects; this was not unique to SMITE, but affected all PSN games.
  • Scaling still looks good.

Friday

  • Early morning - Another short intended downtime with hotfix similar to Thursday.
  • More resources were relocated and given dedicated space to prep for a big weekend.
  • Around 6:30 p.m. Eastern Time - Match Manager starts crashing repeatedly. This is something we have seen before, but very rarely. Many hours/days/weeks have gone into diagnosing and attempting to fix this issue before 8.1.
  • We enter into limited mode and emergency restart as a precaution.
  • Match Manager keeps going down even when there's no backlog or high CCU.
  • We go through the emergency protocol but continue to see issues late into the night. With only Match Manager going down, people can still play games if they get into them but it becomes a huge pain to queue and ranked queues stay in safe mode.
  • This now has become our primary issue, and it's generally unrelated to scaling for player increases or server capacity.
  • Scaling is still good.
  • All focus now on finding new solutions for the Match Manager crash.

Saturday

  • Engineers prepare another intended downtime plus hotfix
  • Implemented another set of changes to address the match manager crashing.
  • Mid day We broke another CCU record with no issues - no specific action yet on Match Manager, the crash scenario just hasn't been encountered yet.
  • Later that night Match Manager does indeed crash.
  • We go into intended downtime to implement our best option for a short term fix - reverting our most popular queues back to normal queues instead of timed queues.
  • We know players enjoy timed queues, but they cause a huge amount of stress on our Match Manager all at once when popular queues pop. We have decreased the queue times and offset modes from popping at the same time to mitigate this, but the issue is persisting.
  • We did have more Match Manager crashes after the timed queue change, leading to another emergency restart. Crashes subsided after the restart.
  • Additional logging was put in and action plans prepped for Sunday if we see more issues.

Sunday

  • Engineering team is still monitoring closely and collecting lots of data to aid in future fixes.
  • Preparing to post this report.

Going Forward

Scalability (Servers and high player counts)

We have been able to make huge strides in improving our ability to scale on each of these major launches.

The start of quarantines and Avatar launches each showed us different issues and allowed our teams to keep improving SMITE to larger scales of players. We can do simulated high load testing, but nothing quite compares to the real thing. Having actual high player environments lets us get the best possible data to continue to fix and expand those environments. We have already seen more improvements from what we learned around the 8.1 launch.

Crash Issues (Match Manager going down)

The way to fix these issues involves a lot of deep, and specific code changes to SMITE, and close monitoring and testing over time.

On the service crash issues, we have been hard at work on this already. The Match Manager crash is not new to 8.1, it's happened before. It can even happen on quiet nights. We have been attempting to reproduce the issue and apply fixes for a while now, but we have not succeeded yet. It has clearly become more present since then, as previously it was rather rare.

Bug fixing can be a beast of a task in software the size of SMITE. We have fixed many before, and we are making good progress on tracking this one down too. With it occurring more we can follow it more closely and learn information we previously couldn’t. We also have had many more iterations testing fixes than we did previously.

On General Health and Management of the Game

SMITE is growing, the player base is growing, and so is the dev team. We have grown our engineering team more than ever in 2020 and already have more exciting new hires planned to continue to address performance.

We have spent more and more of our resources over the years towards the games engineering and performance, and we plan to continue to do that.

External link →