almost 4 years ago - /u/HiRezIsiah - Direct link

Hey everyone,

Season 8 continues to be an incredibly exciting time for us, and we’re still in the midst of diagnosing, investigating, and implementing fixes for various server and stability issues that have been affecting players for the past few weeks. We have made some improvements and taken strides towards long term solutions and wanted to take some time to update you all on what we’ve done while providing some more insight into the more recent issues.

If you haven't had a chance to read our initial Developer Update on Season 8 Issues and Actions from Ajax you can check it out here. It provides some great context and a nice vocabulary section that will help with understanding this new post.

Order of Events

8.2 Goddess of the Salt Sea preparation and launch(Week 1)

  • After the instability of 8.1 and ongoing concerns around Match Manager crashes, we organized a group of people dedicated to specifically re-writing the code of our Match Manager.
  • As the team made progress, we planned for most of the changes to go live with the 8.3 update, but a small handful were deemed safe enough to be included in 8.2. Those few changes performed well enough at low player numbers during our PTS testing phase, but ran into issues at larger scale.
  • As soon as we received reports of the issues in live with the 8.2 launch we did the work to roll back these new changes and resumed use of the 8.1 version of our core services(these services are comprised of backend systems that handle player experience in game across many areas, account management, inventory, matchmaking, chat, etc.) and stayed there the rest of the week.

8.2 Goddess of the Salt Sea Week 2

  • We rolled out new updates to our core services that had more changes and improvements. After a little bit of expected downtime, we found that things were going pretty well. We also used this week to experiment a bit more with some of the backend configuration that should have better capabilities to deal with backlog problems that arise at higher CCU values. Those changes were promising and seemed to perform well despite the issues that came up at the start of week 3 (see next section).
  • Also during week 2, we had an issue at one of our data centers that was outside of our control, requiring us to migrate all of the instance servers hosted there(where matches are played) to another physical location. An issue during the migration process caused a lot of instances to be dropped (resulting in players being kicked from games and match lobbies), but was resolved relatively quickly.

Monday before the 8.2 Goddess of the Salt Sea Bonus Update(Week3)

  • Monday evening before the Bonus Update launch, we encountered another data center issue, again outside of our control. This issue affected players and instances in a similar way to the issue in week 2 and we went through the process of migrating servers again.

8.2 Goddess of the Salt Sea Bonus Update Launch

I want to preface this next section with a little more info on the process for ‘building’ an update for release, as that context will help when explaining the next set of issues and actions. Here’s a quick rundown on the types of game builds SMITE uses and how they work:

Our building process generates a few different outputs that are bundled and shipped as the final game. That process generates the executable binaries (these are the runnable programs and libraries used), the “cooked” data which is a process of taking the assets in the game and bundling them so they load more quickly while the game is running, and the configuration data which can include things like balance changes (damage numbers changing from 100 -> 75 or something). On a game as big and old as SMITE, there are a lot of things to do to generate these outputs and the build can take hours to complete so we have a few different versions that allow us to specify what elements we want to change to save time.

Full builds - These builds compile the code, generate the configuration data, and "cook" the game assets and they take a long time to run.

Native builds - These builds compile the code and re-use assets and configuration from previously generated builds.

Assembly builds - These builds just re-generate the configuration data while re-using the executable binaries and cooked assets from an earlier build.

We typically use full builds that are run overnight to balance safety and time constraints. Now that you have some understanding of build types, here’s a breakdown of the 8.2 bonus update issues and actions.

  • We released the bonus update to find that we reintroduced a crash that had been fixed in an earlier build.
  • For this release we did run a full build, but when it came time to deploy it we found that the output for the Linux (server) version of the build had not been produced. To save time we ran an assembly build, which used the executable binaries from an even older version of the binaries that had the crash in it.
  • We did not find that crash when reviewing the mid balance changes during testing, but it was there when we went live. When we realized what happened, we ran a full build and rolled out a hotfix as fast as we could.

Going forward

  • When it comes to more long term improvements, we’re making changes to our internal processes to make it to where fixes and updates don't have to be made on our biggest release days(when we push out a new update for players to download). Historically we've released client / game server / and backend service updates all on the same day. We will now be looking to regularly make backend service updates during isolated times and non-peak hours.
  • Our Ops, Tech. and Community teams will be working together on more clear messaging plans for when we push updates out to our core services.
  • We're also working on building the ability to utilize more cloud servers to mitigate down times in scenarios when our data centers are having issues that are out of our control. We’ve actually used cloud servers for some regions in the past, but players on those servers experienced a lot of hitching/rubber banding in matches. We're further investigating those issues so we can feel confident in adding that option back into our toolbox. Migrating bare metal servers or spinning up new ones in new data centers takes a lot more time than spinning up cloud servers across multiple sites. When/if we implement these cloud servers, they would primarily be used as an emergency option and not the primary utilized hosts for matches. Our Ops team is also exploring options with other server providers.
  • Lastly, in an effort to streamline reporting for server issues, we've created a new public form for players whenever they encounter issues. Going forward if you experience any server related issues while playing, please visit smitegame.com/reportserverissues and fill out the form with as much information as possible. Your responses will go straight to our Ops team for investigation. We hope that with your reports we’ll be able to identify downward trends in performance more quickly, allowing us to diagnose and resolve issues faster

With this plan, we aim to get SMITE’s server stability to the best place it's ever been, and the team is dedicated to bringing you the best SMITE experience possible. To be 100% transparent, we still expect some level of instability around these changes as we work towards a permanent solution, but we're trying to be smarter around how and when we roll out updates. We’ll still be testing code updates and fixes extensively during testing, but there’s nothing quite like a live environment. There may be some growing pains but, with each issue, we get more information that brings us closer to a resolution. If and when issues do arise, we’ll investigate and work to resolve as quickly as possible.

Thanks so much for sticking with us as we continue to improve. We’ll do our best to update you all as we make progress.

External link →