over 1 year
ago -
-
Direct link
Hi all, I want to give you a behind-the-scenes look at the unusual number of issues after yesterday's game update. I talked with one of our Senior Software Engineers to help me outline what happened, so here you go!
In Monday's studio update we talked about "taking on projects that will keep the game healthy in the long run, like refactoring large swaths of old code". Development on a game that's been live for a decade can be a minefield: fixes like that risk breaking content that was inadvertently depending on a bug in order to "work." That's what happened here when we fixed one of those pieces of old code.
We did catch the problem and fix it, but Murphy's Law was now paying attention. You see, shipping a huge live game like GW2 requires a complex system for managing the flow of code and content going out the door. Behind the scenes there are multiple "copies" of a lot of stuff we work with: copies that are getting tinkered with, copies that are getting tested, and copies that are right on the precipice of being released. Which means that even when a fix is done and dusted it still needs to move forward on various conveyor belts, so to speak, to make sure it gets to you. In this case, the fix missed one of those conveyor belts along the way.
What did this mean for you? In this case, the piece of code that broke was related to content that depends on real time, what we call "time spans". When it went live, it quickly affected a lot of content that appeared to be unrelated at first glance--metas and world bosses, guild missions, raid buffs, Karmic Converters and portal devices, and more--but they all relied on that bit of code in one way or another.
So why didn't we just ship that missing fix right away? It was right there! Well, tracking down the problem and getting a fix together was a bit of a scavenger hunt. When you all flag an issue in the live game the first thing we do is find or recreate the problem based on the information we have so we can look it over firsthand. Yesterday's issues began with a player reporting that the Verdant Brink night boss meta was not triggering, so we started looking to see what was wrong with Verdant Brink. Two more reports about metas elsewhere rolled in shortly afterward, so we switched our search to try to find what was wrong with meta events.
Meanwhile, other players started reporting that certain gizmos were broken, so we began looking into this apparently unrelated issue as well. Then we learned that Pact Supply Agents had all vanished. At this point we started looking for a connection, because it was clear that something bigger was going on here. We knew there was probably a connection, but there were a number of possibilities that we were investigating--including time spanning. Having a list of possible connections allowed us to start tagging in people from specific teams to help--including a software engineer who was able to trace the various issues back to that piece of code, confirm the connection, find the lost fix, stick it back on the conveyor belt, and and hand it off to our Release Management team to get a hotfix ready.
So how did we not catch such major issues before we shipped the update? When there is a bug and something breaks, it doesn't proactively send up a "Help, I'm broken!" flare--we find it when we test that specific content or item. We of course can't test the entire game before every update since there are millions of things to test (every item, skill, enemy, event, NPC, and so on), so we have to use our time wisely while casting the widest possible net for issues. Our QA teams test a broad range of categories, including the new things in the build, things that have broken in the past, and things related to recent major changes to fundamental systems to make sure they are still working properly. In this case, none of the affected systems had broken in this way before, nor were they in any of the "to check" categories. That said, when we ship major bugs like this, it's an opportunity to look at our processes for holes and find ways to improve!
This is obviously a simplified explanation, but I hope this peek at the workflow and the sometimes-very-surprising vagaries of game development provides some insight. In the meantime, our teams are continuing to work on issues as they arise, so thank you again for letting us know when something isn't working properly!