about 3 years ago - Destiny Dev Team - Direct link
Destiny 2 has a recurring PvP event called Iron Banner (IB). During IB, players battle against one another for victory, while simultaneously attempting to complete bounties that offer valuable rewards. An example of a IB bounty might be “achieve kills with specific weapon types within IB matches”. Bounties are a major source of rewards in Destiny 2, so it's important that they work reliably.
Back in December 2019, players reported that the game occasionally failed to credit their kills when trying to complete these bounties. The problem would hit players seemingly at random. Most of the time, the bounties worked fine, but occasionally players would finish a match without any bounty progression. Here's a quick rundown of notable symptoms gathered from our support forum and various social media postings:
  • Players either earned credit for all of their kills in a match, or they didn't earn any credit. Whatever caused this bug affected the player for the entire duration of the match.
  • Players didn't earn valor at the end of the match.
  • The problem wasn't limited to Iron Banner. Players could encounter this bug in regular Crucible matches.
  • Based on the number of player complaints, the bug appeared to be exacerbated in Iron Banner and almost non-existent in Competitive playlists.
  • The bug didn't affect all players in the same match. In a 12-player Iron Banner match, one or two players might hit this bug. The rest of the players would progress their bounties without issue.
  • The bug didn't affect all bounties. For example, if a player had an Iron Banner bounty and a Gunsmith bounty, and both bounties asked the player to get shotgun kills, affected players would earn progress on the Gunsmith bounty but not the IB bounty.

When I started digging into this bug, I found the final symptom to be the most interesting. It suggested the bug might lie in the content setup for the IB bounties. Unfortunately, this theory failed to bear fruit. There was nothing special or unique about the IB bounty content compared to other bounties.  
Around this time, a coworker said they encountered the bug, which was great because I could track down the detailed incident log for their match. While the specifics of incidents are beyond the scope of this post, I've included an example incident at the end of this post to demonstrate how much information is included in a single incident. The incident log confirmed that the game had recorded their kills, but for some mysterious reason the kills didn't progress the bounty.

Further investigation proved difficult because I was unable to reproduce the bug on a local onebox (onebox is the name we use for running all of Destiny’s services locally on our personal workstations). For the time being, I closed out the bug as not reproducible, a disappointing end.

A new clue appears! 

Not too long after closing that bug, a new bug came my way. Multiple players reported that chests weren't dropping loot in the raid. The raid bug smelled eerily similar to the IB bug.
  • Players failed to receive rewards tied to a specific activity type.
  • The bug persisted for the entirety of the activity.
  • Only a subset of players in the activity were impacted.

Interestingly, affected players continued to earn world drops (engrams from enemy kills) even though the raid chests didn't spawn loot. It was almost like the game didn't know the type of activity (raid). This was a compelling theory because it could also explain the IB bug. Each activity in Destiny is associated with various activity intrinsic flags. For example, there are flags for strikes, pvp, and raids. Within those higher-level categories, there are more specific flags, like Nightfall or Iron Banner. When a player starts a new activity, the activity intrinsic flags are marked on the player's account. Our rewards system uses those flags to determine the eligible rewards. Some rewards are not tied to activity intrinsic flags, such as world drops or Gunsmith bounties. In those cases, the game is only looking for if/how you killed an enemy, not where you killed an enemy. But if a player could get into an IB match without the IB intrinsic flag set on their account, none of their kills would count towards their IB bounties because those bounties require the IB flag. While this was an interesting theory, was it plausible? At the time, I had no idea how this could happen. Clearly the player loaded into the correct activity. How could the player get into the activity without the flag getting set?

Can I play?

Based on the forum posts, I tracked down the incident log for one of the raid instances that didn't drop loot. Two major anomalies jumped out at me. First, the ActivityHost didn’t create an ActivityJoin incident when the affected player joined. Second, the ClientHeartbeats for the affected player reported an ActivityPowerLevel of 0. The other five players in the raid reported an ActivityPowerLevel of 94. The current ActivityPowerLevel is recorded on the account at the same time as the activity intrinsic flag. All of this suggests a breakdown in communication between the ActivityHost and WorldServer.
Let's take a short detour to learn about these two services and how they communicate with one another. This is a simplified diagram showing how these services connect to one another and the game client. There are more than 20 different services in the full Destiny 2 ecosystem, and at any given time there are thousands of instances of these services.


The WorldServer (WS) is responsible for tracking the investment state of the player's account. Investment includes stuff like character sheets, gear, and progression. It's also where we write the activity intrinsic flags.
The ActivityHost (AH) manages the state of the activity and synchronizes that state between everyone playing together in the same instance. The AH is also tasked with verifying if a player is allowed to play an activity, via a process called peer validation. A player might be blocked from playing an activity if their power level is too low or if they haven't progressed far enough in a questline. As one of the final steps of starting an activity, the AH checks these permissions for each player by sending queries to the WS. Separately, it’s also the AH’s responsibility to record the incident log that contains all the incidents generated during an activity.
All players loaded into the same activity are connected to the same ActivityHost, but their accounts may be authoritative on different WS. As a result, the AH maintains an individual WS connection per player. The services talk over a proprietary communication layer called Bungie Access Protocol (BAP). Since there is a separate BAP channel between the AH and WS for each player, it's possible to encounter a communication error that only affects one player in the activity. This could fit together with how the bug doesn't hit every player in the activity.
During peer validation, the AH sends a query to the WS over BAP. The query contains the activity ID, and the WS uses the activity ID to lookup the requirements for the activity and notifies the AH if the player is allowed to join. Assuming the AH receives a positive response, it waits for the player to finish joining and then sends a subsequent StartActivity message to the WS. When the WS receives the StartActivity message, it records the activity intrinsic flag and ActivityPowerLevel to the character sheet. If instead the player doesn't have the necessary permissions, the AH boots the player to orbit and does not send the StartActivity message.
Based on the bug’s symptoms, I suspected a communication error might be occurring during peer validation. Specifically, that the StartActivity message wasn't reaching the WS.

I'm not stopping you...

I couldn't reproduce the bug with my onebox, so I began to look through code to see if I could spot the bug, starting with the peer validator. I learned that peer validation happens asynchronously and in parallel with the normal flow of joining an activity. In other words, a player is free to join any activity, but is kicked out whenever the AH receives a negative response from the WS. Suspiciously, I couldn't find a timeout mechanism in peer validator. Perhaps peer validator was blocking indefinitely while waiting for a response from the WS? I checked logs from a random retail AH, and sure enough I found signs that peer validation was waiting indefinitely after sending the query to the WS. Players were frequently loading into activities without completing the full peer validation process! Now I just needed to figure out why we were dropping the query. Back in the code, I noticed these two constants.
k_investment_bap_message_queue_element_count= 2048; 
k_server_message_default_callback_handler_count= 16; 
BAP reserves callback handlers for BAP messages that expect a response. Based on these constants, BAP can queue up to 2048 messages, but it is limited to 16 responses. While the discrepancy between those two numbers throws up some red flags, we don’t need a matching number of callback handlers. The bulk of our messaging between the AH and WS is incident data, which is one-direction and doesn’t require a response. For example, when you get a kill, the AH sends that incident to the WS so the WS can update your bounty progression. This is a one-way communication that doesn't require a response and thus shouldn’t need a callback handler. This is demonstrated by the call to the send-incident function as shown below:
activity_host_send_server_message_internal( 
       account_soid->m_guid, 
       _investment_server_message_network_identifier_incident_activity_host_incident, 
       &request_arguments, 
       h_weak_investment_reference::from(k_invalid_investment_reference_handle), // invalid handle == no response expected 
       &base_incident_name); 

The culprit appears!

Diving deeper into activity_host_send_server_message_internal(), the root cause finally manifested. Callback handlers were being reserved from a circular FIFO queue for every outgoing message even if that message didn’t require a response! To make matters worse, we never checked to see if the queue of callback handlers was full!!
static_class_function_definition 
void c_investment_server_message_interface::activity_host_send_server_message_internal( 
       uint64 world_server_routing_identifier, 
       c_investment_server_message_network_id network_id, 
       const s_server_request_arguments *request_arguments, 
       h_weak_investment_bap_message_queue_callback weak_investment_bap_message_queue_callback, 
       const c_string_hash *optional_additional_context) 

{ 
       // ... 
       h_weak_investment_bap_message_queue_callback server_message_manager_callback= 
                     c_investment_server_message_manager::get()->new_response_handler( 
                           server_message_index, 
                           weak_investment_bap_message_queue_callback, 
                           optional_additional_context); 

       investment_bap_connection_enqueue_message_with_callback( 
              _investment_bap_connection_activity_host_0_outbound_to_activity_host_proxy, 
              _investment_bap_message_policy_service_to_world_server_request, 
              &header_parameters, 
              &world_server_request, 
              sizeof(world_server_request), 
              server_message_manager_callback); 
} 

h_weak_investment_reference c_investment_server_message_manager::new_response_handler( 
       t_investment_server_message_table_index server_message_index, 
       h_weak_investment_reference callback_reference, 
       const c_string_hash *optional_additional_context) 
{ 

       c_investment_server_message_response_handler *server_message_response_handler= 
              m_response_handlers.get_element(m_fifo_index); 

       m_fifo_index= (m_fifo_index + 1) % intsize(m_response_handlers.get_element_count()).to_int16_verify(); 

       server_message_response_handler->prepare_for_send( 
              server_message_index, 
              callback_reference, 
              optional_additional_context); 

       return h_weak_investment_reference::from(server_message_response_handler->get_reference()); 
} 
When the AH sends an incident to the WS, it often sends the incident to every connected WS (a message per player). This allows you to earn progress for your teammates' actions. In activities with lots of players like Iron Banner, a single incident can easily chew through all of the available callback handlers.
Based on these discoveries, I came up with a new theory: players hit this bug when they joined an activity that was already in progress and there was a spike in incidents on the same frame that the AH tried to send a peer validation query to the WS. A callback handler was reserved for the peer validation query, but that handler was then overwritten by handlers allocated unnecessarily for incident messages. Once the handler was overwritten, it was impossible for the AH to process the validation query response from the WS. Peer validator didn’t account for the lack of response, which left it blocked between sending the validation query and sending the StartActivity message. Without the StartActivity message, the WS never recorded the necessary flags to grant rewards for the current activity. It seemed like a viable theory!
Using some debug commands to flood a dev client with incidents, I was finally able to reproduce the bug on my onebox. Using an attached debugger and logs, I confirmed that my theory was correct - overwhelming BAP's callback handler budget with a high volume of outgoing messages caused the peer validator state machine to lock up.

Resolution

A number of fixes came out of this one bug:
  1. Stop creating callback handlers for requests that don't expect a response.
  2. When we attempt to reuse a callback handler, make sure it isn't actively waiting on a response.
  3. Add additional error handling around callback handler allocation/disposal so we can detect if all available callback handlers are is use.
  4. Don't send investment BAP messages that expect a response if we fail to reserve a callback handler for that message. Higher level consumers of BAP are already built to handle send errors in the BAP layer, so it was just a matter of bubbling up this error case to the existing retry logic.
  5. Allow the investment BAP message queue to notify the callback when it fails to send a message so the callback can release its resources.
  6. Increase the number of callback handlers; 16 was arbitrary and possibly too few.
  7. Add additional logging to peer validation so it is easier to see when we transition between states.
  8. Add timeout to peer validation so players will be booted if there is no response from WS.

Beyond all these fixes, there was one more mystery – why did this bug suddenly start affecting players, when it had been lurking seemingly-harmlessly in the code since the days of Destiny 1? I investigated and found a few interesting things. For starters, IB is one of the few activities to have activity-specific rewards (which are the only kind that can be lost when this bug happens). In addition, IB sees a good amount of player churn mid-match, and more players joining mid-match means more running the peer validator during gameplay, which means more chances to hit this bug.
I found one last fascinating contributor - bounty content evolution. The AH maintains lists of incidents that should or should not be sent to the WS based on the kind of bounties that exist in the game (e.g. if there’s no bounty for jumping in the game, we don’t need to send jump incidents to the WS). Over the years, as bounty variety increased, more and more incidents were added to the list of WS-required incidents, steadily increasing the probability of overflowing the callback handler queue.
Going one step further, it turned out that one specific type of bounty had an outsized impact here. When player reports of these bugs were at their highest, we had active bounties that tracked orbs of power, which are spawned by player abilities, and can be generated pretty rapidly. For example, when a player activates Well of Radiance, they create 3 orbs of power. The AH then needs to send a message for each orb to each player's WS. In a full IB match where all players happen to be on different WS, that’s 3 * 12 = 36 incident messages generated in a single frame from a single ability activation! Remember, our callback handler queue only had 16 elements – so that single super activation would swamp the entire queue twice over, overwriting any important callback handlers that were already present.
Like all large mature software projects, Destiny 2 is a complicated system, with many interconnected components. Low level bugs can manifest as mysterious player facing issues. In this case, an error in a low-level service-to-service communication layer prevented progression on a small subset of bounties when a player joined an activity already in progress and arrived on an unlucky frame. When I started investigating this bug, I only had peripheral knowledge of some of the systems involved. I’m glad I had this opportunity to learn more about the inner workings of our amazing game.
-Brad Fawcett
P.S. As promised earlier, here’s an example of a decoded incident. This incident shows a titan getting a kill with a void shotgun. Hundreds of millions of incidents are generated every hour – roughly one per second per player.
CurrentActivityTimeMs: 113400  
==== Record      361 ====  
Id: {  
  EventTime:   1577735462000  
}  
ContentVersionId: 0  
IncidentData: {  
  Event: {  
    EventTime:     0  
  }  
  Id: {  
    IncidentBucket:     1259722579  
    IncidentId:     0  
    IncidentSpecializations: [  
      844012064 (any_kill),  
      2030840469 (kill),  
      2295196519 (player_died),  
      3406343853 (kill.pvp),  
      3737784129 (kill.weapon(any).distance(3)),  
      3805485874 (kill.pvp.audience(all_players)),  
      3502204954 (kill.pvp.class(titan)),  
      1897670327 (kill.pvp.delta(positive)),  
      1775479489 (kill.pvp.distance(close)),  
      3495643034 (kill.pvp.label(precision)),  
      1930520027 (kill.pvp.name(titan)),  
      1973036714 (kill.pvp.type(void)),  
      2986047998 (kill.pvp.weapon(any).exclude(sidearm)),  
      4093674648 (kill.pvp.weapon(any)),  
      1176454688 (kill.pvp.weapon(shotgun)),  
      208873480 (kill.pvp.weapon(shotgun).distance(close)),  
      159107573 (kill.pvp.weapon(special)),  
      3517133451 (kill.pvp.audience(killing_faction)),  
      3930641628 (kill.pvp.distance(close).weapon(any)),  
      678206791 (kill.pvp.type(void).weapon(any)),  
      2193724665 (kill.pvp.type(void).weapon(any).audience(killing_faction)),  
      1490309038 (kill.pvp.weapon(any).exclude(ammo_primary)),  
      2586008362 (kill.pvp.weapon(shotgun).audience(killing_faction)),  
      1441939486 (kill.pvp.type(void).weapon(special)),  
      369377155 (kill.pvp.weapon(special).audience(killing_faction)),  
      3027446115 (player_died_m)]  
  }  
  ContentVersionId:   0  
  Origin:   Client (1)  
  TypeOfIncident:   Kill (1)  
  IncidentHash:   0 ([missing hash 0])  
  Kill: {  
    KillingObject: {  
      IsAI:       False  
      IsVehicle:       False  
      SquadDisablesInvestmentReward:       False  
      CombatantHash:       2166136261 ([missing hash -2128831035])  
      PlayerId:       2305843009260703957  
      FactionIndex:       17  
      FireteamId:       1073880149893797  
      SquadHash:       2166136261 ([missing hash -2128831035])  
      ObjectHash:       2166136261 ([missing hash -2128831035])  
      CharacterClass:       1  
      CharacterGender:       0  
      Tier:       0  
    }  
    DeadObject: {  
      IsAI:       False  
      IsVehicle:       False  
      SquadDisablesInvestmentReward:       False  
      CombatantHash:       2166136261 ([missing hash -2128831035])  
      PlayerId:       2305843009263126106  
      FactionIndex:       18  
      FireteamId:       1073880149893797  
      SquadHash:       2166136261 ([missing hash -2128831035])  
      ObjectHash:       2166136261 ([missing hash -2128831035])  
      CharacterClass:       1  
      CharacterGender:       0  
      Tier:       0  
    }  
    KillingAbility:     -1  
    DeadObjectHeadPosition: {  
      X:       71.51681  
      Y:       17.25534  
      Z:       58.61158  
    }  
    DeadInvestmentEnemyIndex:     40  
    BubbleHash:     3523659485 (pvp_mojo)  
    MapSliceIndex:     0  
    KillerIsDead:     False  
    KillingDamageType:     3  
    KillingDamageLabels: [  
      512,  
      0,  
      0,  
      4194304,  
      0,  
      0,  
      0,  
      83886088]  
    KillingDamgePayload: {  
      AxionBoltChainId:       0  
      AxionBoltHitCount:       0  
    }  
    VictimLabels: [  
      0,  
      0,  
      0,  
      0,  
      0,  
      0,  
      0,  
      553648136]  
    KillingWeaponSlot:     8  
    KillerLabels: [  
      0,  
      0,  
      0,  
      0,  
      0,  
      0,  
      0,  
      83886088]  
    KillingDamageLabelHashes: [  
      2519638427 (precision),  
      51683977 (shotgun),  
      3655393761 (titan),  
      3111576190 (male),  
      898834093 (exo),  
      2079841625 (special death 1),  
      3167052178 (direct),  
      25675886 (special weapon)]  
    KillerLabelHashes: [  
      3655393761 (titan),  
      3111576190 (male),  
      898834093 (exo)]  
    VictimLabelHashes: [  
      3655393761 (titan),  
      3111576190 (male),  
      3887404748 (human)]  
    KillingStatusEffects: [  
      0]  
    VictimStatusEffects: [  
      0]  
  }  
  OriginContext:   2305843009263126106  
}