over 1 year ago - /u/WotC_BenFinkel - Direct link

As many are aware by now, the Shadows over Innistrad: Remastered release on March 21st introduced an unfortunate bug to Magic: The Gathering Arena. This bug affected many cards that confer an ability that mentions the title of the conferring card, such as [[Citizen's Crowbar]] and [[Ninja's Kunai]].

Equipped creature gets +1/+1 and has "{oW}, {oT}, Sacrifice Citizen's Crowbar: Destroy target artifact or enchantment."

Equipped creature has "{1}, {T}, Sacrifice Ninja's Kunai: Ninja's Kunai deals 3 damage to any target."

Instead of sacrificing the conferring object, these abilities sacrificed all permanents controlled by the ability's controller. In Kunai's case, this was also followed up by each of the sacrificed objects dealing 3 damage to the target.

...ouch. What is happening? Why is it happening? How did we miss this happening? Well, do we have a story for you!

The story of how this bug came about requires some background in how MTG Arena is coded. Join me as I break down and explain the most relevant aspects here along with what we learned.

Much of our rules engine code is machine-generated: we use a natural-language processing solution to interpret the English words on the card and create code (this is an article, or a series thereof, by itself!). This has two relevant features: one, every release involves a new generation of all the code that comes from card text - we don't just freeze the original parsed code. Two, due to being machine-generated, many components of the card behavior code are highly generic, as this example will illustrate. The buggy component that arose here is a code snippet (called a Rule in the language we use) responsible for identifying what resources are available to pay a cost, named ProposeEffectCostResource. Every card text that involves non-mana costs has its own version of this Rule:

  • "Discard a card: Draw a card." has a ProposeEffectCostResource Rule that proposes every card in your hand.

  • "As an additional cost to cast this spell, exile a red card from your graveyard." would propose each red card in your graveyard.

  • "Crew 3" proposes each untapped creature you control, weighted by their power.

Let's put a pin in ProposeEffectCostResource for now to discuss self-referential cards. In the Theros Beyond Death expansion, [[Heliod's Punishment]] was introduced, which was MTG Arena's first card that involved a self-reference in a conferred ability ("Remove a task counter from Heliod's Punishment", "destroy Heliod's Punishment").

Enchanted creature can't attack or block. It loses all abilities and has "{oT}: Remove a task counter from Heliod's Punishment. Then if it has no task counters on it, destroy Heliod's Punishment."

This is quite tricky! Most abilities that include a self-reference mean "this card", or perhaps "the card that put this ability on the stack". Heliod's Punishment attached to your [[Runeclaw Bear]] is not talking about Runeclaw Bear in its mentioning of Heliod's Punishment, even though Runeclaw Bear has the ability. So what is it talking about? It's saying "the card that conferred the ability that was activated". That is, we care about the particular ability-on-permanent to know what the self-reference means. We decided that the salient feature of these cards was that they were on Auras and Equipment and made special code to handle self-references in those cases.

Returning to the subject of effect cost resources, Streets of New Capenna introduced [[Falco Spara, Pactweaver]].

You may cast spells from the top of your library by removing a counter from a creature you control in addition to paying their other costs.

What does the ProposeEffectCostResource Rule look like here?

  1. It proposes each type of counter from among permanents you control, and it's invoked whenever you cast a spell using Falco's ability. Lovely. But what if you have multiple copies of Falco out? Legendary sure doesn't mean what it used to. . .

  2. Well, we don't want to make a separate action for each Falco you have out - we just have one action for "you're casting a particular card using a Falco ability" - we don't keep track of which ability-on-a-Falco is responsible, as it's irrelevant (and if it were displayed, perhaps misleading to a player!). But we ran into a problem here...

  3. Even though only one Falco ability is relevant for the action, ALL of them were using their cost payment Rules for that action. Your selection of a counter was filled up redundantly, and when you picked one, each Falco would remove that type of counter from the permanent you chose.

Still with us? Great – also, we're hiring.

So, we made the decision to decouple the ProposeEffectCostResource Rule from abilities-on-cards, and instead have them associated with just the ability text - all the Falcos have the same ability text, so the Rule executes only once. Our work for the conferred-self-reference stuff for Heliod's Punishment stepped in a later part of writing this Rule, so it reintroduced the ability-on-card to the Rule, and everything was awesome.

But then along came Mean Old [[Gutter Grime]] in Shadows over Innistrad: Remastered.

Whenever a nontoken creature you control dies, put a slime counter on Gutter Grime, then create a green Ooze creature token with "This creature's power and toughness are each equal to the number of slime counters on Gutter Grime."

  • Gutter Grime has a conferred ability with a self-reference, just like Heliod's Punishment.
  • Unlike Heliod's Punishment, it's not an Aura or Equipment. Our solution to the conferred self-reference had to be completely rethought.
  • After a lot of sweat and maybe a few tears, we had such a solution: it involved moving that reference to the conferred-ability-on-a-card to earlier in the code generation process. Later, ProposeEffectCostResource deletes that constraint from the Rule it creates.

And thus, the bug: Such cost-resource Rules for conferred self-referencing abilities lose track of the relevant ability. They now proposed resources without that constraint. For "sacrifice", there's still the constraints of "it's on the battlefield" and "you control it", but costs that don't involve a user selection are simply paid by using, well ... all of the qualified resources.

And with Ninja's Kunai, there's actually two different self-references in its conferred ability: * "Sacrifice Ninja's Kunai." This first one is the type we've been talking about, meaning "the card that conferred this ability". * "Ninja's Kunai deals 3 damage to any target" This second self-reference is interpreted to mean "the permanent that was sacrificed." This explains the... explosive nature of Kunai's bug: each of the sacrificed permanents is interpreted to be that latter "Ninja's Kunai", so each of them deals damage. This feature is usually useful (examples: [[Nightmare Shepherd]] triggering on a Mutated creature dying, [[Skyclave Apparition]] dying after its enters-the-battlefield ability triggered twice due to [[Panharmonicon]]).

In Ian's article, we celebrated our over 3000 regression tests, run every night to ward against releasing buggy code. You may wonder how we didn't catch this. Writing a regression test requires a good deal of effort and thought, since they take the form of scripted games of Magic: The Gathering using our rules engine. Some of these tests take over a day to write. Even the simplest ones involve at least 15 minutes of effort to ideate, write, and validate. That may not sound like a lot of time, until you multiply it by the hundreds of cards in each major card set. Therefore, we don't create such tests for every new card on MTG Arena – we focus on the cards that required specific developer effort to work correctly. For everything else, our (human) QA team tests newly added cards at the beginning of a set's implementation, and again before release. It's unreasonable to expect them to also test every other card we've ever shipped with each and every release!

With a project this big and a game this complex, bugs are inevitable. It's still truly disheartening when they're as impactful as this one, especially knowing how hard my team works to prevent them from happening. Now that we've fixed this bug, the fix's verification is part of our regression test suite. We're also already reconsidering our code analysis methodology so we can be more confident we're not wrecking old cards' behaviors by implementing new ones, making this sort of situation rarer in the first place. Last, but certainly not least - I will also continue to be incredibly proud and impressed by the work my team has produced for this game.

#wotc_staff

External link →
over 1 year ago - /u/WotC_BenFinkel - Direct link

As many are aware by now, the Shadows over Innistrad: Remastered release on March 21st introduced an unfortunate bug to Magic: The Gathering Arena. This bug affected many cards that confer an ability that mentions the title of the conferring card, such as [[Citizen's Crowbar]] and [[Ninja's Kunai]].

Equipped creature gets +1/+1 and has "{oW}, {oT}, Sacrifice Citizen's Crowbar: Destroy target artifact or enchantment."

Equipped creature has "{1}, {T}, Sacrifice Ninja's Kunai: Ninja's Kunai deals 3 damage to any target."

Instead of sacrificing the conferring object, these abilities sacrificed all permanents controlled by the ability's controller. In Kunai's case, this was also followed up by each of the sacrificed objects dealing 3 damage to the target.

...ouch. What is happening? Why is it happening? How did we miss this happening? Well, do we have a story for you!

The story of how this bug came about requires some background in how MTG Arena is coded. Join me as I break down and explain the most relevant aspects here along with what we learned.

Much of our rules engine code is machine-generated: we use a natural-language processing solution to interpret the English words on the card and create code (this is an article, or a series thereof, by itself!). This has two relevant features: one, every release involves a new generation of all the code that comes from card text - we don't just freeze the original parsed code. Two, due to being machine-generated, many components of the card behavior code are highly generic, as this example will illustrate. The buggy component that arose here is a code snippet (called a Rule in the language we use) responsible for identifying what resources are available to pay a cost, named ProposeEffectCostResource. Every card text that involves non-mana costs has its own version of this Rule:

  • "Discard a card: Draw a card." has a ProposeEffectCostResource Rule that proposes every card in your hand.

  • "As an additional cost to cast this spell, exile a red card from your graveyard." would propose each red card in your graveyard.

  • "Crew 3" proposes each untapped creature you control, weighted by their power.

Let's put a pin in ProposeEffectCostResource for now to discuss self-referential cards. In the Theros Beyond Death expansion, [[Heliod's Punishment]] was introduced, which was MTG Arena's first card that involved a self-reference in a conferred ability ("Remove a task counter from Heliod's Punishment", "destroy Heliod's Punishment").

Enchanted creature can't attack or block. It loses all abilities and has "{oT}: Remove a task counter from Heliod's Punishment. Then if it has no task counters on it, destroy Heliod's Punishment."

This is quite tricky! Most abilities that include a self-reference mean "this card", or perhaps "the card that put this ability on the stack". Heliod's Punishment attached to your [[Runeclaw Bear]] is not talking about Runeclaw Bear in its mentioning of Heliod's Punishment, even though Runeclaw Bear has the ability. So what is it talking about? It's saying "the card that conferred the ability that was activated". That is, we care about the particular ability-on-permanent to know what the self-reference means. We decided that the salient feature of these cards was that they were on Auras and Equipment and made special code to handle self-references in those cases.

Returning to the subject of effect cost resources, Streets of New Capenna introduced [[Falco Spara, Pactweaver]].

You may cast spells from the top of your library by removing a counter from a creature you control in addition to paying their other costs.

What does the ProposeEffectCostResource Rule look like here?

  1. It proposes each type of counter from among permanents you control, and it's invoked whenever you cast a spell using Falco's ability. Lovely. But what if you have multiple copies of Falco out? Legendary sure doesn't mean what it used to. . .

  2. Well, we don't want to make a separate action for each Falco you have out - we just have one action for "you're casting a particular card using a Falco ability" - we don't keep track of which ability-on-a-Falco is responsible, as it's irrelevant (and if it were displayed, perhaps misleading to a player!). But we ran into a problem here...

  3. Even though only one Falco ability is relevant for the action, ALL of them were using their cost payment Rules for that action. Your selection of a counter was filled up redundantly, and when you picked one, each Falco would remove that type of counter from the permanent you chose.

Still with us? Great – also, we're hiring.

So, we made the decision to decouple the ProposeEffectCostResource Rule from abilities-on-cards, and instead have them associated with just the ability text - all the Falcos have the same ability text, so the Rule executes only once. Our work for the conferred-self-reference stuff for Heliod's Punishment stepped in a later part of writing this Rule, so it reintroduced the ability-on-card to the Rule, and everything was awesome.

But then along came Mean Old [[Gutter Grime]] in Shadows over Innistrad: Remastered.

Whenever a nontoken creature you control dies, put a slime counter on Gutter Grime, then create a green Ooze creature token with "This creature's power and toughness are each equal to the number of slime counters on Gutter Grime."

  • Gutter Grime has a conferred ability with a self-reference, just like Heliod's Punishment.
  • Unlike Heliod's Punishment, it's not an Aura or Equipment. Our solution to the conferred self-reference had to be completely rethought.
  • After a lot of sweat and maybe a few tears, we had such a solution: it involved moving that reference to the conferred-ability-on-a-card to earlier in the code generation process. Later, ProposeEffectCostResource deletes that constraint from the Rule it creates.

And thus, the bug: Such cost-resource Rules for conferred self-referencing abilities lose track of the relevant ability. They now proposed resources without that constraint. For "sacrifice", there's still the constraints of "it's on the battlefield" and "you control it", but costs that don't involve a user selection are simply paid by using, well ... all of the qualified resources.

And with Ninja's Kunai, there's actually two different self-references in its conferred ability: * "Sacrifice Ninja's Kunai." This first one is the type we've been talking about, meaning "the card that conferred this ability". * "Ninja's Kunai deals 3 damage to any target" This second self-reference is interpreted to mean "the permanent that was sacrificed." This explains the... explosive nature of Kunai's bug: each of the sacrificed permanents is interpreted to be that latter "Ninja's Kunai", so each of them deals damage. This feature is usually useful (examples: [[Nightmare Shepherd]] triggering on a Mutated creature dying, [[Skyclave Apparition]] dying after its enters-the-battlefield ability triggered twice due to [[Panharmonicon]]).

In Ian's article, we celebrated our over 3000 regression tests, run every night to ward against releasing buggy code. You may wonder how we didn't catch this. Writing a regression test requires a good deal of effort and thought, since they take the form of scripted games of Magic: The Gathering using our rules engine. Some of these tests take over a day to write. Even the simplest ones involve at least 15 minutes of effort to ideate, write, and validate. That may not sound like a lot of time, until you multiply it by the hundreds of cards in each major card set. Therefore, we don't create such tests for every new card on MTG Arena – we focus on the cards that required specific developer effort to work correctly. For everything else, our (human) QA team tests newly added cards at the beginning of a set's implementation, and again before release. It's unreasonable to expect them to also test every other card we've ever shipped with each and every release!

With a project this big and a game this complex, bugs are inevitable. It's still truly disheartening when they're as impactful as this one, especially knowing how hard my team works to prevent them from happening. Now that we've fixed this bug, the fix's verification is part of our regression test suite. We're also already reconsidering our code analysis methodology so we can be more confident we're not wrecking old cards' behaviors by implementing new ones, making this sort of situation rarer in the first place. Last, but certainly not least - I will also continue to be incredibly proud and impressed by the work my team has produced for this game.

#wotc_staff

External link →
over 1 year ago - /u/WotC_BenFinkel - Direct link

Originally posted by fractalspire

Ah yes, ProposeEffectCostResource. When I first saw this bug, I had a hunch that the problem was ProposeEffectCostResource.

I had first thought the issue was with anaphora resolution (having the parser correctly identify what the self-reference meant). That is, the bug would have been "the parser thought some other phrase was what that self-reference meant". It turns out it did that correctly, we just squashed the meaning out of the result! #wotc_staff

over 1 year ago - /u/WotC_BenFinkel - Direct link

Originally posted by HotTakes4HotCakes

They likely still have to go through each individual one and check or tweak it.

But when you really think about how card text is written, how standardized it is, and that it has been written that way consistently (more or less) for years now, it's really not all that surprising they can do that.

In a sense, the cards are already written in code. You have an official rule set that explains the order of operations and defines specific functions for specific text, then you have card text that is written methodically to adhere to it. It's all structured logically so the results are seldom in question.

Compare that to some other digital only card games where they don't care very much about the consistent logic of the text. The cards were programmed to work as intended, doesn't matter if the player gets it.

In practice, at least 75% of the cards' generated code do not need individual attention. The advantage of the language of MTG cards is that it's very precise. This is what allows our project to not be a fool's errand in the first place. #wotc_staff

over 1 year ago - /u/WotC_BenFinkel - Direct link

Originally posted by EmTeeEm

Hi Ben! Thanks for the write-up. I love behind the scenes stuff like this.

While we've got you here, could you answer a question I've had for a while: were Boons invented to help the card parser? They kind of confused me because they seem like a thing Alchemy and Arena were already doing for delayed triggers, but with adding the words "You get a boon with..." to everything.

It is just something I've been wondering for a long time, since it seems like it didn't add anything Arena didn't already do, otherwise.

Boons have a few things going for them:

  • We (Studio X card design and us Arena devs together) first got the idea after puzzling over Y22-MID's [[Tenacious Pup]]. We wanted an embedded trigger inside a triggered ability, and for that to be the outer trigger's only effect. But something like "When CARDNAME enters the battlefield, when you cast your next spell" was dissatisfying to read. For the Pup, that's why the 1 life is tacked in there. But we didn't want trinkets like that to be the long term solution. So advantage #1 is that it's a more pleasant read.

  • Boons are currently digital-only, as they sort of have a memory issue for paper. More digital design space is fruitful for us.

  • As they are digital-only, we have more flexibility in adjusting the game rules around them if design needs that compared to paper mechanics.

One disadvantage is that it's pretty awkward to make downside boons, given the connotation of the word! #wotc_staff

over 1 year ago - /u/WotC_BenFinkel - Direct link

Originally posted by ghalta

Well, we don't want to make a separate action for each Falco you have out - we just have one action for "you're casting a particular card using a Falco ability" - we don't keep track of which ability-on-a-Falco is responsible, as it's irrelevant (and if it were displayed, perhaps misleading to a player!).

Back when rune decks were a thing, and I had multiple [[Runeforge Champion]] on the field, Arena would make me pick which one's ability I was using when I wanted to case a rune for (1) instead of its casting cost.

That seems like a very similar situation. I haven't played that deck in a long while, as it has fallen from standard. Were both changed so that the player didn't have to pick? If not, why were they handled differently?

You know, I think that actually is an undesired behavior. Our rules for providing an alternative cost to an existing action, as Runeforge Champion does, do not have the logic of "is the source object meaningful" like our rules for providing novel action permissions do. I think I'll make a ticket for that, thanks for bringing it to my attention! #wotc_staff

over 1 year ago - /u/WotC_BenFinkel - Direct link

Originally posted by RealisticCommentBot

similar for multiple serra paragons , though it is at least relevant in some way in that case as you can only use it once per turn per serra paragon

That is absolutely intentional. You should also be able to tell which Serra Paragons have been "used up" (and may want to change which you prefer based on the state of the different Paragons). #wotc_staff

over 1 year ago - /u/WotC_BenFinkel - Direct link

Originally posted by Crystal__

Now I have the irresistible urge to test the behavior of [[Toralf's Hammer]] pre-bug. Does it deal 3 damage for each permanent you control? Only for each equipment you control? Only for each attached equipment? Only 3 damage regardless? Would it magically unatttach all other equipped equipment you control? Would MTGA collapse trying to unnattach permanents without Equip ability? So many questions!

When we made the Gutter Grime change, our test for Toralf's Hammer failed. That led to us actually carving out an exception to the "delete the ability constraint" for "unattach" costs, which led to that test passing again. We really should have taken that as a warning sign that other similar cards may have issues, certainly. But the upshot is Toralf's Hammer never ended up buggy in release. #wotc_staff

over 1 year ago - /u/WotC_BenFinkel - Direct link

Originally posted by DeeBoFour20

Interesting read. I appreciate the transparency as well. I understand bugs like this can slip through and it's not reasonable to manually test some draft chaff from 4 sets ago.

Can you elaborate on the problem with the "emergency ban" system? I've seen people on here saying that's it bugged. Apparently a WotC employee made some statement to that effect. I think a lot of people got upset that there was no immediate remedy to stop people from cheating for multiple days while you're working on a patch.

I can't really, as it's outside my area of the code. I'm focused pretty much entirely on the "playing a game of MTG" side of things. I do know that getting that system working again is a priority for us. #wotc_staff

over 1 year ago - /u/WotC_BenFinkel - Direct link

Originally posted by space20021

The rules engine was not coded by hand, but generated from machine learning and NLP...?

That's a bold move

Machine learning is not used in our parser. The generation of code is intended to be deterministic, which is a feature machine learning is not a good fit for. Our natural language processing techniques are more old-school stuff like generating syntax trees from grammatical productions and encoding semantic meaning with first-order-logic expressions. #wotc_staff

over 1 year ago - /u/WotC_BenFinkel - Direct link

Originally posted by jmorganmartin

We got a Bug-Atog!

Thanks! It is fascinating.

What about [[Blazing Torch]]?

It was a new-to-Arena card affected by the same bug. It makes sense that you wouldn't write a specific test for this new card with (basically) the same effect as Ninja's Kunai, and it makes sense that you don't re-test every old card. But, you also said that all new cards are tested by the human QA team.

Was Blazing Torch overlooked by human QA because it was on the rotating bonus sheet? Pure speculation on my part, but perhaps they didn't have as much (or any) time to test with those cards because they were late add-ons to the set, or something like that?

That's an excellent question! The unfortunate truth is that Gutter Grime's implementation came in pretty late, apparently after Blazing Torch was retested for release. That's pretty abnormal, and mega-unfortunate. #wotc_staff

over 1 year ago - /u/WotC_BenFinkel - Direct link

Originally posted by jasonsavory123

Can I ask why this approach to creating rules was chosen and simultaneously we don’t have a larger card pool? If the rules are generated by reading oracle rules text, why is pioneer, modern, legacy etc not available ?

I could understand the smaller card pool if rules were manually implemented as functions or equivalent, but this threw me for a loop as something that seems too complex for the limited card pool the game started with.

Well, the dream has always been for the card parser to be a massive productivity boost for backfilling MTG's card catalog. There are a few reasons why it isn't just a snap-of-the-finger though:

  • The Pareto Principle applies: the parser is excellent at handling normal MTG card text, but a sizeable number of MTG cards do things that really no other card does. For example, [[Void Winnower]]'s prohibition on casting even-mana value spells would play some havoc with casting X-cost spells. Perhaps we could just dump the large proportion of cards that work "for free" in engine...

  • ... but in-engine isn't the only concern. There's also the client experience to consider. The engine's been worked on for longer and supports some interactions that the client has never needed to implement before. Plus there's our standards of presentation: we want new content we release to meet our standards of clarity to players and to work with our auxiliary systems like autotap, automatic trigger ordering, etc.

  • And still importantly, it needs to make business sense. Even though the work is cheaper than a new set, it's not free to produce and would similarly not be free to distribute. We want to productize the back catalog and make fun experiences like the remaster sets have been, rather than just dumping thousands of cards that you would acquire... how, exactly?

#wotc_staff

over 1 year ago - /u/WotC_BenFinkel - Direct link

Originally posted by ThoseThingsAreWeird

Therefore, we don't create such tests for every new card on MTG Arena

This surprises me a little bit, but it probably has a reasonable answer.

We create regression tests for each new feature, but we've done that from the start. So yeah that adds an extra bit of time onto creating each feature, but we've got a certain level of confidence that we're not breaking stuff in the future (assuming we right the tests correctly, which we always do every time ever...). In the grand scheme of things it's a lot of time, but for each release it's a relatively small amount of time.

Was there a period of time when you weren't creating regression tests? Or is it that your approach to regression tests wasn't covering every Rule? Presumably covering every Rule, would mean you cover every card with an ability? Or actually, that'd need to have regressions on every Rule interacting with every other Rule... Ok yeah I see where this is going...

Ok so I suppose my new question is: how the hell do you even go about confidently testing your rules engine? The perfectionist in me would want a mesh of tests for the entire set of Rules, but that's massively infeasible. Is it just about knowing the game, and knowing "Rule X is going to mess with Rule Y"?

What's a "new feature" for us? This has always been a pretty interesting question to me, for a code-generating system. When a vanilla creature comes out, do you recommend we make a regression test for it? What should the content of that test be? What about a french vanilla creature? What if we have tests for "Draw two cards" already but now a card comes out with "Draw five cards" - is that a new feature?

Our line is "involved developer changes to the parser or engine". This does miss bugs, but in my opinion it is rare. And the greater focus on "new work" allows us to put much more attention in testing the boundary scenarios for the riskiest new behaviors.

Slightly before I joined WotC 6 years ago, our regression test framework was much more inconvenient and brittle, but pretty much from day 1 of engine development there has been some form or another of testing.

As for our strategy for testing, our normal standard is a scripted game with assertions about the game state (or sometimes the internal engine state for stuff like memory leak fixes). They certainly involve a lot of MTG knowledge about "what are tricky interactions to verify". Sometimes, with more algorithmically complex features like autotap (or, perhaps surprisingly, the ZNR party mechanic), we develop unit tests that directly test internal components of the engine. #wotc_staff

over 1 year ago - /u/WotC_BenFinkel - Direct link

Originally posted by RealisticCommentBot

Confusing as the Falco thing is, this exact scenario I think happens (and is a bit odd) when you have two copies of Serra Paragon out, as you have to choose which Serra Paragon you are using to cast a card from your graveyard.

It could totally be relevant mainly because that ability is activate only once each turn compared to spara, but as a user once I'd seen it happen once or twice I undertood what was happening.

I feel it would be similar for spara, but it's defeintly more confusing when they both have counters on them (which is likley the case as they ETB with counters)

The notion is it doesn't matter which Falco you use - the action behaves the exact same way for either. For Serra Paragon, it does matter which you use - that one can't be used again this turn (and maybe you'd prefer to use the one with fewer +1/+1 counters on it just in case your opponent has removal!) #wotc_staff

over 1 year ago - /u/WotC_BenFinkel - Direct link

Originally posted by slavazin

I'm curious about regression testing. You said that those are difficult to write due to simulating parts of a full game. Why not actually run full games (or slices of full games from state A to B) in some headless mode? Either pull the gameplay from standard tournament games, or play a few games and record the gameplay? you can mix a lot of cards with unique interactions, and after each resolution of a trigger, compare the game state with the recorded state/delta. From my extremely limited pov the downsides would be a lot of computer time spent running through somewhat meaningless actions, but if they're fast enough, you can load a lot of unique game situations in 30 minutes of playing and recording a game. An error can then display the card/trigger that caused the trigger and the mismatch in outcome.
Just curious as to the drawbacks

The problem with taking a recording of a game and saying "make sure it plays like that again" is in determining what "like that" means. We do plenty of changes to the game that don't change the gameplay outcome but do, for example, change the information in requests and responses to the client, change what information is available in the game, change autotap strategies, etc. The advantage with our "scripted game" tests is that we're able to decide precisely what is important to verify with automated assertions, and what aspects of the game's proceedings are allowed to vary over the development of Arena as a project. #wotc_staff

over 1 year ago - /u/WotC_BenFinkel - Direct link

Originally posted by r_xy

so how do you choose what cards get a regression test?

if the conferred ability was such a headache to originally implement wouldnt that make it a good candidate for one?

We test cards that involved a developer's effort to get to work in the first place. Human QA does a pass over a set to identify what didn't automatically work from the first time we generate code for a new card set. Anything that doesn't work at that point is, well, my day job! And work we do there gets verified against regression by an automated test.

When we're closer to release, QA does another full pass to hopefully identify regressions, again focusing on the new cards due to the huge explosion of possible interactions.

I identified in the OP the relevant cards in the story: Heliod's Punishment has plenty of tests that lean on "self-reference in conferred abilities". Unfortunately Heliod's Punishment's behavior doesn't involve the ProposeEffectCostResource rule, which was the center of this bug: its conferred ability's only cost is the tap-symbol. #wotc_staff

over 1 year ago - /u/WotC_BenFinkel - Direct link

Originally posted by Douglasjm

We decided that the salient feature of these cards was that they were on Auras and Equipment and made special code to handle self-references in those cases.

It seems obvious to me that the salient feature is which card the ability was printed on. This is not the first time I've seen a bug in Arena result from not properly considering the "printed on" relationship, though the other one I remember had to do with linked abilities. It makes me wonder if the dev team, and/or the design of the code base, need more awareness of the importance of that relationship.

Can you clarify what your suggestion is? "Printed on" is a pretty ambiguous concept:

  • What about copy effects? If card A becomes a copy of Gutter Grime and triggers to make an Ooze, the reference to "Gutter Grime" on the Ooze means "Card A".

  • That still holds true even if Card A stops being a copy of Gutter Grime.

  • Through horrible shenanigans you're able to make The Book of Vile Darkness create a Vecna token that has Gutter Grime's triggered ability. In that case the "Gutter Grime" phrase on the Ooze it creates refers to the Vecna that made the Ooze token. Was that ability "printed on" Vecna?

I think perhaps what you're trying to say is "Gutter Grime" in the conferred ability refers to "the card that conferred this ability to this Ooze". But that's the whole point of this post - identifying when a self-reference is like that is nontrivial. Our original logic, due to the cards we had covered on Arena, was myopically focused on attachment cards. Gutter Grime challenged that assumption, and us changing our code to account for that led to this oversight. #wotc_staff

over 1 year ago - /u/WotC_BenFinkel - Direct link

Originally posted by Flyrpotacreepugmu

That's quite an interesting look behind the scenes. Ever since someone mentioned that Gutter Grime was the cause, I've been trying to think of how that could possibly break these equipment, but I never would've guessed that was how it happened.

That bit about Falco Spara was also interesting. It also reminded me that multiple copies of [[Muldrotha, the Gravetide]] don't work properly (or at least didn't a couple months ago). Casting one spell of each type removes the option even if you have multiple Muldrothas that should each be able to cast one. I wonder if that's a similar issue to Falco Spara where they all try to do the same thing, or if it's because of Muldrotha's unique UI...

I believe that had been a UI bug, where the client was improperly batching the Muldrotha permissions in its presentation of your actions. #wotc_staff

over 1 year ago - /u/WotC_BenFinkel - Direct link

Originally posted by Juuuuuuuules

As a noncoder, I found this super interesting. I know it’s more work for you but I’d love more of these posts when it’s relevant.

I'd love to tell more stories about "challenging developments that went smoothly". I think there's a couple challenges to that:

  • Less of a narrative! With a bug, there's an immediate hook of "how did that happen", then a cool investigation, a eureka of the issue, and often an embarrassingly simple fix (this bug was fixed just by deleting a line of code!). I think that makes for a pretty clear flow. Most implementation stories don't have such a narrative structure to them, which makes them harder to write about.

  • Scope of background. Even this post had coworkers dozing off with the groundwork I presented to describe the bug. New features are often even less cleanly described.

  • When? What? It can be hard after-the-fact to decide what would make an interesting story to talk about, or when to talk about it.

Still, the reaction to Ian's post and this has us pretty interested in doing more. Heck, I've always wanted to! I'm sure we'll overcome the above challenges haha. #wotc_staff

over 1 year ago - /u/WotC_BenFinkel - Direct link

Originally posted by gitgudds3

I thought the end of this story would be:

“Thank goodness!” said Bilbo laughing, and handed him the tobacco-jar.

Perhaps for an LTR implementation tale! #wotc_staff

over 1 year ago - /u/WotC_BenFinkel - Direct link

Originally posted by Un111KnoWn

How similarly does MTG Arena work compared to how MTG Online works?

Pretty broad question. In one sense, we're somewhat similar: we both make code happen starting from English strings from new cards to make a good MTG play experience. But our engineering is completely different, from code generation to the actual engine design. #wotc_staff

over 1 year ago - /u/WotC_BenFinkel - Direct link

Originally posted by saxophoneplayingcat

How do you detect the 25% needing individual attention?

Two main ways:

  • The parser fails to generate code. This is great! It's recognizing that something is outside its current boundaries. We usually have a good idea of what we need to do from the error messaging.

  • The parser generates wrong code. Less great. Human QA needs to play the card to see that it's doing the wrong thing. The most common type of problem here is with "anaphora resolution" - figuring out what ambiguous phrases like "it" or "that creature" mean. Why, I just estimated the complexity of a few LTR bugs with that issue moments ago... #wotc_staff

over 1 year ago - /u/WotC_BenFinkel - Direct link

Originally posted by ThoseThingsAreWeird

What's a "new feature" for us?

In my head I've always separated Arena out into "the actual game of Magic" and then "the bits Arena adds". So I guess new Rules (e.g. Incubate, I think that fits your Rule description), but then also new Arena bits (like the new Codex of the Multiverse)?

What if we have tests for "Draw two cards" already but now a card comes out with "Draw five cards" - is that a new feature?

I guess that depends on how your parser was set up, but I'd wager you've written the parser to be smart enough to say "Draw 2" is the same as "Draw 5" as those are two different tokens1 ("Draw" and number). But then I guess that raises the question of something like is "Draw 1, then Scry 1" the same as "Draw 1" and "Scry 1" (i.e. combined vs separate)?

Our line is "involved developer changes to the parser or engine"

Yeah that makes sense to me 🤷‍♂️ At the risk of the answer being "this thread", what sort of stuff doesn't that catch?

scripted game with assertions about the game state

Are those assertions based on what the parser is saying the card should be doing? Or are they based on what the dev thinks should be the difference? I'm mostly thinking Alchemy cards here; do you need to update your tests for tweaks to those? Or do the tests "just work" because the game state deltas are coming from the parser?


1 : I'm presuming the parser is a tokeniser? Effectively a compiler for the "language" that is Magic rules text?

Tokenization is a component of our parsing process, one very early in the process. It's true that replacing one token with another similar one is often not worth considering to be a big difference. But what about one sentence structure with another? For example "If you would draw a card, draw two cards instead" vs. "If you would draw a card, instead draw two cards" should behave the same despite being worded differently (and both are in fact valid wordings). If we already handled a phrase like "If CARDNAME would deal damage, it deals twice that much damage instead" as well as "If CARDNAME would deal damage, instead it deals twice that much damage", then we've already handled that syntactical difference. Let's say we wrote tests for the latter two cards; we'd find in ad-hoc testing that once we got either version of the draw replacement working (and wrote tests verifying it), the other one would work too. Given that it worked "out of the box", how much effort should we spend testing the new wording? As much as for the wording we had to do work for? It's a tough call, but it's not economical to test every card equally.

what sort of stuff doesn't that catch?

I suppose one type of bug we sometimes miss for cards we don't work on is that we don't do as deep edge-case hunting for the human passes (after all, they have hundreds of cards to get through). When we do work on a card, multiple people brainstorm ways it could go wrong and we check those. With the file pass, there are fewer eyes on it. Internal playtesting does help find issues though.

Are those assertions based on what the parser is saying the card should be doing?

Our tests are pretty much entirely hand-written, so it's a dev's judgment about what's worth paying attention to. We have to update tests sometimes as behaviors change, but that's also true for paper cards. In terms of Alchemy changes, [[Static Discharge]] has had its test updated a lot due to its wording changes, and I'm sure we've not seen the end of that haha. #wotc_staff

over 1 year ago - /u/WotC_BenFinkel - Direct link

Originally posted by Douglasjm

The rules have a concept of "printed on", and use it in the definition of characteristic-defining abilities and the rules for linked abilities, and also for the starting point of applying layers, as well as resolving object references by card name. To implement the rules in a way that conceptually matches how they are written, the Arena code should also recognize and use this concept. Copy effects and shenanigans with Vecna and such do complicate it a bit, though.

For reasoning about it, let's define a concept I'll call "effectively printed on", or "EPO". In the simplest case, without copy effects or shenanigans, an ability is effectively printed on the card that it is, in fact, literally physically printed on. Any reference by name to that card should be interpreted as referring to the ability's EPO card.

To determine an ability's EPO card in the presence of copy effects, it is necessary to distinguish between conferred and non-conferred abilities. A conferred ability inherits its EPO card from the ability or effect that added it. Copying a conferred ability results in an ability that has the same EPO card as the copied instance of the ability does.

A non conferred ability has an EPO card of whatever object it happens to end up on.

References to an ability's EPO card/object should be identified as EPO references by comparing the reference to the name of the actually literally printed on card. The object it actually refers to should then be identified by the principles I just described.

For your examples:

  • Card A becomes a copy of Gutter Grime, and makes an Ooze. The Ooze's ability is conferred, and therefore has the same EPO as the ability that conferred it. The ability that conferred it is on Card A, and was not conferred, so its EPO card is Card A. The Ooze's reference to "Gutter Grime" thus resolves to "Card A".
  • Card A ceases to be a copy of Gutter Grime. The Ooze's ability is still a conferred ability, and the ability that conferred it was on Card A when that ability existed, so the reference should still resolve to "Card A".
  • A Vecna token has Gutter Grime's ability through shenanigans, and creates an Ooze. The Ooze's ability is conferred, therefore it inherits its EPO from the ability that conferred it, which is effectively printed on Vecna.

...Hmm. Actually, having written all that, I think I can simplify it dramatically:

  • If an ability refers by name to the card that it is actually literally printed on, and the ability is conferred rather than inherent, then the reference is to the object that conferred the ability.

Also, side point for linked abilities, in order for two abilities to be linked they must be actually literally printed on the same card. Copy effects and shenanigans are explicitly by the rules irrelevant for that; for a copied or conferred ability to be linked, the ability it's linked with must be copied or conferred from the same source.

I guess my point I'm trying to make is that we do have such a concept (it is a bit hard to discuss given that "ability" means three separate but tangled concepts in MTG). There absolutely is a relationship between an ability-on-card and the card that possesses it, and that's normally how self-references are interpreted. A large part of the complication here (and complication -> misunderstanding -> bugs) is that self-references mean different things in different contexts in MTG text. Your summary of the correct answer is pretty accurate, and it reflects our current logic, but getting there took some iteration and seeing more examples of abilities that contradicted our understanding. #wotc_staff

over 1 year ago - /u/WotC_Jay - Direct link

Originally posted by WotC_BenFinkel

Well, the dream has always been for the card parser to be a massive productivity boost for backfilling MTG's card catalog. There are a few reasons why it isn't just a snap-of-the-finger though:

  • The Pareto Principle applies: the parser is excellent at handling normal MTG card text, but a sizeable number of MTG cards do things that really no other card does. For example, [[Void Winnower]]'s prohibition on casting even-mana value spells would play some havoc with casting X-cost spells. Perhaps we could just dump the large proportion of cards that work "for free" in engine...

  • ... but in-engine isn't the only concern. There's also the client experience to consider. The engine's been worked on for longer and supports some interactions that the client has never needed to implement before. Plus there's our standards of presentation: we want new content we release to meet our standards of clarity to players and to work with our auxiliary systems like autotap, automatic trigger ordering, etc.

  • And still importantly, it needs to make business sense. Even though the work is cheaper than a new set, it's not free to produce and would similarly not be free to distribute. We want to productize the back catalog and make fun experiences like the remaster sets have been, rather than just dumping thousands of cards that you would acquire... how, exactly?

#wotc_staff

Another factor here is QA time. Just because the parser thinks it understands a card, it doesn't necessarily mean that it's right (there are some great stories here that we'll tell sometime). We need to either write automated tests to validate behavior (which takes time) or have QA manually test the card in a variety of scenarios (also takes time).

We want to expand Arena's card pool just like players want. At the most obvious level, releasing new cards directly makes us money. But, more than that, we all work here because we love Magic, and we love Arena, and we want it to continue to grow. But we need to balance that with the people we have who can do the work required. As Ben noted, we're also hiring.

over 1 year ago - /u/WotC_BenFinkel - Direct link

Originally posted by ClapSalientCheeks

Here's another instance of a weird "apply this effect to a permanent" occurrence: Urabrask's forge will destroy any available UF-created creature token, ignoring whether or not the token is "THAT token"

Example: Forge makes a token.

Phase the token out; UF finds nothing to sacrifice.

Next turn, a new token is both created and destroyed before the end step.

At this end step, UF will still sacrifice the first token from a turn ago, even though the oracle text does not refer to it.

Haven't yet tested this effect on copies of the token.

We are unable to reproduce this bug, on live or in our dev environments. Are you sure the reproduction steps here are accurate? #wotc_staff

over 1 year ago - /u/WotC_BenFinkel - Direct link

Originally posted by DonRobo

How do you fix those? Do you hardcore fixes for those or can you develop generic solutions?

We almost never make kludges for cards - our normal workflow is to build solutions that would handle variation: what similar designs would have the same issue? Can we preemptively handle those? For example, for [[Muldrotha, the Grave Tide]], our solution made it so we could handle a similarly worded card that allowed you to play different colors of cards, or different land types, etc., even though Muldrotha is one-of-a-kind. #wotc_staff

over 1 year ago - /u/WotC_BenFinkel - Direct link

Originally posted by PhRzN

Why do you need to regenerate the rules for all cards on each release instead of freezing? Is it because new cards can change behaviors of old ones due to new mechanics?

Ack, sorry I missed this one! There are plenty of reasons to regenerate card code:

  • The biggest reason is consistency. We do not want to be in a world where two nearly identically worded abilities work totally differently because one was "locked in" years ago and behaves totally out of date. Worse yet would be something like two different printings of the same card behaving differently!

  • Frequently, updates we do to new cards want to apply to old cards too, particularly for rules-tangential behavior (client communication, autotap, trigger ordering, etc.). If we've identified behavior in an ability we want to treat differently for those tasks, it's good to apply it to all abilities with that behavior.

  • Rules changes aren't super infrequent in MTG. Having our code be regenerated makes it so that we don't have to manually identify which cards are affected by the rules change - as long as we correctly handle the new requirements generating code for the text, those changes will come "for free" to the old cards. Of course, they still need to be tested! As an example, as we've recently anounced, the new Battle card type is now choosable for "any target". Instead of needing to find every card with that phrase (as well as stuff like "any other target" etc.), we can just change the lexical definition of those phrases to include battles, and all the cards with them will automatically now accept battles as legal targets. #wotc_staff

over 1 year ago - /u/WotC_BenFinkel - Direct link

Originally posted by ZodiacWalrus

Did anyone else think from the title that "Pierce Flesh" and "Spirit Alike" were actually MTG cards and they had somehow made new card-breaking bugs after fixing Kunai/Crowbar lmao?

Thankfully not, though Pierce Flesh does sound like it would make a great black removal spell.

It's a reference to the flavor text of the Kunai! Thanks to /u/WotC_Megan for the idea. #wotc_staff

over 1 year ago - /u/WotC_BenFinkel - Direct link

Originally posted by PhRzN

Thanks for the info!

Oh, one other great reason is that the whole point of our regression tests is to identify situations where we end up generating wrong code. If we make a change for a future card that would break an old card (in a way that we've tested), we really want the old card's code to change too: our test will fail and we'll notice that the change we're making is the wrong change to make. If the old card's code is frozen, then our regression tests aren't really doing much for us! #wotc_staff

over 1 year ago - /u/WotC_BenFinkel - Direct link

Originally posted by randommuser90

What was the parsers first response to being fed Emrakul?

A syntax error due to being unfamiliar with the phrase "After that turn". #wotc_staff

over 1 year ago - /u/WotC_BenFinkel - Direct link

Originally posted by grelgen

so why do you refuse to fix Golden Guardian?

Have you checked it this week? The fix was released with Shadows over Innistrad: Remastered. It took so long because the development team wasn't aware of it until somewhat recently; we're working on improving the user bug report pipeline. #wotc_staff






Recent Magic The Gathering: Arena Posts

1 day ago -