Magic The Gathering: Arena

Magic The Gathering: Arena Dev Tracker




30 Mar

Comment

Originally posted by ThoseThingsAreWeird

What's a "new feature" for us?

In my head I've always separated Arena out into "the actual game of Magic" and then "the bits Arena adds". So I guess new Rules (e.g. Incubate, I think that fits your Rule description), but then also new Arena bits (like the new Codex of the Multiverse)?

What if we have tests for "Draw two cards" already but now a card comes out with "Draw five cards" - is that a new feature?

I guess that depends on how your parser was set up, but I'd wager you've written the parser to be smart enough to say "Draw 2" is the same as "Draw 5" as those are two different tokens1 ("Draw" and number). But then I guess that raises the question of something like is "Draw 1, then Scry 1" the same as "Draw 1" and "Scry 1" (i.e. combined vs separate)?

Our line is "involved developer changes to the parser or engine"

Yeah that m...

Read more

Tokenization is a component of our parsing process, one very early in the process. It's true that replacing one token with another similar one is often not worth considering to be a big difference. But what about one sentence structure with another? For example "If you would draw a card, draw two cards instead" vs. "If you would draw a card, instead draw two cards" should behave the same despite being worded differently (and both are in fact valid wordings). If we already handled a phrase like "If CARDNAME would deal damage, it deals twice that much damage instead" as well as "If CARDNAME would deal damage, instead it deals twice that much damage", then we've already handled that syntactical difference. Let's say we wrote tests for the latter two cards; we'd find in ad-hoc testing that once we got either version of the draw replacement working (and wrote tests verifying it), the other one would work too. Given that it worked "out of the box", how much effort should we spend testing...

Read more

29 Mar

Comment

Originally posted by saxophoneplayingcat

How do you detect the 25% needing individual attention?

Two main ways:

  • The parser fails to generate code. This is great! It's recognizing that something is outside its current boundaries. We usually have a good idea of what we need to do from the error messaging.

  • The parser generates wrong code. Less great. Human QA needs to play the card to see that it's doing the wrong thing. The most common type of problem here is with "anaphora resolution" - figuring out what ambiguous phrases like "it" or "that creature" mean. Why, I just estimated the complexity of a few LTR bugs with that issue moments ago... #wotc_staff

Comment

Originally posted by Un111KnoWn

How similarly does MTG Arena work compared to how MTG Online works?

Pretty broad question. In one sense, we're somewhat similar: we both make code happen starting from English strings from new cards to make a good MTG play experience. But our engineering is completely different, from code generation to the actual engine design. #wotc_staff

Comment

Originally posted by gitgudds3

I thought the end of this story would be:

“Thank goodness!” said Bilbo laughing, and handed him the tobacco-jar.

Perhaps for an LTR implementation tale! #wotc_staff

Comment

Originally posted by Juuuuuuuules

As a noncoder, I found this super interesting. I know it’s more work for you but I’d love more of these posts when it’s relevant.

I'd love to tell more stories about "challenging developments that went smoothly". I think there's a couple challenges to that:

  • Less of a narrative! With a bug, there's an immediate hook of "how did that happen", then a cool investigation, a eureka of the issue, and often an embarrassingly simple fix (this bug was fixed just by deleting a line of code!). I think that makes for a pretty clear flow. Most implementation stories don't have such a narrative structure to them, which makes them harder to write about.

  • Scope of background. Even this post had coworkers dozing off with the groundwork I presented to describe the bug. New features are often even less cleanly described.

  • When? What? It can be hard after-the-fact to decide what would make an interesting story to talk about, or when to talk about it.

Still, the reaction to Ian's post and this has us pretty interested in doing more. Heck, I've always wanted to! I'm...

Read more
Comment

Originally posted by Flyrpotacreepugmu

That's quite an interesting look behind the scenes. Ever since someone mentioned that Gutter Grime was the cause, I've been trying to think of how that could possibly break these equipment, but I never would've guessed that was how it happened.

That bit about Falco Spara was also interesting. It also reminded me that multiple copies of [[Muldrotha, the Gravetide]] don't work properly (or at least didn't a couple months ago). Casting one spell of each type removes the option even if you have multiple Muldrothas that should each be able to cast one. I wonder if that's a similar issue to Falco Spara where they all try to do the same thing, or if it's because of Muldrotha's unique UI...

I believe that had been a UI bug, where the client was improperly batching the Muldrotha permissions in its presentation of your actions. #wotc_staff

Comment

Originally posted by Douglasjm

We decided that the salient feature of these cards was that they were on Auras and Equipment and made special code to handle self-references in those cases.

It seems obvious to me that the salient feature is which card the ability was printed on. This is not the first time I've seen a bug in Arena result from not properly considering the "printed on" relationship, though the other one I remember had to do with linked abilities. It makes me wonder if the dev team, and/or the design of the code base, need more awareness of the importance of that relationship.

Can you clarify what your suggestion is? "Printed on" is a pretty ambiguous concept:

  • What about copy effects? If card A becomes a copy of Gutter Grime and triggers to make an Ooze, the reference to "Gutter Grime" on the Ooze means "Card A".

  • That still holds true even if Card A stops being a copy of Gutter Grime.

  • Through horrible shenanigans you're able to make The Book of Vile Darkness create a Vecna token that has Gutter Grime's triggered ability. In that case the "Gutter Grime" phrase on the Ooze it creates refers to the Vecna that made the Ooze token. Was that ability "printed on" Vecna?

I think perhaps what you're trying to say is "Gutter Grime" in the conferred ability refers to "the card that conferred this ability to this Ooze". But that's the whole point of this post - identifying when a self-reference is like that is nontrivial. Our original logic, due to the cards we had covered on Arena, was myo...

Read more
Comment

Originally posted by r_xy

so how do you choose what cards get a regression test?

if the conferred ability was such a headache to originally implement wouldnt that make it a good candidate for one?

We test cards that involved a developer's effort to get to work in the first place. Human QA does a pass over a set to identify what didn't automatically work from the first time we generate code for a new card set. Anything that doesn't work at that point is, well, my day job! And work we do there gets verified against regression by an automated test.

When we're closer to release, QA does another full pass to hopefully identify regressions, again focusing on the new cards due to the huge explosion of possible interactions.

I identified in the OP the relevant cards in the story: Heliod's Punishment has plenty of tests that lean on "self-reference in conferred abilities". Unfortunately Heliod's Punishment's behavior doesn't involve the ProposeEffectCostResource rule, which was the center of this bug: its conferred ability's only cost is the tap-symbol. #wotc_staff

Comment

Originally posted by slavazin

I'm curious about regression testing. You said that those are difficult to write due to simulating parts of a full game. Why not actually run full games (or slices of full games from state A to B) in some headless mode? Either pull the gameplay from standard tournament games, or play a few games and record the gameplay? you can mix a lot of cards with unique interactions, and after each resolution of a trigger, compare the game state with the recorded state/delta. From my extremely limited pov the downsides would be a lot of computer time spent running through somewhat meaningless actions, but if they're fast enough, you can load a lot of unique game situations in 30 minutes of playing and recording a game. An error can then display the card/trigger that caused the trigger and the mismatch in outcome.
Just curious as to the drawbacks

The problem with taking a recording of a game and saying "make sure it plays like that again" is in determining what "like that" means. We do plenty of changes to the game that don't change the gameplay outcome but do, for example, change the information in requests and responses to the client, change what information is available in the game, change autotap strategies, etc. The advantage with our "scripted game" tests is that we're able to decide precisely what is important to verify with automated assertions, and what aspects of the game's proceedings are allowed to vary over the development of Arena as a project. #wotc_staff

Comment

Originally posted by RealisticCommentBot

Confusing as the Falco thing is, this exact scenario I think happens (and is a bit odd) when you have two copies of Serra Paragon out, as you have to choose which Serra Paragon you are using to cast a card from your graveyard.

It could totally be relevant mainly because that ability is activate only once each turn compared to spara, but as a user once I'd seen it happen once or twice I undertood what was happening.

I feel it would be similar for spara, but it's defeintly more confusing when they both have counters on them (which is likley the case as they ETB with counters)

The notion is it doesn't matter which Falco you use - the action behaves the exact same way for either. For Serra Paragon, it does matter which you use - that one can't be used again this turn (and maybe you'd prefer to use the one with fewer +1/+1 counters on it just in case your opponent has removal!) #wotc_staff

Comment

Originally posted by ThoseThingsAreWeird

Therefore, we don't create such tests for every new card on MTG Arena

This surprises me a little bit, but it probably has a reasonable answer.

We create regression tests for each new feature, but we've done that from the start. So yeah that adds an extra bit of time onto creating each feature, but we've got a certain level of confidence that we're not breaking stuff in the future (assuming we right the tests correctly, which we always do every time ever...). In the grand scheme of things it's a lot of time, but for each release it's a relatively small amount of time.

Was there a period of time when you weren't creating regression tests? Or is it that your approach to regression tests wasn't covering every Rule? Presumably covering every Rule, would mean you cover every card with an ability? Or actually, that'd need to have regressions on every Rule interacting with every other Rule... Ok yeah I see where this is going...

...
Read more

What's a "new feature" for us? This has always been a pretty interesting question to me, for a code-generating system. When a vanilla creature comes out, do you recommend we make a regression test for it? What should the content of that test be? What about a french vanilla creature? What if we have tests for "Draw two cards" already but now a card comes out with "Draw five cards" - is that a new feature?

Our line is "involved developer changes to the parser or engine". This does miss bugs, but in my opinion it is rare. And the greater focus on "new work" allows us to put much more attention in testing the boundary scenarios for the riskiest new behaviors.

Slightly before I joined WotC 6 years ago, our regression test framework was much more inconvenient and brittle, but pretty much from day 1 of engine development there has been some form or another of testing.

As for our strategy for testing, our normal standard is a scripted game with assertions a...

Read more
Comment

Originally posted by jasonsavory123

Can I ask why this approach to creating rules was chosen and simultaneously we don’t have a larger card pool? If the rules are generated by reading oracle rules text, why is pioneer, modern, legacy etc not available ?

I could understand the smaller card pool if rules were manually implemented as functions or equivalent, but this threw me for a loop as something that seems too complex for the limited card pool the game started with.

Well, the dream has always been for the card parser to be a massive productivity boost for backfilling MTG's card catalog. There are a few reasons why it isn't just a snap-of-the-finger though:

  • The Pareto Principle applies: the parser is excellent at handling normal MTG card text, but a sizeable number of MTG cards do things that really no other card does. For example, [[Void Winnower]]'s prohibition on casting even-mana value spells would play some havoc with casting X-cost spells. Perhaps we could just dump the large proportion of cards that work "for free" in engine...

  • ... but in-engine isn't the only concern. There's also the client experience to consider. The engine's been worked on for longer and supports some interactions that the client has never needed to implement before. Plus there's our standards of presentation: we want new content we release to meet our standards of clarity to players and to work with our auxiliary systems like autotap...

Read more
Comment

Originally posted by jmorganmartin

We got a Bug-Atog!

Thanks! It is fascinating.

What about [[Blazing Torch]]?

It was a new-to-Arena card affected by the same bug. It makes sense that you wouldn't write a specific test for this new card with (basically) the same effect as Ninja's Kunai, and it makes sense that you don't re-test every old card. But, you also said that all new cards are tested by the human QA team.

Was Blazing Torch overlooked by human QA because it was on the rotating bonus sheet? Pure speculation on my part, but perhaps they didn't have as much (or any) time to test with those cards because they were late add-ons to the set, or something like that?

That's an excellent question! The unfortunate truth is that Gutter Grime's implementation came in pretty late, apparently after Blazing Torch was retested for release. That's pretty abnormal, and mega-unfortunate. #wotc_staff

Comment

Originally posted by space20021

The rules engine was not coded by hand, but generated from machine learning and NLP...?

That's a bold move

Machine learning is not used in our parser. The generation of code is intended to be deterministic, which is a feature machine learning is not a good fit for. Our natural language processing techniques are more old-school stuff like generating syntax trees from grammatical productions and encoding semantic meaning with first-order-logic expressions. #wotc_staff

Comment

Originally posted by DeeBoFour20

Interesting read. I appreciate the transparency as well. I understand bugs like this can slip through and it's not reasonable to manually test some draft chaff from 4 sets ago.

Can you elaborate on the problem with the "emergency ban" system? I've seen people on here saying that's it bugged. Apparently a WotC employee made some statement to that effect. I think a lot of people got upset that there was no immediate remedy to stop people from cheating for multiple days while you're working on a patch.

I can't really, as it's outside my area of the code. I'm focused pretty much entirely on the "playing a game of MTG" side of things. I do know that getting that system working again is a priority for us. #wotc_staff

Comment

Originally posted by Crystal__

Now I have the irresistible urge to test the behavior of [[Toralf's Hammer]] pre-bug. Does it deal 3 damage for each permanent you control? Only for each equipment you control? Only for each attached equipment? Only 3 damage regardless? Would it magically unatttach all other equipped equipment you control? Would MTGA collapse trying to unnattach permanents without Equip ability? So many questions!

When we made the Gutter Grime change, our test for Toralf's Hammer failed. That led to us actually carving out an exception to the "delete the ability constraint" for "unattach" costs, which led to that test passing again. We really should have taken that as a warning sign that other similar cards may have issues, certainly. But the upshot is Toralf's Hammer never ended up buggy in release. #wotc_staff

Comment

Originally posted by RealisticCommentBot

similar for multiple serra paragons , though it is at least relevant in some way in that case as you can only use it once per turn per serra paragon

That is absolutely intentional. You should also be able to tell which Serra Paragons have been "used up" (and may want to change which you prefer based on the state of the different Paragons). #wotc_staff

Comment

Originally posted by ghalta

Well, we don't want to make a separate action for each Falco you have out - we just have one action for "you're casting a particular card using a Falco ability" - we don't keep track of which ability-on-a-Falco is responsible, as it's irrelevant (and if it were displayed, perhaps misleading to a player!).

Back when rune decks were a thing, and I had multiple [[Runeforge Champion]] on the field, Arena would make me pick which one's ability I was using when I wanted to case a rune for (1) instead of its casting cost.

That seems like a very similar situation. I haven't played that deck in a long while, as it has fallen from standard. Were both changed so that the player didn't have to pick? If not, why were they handled differently?

You know, I think that actually is an undesired behavior. Our rules for providing an alternative cost to an existing action, as Runeforge Champion does, do not have the logic of "is the source object meaningful" like our rules for providing novel action permissions do. I think I'll make a ticket for that, thanks for bringing it to my attention! #wotc_staff

Comment

Originally posted by EmTeeEm

Hi Ben! Thanks for the write-up. I love behind the scenes stuff like this.

While we've got you here, could you answer a question I've had for a while: were Boons invented to help the card parser? They kind of confused me because they seem like a thing Alchemy and Arena were already doing for delayed triggers, but with adding the words "You get a boon with..." to everything.

It is just something I've been wondering for a long time, since it seems like it didn't add anything Arena didn't already do, otherwise.

Boons have a few things going for them:

  • We (Studio X card design and us Arena devs together) first got the idea after puzzling over Y22-MID's [[Tenacious Pup]]. We wanted an embedded trigger inside a triggered ability, and for that to be the outer trigger's only effect. But something like "When CARDNAME enters the battlefield, when you cast your next spell" was dissatisfying to read. For the Pup, that's why the 1 life is tacked in there. But we didn't want trinkets like that to be the long term solution. So advantage #1 is that it's a more pleasant read.

  • Boons are currently digital-only, as they sort of have a memory issue for paper. More digital design space is fruitful for us.

  • As they are digital-only, we have more flexibility in adjusting the game rules around them if design needs that compared to paper mechanics.

One disadvantage is that it's pretty awkward to make downside boons, given the connotation of the w...

Read more
Comment

Originally posted by HotTakes4HotCakes

They likely still have to go through each individual one and check or tweak it.

But when you really think about how card text is written, how standardized it is, and that it has been written that way consistently (more or less) for years now, it's really not all that surprising they can do that.

In a sense, the cards are already written in code. You have an official rule set that explains the order of operations and defines specific functions for specific text, then you have card text that is written methodically to adhere to it. It's all structured logically so the results are seldom in question.

Compare that to some other digital only card games where they don't care very much about the consistent logic of the text. The cards were programmed to work as intended, doesn't matter if the player gets it.

In practice, at least 75% of the cards' generated code do not need individual attention. The advantage of the language of MTG cards is that it's very precise. This is what allows our project to not be a fool's errand in the first place. #wotc_staff