Skip to content

Methodology

How we rate cards and classify decks

Most card rankings are one person’s tier list - a single letter grade, assigned by feel, that drifts with the meta and can’t be argued with. We do it differently. We break a card into a handful of observable properties, then derive its rating from a fixed rule. Change the rule, and every card re-rates instantly. This page explains the whole pipeline, including where it falls short.

Part 1 — Rating a card

A holistic tier (“this is an A”) hides its reasoning. When two people disagree, there’s nothing to point at. Our fix is “derive, don’t judge”: a language model reads the card and tags seven small, concrete atoms - each a narrow question with a fixed set of answers. The rating itself is computed from those atoms by deterministic code, not chosen by the model.

The seven atoms

rate
under · on · over

Effect-for-cost against the curve - is the card below, on, or above what its mana cost should buy?

floor
dead · incremental · standalone

What you get with no synergy, cast alone on an empty board the turn it comes down.

dependence
modular · synergy · build-around

How much the payoff is a function of the rest of the deck. Modular works anywhere; build-around needs a dedicated shell.

ceiling
caps · scales · takes-over

Realistic best case. Caps is a fixed bounded effect; scales grows with the game; takes-over wins on its own if unanswered.

flexibility
rigid · flexible

Modal, instant-speed, repeatable, or able to hit multiple target types.

breadth
narrow · broad

How many game-states it is good in - developing, at parity, ahead, behind.

resilience
fragile · sticky · recurring · n/a

For permanents: how hard it is to answer. Fragile dies to common removal; sticky resists it; recurring comes back.

From atoms to a tier range

A deterministic rollup turns the atoms into a result. Crucially, it produces a range, not a single grade: a tier_floor (honest baseline with no synergy) and a tier_ceiling (best case with the right shell), each one of trap · fringe · tier-2 · tier-1. A modular staple sits at a tight tier-2-to-tier-1 band; a build-around payoff gets a wide spread from fringe to tier-1, which is exactly the point - its power is conditional, and the range says so.

The rollup runs once per format. Brawl (1-v-1, 25 life) and Commander (multiplayer, 40 life) use different constants, because the same card is worth different amounts in each: beatdown is stronger at 25 life, tax effects scale with more opponents, and a fixed-size effect is diluted at 40 life. So a card can read tier-1 in Brawl and tier-2 in Commander from the same atoms.

The rollup also assigns a kind - a one-word role - and a trap_risk flag for cards whose best case is still a trap:

fillerLow power across the board - a deck-filler, not a reason to play it.

role-playerPlayable but not premium; narrow or slightly below rate.

strong-stapleHigh, reliable power - the backbone of a deck.

game-definingTakes over the game if left unanswered.

build-around-bombA build-around whose ceiling takes over - worth building toward.

niche-payoffA build-around that rewards a shell but does not take over.

Lands and rocks are rated by rule, not by AI

Mana-fixing lands and formulaic mana rocks don’t need a language model - they come in known cycles. A pure-fixing land (fetch, shock, pain, check, triome…) or a formula rock (Sol Ring, Signets, Talismans…) is rated directly from its cycle by deterministic classifiers, skipping the LLM entirely. Every rating records where it came from: a source of atoms, land:<cycle>, or rock:<cycle>, so the provenance of any number is always inspectable.

Because the rollup is pure code, it is versioned and free to re-run. The rollup logic carries a version (currently v6) and the AI tagging prompt carries its own. Find a card the rules get wrong, fix the rule, bump the version, and the entire catalog re-derives at zero AI cost - no re-reading thousands of cards. That tunability, not reproducibility, turned out to be the real payoff (more on that below).

Part 2 — Classifying a deck

Once every card has atoms, a deck has a measurable shape. We build a 23-axis fingerprint: for each non-land card we read off its atom values and its kind, then take the quantity-weighted percentage of the deck sitting on each axis - what share is above-rate, what share takes over the game, what share is build-around, and so on. The result is a 23-number vector describing how the deck is built, independent of which specific cards fill it.

Centroids from real tournament decks

To know what an archetype looks like, we fingerprinted 676 constructed Pro Tour decklists(Standard, Pioneer, and Modern events, 2023–2026) and averaged them per archetype into centroids - a mean vector plus a per-axis spread. The library has five macro-archetypes:

control176 decksaggro161 deckscombo141 decksramp62 deckstempo56 decks

Midrange is deliberately absent. When we tried it, midrange decks self-classified only 11% of the time - they scatter across aggro and control because midrange isthat blend, not a distinct atom-shape. Rather than ship a phantom cluster, we drop it; a midrange deck classifies to whichever lean dominates it. Unmappable event labels (“other”) are excluded too.

Nearest-centroid matching

To classify a deck, we measure its distance to each centroid as a variance-weighted RMS z-score: each axis’s gap is divided by how much that axis naturally varies within the archetype, so a deck is not penalized for differing on a noisy axis but is on a tight, defining one. The nearest centroid wins. Confidence is 1 / (1 + max(0, z - 1))- 100% for a dead-center match, falling off as the deck drifts. If fewer than half a deck’s non-land cards carry atoms, we decline to classify rather than guess.

Does it actually work? (the candid part)

Held against its own cohort, the model self-classifies 77% of decks to the right archetype. Per macro:

aggro91%
tempo79%
control78%
ramp71%
combo60%

Aggro is crisp; combo is the weak point at 60%, because “combo” spans wildly different builds that don’t share one atom-shape. We also tested whether an archetype’s shape holds across formats: a Standard control centroid sits a tiny z = 0.71 from a Modern one - effectively the same shape. That format-invariance is what lets a cross-format Pro Tour cohort describe a Brawl deck at all.

Tuning toward an archetype

The same math runs in reverse. Point at a target archetype and we compute the direction from your deck’s fingerprint to that centroid, weight each axis by how defining it is, and score every card by how much it moves you along that direction. The most counter-aligned cards become suggested cuts; the best-aligned legal candidates in your colors become suggested adds. You can try it on the Deck Tuner.

Directional, not “strictly better.”A swap moves your deck’s shapetoward the archetype. It does not prove the new card wins more games - that depends on curve, meta, and synergy the fingerprint can’t see.

Part 3 — What this is not

Coverage is partial.

As of June 2026, roughly 42%of the ~14,800 Brawl-legal non-land cards carry atom tags (about 6,150 of them), plus every fixing land and formula rock by rule. Tagging the rest is ongoing. Untagged cards are simply skipped in a deck’s fingerprint - which is why the classifier refuses a verdict below 50% coverage, and why a sparsely-tagged deck reads as “unclassified” rather than wrong.

It reads shape, not coherence.

The fingerprint answers “what archetype does this look like?” - not “is this a good, synergistic deck?” Two decks with identical atom shapes can be miles apart in how well their cards actually work together. Coherence needs simulated play; atoms only see the cards in isolation.

Some centroids are thin.

Ramp (62 decks) and tempo (56) are built from fewer examples than control or aggro, so their boundaries are looser and marginal decks near them deserve more skepticism. Use the confidence score as a quality gate.

Magnitude is approximated.

A 3-damage removal spell and a 5-damage one can tag identically (both on-rate, capped, incremental). We proxy for format magnitude - discounting fixed effects at 40 life, for instance - but atoms are categorical, so fine-grained magnitude is a known blind spot. Limited is deliberately out of scope; the atom model is tuned for singleton formats where the card pool is deep.

Reproducibility was not the win.

We originally expected atoms to be more reproducible than holistic tiers. Testing didn’t bear that out - holistic re-rating was already ~100% stable. The wins that did hold up were correctness (atoms fixed cards a single tier mis-rated), tunability (fix a rule, re-derive for free), and range(conditional cards get an honest spread instead of a falsely precise grade). We keep the system because of those, and we’re telling you the hypothesis that didn’t pan out because that’s the honest thing to do.

See it in action