Round 5 — How We Got These Results

This document describes the evaluation process for R5 (20 players, 190 matchups). We're sharing this because we want your expertise — if you see errors in our methodology or results, corrections are welcome.

Final numbers: 190 matchups total. 147 resolved by AI agent agreement. 39 resolved by mod review. 4 from historical database.

Step 1: Deck Plans

Before evaluating individual matchups, we generated a "deck plan" for each of the 20 decks — a 2-3 sentence summary of what the deck does, its mana sequencing, and key vulnerabilities. These plans were reviewed by a human before proceeding, because an earlier version contained errors (e.g., claiming a deck could play two lands on T1).

You can see the deck plans we used: R5 Deck Plans

Step 2: Per-Deck Agent Evaluation

For each of the 20 decks, we launched a Claude AI agent with:

The full 3CB rules (empty library doesn't lose, worst-outcome convention, best-play search)
Complete Oracle text for all 60 cards in the field
Deck plans for every deck (so the agent understands each deck's strategy)
Instructions to evaluate all 19 opponents from that deck's perspective

Each agent produced per-direction verdicts (on the play and on the draw) plus a 1-2 sentence narrative for each direction. This means every matchup was evaluated twice — once from each side.

The prompt instructs agents to think step by step about mana sequencing, interaction timing, and combat math. Both players play optimally: play to win if possible, force a draw if not, accept the loss if neither.

Step 3: Crosscheck

We compared the two agents' verdicts for each matchup. If Agent A (evaluating Deck X) says "X wins on the play" and Agent B (evaluating Deck Y) says "Y loses on the play," they agree. If they disagree, the matchup is flagged.

Initial crosscheck: 126 agreements, 64 disagreements (34% disagreement rate)

Step 4: Corrections and Re-evaluation

We identified several categories of errors:

Deck plan errors propagating to agents (e.g., wrong mana sequencing)
Card rules errors (e.g., Phyrexian Revoker can only name nonland cards)
Strategic errors (e.g., not considering Alpine Moon T1 as an option)

We re-ran 5 of the most problematic decks with corrected deck plans and additional guidance in the prompt. This reduced disagreements from 64 to 40.

We also ran 9 targeted re-evaluations for specific matchups where deck plan errors were most likely to have changed the outcome. 3 of these flipped the result.

Step 5: Human Review

The remaining 40 disagreements were resolved by mod review. For each, we examined both agents' reasoning and narratives, considered the card interactions, and picked the correct result.

These 39 matchups (plus 4 historical) have a placeholder narrative: "Mod-resolved outcome, no narrative. Please argue your case in the thread."

This is where you come in. If you think a mod-resolved matchup is wrong — or any matchup, really — please reply to the results thread with the matchup, your proposed result, and the key card interactions that support it. We'll re-evaluate and apply corrections.

What the Agents See

Here's a simplified version of what each agent receives. The actual prompt is ~27,000 characters and includes full Oracle text for all cards.

You are evaluating Three Card Blind (3CB) matchups
for one deck against all opponents.

## 3CB Rules
- 3-card hand, no library. Drawing from empty library
  does NOT cause a loss.
- Normal Magic rules. Starting life: 20.
- Both players play optimally (3 pts win, 1 draw, 0 loss).
- Coin flips/dice = worst outcome for controller.
- Evaluate EACH DIRECTION independently.
- WL (each wins on play) ≠ DD (neither can win).

## Your Deck (@handle)
[Full Oracle text + deck plan]

## Opponent: @handle
[Full Oracle text + deck plan]
... (19 opponents)

## Instructions
For each opponent:
1. On-the-play verdict + narrative (you go first)
2. On-the-draw verdict + narrative (opponent goes first)
Think step by step about mana sequencing,
interaction timing, and combat math.

VERDICT: P0_WINS | P1_WINS | DRAW
NARRATIVE: [1-2 sentences, under 200 chars]

Known Limitations

AI agents make mistakes. Complex card interactions, unusual timing windows, and layer-dependent effects are all error-prone. The crosscheck catches many but not all errors.
34% initial disagreement rate is high. This suggests the agents are uncertain about many matchups, and the "correct" result for some matchups may be genuinely debatable.
Mod review is fallible too. Some of the 39 mod-resolved matchups may be wrong. That's why we're inviting corrections.
Narratives may not match outcomes for agent-agreed matchups where the agents agreed on the verdict but described slightly different play sequences.

Correction Process

Reply to the results thread on Bluesky with your correction. Include:

Which matchup (e.g., "nickchk vs brythefryguy")
Your proposed result (e.g., "WL — nickchk wins on the play, brythefryguy wins on the draw")
Why — the key card interactions that determine the outcome

We'll re-evaluate and update the dashboard. All corrections are tracked with an audit trail.

← Back to Dashboard