# Challenge Gate — Bug Validity Review ## Purpose The challenge gate is a self-adversarial review that every confirmed bug must survive before receiving a writeup and regression test. It catches false positives, over-classified feature gaps, and findings where pattern-matching overrode common sense. The gate can be invoked two ways: 1. **During a playbook run** — automatically applied to bugs matching trigger patterns (see below). 2. **Standalone** — pointed at a `quality/` directory from a prior run to challenge specific bugs. Example: `"Read quality/writeups/BUG-042.md and the source code it references. Run the challenge gate on this bug."` ## The two-round challenge For each bug under review, run exactly two rounds. Each round uses a fresh sub-agent so the challenger has no investment in the finding. ### Round 1: "Does this strike you as a real bug?" Provide the sub-agent with: - The bug writeup (or BUGS.md entry if no writeup yet) - The actual source code at the cited file:line (read it fresh — do not trust the writeup's code snippet) - All comments within 10 lines above and below the cited location - The project's README section on the relevant feature (if any) Prompt the sub-agent: > You are reviewing a bug report filed against an open-source project. Read the source code and the bug report below. Then answer: **does this strike you as a real bug?** > > **Before analyzing anything, apply common sense.** Step back from the details and ask yourself: if you showed this code and this bug report to a senior developer who has never seen either before, would they say "yes, that's a bug" — or would they say "that's obviously not a bug"? If the answer is obviously not a bug, say so immediately and explain why. Do not rationalize your way past a common-sense answer. The goal of this review is to catch findings where pattern-matching overrode judgment. > > Then consider: > - Is the developer aware of this behavior? (Look for comments, TODO markers, design decision notes, WHY annotations, OODA references.) > - Is this a documented limitation or intentional trade-off? (Check if other code paths handle this differently by design, not by accident.) > - Would the project maintainer respond "that's not a bug, that's how it works" or "that's a known limitation we documented"? > - Is the "expected behavior" in the bug report actually required by any spec, or is it the auditor's opinion about what the code should do? > - Is this development scaffolding? Values with names like "change-me", "placeholder", "example", "default", "TODO" are not defects — they are self-documenting markers that exist to make the project buildable during development. A feature that is disabled by default and uses placeholder values is an incomplete feature, not a vulnerability. > > Give your honest assessment. If it's a real bug, say so and explain why. If it's not, say so and explain why. A finding can be "not a bug" even if the code could be improved — the question is whether a reasonable maintainer would accept this as a defect report. ### Round 2: Targeted follow-up Based on the Round 1 response, generate a single pointed follow-up question. The goal is to stress-test whatever position the sub-agent took in Round 1. **If Round 1 said "real bug":** The follow-up should challenge the finding from the maintainer's perspective. Use a fresh sub-agent with this framing: > You are the maintainer of this project. A contributor filed this bug report. You wrote the code being criticized. Read the code, the bug report, and the Round 1 assessment below. > > Write the single most compelling argument for why this is NOT a bug. Consider: intentional design decisions, documented limitations, deployment context, common patterns in this language/framework, and whether the "expected behavior" is actually specified anywhere authoritative. > > Then, after making that argument, state whether you still believe it's a real bug or whether the argument convinced you it's not. **If Round 1 said "not a bug":** The follow-up should challenge the dismissal. Use a fresh sub-agent with this framing: > You are a security researcher reviewing this codebase. Another reviewer dismissed this finding as "not a bug." Read the code, the bug report, and the Round 1 dismissal below. > > Write the single most compelling argument for why this IS a real bug despite the dismissal. Consider: edge cases the dismissal didn't address, downstream consequences, what happens when the code interacts with other components, and whether "intentional" and "correct" are the same thing. > > Then, after making that argument, state whether you believe the finding should be confirmed or dismissed. ### Verdict After both rounds, assign one of three verdicts: - **CONFIRMED** — Both rounds agree it's a real bug, or Round 2's challenge failed to undermine Round 1's confirmation. The bug proceeds to writeup and regression test as normal. - **DOWNGRADED** — The bug is real but the severity was inflated. Adjust severity and update the writeup. Common case: a CRITICAL that should be MEDIUM, or a security finding that's actually a code quality issue. - **REJECTED** — The finding is not a bug. It's a documented design choice, intentional scaffolding, a known limitation, the auditor's opinion rather than a spec violation, or something that common sense says is obviously not a defect. Remove the bug ID. If useful, record it in a "Reviewed and dismissed" appendix in BUGS.md with the challenge reasoning. Write the verdict and both rounds' reasoning to `quality/challenge/BUG-NNN-challenge.md`. This file is the audit trail — it shows reviewers that each finding was stress-tested. ## Auto-trigger patterns During a playbook run, automatically apply the challenge gate to any bug matching one or more of these patterns. These patterns are where false positives concentrate: | Pattern | Why it triggers | Example | |---------|----------------|---------| | **Security-class finding** (credential leak, auth bypass, injection) | Severity calibration auto-escalates these; the model is incentivized to defend them | BUG-041: "hardcoded JWT secret" that was a development placeholder | | **Code contains design-decision comments at the cited location** | WHY comments, OODA references, TODO-with-explanation, or design decision docs within 10 lines of the cited code suggest the developer made a conscious choice | BUG-007/008: `// WHY-OODA81: Batch upload uses "default" workspace` | | **The "expected behavior" has no spec basis** | Bug's spec_basis field says "code inconsistency" rather than citing a spec document, or the requirement was invented by the auditor (Tier 3 / REQ-NNN created during the run) | BUG-041: REQ-019 was created by the auditor, not derived from project docs | | **Another code path handles the same concern differently** | If text_upload does X but file_upload doesn't, that might be a real inconsistency — or it might be intentional divergence. The challenge sorts out which. | BUG-001/002: text_upload merges source_ids, file_upload overwrites — challenge confirms this is a real bug because text_upload has an explicit fix comment | | **The finding is about missing functionality rather than incorrect behavior** | "This handler doesn't do X" is often a feature gap, not a bug. The challenge checks whether X was ever promised. | BUG-009/029: batch upload "missing" graph writes that were never part of the batch upload's documented scope | The pattern list is intentionally conservative — it triggers on categories with historically high false-positive rates. Bugs that don't match any pattern skip the challenge gate and proceed directly to writeup. To add new patterns: append a row to the table above with the pattern description, the reasoning, and a concrete example from a prior run. ## Standalone invocation When invoked standalone (not during a playbook run), the challenge gate: 1. Reads the specified bug writeup from `quality/writeups/BUG-NNN.md` 2. Reads the source code at the cited file:line (fresh read, not from the writeup) 3. Runs both rounds as described above 4. Writes the verdict to `quality/challenge/BUG-NNN-challenge.md` 5. If the verdict is REJECTED, suggests removing the bug from BUGS.md and tdd-results.json Example prompt for standalone use: ``` Read the quality playbook skill at .github/skills/SKILL.md and .github/skills/references/challenge_gate.md. Run the challenge gate on BUG-042 using the writeup at quality/writeups/BUG-042.md and the source code in this repo. ``` ## Token budget Each bug costs roughly 2 sub-agent calls. For a typical run with 5-10 auto-triggered bugs, that's 10-20 sub-agent calls. This is significantly cheaper than a full iteration cycle and catches the highest-value false positives. For runs with many security findings (>15 auto-triggered), consider batching: run Round 1 on all triggered bugs first, then only run Round 2 on bugs where Round 1 was ambiguous or where the confidence was low.