Files
awesome-copilot/skills/quality-playbook/references/challenge_gate.md
T
Andrew Stellman b8441d218b Update quality-playbook skill to v1.5.6 + add agent (#1402)
Rebuilds branch from upstream/staged (was previously merged from
upstream/main, which brought in materialized plugin files that
fail Check Plugin Structure on PRs targeting staged).

Changes vs. staged:
- Update skills/quality-playbook/ to v1.5.6 (31 bundled assets:
  SKILL.md + LICENSE.txt + 16 references/ + 9 phase_prompts/ +
  3 agents/ + bin/citation_verifier.py + quality_gate.py).
- Add agents/quality-playbook.agent.md (top-level orchestrator).
  name: quality-playbook (validator-compliant).
- Update docs/README.skills.md quality-playbook row description
  + bundled-assets list to v1.5.6.
- Fix 'unparseable' → 'unparsable' in quality_gate.py (5 instances;
  codespell preference, both spellings valid).

Closes the v1.4.0 → v1.5.6 update in a single clean commit on top of
upstream/staged. The preserved backup branch backup-bedbe84-pre-rebuild
(SHA bedbe848fa3c0f0eda8e653c42b599a17dd2e354) holds the prior history for reference.
2026-05-11 11:31:53 +10:00

8.8 KiB

Challenge Gate — Bug Validity Review

Purpose

The challenge gate is a self-adversarial review that every confirmed bug must survive before receiving a writeup and regression test. It catches false positives, over-classified feature gaps, and findings where pattern-matching overrode common sense.

The gate can be invoked two ways:

  1. During a playbook run — automatically applied to bugs matching trigger patterns (see below).
  2. Standalone — pointed at a quality/ directory from a prior run to challenge specific bugs. Example: "Read quality/writeups/BUG-042.md and the source code it references. Run the challenge gate on this bug."

The two-round challenge

For each bug under review, run exactly two rounds. Each round uses a fresh sub-agent so the challenger has no investment in the finding.

Round 1: "Does this strike you as a real bug?"

Provide the sub-agent with:

  • The bug writeup (or BUGS.md entry if no writeup yet)
  • The actual source code at the cited file:line (read it fresh — do not trust the writeup's code snippet)
  • All comments within 10 lines above and below the cited location
  • The project's README section on the relevant feature (if any)

Prompt the sub-agent:

You are reviewing a bug report filed against an open-source project. Read the source code and the bug report below. Then answer: does this strike you as a real bug?

Before analyzing anything, apply common sense. Step back from the details and ask yourself: if you showed this code and this bug report to a senior developer who has never seen either before, would they say "yes, that's a bug" — or would they say "that's obviously not a bug"? If the answer is obviously not a bug, say so immediately and explain why. Do not rationalize your way past a common-sense answer. The goal of this review is to catch findings where pattern-matching overrode judgment.

Then consider:

  • Is the developer aware of this behavior? (Look for comments, TODO markers, design decision notes, WHY annotations, OODA references.)
  • Is this a documented limitation or intentional trade-off? (Check if other code paths handle this differently by design, not by accident.)
  • Would the project maintainer respond "that's not a bug, that's how it works" or "that's a known limitation we documented"?
  • Is the "expected behavior" in the bug report actually required by any spec, or is it the auditor's opinion about what the code should do?
  • Is this development scaffolding? Values with names like "change-me", "placeholder", "example", "default", "TODO" are not defects — they are self-documenting markers that exist to make the project buildable during development. A feature that is disabled by default and uses placeholder values is an incomplete feature, not a vulnerability.

Give your honest assessment. If it's a real bug, say so and explain why. If it's not, say so and explain why. A finding can be "not a bug" even if the code could be improved — the question is whether a reasonable maintainer would accept this as a defect report.

Round 2: Targeted follow-up

Based on the Round 1 response, generate a single pointed follow-up question. The goal is to stress-test whatever position the sub-agent took in Round 1.

If Round 1 said "real bug": The follow-up should challenge the finding from the maintainer's perspective. Use a fresh sub-agent with this framing:

You are the maintainer of this project. A contributor filed this bug report. You wrote the code being criticized. Read the code, the bug report, and the Round 1 assessment below.

Write the single most compelling argument for why this is NOT a bug. Consider: intentional design decisions, documented limitations, deployment context, common patterns in this language/framework, and whether the "expected behavior" is actually specified anywhere authoritative.

Then, after making that argument, state whether you still believe it's a real bug or whether the argument convinced you it's not.

If Round 1 said "not a bug": The follow-up should challenge the dismissal. Use a fresh sub-agent with this framing:

You are a security researcher reviewing this codebase. Another reviewer dismissed this finding as "not a bug." Read the code, the bug report, and the Round 1 dismissal below.

Write the single most compelling argument for why this IS a real bug despite the dismissal. Consider: edge cases the dismissal didn't address, downstream consequences, what happens when the code interacts with other components, and whether "intentional" and "correct" are the same thing.

Then, after making that argument, state whether you believe the finding should be confirmed or dismissed.

Verdict

After both rounds, assign one of three verdicts:

  • CONFIRMED — Both rounds agree it's a real bug, or Round 2's challenge failed to undermine Round 1's confirmation. The bug proceeds to writeup and regression test as normal.
  • DOWNGRADED — The bug is real but the severity was inflated. Adjust severity and update the writeup. Common case: a CRITICAL that should be MEDIUM, or a security finding that's actually a code quality issue.
  • REJECTED — The finding is not a bug. It's a documented design choice, intentional scaffolding, a known limitation, the auditor's opinion rather than a spec violation, or something that common sense says is obviously not a defect. Remove the bug ID. If useful, record it in a "Reviewed and dismissed" appendix in BUGS.md with the challenge reasoning.

Write the verdict and both rounds' reasoning to quality/challenge/BUG-NNN-challenge.md. This file is the audit trail — it shows reviewers that each finding was stress-tested.

Auto-trigger patterns

During a playbook run, automatically apply the challenge gate to any bug matching one or more of these patterns. These patterns are where false positives concentrate:

Pattern Why it triggers Example
Security-class finding (credential leak, auth bypass, injection) Severity calibration auto-escalates these; the model is incentivized to defend them BUG-041: "hardcoded JWT secret" that was a development placeholder
Code contains design-decision comments at the cited location WHY comments, OODA references, TODO-with-explanation, or design decision docs within 10 lines of the cited code suggest the developer made a conscious choice BUG-007/008: // WHY-OODA81: Batch upload uses "default" workspace
The "expected behavior" has no spec basis Bug's spec_basis field says "code inconsistency" rather than citing a spec document, or the requirement was invented by the auditor (Tier 3 / REQ-NNN created during the run) BUG-041: REQ-019 was created by the auditor, not derived from project docs
Another code path handles the same concern differently If text_upload does X but file_upload doesn't, that might be a real inconsistency — or it might be intentional divergence. The challenge sorts out which. BUG-001/002: text_upload merges source_ids, file_upload overwrites — challenge confirms this is a real bug because text_upload has an explicit fix comment
The finding is about missing functionality rather than incorrect behavior "This handler doesn't do X" is often a feature gap, not a bug. The challenge checks whether X was ever promised. BUG-009/029: batch upload "missing" graph writes that were never part of the batch upload's documented scope

The pattern list is intentionally conservative — it triggers on categories with historically high false-positive rates. Bugs that don't match any pattern skip the challenge gate and proceed directly to writeup.

To add new patterns: append a row to the table above with the pattern description, the reasoning, and a concrete example from a prior run.

Standalone invocation

When invoked standalone (not during a playbook run), the challenge gate:

  1. Reads the specified bug writeup from quality/writeups/BUG-NNN.md
  2. Reads the source code at the cited file:line (fresh read, not from the writeup)
  3. Runs both rounds as described above
  4. Writes the verdict to quality/challenge/BUG-NNN-challenge.md
  5. If the verdict is REJECTED, suggests removing the bug from BUGS.md and tdd-results.json

Example prompt for standalone use:

Read the quality playbook skill at .github/skills/SKILL.md and .github/skills/references/challenge_gate.md.
Run the challenge gate on BUG-042 using the writeup at quality/writeups/BUG-042.md
and the source code in this repo.

Token budget

Each bug costs roughly 2 sub-agent calls. For a typical run with 5-10 auto-triggered bugs, that's 10-20 sub-agent calls. This is significantly cheaper than a full iteration cycle and catches the highest-value false positives.

For runs with many security findings (>15 auto-triggered), consider batching: run Round 1 on all triggered bugs first, then only run Round 2 on bugs where Round 1 was ambiguous or where the confidence was low.