Update quality-playbook skill to v1.5.6 + add agent (#1402)

Rebuilds branch from upstream/staged (was previously merged from upstream/main, which brought in materialized plugin files that fail Check Plugin Structure on PRs targeting staged). Changes vs. staged: - Update skills/quality-playbook/ to v1.5.6 (31 bundled assets: SKILL.md + LICENSE.txt + 16 references/ + 9 phase_prompts/ + 3 agents/ + bin/citation_verifier.py + quality_gate.py). - Add agents/quality-playbook.agent.md (top-level orchestrator). name: quality-playbook (validator-compliant). - Update docs/README.skills.md quality-playbook row description + bundled-assets list to v1.5.6. - Fix 'unparseable' → 'unparsable' in quality_gate.py (5 instances; codespell preference, both spellings valid). Closes the v1.4.0 → v1.5.6 update in a single clean commit on top of upstream/staged. The preserved backup branch backup-bedbe84-pre-rebuild (SHA bedbe848fa3c0f0eda8e653c42b599a17dd2e354) holds the prior history for reference.
2026-05-15 19:21:45 +00:00 · 2026-05-10 21:31:53 -04:00
parent e7755069e9
commit b8441d218b
32 changed files with 9639 additions and 543 deletions
@@ -65,6 +65,31 @@ Requirements are tagged with `[Req: tier — source]`. Weight your findings by t

 ---

+## Pre-audit docs validation (required triage section)
+
+The triage report must include a `## Pre-audit docs validation` section regardless of whether `reference_docs/` exists. This section documents what the auditors used as their factual baseline.
+
+**If `reference_docs/` exists:** Spot-check the gathered docs for factual accuracy before running the audit. Stale or incorrect docs can skew audit confidence — a model that reads "the library handles X by doing Y" in the docs will rate a divergent finding higher even if the docs are wrong.
+
+**Quick validation procedure (5 minutes max):**
+1. Pick 2–3 factual claims from `reference_docs/` that describe specific runtime behavior (e.g., "invalid input raises ValueError", "field X defaults to Y", "format Z is not supported").
+2. Grep the source code for the cited behavior. Does the code match the docs?
+3. If any claim is wrong, note it in the triage header: "reference_docs/ contains N known inaccuracies: [list]. Findings that rely on these claims are downgraded to NEEDS REVIEW."
+
+**Spot-check claims about code contents must extract, not assert.** When the spec audit prompt or pre-validation includes claims like "function X handles constant Y at line Z," the triage must read the cited lines and report what they actually contain. Do not confirm a claim by checking that the function exists or that the constant is defined somewhere — confirm it by showing the exact text at the cited lines. Format each spot-check result as:
+
+```
+Claim: "vring_transport_features() preserves VIRTIO_F_RING_RESET at line 3527"
+Actual line 3527: `default:`
+Result: CLAIM IS FALSE — line 3527 is the default branch, not a RING_RESET case label
+```
+
+Spot-check claims derived from generated requirements or gathered docs (rather than from the code) are **hypotheses to test**, not facts to confirm. This rule prevents the contamination chain observed in v1.3.17 where a false spot-check claim was accepted as "accurate" without reading the actual lines, causing three auditors to inherit a hallucinated code-presence claim.
+
+**If `reference_docs/` does not exist:** State this explicitly: "No supplemental docs provided. Auditors relied on in-repo specs and code only." This confirms the absence is intentional, not an oversight.
+
+This section fires in every triage, not just when docs are present. In v1.3.5 cross-repo testing, it only fired in 1/8 repos because it was conditional — making it required ensures the audit trail always documents the factual baseline.
+
 ## Running the Audit

 1. Give the identical prompt to three AI tools
@@ -73,7 +98,18 @@ Requirements are tagged with `[Req: tier — source]`. Weight your findings by t

 ## Triage Process

-After all three models report, merge findings:
+After all three models report, merge findings.
+
+**Log the effective council size.** If a model did not return a usable report (timeout, empty output, refusal), record this in the triage header:
+
+```
+## Council Status
+- Model A: Fresh report received (YYYY-MM-DD)
+- Model B: Fresh report received (YYYY-MM-DD)
+- Model C: TIMEOUT — no usable report. Effective council: 2/3.
+```
+
+When the effective council is 2/3, downgrade the confidence tier: "All three" becomes impossible, "Two of three" becomes the ceiling. When the effective council is 1/3, all findings are "Needs verification" regardless of how confident that single model is. Do not silently substitute stale reports from prior runs — if a model didn't produce a fresh report for this run, it didn't participate.

 | Confidence | Found By | Action |
 |------------|----------|--------|
@@ -81,10 +117,36 @@ After all three models report, merge findings:
 | High | Two of three | Likely real — verify and fix |
 | Needs verification | One only | Could be real or hallucinated — deploy verification probe |

+**When the effective council is 2/3 or less:** Distinguish single-auditor findings from multi-auditor findings explicitly in the triage. With a 2/3 council, a finding from both present auditors has "High" confidence. A finding from only one present auditor has "Needs verification" — it cannot be promoted to confirmed BUG without a verification probe, because the missing auditor might have contradicted it. Do not treat all findings as equivalent just because the council is incomplete.
+
+In the triage summary table, add a column for auditor agreement: "2/2 present", "1/2 present", etc. This makes the confidence tier visible and auditable.
+
+**Incomplete council gate for enumeration/dispatch checks.** If the effective council is less than 3/3 and the run includes whitelist/enumeration/dispatch-function checks (claims about which constants a function handles), the audit may not conclude "no confirmed defects" for those checks without executed mechanical proof. Check whether `quality/mechanical/<function>_cases.txt` exists for each relevant function. If it does and shows the constant is present, the claim is confirmed. If it does and shows the constant is absent, the claim is false regardless of what any auditor wrote. If no mechanical artifact exists, generate one before closing the enumeration check. This rule exists because v1.3.18 had an effective council of 1/3, and the single model's triage fabricated line contents for enumeration claims — a mechanical artifact would have caught the contradiction.
+
 ### The Verification Probe

 When models disagree on factual claims, deploy a read-only probe: give one model the disputed claim and ask it to read the code and report ground truth. Never resolve factual disputes by majority vote — the majority can be wrong about what code actually does.

+**Verification probes must produce executable evidence.** Prose reasoning is not sufficient for either confirmations or rejections. Every verification probe must produce a test assertion that mechanically proves the determination:
+
+**For rejections** (finding is false positive): Write an assertion that PASSES, proving the auditor's claim is wrong:
+```python
+# Rejection proof: function X does check for null at line 247
+assert "if (ptr == NULL)" in source_of("X"), "X has null check at line 247"
+```
+If you cannot write a passing assertion, **do not reject the finding**. The inability to produce mechanical proof is itself evidence that the finding may be real.
+
+**For confirmations** (finding is a real bug): Write an assertion that FAILS (expected-failure), proving the bug exists:
+```python
+# Confirmation proof: RING_RESET is not a case label in the whitelist
+assert "case VIRTIO_F_RING_RESET:" in source_of("vring_transport_features"), \
+    "RING_RESET should be in the switch but is not — cleared by default at line 3527"
+```
+
+**Every assertion must cite an exact line number** for the evidence it references. Not "lines 3527-3528" but "line 3527: `default:`" — showing what the line actually contains.
+
+**Why this rule exists:** In v1.3.16 virtio testing, the triage received a correct minority finding that VIRTIO_F_RING_RESET was missing from a switch/case whitelist. The triage performed a verification probe that claimed lines 3527-3528 "explicitly preserve VIRTIO_F_RING_RESET" — but those lines contained the `default:` branch. The probe hallucinated compliance. Had it been required to write `assert "case VIRTIO_F_RING_RESET:" in source`, the assertion would have failed, exposing the hallucination. Requiring executable evidence makes hallucinated rejections self-defeating.
+
 ### Categorize Each Confirmed Finding

 - **Spec bug** — Spec is wrong, code is fine → update spec
@@ -96,6 +158,45 @@ When models disagree on factual claims, deploy a read-only probe: give one model

 That last category is the bridge between the spec audit and the test suite. Every confirmed finding not already covered by a test should become one.

+### Legacy and historical scripts
+
+Scripts documented as "historical," "deprecated," or "not part of current workflow" are sometimes downgraded during triage on the theory that they don't affect current operations. This is correct when the script genuinely never runs. But if the script's bug has already materialized in canonical artifacts — duplicate entries in a published file, stale data in a checked-in cache, incorrect mappings that downstream tools consume — the bug is not historical. It's a live defect in the repository's published state.
+
+**Rule: If a legacy script's bug is already visible in canonical artifacts, promote it to confirmed BUG regardless of the script's status.** The script may be historical, but the damage it left behind is current. The regression test should target the artifact (the duplicate entry, the stale mapping), not the script — because the artifact is what users encounter.
+
+This rule exists because v1.3.5 bootstrap runs on QPB found duplicate changelog entries and stale cache mappings produced by a "historical" script. Both triages downgraded the findings because the script was historical. But the duplicate entries were already in the published library, visible to every user.
+
+### Cross-artifact consistency check
+
+After triage, compare the spec audit findings against the code review findings from `quality/code_reviews/`. If the code review and spec audit disagree on the same factual claim (one says a bug is real, the other calls it a false positive), flag the disagreement and deploy a verification probe. The code review and spec audit use different methods (structural reading vs. spec comparison), so disagreements are informative, not errors. But a factual contradiction about what the code actually does needs to be resolved before either report is trusted.
+
+## Detecting partial sessions and carried-over artifacts
+
+### Partial session detection
+
+A session that terminates early (timeout, context exhaustion, crash) may generate scaffolding (directory structure, empty templates) without producing the actual review or audit content. The retry mechanism in the run script can regenerate scaffolding but cannot recover the analytical work.
+
+**After any session completes, check for partial results:**
+1. If `quality/code_reviews/` exists but contains no `.md` files with actual findings (or only contains template headers with no BUG/VIOLATED/INCONSISTENT entries), the code review did not run. Mark this as FAILED in PROGRESS.md, not as "complete with no findings."
+2. If `quality/spec_audits/` exists but contains no triage summary, the spec audit did not run.
+3. If `quality/test_regression.*` exists but contains only imports and no test functions, regression tests were not written.
+
+A partial session is not a "clean run with no findings" — it's a failed run that needs to be re-executed. PROGRESS.md should record this clearly: "Phase 6: FAILED — code review session terminated before producing findings. Re-run required."
+
+### Provenance headers on carried-over artifacts
+
+When a new playbook run finds existing artifacts from a previous run (after archiving), or when artifacts survive from a failed session, they must carry provenance headers so readers know their origin.
+
+**If any artifact was NOT generated fresh in the current run**, add a provenance header:
+
+```markdown
+<!-- PROVENANCE: This file was carried over from a previous run ([date]).
+     It was NOT regenerated by the current v1.3.5 run.
+     Treat findings as potentially stale — verify against current source before acting. -->
+```
+
+This prevents the failure mode observed in v1.3.4 where express and zod silently preserved v1.3.3 code reviews and spec audits without marking them as archival. Users reading those artifacts assumed they were fresh v1.3.4 results.
+
 ## Fix Execution Rules

 - Group fixes by subsystem, not by defect number
@@ -130,6 +231,10 @@ Different models have different audit strengths. In practice:

 The specific models that excel will change over time. The principle holds: use multiple models with different strengths, and always include the four guardrails.

+### Minimum model capability
+
+The audit protocol requires reading function bodies, citing line numbers, grepping before claiming missing, and classifying defect types. Lightweight or speed-optimized models (Haiku-class, GPT-4o-mini-class) are not suitable as auditors. They tend to skim rather than read, skip the grep step, and produce shallow or empty reports ("No defects found") on codebases where stronger models find real bugs. Use models with strong code-reading ability for all three auditor slots. A weak auditor doesn't just miss findings — it reduces the Council from three independent perspectives to two.
+
 ## Tips for Writing Scrutiny Areas

 The scrutiny areas are the most important part of the prompt. Generic questions like "check if the code matches the spec" produce generic answers. Specific questions that name functions, files, and edge cases produce specific findings.