mirror of
https://github.com/github/awesome-copilot.git
synced 2026-05-15 19:21:45 +00:00
Update quality-playbook skill to v1.5.6 + add agent (#1402)
Rebuilds branch from upstream/staged (was previously merged from upstream/main, which brought in materialized plugin files that fail Check Plugin Structure on PRs targeting staged). Changes vs. staged: - Update skills/quality-playbook/ to v1.5.6 (31 bundled assets: SKILL.md + LICENSE.txt + 16 references/ + 9 phase_prompts/ + 3 agents/ + bin/citation_verifier.py + quality_gate.py). - Add agents/quality-playbook.agent.md (top-level orchestrator). name: quality-playbook (validator-compliant). - Update docs/README.skills.md quality-playbook row description + bundled-assets list to v1.5.6. - Fix 'unparseable' → 'unparsable' in quality_gate.py (5 instances; codespell preference, both spellings valid). Closes the v1.4.0 → v1.5.6 update in a single clean commit on top of upstream/staged. The preserved backup branch backup-bedbe84-pre-rebuild (SHA bedbe848fa3c0f0eda8e653c42b599a17dd2e354) holds the prior history for reference.
This commit is contained in:
@@ -1,21 +1,190 @@
|
||||
MIT License
|
||||
Apache License
|
||||
Version 2.0, January 2004
|
||||
http://www.apache.org/licenses/
|
||||
|
||||
Copyright (c) 2025 Andrew Stellman
|
||||
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
1. Definitions.
|
||||
|
||||
The above copyright notice and this permission notice shall be included in all
|
||||
copies or substantial portions of the Software.
|
||||
"License" shall mean the terms and conditions for use, reproduction,
|
||||
and distribution as defined by Sections 1 through 9 of this document.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
SOFTWARE.
|
||||
"Licensor" shall mean the copyright owner or entity authorized by
|
||||
the copyright owner that is granting the License.
|
||||
|
||||
"Legal Entity" shall mean the union of the acting entity and all
|
||||
other entities that control, are controlled by, or are under common
|
||||
control with that entity. For the purposes of this definition,
|
||||
"control" means (i) the power, direct or indirect, to cause the
|
||||
direction or management of such entity, whether by contract or
|
||||
otherwise, or (ii) ownership of fifty percent (50%) or more of the
|
||||
outstanding shares, or (iii) beneficial ownership of such entity.
|
||||
|
||||
"You" (or "Your") shall mean an individual or Legal Entity
|
||||
exercising permissions granted by this License.
|
||||
|
||||
"Source" form shall mean the preferred form for making modifications,
|
||||
including but not limited to software source code, documentation
|
||||
source, and configuration files.
|
||||
|
||||
"Object" form shall mean any form resulting from mechanical
|
||||
transformation or translation of a Source form, including but
|
||||
not limited to compiled object code, generated documentation,
|
||||
and conversions to other media types.
|
||||
|
||||
"Work" shall mean the work of authorship, whether in Source or
|
||||
Object form, made available under the License, as indicated by a
|
||||
copyright notice that is included in or attached to the work
|
||||
(an example is provided in the Appendix below).
|
||||
|
||||
"Derivative Works" shall mean any work, whether in Source or Object
|
||||
form, that is based on (or derived from) the Work and for which the
|
||||
editorial revisions, annotations, elaborations, or other modifications
|
||||
represent, as a whole, an original work of authorship. For the purposes
|
||||
of this License, Derivative Works shall not include works that remain
|
||||
separable from, or merely link (or bind by name) to the interfaces of,
|
||||
the Work and Derivative Works thereof.
|
||||
|
||||
"Contribution" shall mean any work of authorship, including
|
||||
the original version of the Work and any modifications or additions
|
||||
to that Work or Derivative Works thereof, that is intentionally
|
||||
submitted to the Licensor for inclusion in the Work by the copyright owner
|
||||
or by an individual or Legal Entity authorized to submit on behalf of
|
||||
the copyright owner. For the purposes of this definition, "submitted"
|
||||
means any form of electronic, verbal, or written communication sent
|
||||
to the Licensor or its representatives, including but not limited to
|
||||
communication on electronic mailing lists, source code control systems,
|
||||
and issue tracking systems that are managed by, or on behalf of, the
|
||||
Licensor for the purpose of discussing and improving the Work, but
|
||||
excluding communication that is conspicuously marked or otherwise
|
||||
designated in writing by the copyright owner as "Not a Contribution."
|
||||
|
||||
"Contributor" shall mean Licensor and any individual or Legal Entity
|
||||
on behalf of whom a Contribution has been received by the Licensor and
|
||||
subsequently incorporated within the Work.
|
||||
|
||||
2. Grant of Copyright License. Subject to the terms and conditions of
|
||||
this License, each Contributor hereby grants to You a perpetual,
|
||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
||||
copyright license to reproduce, prepare Derivative Works of,
|
||||
publicly display, publicly perform, sublicense, and distribute the
|
||||
Work and such Derivative Works in Source or Object form.
|
||||
|
||||
3. Grant of Patent License. Subject to the terms and conditions of
|
||||
this License, each Contributor hereby grants to You a perpetual,
|
||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
||||
(except as stated in this section) patent license to make, have made,
|
||||
use, offer to sell, sell, import, and otherwise transfer the Work,
|
||||
where such license applies only to those patent claims licensable
|
||||
by such Contributor that are necessarily infringed by their
|
||||
Contribution(s) alone or by combination of their Contribution(s)
|
||||
with the Work to which such Contribution(s) was submitted. If You
|
||||
institute patent litigation against any entity (including a
|
||||
cross-claim or counterclaim in a lawsuit) alleging that the Work
|
||||
or a Contribution incorporated within the Work constitutes direct
|
||||
or contributory patent infringement, then any patent licenses
|
||||
granted to You under this License for that Work shall terminate
|
||||
as of the date such litigation is filed.
|
||||
|
||||
4. Redistribution. You may reproduce and distribute copies of the
|
||||
Work or Derivative Works thereof in any medium, with or without
|
||||
modifications, and in Source or Object form, provided that You
|
||||
meet the following conditions:
|
||||
|
||||
(a) You must give any other recipients of the Work or
|
||||
Derivative Works a copy of this License; and
|
||||
|
||||
(b) You must cause any modified files to carry prominent notices
|
||||
stating that You changed the files; and
|
||||
|
||||
(c) You must retain, in the Source form of any Derivative Works
|
||||
that You distribute, all copyright, patent, trademark, and
|
||||
attribution notices from the Source form of the Work,
|
||||
excluding those notices that do not pertain to any part of
|
||||
the Derivative Works; and
|
||||
|
||||
(d) If the Work includes a "NOTICE" text file as part of its
|
||||
distribution, then any Derivative Works that You distribute must
|
||||
include a readable copy of the attribution notices contained
|
||||
within such NOTICE file, excluding any notices that do not
|
||||
pertain to any part of the Derivative Works, in at least one
|
||||
of the following places: within a NOTICE text file distributed
|
||||
as part of the Derivative Works; within the Source form or
|
||||
documentation, if provided along with the Derivative Works; or,
|
||||
within a display generated by the Derivative Works, if and
|
||||
wherever such third-party notices normally appear. The contents
|
||||
of the NOTICE file are for informational purposes only and
|
||||
do not modify the License. You may add Your own attribution
|
||||
notices within Derivative Works that You distribute, alongside
|
||||
or as an addendum to the NOTICE text from the Work, provided
|
||||
that such additional attribution notices cannot be construed
|
||||
as modifying the License.
|
||||
|
||||
You may add Your own copyright statement to Your modifications and
|
||||
may provide additional or different license terms and conditions
|
||||
for use, reproduction, or distribution of Your modifications, or
|
||||
for any such Derivative Works as a whole, provided Your use,
|
||||
reproduction, and distribution of the Work otherwise complies with
|
||||
the conditions stated in this License.
|
||||
|
||||
5. Submission of Contributions. Unless You explicitly state otherwise,
|
||||
any Contribution intentionally submitted for inclusion in the Work
|
||||
by You to the Licensor shall be under the terms and conditions of
|
||||
this License, without any additional terms or conditions.
|
||||
Notwithstanding the above, nothing herein shall supersede or modify
|
||||
the terms of any separate license agreement you may have executed
|
||||
with Licensor regarding such Contributions.
|
||||
|
||||
6. Trademarks. This License does not grant permission to use the trade
|
||||
names, trademarks, service marks, or product names of the Licensor,
|
||||
except as required for reasonable and customary use in describing the
|
||||
origin of the Work and reproducing the content of the NOTICE file.
|
||||
|
||||
7. Disclaimer of Warranty. Unless required by applicable law or
|
||||
agreed to in writing, Licensor provides the Work (and each
|
||||
Contributor provides its Contributions) on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
||||
implied, including, without limitation, any warranties or conditions
|
||||
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
|
||||
PARTICULAR PURPOSE. You are solely responsible for determining the
|
||||
appropriateness of using or redistributing the Work and assume any
|
||||
risks associated with Your exercise of permissions under this License.
|
||||
|
||||
8. Limitation of Liability. In no event and under no legal theory,
|
||||
whether in tort (including negligence), contract, or otherwise,
|
||||
unless required by applicable law (such as deliberate and grossly
|
||||
negligent acts) or agreed to in writing, shall any Contributor be
|
||||
liable to You for damages, including any direct, indirect, special,
|
||||
incidental, or consequential damages of any character arising as a
|
||||
result of this License or out of the use or inability to use the
|
||||
Work (including but not limited to damages for loss of goodwill,
|
||||
work stoppage, computer failure or malfunction, or any and all
|
||||
other commercial damages or losses), even if such Contributor
|
||||
has been advised of the possibility of such damages.
|
||||
|
||||
9. Accepting Warranty or Additional Liability. While redistributing
|
||||
the Work or Derivative Works thereof, You may choose to offer,
|
||||
and charge a fee for, acceptance of support, warranty, indemnity,
|
||||
or other liability obligations and/or rights consistent with this
|
||||
License. However, in accepting such obligations, You may act only
|
||||
on Your own behalf and on Your sole responsibility, not on behalf
|
||||
of any other Contributor, and only if You agree to indemnify,
|
||||
defend, and hold each Contributor harmless for any liability
|
||||
incurred by, or claims asserted against, such Contributor by reason
|
||||
of your accepting any such warranty or additional liability.
|
||||
|
||||
END OF TERMS AND CONDITIONS
|
||||
|
||||
Copyright 2025 Andrew Stellman
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
|
||||
+2324
-65
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,222 @@
|
||||
# Calibration Orchestrator — autonomous cycle prompt template (v1.5.6)
|
||||
|
||||
*Prompt template for the AI session driving an end-to-end QPB calibration cycle. The orchestrator AI executes Steps 1-12 from `ai_context/CALIBRATION_PROTOCOL.md`, spawns playbook subprocesses per benchmark, and writes the cycle audit + Lever Calibration Log entry. Designed for Claude Code sessions but will work in any tool with bash + file tools.*
|
||||
|
||||
*This prompt builds on `ai_context/CALIBRATION_PROTOCOL.md` Mode 1 (autonomous). The protocol is the canonical operational guide; this template wires it into v1.5.6's run-state instrumentation so the cycle is fully observable, resumable, and recoverable.*
|
||||
|
||||
*Schema for cycle-level events: `references/run_state_schema.md`.*
|
||||
|
||||
*Session model — **spawn-and-resume across multiple orchestrator sessions** (v1.5.6 cluster F.1 finding from the 2026-05-02 Pattern 7 cycle). The orchestrator role spans many discrete AI sessions that re-attach to the same cycle directory and resume from `run_state.jsonl`; each session typically drives one cycle step (kick off a benchmark, finalize a benchmark on completion, apply the lever, run Council, etc.) and exits. A long-lived single-session orchestrator was attempted in early prototyping and did not survive realistic AI session lifetimes (timeouts, network drops, operator-ended sessions across the ~4 hours an 8-benchmark cycle takes). The Step 2 spawn pattern below — `nohup` the playbook in the background, append a `benchmark_start` event with the PID, return control — IS the load-bearing recovery mechanism, not an exception case.*
|
||||
|
||||
*Compare with `ai_context/AI_ORCHESTRATION_PATTERNS.md`. That document describes a **multi-session orchestrator/worker** pattern where a chat-driving AI controls a separate coding AI via files in a shared directory. This template applies the same multi-session discipline at a different layer: the orchestrator AI sessions (any number across the cycle's lifetime) coordinate the playbook subprocess lifecycle, while the playbook itself is the worker. Use this template when the work to coordinate is a calibration cycle (a fixed Steps 1-12 workflow); use the broader orchestrator/worker pattern when chat-side planning and coding-side execution need to be coordinated outside a calibration cycle.*
|
||||
|
||||
---
|
||||
|
||||
## Role
|
||||
|
||||
You are the **calibration orchestrator** for a Quality Playbook calibration cycle. Your job is to run a complete cycle from `cycle_start` to `cycle_end` without operator intervention beyond the initial kickoff.
|
||||
|
||||
You are NOT the playbook AI. You spawn playbook AI sessions (via `python3 -m bin.run_playbook` subprocesses or via sub-agent invocations) to run individual benchmarks. You drive the cycle-level workflow above the playbook.
|
||||
|
||||
---
|
||||
|
||||
## Inputs (operator provides at kickoff)
|
||||
|
||||
The operator launches you with these inputs filled in:
|
||||
|
||||
- **`<cycle_name>`** — short kebab-case identifier. Format: `<YYYY-MM-DD>-<lever-or-test-shorthand>`. Example: `2026-05-15-pattern7-displacement-recovery`.
|
||||
- **`<lever_id>`** — the lever from `ai_context/IMPROVEMENT_LOOP.md` you're calibrating. Example: `lever-1-exploration-breadth-depth`.
|
||||
- **`<lever_change_description>`** — what you'll actually edit. Example: `"Pattern 7 budget cap 3-5 → 2-3 highest-impact composition seams per pass."`
|
||||
- **`<benchmarks>`** — comma-separated benchmark list. Example: `chi-1.3.45,chi-1.5.1,virtio-1.5.1,express-1.3.50`.
|
||||
- **`<hypothesis>`** — the testable claim. Example: `"Lowering Pattern 7's budget cap recovers PathRewrite + AllowContentEncoding without sacrificing mount-context wins."`
|
||||
- **`<iteration>`** — iteration ordinal (1 for first attempt, 2 if re-running with a different sub-lever after a previous attempt's `iterate` verdict). Default: 1.
|
||||
- **`<iterate_cap>`** — maximum iterations before halt. Default: 3.
|
||||
|
||||
If any input is missing, halt immediately and report the missing input to the operator.
|
||||
|
||||
---
|
||||
|
||||
## Cycle directory layout
|
||||
|
||||
Working directory: `~/Documents/AI-Driven Development/Quality Playbook/Calibration Cycles/<cycle_name>/`
|
||||
|
||||
Files you produce:
|
||||
- `run_state.jsonl` — cycle-level event log (your own append-only output). Schema: `references/run_state_schema.md` "Cycle-level events" section.
|
||||
- `audit.md` — human-readable cycle audit. Written at cycle close.
|
||||
- `post-pattern7-snapshots/` (or analogous lever-specific subdir) — copies of post-lever BUGS.md per benchmark, in case canonical paths get overwritten.
|
||||
- `visualizations/` — populated by `bin/visualize_calibration.py` (available in current releases; may not exist yet during early cycles).
|
||||
|
||||
Files you write to elsewhere:
|
||||
- `metrics/regression_replay/<timestamp>/<bench>-<bench>-all.json` — per-benchmark cell.json (one per pre/post pair).
|
||||
- `docs/process/Lever_Calibration_Log.md` — append a new cycle entry at cycle close.
|
||||
|
||||
---
|
||||
|
||||
## Resume semantics
|
||||
|
||||
Before doing anything else, check whether `Calibration Cycles/<cycle_name>/run_state.jsonl` exists.
|
||||
|
||||
- **No file:** fresh cycle. Proceed to Step 0 below.
|
||||
- **File exists:** read all events. Find the last event. Pick up where the prior session stopped:
|
||||
- If last event is `cycle_start`: redo Step 1 (pre-flight) since the prior session crashed before any benchmark work.
|
||||
- If last event is `benchmark_start <bench>` without matching `benchmark_end`: that benchmark was in flight when the prior session crashed. Check whether `repos/archive/<bench>/quality/run_state.jsonl` shows a `run_end` event. If yes: parse the BUGS.md, append `benchmark_end`, continue to next benchmark. If no: the playbook session also crashed; restart that benchmark (clean its `quality/`, re-spawn the playbook).
|
||||
- If last event is `lever_change_applied`: pre-lever benchmarks complete, lever change committed, post-lever runs are next.
|
||||
- If last event is `benchmark_end <bench>` (last bench in the list): all benchmarks done; proceed to delta computation + cycle close.
|
||||
|
||||
Trust artifacts (BUGS.md content, commit history) more than events. If events claim a benchmark complete but BUGS.md is empty, re-run.
|
||||
|
||||
---
|
||||
|
||||
## Steps
|
||||
|
||||
### Step 0: Initialize cycle run-state
|
||||
|
||||
If fresh cycle:
|
||||
|
||||
1. Create `Calibration Cycles/<cycle_name>/` directory if absent.
|
||||
2. Write `run_state.jsonl` with two events:
|
||||
- `_index`: `{"event":"_index","ts":"<now>","schema_version":"1.5.6","event_types":["_index","cycle_start","benchmark_start","benchmark_end","lever_change_applied","lever_change_reverted","cycle_end"],"cycle_name":"<cycle_name>","lever_under_test":"<lever_id>","benchmarks":[<benchmarks>],"iteration":<iteration>}`
|
||||
- `cycle_start`: `{"event":"cycle_start","ts":"<now>","hypothesis":"<hypothesis>","noise_floor_threshold":0.05}`
|
||||
|
||||
### Step 1: Pre-flight
|
||||
|
||||
Verify environment per `CALIBRATION_PROTOCOL.md` Step 1 checks:
|
||||
|
||||
- `git status --porcelain` clean (or only contains expected scratch files; document any).
|
||||
- Current branch is `1.5.6` (or whichever development branch you're on); record the HEAD SHA.
|
||||
- `bin/run_playbook.py --help` runs cleanly.
|
||||
- `claude --version` (or whichever runner you're using) reports a usable version.
|
||||
- For each benchmark in `<benchmarks>`: verify `repos/archive/<bench>/` exists; verify `repos/archive/<bench>/quality/previous_runs/<latest>/quality/BUGS.md` exists (this is the historical baseline used for recall computation).
|
||||
|
||||
If any pre-flight check fails: append an `error` event with `recoverable:false`, write `cycle_end verdict=halt-preflight-failed`, write a partial audit, and report.
|
||||
|
||||
### Step 2: Pre-lever benchmark runs
|
||||
|
||||
For each benchmark in `<benchmarks>`:
|
||||
|
||||
1. Append `benchmark_start`: `{"event":"benchmark_start","ts":"<now>","benchmark":"<bench>","lever_state":"pre-lever"}`.
|
||||
2. Verify or restore the canonical pre-lever state of the QPB working tree (the lever change must NOT yet be applied at this point).
|
||||
3. Reset the benchmark's `quality/` to a known-empty state: `cp -r repos/archive/<bench>/quality/previous_runs/<latest>/ /tmp/save-<bench>/ && rm -rf repos/archive/<bench>/quality/* && cp -r /tmp/save-<bench>/quality/* repos/archive/<bench>/quality/previous_runs/` (or equivalent — the goal is a fresh `quality/` tree with prior_runs preserved).
|
||||
4. Spawn the playbook. The realistic mechanism for AI-session-driven cycles is **spawn + resume on re-invocation**:
|
||||
- Launch the playbook in the background with output redirected to a log file: `nohup python3 -m bin.run_playbook --claude --phase 1,2,3 repos/archive/<bench> > <bench>-playbook.log 2>&1 &`. Capture the PID.
|
||||
- Append a `benchmark_start` event with the PID and log path so a resumed orchestrator can find them.
|
||||
- Return control to the operator (or to the calling shell). The orchestrator session ends; the playbook continues running.
|
||||
- The operator (or a watchdog) re-invokes the orchestrator periodically (e.g., every 30-60 minutes). On each re-invocation, the orchestrator reads its cycle's `run_state.jsonl`, finds the in-flight benchmark, and checks `repos/archive/<bench>/quality/run_state.jsonl` for `run_end`. If complete: parse BUGS.md, compute recall, append `benchmark_end`, advance to next benchmark (or next cycle step). If incomplete and the playbook PID is still alive: re-launch the orchestrator later. If incomplete and the PID is dead: the playbook crashed; clean and re-spawn.
|
||||
- **Why not synchronous block:** AI sessions (Claude Code, Cowork sub-agents) don't reliably block for 30-minute subprocess durations across 8 benchmarks (~4 hours total). The session would time out, drop network, or be ended by the operator. Spawn + resume is the only pattern that survives realistic session lifetimes.
|
||||
- **Watchdog timeout:** if a benchmark's playbook hasn't produced a `run_end` event after 90 minutes wall-clock, treat it as hung. Kill the PID, clean the benchmark's `quality/`, append `error recoverable:true`, and re-spawn. After 3 hung-and-restart cycles on the same benchmark, halt with `cycle_end verdict:"halt-playbook-hang"`.
|
||||
5. When the playbook reports complete: read `repos/archive/<bench>/quality/BUGS.md`. Compute recall: count of bug IDs in the new BUGS.md that match (by file:line or canonical bug name) any bug ID in `repos/archive/<bench>/quality/previous_runs/<latest>/quality/BUGS.md`. Recall = `|found ∩ baseline| / |baseline|`.
|
||||
6. Append `benchmark_end`: `{"event":"benchmark_end","ts":"<now>","benchmark":"<bench>","lever_state":"pre-lever","recall":<r>,"bugs_found":[...],"bugs_missed":[...],"historical_baseline_path":"<path>"}`.
|
||||
|
||||
### Step 3: Apply lever change
|
||||
|
||||
1. Edit the file(s) per `<lever_change_description>`. Example for the Pattern 7 displacement recovery cycle: edit `references/exploration_patterns.md` Pattern 7 budget-cap line.
|
||||
2. Commit to the working branch (1.5.6 or current development branch): `git add <files> && git commit -m "v1.5.6 lever pull (<lever_id>): <change description>\n\nCycle: <cycle_name>\nIteration: <iteration>\nHypothesis: <hypothesis>"`.
|
||||
3. Capture the commit SHA.
|
||||
4. Append `lever_change_applied`: `{"event":"lever_change_applied","ts":"<now>","lever_id":"<lever_id>","files_changed":[<files>],"commit_sha":"<sha>","description":"<lever_change_description>"}`.
|
||||
|
||||
### Step 4: Post-lever benchmark runs
|
||||
|
||||
Repeat Step 2's loop with `lever_state:"post-lever"` for each benchmark. Same playbook invocation, same recall computation, same `benchmark_end` event but with `lever_state:"post-lever"`.
|
||||
|
||||
After each `benchmark_end`, copy the post-lever BUGS.md aside into `Calibration Cycles/<cycle_name>/post-lever-snapshots/<bench>.md` so it survives any subsequent cleanup.
|
||||
|
||||
### Step 5: Compute deltas + cross-benchmark check
|
||||
|
||||
1. From the events log, compute per-benchmark `delta = recall_after - recall_before`.
|
||||
2. Check the cross-benchmark invariant: NO benchmark should regress beyond `noise_floor_threshold` (0.05). If `delta < -0.05` on any benchmark, the lever pull caused a regression there — this is a Block condition.
|
||||
3. Build the cell.json output: write to `metrics/regression_replay/<cycle-timestamp>/<lever-bench>-all.json` per the cell.json schema. Include `lever_under_test`, `benchmarks`, `recall_before`, `recall_after`, `delta`, `regression_check.status` (clean/regression), `noise_floor_threshold:0.05`.
|
||||
|
||||
### Step 6: Council review (Mode 1: sub-agent fan-out, three lenses)
|
||||
|
||||
Per `CALIBRATION_PROTOCOL.md` Step 7. Spawn three parallel sub-agents using your tool's parallel-agent mechanism (Cowork's Agent tool with `general-purpose` subagent_type, parallel `claude` CLI invocations from bash, etc.). **Three flat lenses, not nested 9-perspective** — Mode 1's autonomous Council is intentionally lighter than the operator-driven nested Council in `CALIBRATION_PROTOCOL.md`'s Mode 2. The full 9-perspective nested panel requires `gh copilot` invocations the orchestrator can't run.
|
||||
|
||||
Each of the three sub-agents gets:
|
||||
|
||||
- The cycle's hypothesis, lever change diff, pre/post recall numbers per benchmark, regression check status.
|
||||
- A focused review lens, one per sub-agent:
|
||||
- **Sub-agent 1 (Diagnosis lens):** "Is the lever change well-targeted at the diagnosed symptom?" Reads the cycle's hypothesis and the lever-change diff. Verdict: targets the symptom / doesn't / partial.
|
||||
- **Sub-agent 2 (Scope lens):** "Are the recall numbers honest given run conditions?" Reads the per-benchmark `benchmark_end` events and the underlying BUGS.md files. Verdict: numbers reflect reality / numbers may be artifact of run conditions / inconclusive.
|
||||
- **Sub-agent 3 (Regression-risk lens):** "Does any benchmark regress beyond the noise floor? Are wins on one benchmark coming at the cost of losses elsewhere?" Verdict: clean / regression-detected / partial-recovery.
|
||||
|
||||
Synthesize into a Council verdict: Ship (all three positive or two-of-three positive with no Block), Block (any sub-agent issues a Block, or two-of-three negative), Iterate (Council surfaces a clearly-better sub-lever). Document each sub-agent's verdict in the cycle audit.
|
||||
|
||||
### Step 7: Decide verdict
|
||||
|
||||
Based on Council outcome + measurement results:
|
||||
|
||||
- **Ship:** Council Ship + delta > noise floor + cross-benchmark check clean. Lever change stays committed; cycle closes with `verdict:"ship"`.
|
||||
- **Revert:** Council Block + delta ≤ noise floor OR cross-benchmark regression. Revert the lever change with a NEW commit: `git revert <sha>`. Do NOT use `git reset --hard` — that destroys history on shared branches and will break any in-flight work or downstream clones (the safety hole the workspace verify-before-claiming rule is built to catch). The revert commit becomes part of the cycle's audit trail. Cycle closes with `verdict:"revert"`.
|
||||
- **Iterate:** Council suggests a different sub-lever, or measurement results are ambiguous. If `<iteration> < <iterate_cap>`: relaunch yourself with `<iteration> + 1` and a new sub-lever description. If `<iteration> >= <iterate_cap>`: halt with `verdict:"halt-iterate-cap"` — you've exhausted iterations without convergence.
|
||||
|
||||
### Step 8: Write cycle audit
|
||||
|
||||
At `Calibration Cycles/<cycle_name>/audit.md`. Sections:
|
||||
|
||||
- Header (cycle name, dates, lever, benchmarks, hypothesis, iteration, verdict).
|
||||
- Pre-flight summary.
|
||||
- Pre-lever results (per-benchmark recall, BUGS.md summary).
|
||||
- Lever change applied (commit SHA, files changed, diff stats).
|
||||
- Post-lever results (per-benchmark recall, deltas, regression check).
|
||||
- Council synthesis.
|
||||
- Verdict + rationale.
|
||||
- Reduced-scope acknowledgment (if any benchmark was dropped from the original cycle scope — name the benchmark, the reason, and the follow-up cycle that will close it. Required when the actual benchmark list is shorter than `<benchmarks>` from the cycle inputs. v1.5.6 finding: 2026-05-02 cycle dropped chi-1.5.1 for time budget; the audit explicitly documented the reduced scope and pointed at a follow-up cycle.).
|
||||
- Cycle Findings (anything notable that surfaced — protocol gaps, runtime quirks, follow-on work). **Required even if empty — write `(none)` rather than omitting the section.** v1.5.6 finding: the 2026-05-02 cycle audit did not include this section despite the protocol calling for it; future cycles must include it explicitly so the file's structure is grep-able.
|
||||
|
||||
Use the Cycle 1 (chi-1.3.45) audit at `Calibration Cycles/2026-05-01-chi-1.3.45/audit.md` as the template format.
|
||||
|
||||
### Step 9: Append Lever Calibration Log entry
|
||||
|
||||
At `~/Documents/QPB/docs/process/Lever_Calibration_Log.md`. Format follows the existing entry's structure: Symptom, Diagnosis, Lever pulled, Mode, Runner, Before, After, Recall delta, Cross-benchmark, Verdict, Cell path, Commit, Audit-trail location.
|
||||
|
||||
### Step 10: Generate visualizations (if `bin/visualize_calibration.py` exists)
|
||||
|
||||
Run `python3 -m bin.visualize_calibration <cycle-dir>`. Produces 4 PNGs into `Calibration Cycles/<cycle_name>/visualizations/`. If the script is unavailable in the checkout you're using, skip with a note in the audit.
|
||||
|
||||
### Step 11: Write `cycle_end` event
|
||||
|
||||
Append to `Calibration Cycles/<cycle_name>/run_state.jsonl`:
|
||||
|
||||
```json
|
||||
{"event":"cycle_end","ts":"<now>","verdict":"<ship|revert|iterate|halt-iterate-cap>","recall_before":{<bench>:<r>,...},"recall_after":{<bench>:<r>,...},"delta":{<bench>:<d>,...},"cross_benchmark_check":{"clean":<bool>,"regressions":[...]}}
|
||||
```
|
||||
|
||||
### Step 12: Final report to operator
|
||||
|
||||
Print a summary block to stdout:
|
||||
|
||||
- Cycle name, iteration, verdict.
|
||||
- Per-benchmark before/after/delta recall in a tabular form.
|
||||
- Council synthesis one-liner.
|
||||
- Path to audit.md, cell.json, calibration log entry, visualizations.
|
||||
- Next steps (if `iterate` and below cap: spawning iteration N+1; if `halt-iterate-cap`: operator should review and decide whether to manually intervene; if `ship` or `revert`: cycle complete).
|
||||
|
||||
---
|
||||
|
||||
## Failure modes and recovery
|
||||
|
||||
- **Playbook subprocess crashes mid-run:** the per-benchmark `quality/run_state.jsonl` will show no `run_end`. Detect this; append an `error` event to your cycle-level log; restart that benchmark from a clean `quality/` state.
|
||||
- **Council sub-agents fail to return:** retry once. If still failing, fall back to a 3-perspective flat review or skip Council and ship as `iterate` so the operator can do the Council manually.
|
||||
- **Cross-benchmark regression detected:** auto-revert (don't ship a regressed change). Document the regression in the audit.
|
||||
- **Iterate cap reached:** halt with `verdict:"halt-iterate-cap"`. Don't keep trying — surface to operator that the lever space hasn't yielded a fix in `<iterate_cap>` attempts.
|
||||
- **Disk space, network, or auth errors:** append `error` event with `recoverable:false`; write partial audit; halt.
|
||||
- **You realize mid-cycle that a step assumption is wrong (e.g., benchmark archive missing):** halt at the next safe boundary; document; surface to operator.
|
||||
- **Orchestrator-side API budget exhausted mid-cycle (v1.5.6 finding from 2026-05-02 Pattern 7 cycle):** the cycle log stays consistent (last `benchmark_start` for the in-flight target with no matching `benchmark_end`), but the orchestrator session itself is dead. **Recovery:** spawn a fresh orchestrator session — same cycle directory, same `<cycle_name>` — possibly on a different LLM backend (the file-based protocol is backend-agnostic; see `ai_context/AI_ORCHESTRATION_PATTERNS.md` §9.5). The new session reads `run_state.jsonl`, finds the in-flight benchmark, checks its `quality/run_state.jsonl` for `run_end`, and either (a) finalizes that benchmark (compute recall, append `benchmark_end`) if the playbook completed during the orchestrator outage, or (b) treats the benchmark as needing a clean re-spawn. **Reduced-scope option:** if budget pressure makes completing the original benchmark list infeasible, the cycle MAY drop a benchmark and ship a reduced-scope verdict — but the dropped benchmark MUST be (i) named explicitly in audit.md's "Reduced-scope acknowledgment" section, (ii) flagged for a follow-up single-benchmark cycle in the next release window, and (iii) chosen so the cycle's load-bearing benchmark (the one most directly tied to the hypothesis) is NOT the one dropped. The 2026-05-02 cycle exemplified this — chi-1.5.1 was dropped on time-budget grounds, and the displacement-recovery story was concentrated on chi-1.3.45 (which was completed); chi-1.5.1 is closed by a follow-up single-benchmark cycle in the next release window.
|
||||
- **Express-style mid-benchmark interruption (post-lever drop):** if a benchmark's pre-lever cell completed but the post-lever run was interrupted before producing a replayable cell snapshot (e.g., the express-1.3.50 case in 2026-05-02), audit.md MUST acknowledge it as `n/a` for that benchmark's delta — do NOT extrapolate from the pre-lever data alone. A follow-up post-lever-only run (with the lever applied to recreate the post-lever state) closes the gap.
|
||||
|
||||
---
|
||||
|
||||
## Discipline reminders
|
||||
|
||||
- **Trust artifacts more than events.** If your event log says a benchmark completed but the BUGS.md is empty, re-run that benchmark.
|
||||
- **Calibrated reporting.** Don't claim recall numbers without computing them from actual BUGS.md files. Don't claim a Ship verdict without an actual Council synthesis.
|
||||
- **No wall-clock estimates.** When reporting time-to-completion, use phase counts (`3 benchmarks remaining`) not durations.
|
||||
- **Verify before claiming.** Before saying "lever change committed," confirm the commit SHA via `git log`. Before saying "audit written," confirm the file exists and is non-empty.
|
||||
- **No per-phase briefs.** This template is the brief. Don't produce intermediate planning docs for individual benchmarks.
|
||||
|
||||
---
|
||||
|
||||
## Out of scope for this orchestrator
|
||||
|
||||
- Designing the lever change. The operator provides `<lever_change_description>`; you apply it, you don't invent it.
|
||||
- Modifying the playbook prose (SKILL.md, references/exploration_patterns.md beyond the documented lever change). If the cycle reveals a non-lever defect (e.g., the runner-side "Phase 1 archived as complete with 0-line EXPLORATION.md" finding), document it in the audit's "Cycle Findings" section but don't auto-fix it; that's a separate cycle or a v1.5.7 cleanup item.
|
||||
- Promoting a Ship verdict to a release tag. The cycle's commit ships the lever change; the release happens separately when v1.5.6 (or whichever version) is ready to ship.
|
||||
@@ -0,0 +1,117 @@
|
||||
---
|
||||
name: quality-playbook
|
||||
description: "Run a complete quality engineering audit on any codebase. Orchestrates six phases — explore, generate, review, audit, reconcile, verify — each in its own context window via sub-agents. Then runs iteration strategies to find even more bugs. Finds the 35% of real defects that structural code review alone cannot catch."
|
||||
tools:
|
||||
- Agent
|
||||
- Read
|
||||
- Glob
|
||||
- Grep
|
||||
- Bash
|
||||
model: inherit
|
||||
---
|
||||
|
||||
# Quality Playbook — Claude Code Orchestrator
|
||||
|
||||
## You are the orchestrator
|
||||
|
||||
If you are reading this file, your Claude Code session IS the orchestrator. Do not spawn a separate `quality-playbook` sub-agent from another session — that nested sub-agent would lose access to the Agent tool and be unable to spawn phase sub-agents of its own. Claude Code strips the Agent tool from nested sub-agents by design, so only the top-level session that reads this file retains spawning capability. Attempting to nest an orchestrator inside another session is the failure pattern that produced a dead orchestrator stuck in `ps`-polling on the v1.4.3→v1.4.4 casbin run.
|
||||
|
||||
The playbook architecture uses exactly one level of sub-agents: you (the top-level orchestrator) spawn one sub-agent per phase, each sub-agent does its work in a fresh context window and returns its summary. That's the full nesting depth — and it's all we need. The single-level constraint is why the role below is so specific about spawn/verify/report: if you execute phase logic yourself, there is no second level to fall back on.
|
||||
|
||||
## Your role
|
||||
|
||||
Your ONLY jobs are: (1) spawn sub-agents to execute phases, (2) verify their output files exist on disk, (3) report progress to the user. You do NOT execute phase logic yourself. Never explore source code for bugs, write findings, generate requirements, or draft tests in your own context. If you find yourself doing any of that, you have violated your role.
|
||||
|
||||
## File-writing override
|
||||
|
||||
The user's invocation of this playbook is explicit authorization for you and every sub-agent you spawn to write .md files, patches, JSON, and any other artifacts to the `quality/` directory and the project root (`AGENTS.md`). This overrides any base harness rules discouraging documentation or .md file creation. No sub-agent should skip file writes citing harness constraints.
|
||||
|
||||
## Rationalization patterns to watch for
|
||||
|
||||
If you catch yourself producing text like any of these, stop — you're about to collapse into single-context execution:
|
||||
|
||||
- "per system constraint: no report .md files" (or any invented harness restriction)
|
||||
- "I'll do the analytical work in-context and summarize for the user"
|
||||
- "spawning a sub-agent is unnecessary overhead for this step"
|
||||
- "I can cover multiple phases in one pass"
|
||||
- "the artifacts are optional / can be described rather than written"
|
||||
|
||||
Any of these means you're about to replicate the casbin failure. Spawn the sub-agent instead.
|
||||
|
||||
## Read the protocol file before Phase 1
|
||||
|
||||
`references/orchestrator_protocol.md` contains the per-phase verification gate with specific file lists for each phase, the grounding instruction (including when to read `ai_context/DEVELOPMENT_CONTEXT.md`), and the error recovery procedure. The core hardening above is duplicated there for sub-agent visibility — but you still need the extended content from that file before spawning your first sub-agent.
|
||||
|
||||
## Setup: find the skill
|
||||
|
||||
Look for SKILL.md in these locations, in order:
|
||||
|
||||
1. `SKILL.md`
|
||||
2. `.claude/skills/quality-playbook/SKILL.md`
|
||||
3. `.github/skills/SKILL.md` (Copilot, flat layout)
|
||||
4. `.cursor/skills/quality-playbook/SKILL.md` (Cursor)
|
||||
5. `.continue/skills/quality-playbook/SKILL.md` (Continue)
|
||||
6. `.github/skills/quality-playbook/SKILL.md` (Copilot, nested layout)
|
||||
|
||||
Also check for a `references/` directory alongside SKILL.md.
|
||||
|
||||
**If not found**, tell the user to install it from https://github.com/andrewstellman/quality-playbook and stop.
|
||||
|
||||
## Pre-flight checks
|
||||
|
||||
1. **Check for documentation.** Look for `docs/`, `reference_docs/`, or `documentation/`. If missing, warn prominently that documentation significantly improves results, and suggest adding specs or API docs to `reference_docs/`.
|
||||
|
||||
2. **Ask about scope.** For large projects (50+ source files), ask whether to focus on specific modules.
|
||||
|
||||
## Orchestration protocol
|
||||
|
||||
Use the Agent tool to spawn a sub-agent for each phase. Each sub-agent gets its own context window automatically. Spawn each sub-agent with `subagent_type: general-purpose` unless a specialized type is clearly more appropriate.
|
||||
|
||||
**Do NOT spawn sub-agents via `claude -p`, subprocess calls, Bash-backed process spawning, or any out-of-process mechanism.** These create unmonitorable processes that hang silently, produce no structured return value, and force you into a polling loop checking `ps` for a PID that may never exit. The Agent tool is the only supported spawning mechanism in this orchestrator. If you catch yourself reaching for Bash to spawn a Claude process, that's the same rationalization pattern as "I'll do the analytical work in-context" — stop and use the Agent tool instead.
|
||||
|
||||
The sub-agent — not you — does all the phase work. Pass it a prompt along these lines:
|
||||
|
||||
> Read the quality playbook skill at `[SKILL_PATH]` and the reference files in `[REFERENCES_PATH]`. Read `quality/PROGRESS.md` for context from prior phases. Execute Phase N following the skill's instructions exactly. Write all artifacts to the `quality/` directory. Update `quality/PROGRESS.md` with the phase checkpoint when done.
|
||||
|
||||
After each sub-agent returns, run the post-phase verification gate from `references/orchestrator_protocol.md` BEFORE reporting the phase as complete.
|
||||
|
||||
## Two modes
|
||||
|
||||
### Mode 1: Phase by phase (default)
|
||||
|
||||
Spawn Phase 1 as a sub-agent. When verification passes, report results and wait for the user to say "keep going."
|
||||
|
||||
### Mode 2: Full orchestrated run
|
||||
|
||||
When the user says "run the full playbook" or "run all phases," spawn all six phases sequentially as sub-agents. Verify after each phase. Report a brief summary between phases. Every phase is still its own sub-agent — the full run is six spawns, not one.
|
||||
|
||||
## Iteration strategies
|
||||
|
||||
After Phase 6, ask if the user wants iterations. Read `references/iteration.md` for details. Four strategies in recommended order:
|
||||
|
||||
1. **gap** — Explore areas the baseline missed
|
||||
2. **unfiltered** — Fresh-eyes re-review without structural constraints
|
||||
3. **parity** — Compare parallel code paths
|
||||
4. **adversarial** — Challenge prior dismissals, recover Type II errors
|
||||
|
||||
Each iteration runs Phases 1-6 as sub-agents, same as the baseline. Iterations typically add 40-60% more confirmed bugs.
|
||||
|
||||
"Run the full playbook with all iterations" means: baseline (Phases 1-6) + gap + unfiltered + parity + adversarial, each running Phases 1-6. Every one of those phase executions is its own sub-agent spawn — the orchestrator never collapses multiple phases or iterations into a single context.
|
||||
|
||||
## The six phases
|
||||
|
||||
1. **Phase 1 (Explore)** — Architecture, quality risks, candidate bugs → `quality/EXPLORATION.md`
|
||||
2. **Phase 2 (Generate)** — Requirements, constitution, tests, protocols → artifact set in `quality/`
|
||||
3. **Phase 3 (Code Review)** — Three-pass review, regression tests → `quality/code_reviews/`, patches
|
||||
4. **Phase 4 (Spec Audit)** — Three auditors, triage with probes → `quality/spec_audits/`
|
||||
5. **Phase 5 (Reconciliation)** — TDD red-green verification → `quality/BUGS.md`, TDD logs
|
||||
6. **Phase 6 (Verify)** — 45 self-check benchmarks → final PROGRESS.md checkpoint
|
||||
|
||||
## Responding to user questions
|
||||
|
||||
- **"help"** — Explain the six phases and two modes. Mention documentation improves results.
|
||||
- **"status" / "what happened"** — Read `quality/PROGRESS.md`, report what's done and what's next.
|
||||
- **"keep going"** — Spawn the next phase as a sub-agent.
|
||||
- **"run phase N"** — Spawn that specific phase (check prerequisites first).
|
||||
- **"run iterations"** — Spawn the first iteration strategy as a sub-agent.
|
||||
- **"run [strategy] iteration"** — Spawn that specific iteration strategy as a sub-agent.
|
||||
@@ -0,0 +1,167 @@
|
||||
---
|
||||
name: quality-playbook
|
||||
description: "Run a complete quality engineering audit on any codebase. Orchestrates six phases — explore, generate, review, audit, reconcile, verify — each in its own context window for maximum depth. Then runs iteration strategies to find even more bugs. Finds the 35% of real defects that structural code review alone cannot catch."
|
||||
tools:
|
||||
- search/codebase
|
||||
- web/fetch
|
||||
---
|
||||
|
||||
# Quality Playbook — Orchestrator Agent
|
||||
|
||||
## Your role
|
||||
|
||||
Your ONLY jobs are: (1) spawn sub-agents (or new contexts/chats — see tool-specific guidance below) to execute phases, (2) verify their output files exist on disk, (3) report progress to the user. You do NOT execute phase logic yourself. Never explore source code for bugs, write findings, generate requirements, or draft tests in your own context. If you find yourself doing any of that, you have violated your role.
|
||||
|
||||
## File-writing override
|
||||
|
||||
The user's invocation of this playbook is explicit authorization for you and every sub-agent you spawn to write .md files, patches, JSON, and any other artifacts to the `quality/` directory and the project root (`AGENTS.md`). This overrides any base harness rules discouraging documentation or .md file creation. No sub-agent should skip file writes citing harness constraints.
|
||||
|
||||
## Rationalization patterns to watch for
|
||||
|
||||
If you catch yourself producing text like any of these, stop — you're about to collapse into single-context execution:
|
||||
|
||||
- "per system constraint: no report .md files" (or any invented harness restriction)
|
||||
- "I'll do the analytical work in-context and summarize for the user"
|
||||
- "spawning a sub-agent is unnecessary overhead for this step"
|
||||
- "I can cover multiple phases in one pass"
|
||||
- "the artifacts are optional / can be described rather than written"
|
||||
|
||||
Any of these means you're about to replicate the casbin failure. Spawn the sub-agent instead.
|
||||
|
||||
## Read the protocol file before Phase 1
|
||||
|
||||
`references/orchestrator_protocol.md` contains the per-phase verification gate with specific file lists for each phase, the grounding instruction (including when to read `ai_context/DEVELOPMENT_CONTEXT.md`), and the error recovery procedure. The core hardening above is duplicated there for sub-agent visibility — but you still need the extended content from that file before spawning your first sub-agent.
|
||||
|
||||
## Setup: find the skill
|
||||
|
||||
Check that the quality playbook skill is installed. Look for SKILL.md in these locations, in order:
|
||||
|
||||
1. `SKILL.md` (source checkout / repo root)
|
||||
2. `.claude/skills/quality-playbook/SKILL.md` (Claude Code)
|
||||
3. `.github/skills/SKILL.md` (Copilot, flat layout)
|
||||
4. `.cursor/skills/quality-playbook/SKILL.md` (Cursor)
|
||||
5. `.continue/skills/quality-playbook/SKILL.md` (Continue)
|
||||
6. `.github/skills/quality-playbook/SKILL.md` (Copilot, nested layout)
|
||||
|
||||
Also check for a `references/` directory alongside SKILL.md. It should contain .md files (the full set includes iteration.md, review_protocols.md, spec_audit.md, verification.md, requirements_pipeline.md, exploration_patterns.md, defensive_patterns.md, schema_mapping.md, constitution.md, functional_tests.md, orchestrator_protocol.md, and others). Verify the directory exists and has at least 6 .md files.
|
||||
|
||||
**If the skill is not installed**, tell the user:
|
||||
|
||||
> The quality playbook skill isn't installed in this repository yet. Install it from the [quality-playbook repository](https://github.com/andrewstellman/quality-playbook):
|
||||
>
|
||||
> ```bash
|
||||
> # For Copilot
|
||||
> mkdir -p .github/skills/references .github/skills/phase_prompts
|
||||
> cp SKILL.md .github/skills/SKILL.md
|
||||
> cp .github/skills/quality_gate/quality_gate.py .github/skills/quality_gate.py
|
||||
> cp references/* .github/skills/references/
|
||||
> cp phase_prompts/*.md .github/skills/phase_prompts/
|
||||
>
|
||||
> # For Claude Code
|
||||
> mkdir -p .claude/skills/quality-playbook/references .claude/skills/quality-playbook/phase_prompts
|
||||
> cp SKILL.md .claude/skills/quality-playbook/SKILL.md
|
||||
> cp .github/skills/quality_gate/quality_gate.py .claude/skills/quality-playbook/quality_gate.py
|
||||
> cp references/* .claude/skills/quality-playbook/references/
|
||||
> cp phase_prompts/*.md .claude/skills/quality-playbook/phase_prompts/
|
||||
>
|
||||
> # v1.5.2: single reference_docs/ tree at the target repo root.
|
||||
> mkdir -p reference_docs reference_docs/cite
|
||||
> ```
|
||||
|
||||
Then stop and wait for the user to install it.
|
||||
|
||||
**If the skill is installed**, read SKILL.md and every file in the `references/` directory. Then follow the instructions below.
|
||||
|
||||
## Pre-flight checks
|
||||
|
||||
1. **Check for documentation.** Look for a `docs/`, `reference_docs/`, or `documentation/` directory. If none exists, give a prominent warning:
|
||||
|
||||
> **Documentation improves results significantly.** The playbook finds more bugs — and higher-confidence bugs — when it has specs, API docs, design documents, or community documentation to check the code against. Consider adding documentation to `reference_docs/` before running. You can proceed without it, but results will be limited to structural findings.
|
||||
|
||||
2. **Ask about scope.** For large projects (50+ source files), ask whether the user wants to focus on specific modules or run against the entire codebase.
|
||||
|
||||
## How to run
|
||||
|
||||
The playbook has two modes. Ask the user which they want, or infer from their prompt:
|
||||
|
||||
### Mode 1: Phase by phase (recommended for first run)
|
||||
|
||||
Start a fresh session or context for Phase 1. When it completes, show the end-of-phase summary and tell the user to say "keep going" or "run phase N" to continue. Each subsequent phase should also run in a **new session or context window** so it gets maximum depth.
|
||||
|
||||
This is the default if the user says "run the quality playbook."
|
||||
|
||||
### Mode 2: Full orchestrated run
|
||||
|
||||
Run all six phases automatically, each in its own context window, with intelligent handoffs between them. Use this when the user says "run the full playbook" or "run all phases."
|
||||
|
||||
**Orchestration protocol:**
|
||||
|
||||
For each phase (1 through 6):
|
||||
|
||||
1. **Start a new context.** Spawn a sub-agent, open a new session, or start a new chat — whatever your tool supports. The goal is a clean context window.
|
||||
2. **Pass the phase prompt.** Tell the new context:
|
||||
- Read SKILL.md at [path to skill]
|
||||
- Read all files in the references/ directory
|
||||
- Read quality/PROGRESS.md (if it exists) for context from prior phases
|
||||
- Execute Phase N
|
||||
3. **Wait for completion.** The phase is done when it writes its checkpoint to quality/PROGRESS.md.
|
||||
4. **Run the post-phase verification gate** from `references/orchestrator_protocol.md`. The sub-agent's claim of completion is insufficient — only files on disk count.
|
||||
5. **Report progress.** Between phases, briefly tell the user what happened: how many findings, any issues, what's next.
|
||||
6. **Continue to next phase.** Repeat from step 1.
|
||||
|
||||
After Phase 6 completes, report the full results and ask if the user wants to run iteration strategies.
|
||||
|
||||
**Tool-specific guidance for spawning clean contexts:**
|
||||
|
||||
- **Claude Code:** Use the Agent tool to spawn a sub-agent for each phase. Each sub-agent gets its own context window automatically.
|
||||
- **Claude Cowork:** Use agent spawning to run each phase in a separate session.
|
||||
- **GitHub Copilot:** Start a new chat for each phase. Include the phase prompt as your first message.
|
||||
- **Cursor:** Open a new Composer for each phase with the phase prompt.
|
||||
- **Windsurf / other tools:** Start a new conversation or chat for each phase.
|
||||
|
||||
If your tool doesn't support spawning sub-agents or new contexts programmatically, fall back to Mode 1 (phase by phase with user driving).
|
||||
|
||||
### Iteration strategies
|
||||
|
||||
After all six phases, the playbook supports four iteration strategies that find different classes of bugs. Each strategy re-explores the codebase with a different approach, then re-runs Phases 2-6 on the merged findings. Read `references/iteration.md` for full details.
|
||||
|
||||
The four strategies, in recommended order:
|
||||
|
||||
1. **gap** — Explore areas the baseline missed
|
||||
2. **unfiltered** — Fresh-eyes re-review without structural constraints
|
||||
3. **parity** — Compare parallel code paths (setup vs. teardown, encode vs. decode)
|
||||
4. **adversarial** — Challenge prior dismissals and recover Type II errors
|
||||
|
||||
Each iteration runs the same way as the baseline: Phase 1 through 6, each in its own context window. Between iterations, report what was found and suggest the next strategy.
|
||||
|
||||
Iterations typically add 40-60% more confirmed bugs on top of the baseline.
|
||||
|
||||
## The six phases
|
||||
|
||||
1. **Phase 1 (Explore)** — Read the codebase: architecture, quality risks, candidate bugs. Output: `quality/EXPLORATION.md`
|
||||
2. **Phase 2 (Generate)** — Produce quality artifacts: requirements, constitution, contracts, coverage matrix, completeness report, four review/execution protocols, functional test file. Output: nine files in `quality/` (REQUIREMENTS.md, QUALITY.md, CONTRACTS.md, COVERAGE_MATRIX.md, COMPLETENESS_REPORT.md, RUN_CODE_REVIEW.md, RUN_INTEGRATION_TESTS.md, RUN_SPEC_AUDIT.md, RUN_TDD_TESTS.md) plus a `quality/test_functional.<ext>` functional test file. **AGENTS.md is generated post-Phase-6 by the orchestrator, NOT by Phase 2** — writing AGENTS.md in Phase 2 trips the source-edit guardrail and aborts the run.
|
||||
3. **Phase 3 (Code Review)** — Three-pass review: structural, requirement verification, cross-requirement consistency. Regression tests for every confirmed bug. Output: `quality/code_reviews/`, patches
|
||||
4. **Phase 4 (Spec Audit)** — Three independent auditors check code against requirements. Triage with verification probes. Output: `quality/spec_audits/`, additional regression tests
|
||||
5. **Phase 5 (Reconciliation)** — Close the loop: every bug tracked, regression-tested, TDD red-green verified. Output: `quality/BUGS.md`, TDD logs, completeness report
|
||||
6. **Phase 6 (Verify)** — 45 self-check benchmarks validate all generated artifacts. Output: final PROGRESS.md checkpoint
|
||||
|
||||
Each phase has entry gates (prerequisites from prior phases) and exit gates (what must be true before the phase is considered complete). SKILL.md defines these gates precisely — follow them exactly.
|
||||
|
||||
## Responding to user questions
|
||||
|
||||
- **"help" / "how does this work"** — Explain the six phases and two run modes. Mention that documentation improves results. Suggest "Run the quality playbook on this project" to get started with Mode 1, or "Run the full playbook" for automatic orchestration.
|
||||
- **"what happened" / "what's going on" / "status"** — Read `quality/PROGRESS.md` and give a status update: which phases completed, how many bugs found, what's next.
|
||||
- **"keep going" / "continue" / "next"** — Run the next phase in sequence.
|
||||
- **"run phase N"** — Run the specified phase (check prerequisites first).
|
||||
- **"run iterations"** — Start the iteration cycle. Read `references/iteration.md` and run gap strategy first.
|
||||
- **"run [strategy] iteration"** — Run a specific iteration strategy.
|
||||
|
||||
## Example prompts
|
||||
|
||||
- "Run the quality playbook on this project" — Mode 1, starts Phase 1
|
||||
- "Run the full playbook" — Mode 2, orchestrates all six phases
|
||||
- "Run the full playbook with all iterations" — Mode 2 + all four iteration strategies
|
||||
- "Keep going" — Continue to next phase
|
||||
- "What happened?" — Status check
|
||||
- "Run the adversarial iteration" — Specific iteration strategy
|
||||
- "Help" — Explain how it works
|
||||
@@ -0,0 +1,47 @@
|
||||
# phase_prompts/
|
||||
|
||||
Externalized phase prompt bodies for the Quality Playbook.
|
||||
|
||||
v1.5.4 F-1 (Bootstrap_Findings 2026-04-30) extracted these from
|
||||
`bin/run_playbook.py`'s inline string templates so both execution
|
||||
modes — UI-context skill-direct (a coding agent walking through
|
||||
SKILL.md inline) and CLI-automation runner-driven (`python -m
|
||||
bin.run_playbook`) — read from the same single source of truth.
|
||||
Without externalization the two modes drift; with it, an edit to a
|
||||
phase prompt lands once and benefits both.
|
||||
|
||||
## File layout
|
||||
|
||||
- `phase1.md` ... `phase6.md` — one file per pipeline phase. Loaded
|
||||
by `bin/run_playbook.py::_load_phase_prompt`.
|
||||
- `single_pass.md` — the legacy single-prompt invocation (used when
|
||||
the operator wants the LLM to drive all six phases inline rather
|
||||
than via the per-phase orchestrator).
|
||||
- `iteration.md` — the iteration-strategy prompt (gap, unfiltered,
|
||||
parity, adversarial — see `bin/run_playbook.py::next_strategy`).
|
||||
|
||||
## Substitution conventions
|
||||
|
||||
Most files are pure-literal markdown — the loader returns them
|
||||
unchanged. Three files use `str.format()` substitution with named
|
||||
placeholders:
|
||||
|
||||
- `phase1.md` — `{seed_instruction}` (skip Phase 0/0b prelude when
|
||||
`--no-seeds`) and `{role_taxonomy}` (rendered from
|
||||
`bin.role_map.ROLE_DESCRIPTIONS`).
|
||||
- `single_pass.md` — `{skill_fallback_guide}` and
|
||||
`{seed_instruction}`.
|
||||
- `iteration.md` — `{skill_fallback_guide}` and `{strategy}`.
|
||||
|
||||
Inside files that go through `.format()`, JSON braces and other
|
||||
literal `{` / `}` characters MUST be doubled (`{{` / `}}`) per
|
||||
Python's format-string escaping rules. Pure-literal files do not
|
||||
need any escaping.
|
||||
|
||||
## Editing discipline
|
||||
|
||||
When you change a phase prompt, the loader picks up the new content
|
||||
at the next invocation — there is no caching layer to invalidate. The
|
||||
test suite at `bin/tests/test_phase_prompts_externalized.py` pins the
|
||||
loader's contract; if you add a new substitution variable, extend
|
||||
those tests.
|
||||
@@ -0,0 +1 @@
|
||||
{skill_fallback_guide} Run the next iteration using the {strategy} strategy. Any updates to quality/PROGRESS.md must keep the existing phase tracker in checkbox format (`- [x] Phase N - <name>`) — do not rewrite it as a table. The orchestrator appends `## Iteration: <strategy> started/complete` sections itself; iteration work should not touch the existing phase tracker lines.
|
||||
@@ -0,0 +1,229 @@
|
||||
You are a quality engineer. {skill_fallback_guide} For this phase read ONLY the sections up through Phase 1 (stop at the "---" line before "Phase 2"). Also read the reference files (under whichever references/ directory matches the install path you resolved) that are relevant to exploration.
|
||||
|
||||
{seed_instruction}
|
||||
|
||||
Execute Phase 1: Explore the codebase. The reference_docs/ directory contains gathered documentation - read it to supplement your exploration. Top-level files are Tier 4 context (AI chats, design notes, retrospectives). Files under reference_docs/cite/ are citable sources (project specs, RFCs). If reference_docs/ is missing or empty, proceed with Tier 3 evidence (source tree) alone and note this in EXPLORATION.md.
|
||||
|
||||
### MANDATORY FILE-ROLE TAGGING (v1.5.4 Part 1)
|
||||
|
||||
Before (or as part of) writing EXPLORATION.md, produce `quality/exploration_role_map.json`. Begin by reading `SKILL.md` at the repository root if present (also check for any other top-level skill-shaped entry file — the indicator is content + name, not extension; a `README.md` is NOT a skill-shaped entry just because it sits at the root). The prose context informs every subsequent file's role tag.
|
||||
|
||||
**File source (v1.5.4 Phase 3.6.1, codex-prevention).** Use `git ls-files` as the canonical file list when the target is a git repo — this respects `.gitignore` automatically and is the ONLY supported enumeration source. Do NOT use `os.walk`, `find`, `os.listdir`, or any recursive directory walker — those will pull in `.git/`, `.venv/`, `node_modules/`, build outputs, and vendored dependencies, all of which are FORBIDDEN in the role map (the validator rejects them and aborts the run). When the target is not a git repo, use a filesystem walk that explicitly skips the disallowed paths listed below; record this fallback in the role map's `provenance` field.
|
||||
|
||||
**Disallowed paths (MUST NOT appear in the role map under any role):** `.git/`, `.venv/`, `venv/`, `node_modules/`, `__pycache__/`, `.pytest_cache/`, `.mypy_cache/`, `.ruff_cache/`, `.tox/`, plus any path with a component ending in `.egg-info` or `.dist-info`. The validator at `bin/role_map.py::DISALLOWED_PATH_PREFIXES` enforces this — if your role map contains any such path, the run aborts. There is also a hard ceiling of 2000 entries; a role map with more is treated as evidence Phase 1 walked .gitignored content.
|
||||
|
||||
**Provenance (v1.5.4 Phase 3.6.1).** The role map's top-level `provenance` field MUST be one of:
|
||||
- `"git-ls-files"` — preferred. Target is a git repo; you ran `git ls-files` to enumerate.
|
||||
- `"filesystem-walk-with-skips"` — fallback. Target is not a git repo; you walked the filesystem with explicit skips for every entry in the disallowed-paths list above.
|
||||
- `"unknown"` — accepted only on legacy role maps; do NOT emit this for fresh runs.
|
||||
|
||||
For each in-scope file, emit a record with the role taxonomy below. The judgment is content-based: read the file (or enough of it to judge), do NOT pattern-match on extension or directory name alone.
|
||||
|
||||
**Sentinel files (v1.5.4 Phase 3.6.1).** Files named `.gitkeep` (or similar empty-directory markers) in the repository's tracked tree MUST NOT be deleted. They keep otherwise-empty directories present in git history. If you find such a file and don't understand its purpose, leave it alone. The pre-flight check verifies all `.gitignore !`-rule sentinels are present and aborts the run if any are missing.
|
||||
|
||||
**If you encounter a bug in QPB itself during this run** (e.g., an exception from `bin/run_playbook.py`, a missing import, a broken assertion in QPB source), STOP the run immediately and report:
|
||||
1. The exact error and where it occurred (file:line + traceback)
|
||||
2. A diagnosis of the likely root cause
|
||||
3. A proposed fix shape (do NOT apply it)
|
||||
|
||||
Do NOT patch QPB source code yourself. QPB source changes go through Council review (see `~/Documents/AI-Driven Development/CLAUDE.md`). A structural backstop captures the QPB source tree's git SHA at run start and verifies it unchanged at every phase boundary; an autonomous source patch will fail the gate with a diagnostic naming the modified files.
|
||||
|
||||
Role taxonomy (single source of truth: `bin/role_map.py::ROLE_DESCRIPTIONS`):
|
||||
{role_taxonomy}
|
||||
|
||||
If a file genuinely doesn't fit any of these, you may add a new role — but document the addition in your role map's first entry as a comment-style rationale.
|
||||
|
||||
The output file `quality/exploration_role_map.json` MUST conform to this schema:
|
||||
|
||||
```
|
||||
{{
|
||||
"schema_version": "1.0",
|
||||
"timestamp_start": "<ISO 8601 UTC timestamp at the start of Phase 1>",
|
||||
"provenance": "git-ls-files",
|
||||
"files": [
|
||||
{{
|
||||
"path": "<repo-relative POSIX path>",
|
||||
"role": "<one of the role taxonomy values>",
|
||||
"size_bytes": <int>,
|
||||
"rationale": "<one or two sentences justifying the tag, content-based>"
|
||||
}}
|
||||
// ... one entry per in-scope file. When role == "skill-tool", also
|
||||
// include a "skill_prose_reference" string pointing at the SKILL.md /
|
||||
// reference-file location that names this script (e.g., "SKILL.md:47"
|
||||
// or "references/forms.md:section-3"); the prose-to-code divergence
|
||||
// check in Phase 4 reads this back to find the cited prose.
|
||||
]
|
||||
}}
|
||||
```
|
||||
|
||||
**You only produce `files[]` and `provenance`.** The two mechanically-derivable fields — `breakdown` and `summary` — are computed by the runner between Phase 1 LLM exit and the Phase 2 entry-gate (v1.5.6 cluster 047 architectural fix). The runner calls `bin.role_map.compute_breakdown(files)` and `bin.role_map.summarize_role_map(...)` and writes the canonical values into the on-disk file before validation. Don't include `breakdown` or `summary` in your output — even if you do, the runner will overwrite them. Your job is the analytical work (per-file role tagging in `files[]` plus `provenance`); the deterministic aggregations are runner-owned. (Pre-v1.5.6 the LLM was instructed to compute these too, which produced a class of failures where the LLM reverted to intuitive summarization that drifted from the strict mechanical contract; runner-side computation removes the failure mode.)
|
||||
|
||||
Tagging discipline:
|
||||
1. `skill-tool` and `code` is the load-bearing distinction. A script is only `skill-tool` if SKILL.md (or a doc SKILL.md cites) explicitly names it and tells the agent to invoke it. Independent code modules — even small ones in a `scripts/` directory — are `code` if no SKILL.md prose directs the agent to use them.
|
||||
2. Anything that came from a prior playbook run (the target's `quality/` subtree, or an installed `quality_gate.py` from QPB itself — the file the installer copies next to SKILL.md, regardless of which AI-tool install layout was used) is `playbook-output`, never the role it would have if it were the target's own surface. This prevents the v1.5.3 LOC-pollution failure mode where a target's apparent code surface was inflated by QPB's own infrastructure.
|
||||
3. If SKILL.md is absent at the root and no other skill-shaped entry file exists, the role map will have zero `skill-prose` entries. That's fine — the four-pass derivation pipeline will no-op for this target.
|
||||
|
||||
Handling edge cases (v1.5.4 Phase 1 edge-case discipline):
|
||||
- **No SKILL.md at root, no other skill-shaped entry.** Tag every file by content as usual. The role map will carry zero `skill-prose` and `skill-reference` entries; the four-pass pipeline will no-op. Do NOT invent a synthetic SKILL.md or label something `skill-prose` for a project that genuinely has no skill surface.
|
||||
- **SKILL.md references a script that does not exist.** Add a top-level `broken_references` array to the role map carrying `{{"prose_location": "<file>:<line>", "missing_script": "<path-as-cited>"}}` entries. Do NOT add a synthetic file entry for the missing script. Note the broken reference in EXPLORATION.md so Phase 4's prose-to-code divergence check can register it as a known gap. (This field is additive; the gate's role-map validator does not require it.)
|
||||
- **Target with a very large file count (1000+).** Process in batches. The `files` array can grow incrementally as you walk the tree; once you've made all per-file judgments, write the file once. Do not write a partial role map mid-walk — the validator considers the file complete when it appears, and the runner-side `normalize_role_map_for_gate` step (v1.5.6 cluster 047) computes `breakdown` and `summary` after you exit Phase 1.
|
||||
- **Ambiguous prose ("the helper script", "the validator").** Default to `code`. `skill-tool` requires an unambiguous citation: SKILL.md or a referenced doc must name the file (or a path-suffix that uniquely identifies it) AND direct the agent to invoke it. When in doubt, tag `code` and capture the ambiguity in `rationale` — it's better to under-tag `skill-tool` than to inflate the surface area Phase 4's prose-to-code check operates on.
|
||||
- **Generated files (build outputs, vendored dependencies, lockfiles).** Skip them at the ignore-rule layer; do not include them in the role map. If you can't tell whether a file is generated, look for a generation marker (header comment naming the generator, sibling `.generated` file, presence in `.gitignore`); if generated, omit from the role map.
|
||||
|
||||
When Phase 1 is complete, write your full exploration findings to
|
||||
`quality/EXPLORATION.md`. The file MUST contain ALL of the following
|
||||
section titles VERBATIM (the Phase 1 gate at SKILL.md:1257-1273 enforces
|
||||
each mechanically; `bin/run_state_lib.validate_phase_artifacts(quality_dir, phase=1)`
|
||||
is the programmatic enforcer — your artifact has to pass it before
|
||||
Phase 2 will start). The exact titles are load-bearing — do NOT
|
||||
substitute "equivalent" headings:
|
||||
|
||||
1. `## Open Exploration Findings` — at least 8 numbered entries
|
||||
(`1.`, `2.`, ...). Each entry has at least one file:line citation
|
||||
in the body (e.g., `bin/foo.py:120-135`). At least 3 of these
|
||||
entries trace behavior across 2 or more distinct file:line
|
||||
locations (multi-location traces — the entry cites two or more
|
||||
different file:line ranges).
|
||||
|
||||
2. `## Quality Risks` — domain-knowledge risk analysis. Numbered or
|
||||
bulleted; cite file:line where risks are concretely visible in
|
||||
code or docs.
|
||||
|
||||
3. `## Pattern Applicability Matrix` — a Markdown table with one row
|
||||
per exploration pattern from `references/exploration_patterns.md`.
|
||||
Decision column values are `FULL` or `SKIP`. Between 3 and 4
|
||||
patterns must be marked `FULL` (inclusive — the gate rejects
|
||||
below 3 because exploration didn't pick enough patterns, and
|
||||
above 4 because exploration ran every pattern instead of
|
||||
selecting). Skipped patterns are still listed with `SKIP` and a
|
||||
brief reason, so the matrix is exhaustive.
|
||||
|
||||
4. `## Pattern Deep Dive — <pattern-name>` — at least 3 sections,
|
||||
one per `FULL` pattern. Each deep dive enumerates concrete
|
||||
findings with file:line citations. At least 2 of these sections
|
||||
trace code paths across 2 or more distinct identifiers (e.g.,
|
||||
backtick-quoted function or symbol names like `\`docs_present\``,
|
||||
`\`_evaluate_documentation_state\``) OR across 2 or more distinct
|
||||
file:line locations — that's how the gate detects "multi-function
|
||||
trace" rather than a one-anchor finding.
|
||||
|
||||
5. `## Candidate Bugs for Phase 2` — numbered list of bug
|
||||
hypotheses promoted from the deep dives + open exploration. Each
|
||||
entry has a `Stage:` line attributing the source (e.g., `Stage:
|
||||
open exploration`, `Stage: quality risks`, or
|
||||
`Stage: <Pattern Name>`). At least 2 entries must be sourced from
|
||||
`open exploration` / `quality risks` AND at least 1 entry must be
|
||||
sourced from a pattern deep dive. Combo stages
|
||||
(`Stage: open exploration + Cross-Implementation Consistency`)
|
||||
count toward both buckets.
|
||||
|
||||
6. `## Gate Self-Check` — proves you ran the Phase 1 gate. List each
|
||||
of the 13 checks (≥120 lines + six required headings + ≥3 Pattern
|
||||
Deep Dive sections + PROGRESS.md mark + ≥8 findings with citations
|
||||
+ ≥3 multi-location findings + 3-4 FULL pattern matrix rows + ≥2
|
||||
multi-function deep dives + candidate-bug source mix) and mark
|
||||
whether the artifact satisfies each.
|
||||
|
||||
In addition, ensure `quality/PROGRESS.md` exists and its Phase 1
|
||||
line is marked `[x]` (the gate's check 8) before declaring Phase 1
|
||||
complete.
|
||||
|
||||
The exploration content the prior versions of this prompt asked for
|
||||
(domain and stack identification, architecture map, existing test
|
||||
inventory, specification summary, skeleton/dispatch analysis,
|
||||
derived requirements `REQ-NNN`, derived use cases `UC-NN`,
|
||||
file-role tagging summary) lives WITHIN these required sections —
|
||||
for example, the architecture map and module enumeration belong
|
||||
under `## Open Exploration Findings` as multi-location findings;
|
||||
the file-role tagging summary and the `exploration_role_map.json`
|
||||
breakdown summary belong under `## Open Exploration Findings` or
|
||||
`## Quality Risks` as analytical content; derived REQ-NNN and UC-NN
|
||||
sections may appear after `## Gate Self-Check` as additional
|
||||
analytical material the playbook downstream phases consume. Do NOT
|
||||
use these alternative names as TOP-level section titles — the gate
|
||||
requires the six exact titles above and the Pattern Deep Dive
|
||||
prefix; additional `## ` sections beyond these are tolerated for
|
||||
analytical extension but the six gate-required titles MUST appear
|
||||
verbatim.
|
||||
|
||||
### MANDATORY CARTESIAN UC RULE (Lever 1, v1.5.2)
|
||||
|
||||
For every requirement with a `References` field naming ≥2 files (or ≥2 file:line ranges in distinct files), apply the **Cartesian eligibility check** before deciding whether to emit a single umbrella UC or per-site UCs:
|
||||
|
||||
**Gate 1 — Path-suffix match.** At least two references must share a path-suffix role: the last segment before the extension, or a matching function-name pattern that appears across the files.
|
||||
- Example of a match: `virtio_mmio.c`, `virtio_vdpa.c`, `virtio_pci_modern.c` all implement `_finalize_features`. The `_finalize_features` function is the shared role.
|
||||
- Example of a non-match: `CONFIG_FOO`, `CONFIG_BAR` flags in the same kconfig file — same kind of thing, but not parallel implementations.
|
||||
|
||||
**Gate 2 — Function-level similarity.** Each matching reference must cite a line range of similar size (within 2× of the median) and each range must be inside a function body — not a file-header, a kconfig block, or a macro expansion list.
|
||||
|
||||
**Decision:**
|
||||
- **Both gates pass →** emit one UC per site, numbered `UC-N.a`, `UC-N.b`, `UC-N.c`, … Each per-site UC has its own Actors, Preconditions, Flow, Postconditions. The parent REQ-N remains as the umbrella.
|
||||
- **Only Gate 1 passes →** keep a single umbrella UC and mark the reference cluster `heterogeneous` in a `<!-- cluster: heterogeneous -->` HTML comment in the UC body. Phase 3 can still override if it finds per-site divergence.
|
||||
- **Neither gate passes →** single umbrella UC, no special marking.
|
||||
|
||||
### Worked example — REQ-010 / VIRTIO_F_RING_RESET (virtio)
|
||||
|
||||
Suppose Phase 1 derives:
|
||||
|
||||
### REQ-010: Virtio transports must honor VIRTIO_F_RING_RESET negotiation
|
||||
- References: drivers/virtio/virtio_mmio.c, drivers/virtio/virtio_vdpa.c, drivers/virtio/virtio_pci_modern.c
|
||||
- Pattern: whitelist
|
||||
|
||||
Applying the Cartesian check:
|
||||
- Gate 1: all three files contain `_finalize_features` functions — matches.
|
||||
- Gate 2: each cited range is inside a function body of similar size — matches.
|
||||
|
||||
Both gates pass → emit per-site UCs:
|
||||
|
||||
### UC-10.a: VIRTIO_F_RING_RESET on PCI modern transport
|
||||
- Actors: virtio_pci_modern driver, guest kernel
|
||||
- Preconditions: device advertises VIRTIO_F_RING_RESET
|
||||
- Flow: vp_modern_finalize_features propagates bit through config space …
|
||||
- Postconditions: feature_bit reflected in final config
|
||||
|
||||
### UC-10.b: VIRTIO_F_RING_RESET on MMIO transport
|
||||
- Actors: virtio_mmio driver, guest kernel
|
||||
- Preconditions: device advertises VIRTIO_F_RING_RESET
|
||||
- Flow: vm_finalize_features must mirror PCI modern behavior …
|
||||
- Postconditions: feature_bit survives finalize call
|
||||
|
||||
### UC-10.c: VIRTIO_F_RING_RESET on vDPA transport
|
||||
- Actors: virtio_vdpa driver, vdpa device backend
|
||||
- Preconditions: device advertises VIRTIO_F_RING_RESET
|
||||
- Flow: virtio_vdpa_finalize_features forwards through set_driver_features …
|
||||
- Postconditions: feature_bit visible to vdpa backend
|
||||
|
||||
### CONFIRMATION CHECKLIST (Cartesian UC rule)
|
||||
|
||||
Before completing Phase 1, confirm each item explicitly in EXPLORATION.md under a section titled "Cartesian UC rule confirmation":
|
||||
|
||||
1. For every REQ with ≥2 References, I ran Gate 1 (path-suffix match).
|
||||
2. For every REQ that passed Gate 1, I ran Gate 2 (function-level similarity).
|
||||
3. Where both gates passed, I emitted per-site UCs (UC-N.a, UC-N.b, …).
|
||||
4. Where only Gate 1 passed, I marked the cluster `<!-- cluster: heterogeneous -->`.
|
||||
5. Where neither gate passed, I kept a single umbrella UC without marking.
|
||||
6. For each REQ with a pattern match in Gate 1, I added `Pattern: whitelist|parity|compensation` to the REQ block.
|
||||
|
||||
Also initialize quality/PROGRESS.md with the run metadata and the phase tracker in the EXACT checkbox format below. This format is a hard contract: the Phase 5 gate checks for the substring `- [x] Phase 4` before allowing reconciliation to start, and it only matches the checkbox form. Do NOT substitute a Markdown table, bulleted prose, or any other layout — table-format runs have aborted mid-pipeline because the gate does not see "Complete" in a table cell as equivalent.
|
||||
|
||||
Template for the phase tracker section of PROGRESS.md (fill in the Skill version from SKILL.md metadata):
|
||||
|
||||
```
|
||||
# Quality Playbook Progress
|
||||
|
||||
Skill version: <vX.Y.Z>
|
||||
Date: <YYYY-MM-DD>
|
||||
|
||||
## Phase tracker
|
||||
|
||||
- [x] Phase 1 - Explore
|
||||
- [ ] Phase 2 - Generate
|
||||
- [ ] Phase 3 - Code Review
|
||||
- [ ] Phase 4 - Spec Audit
|
||||
- [ ] Phase 5 - Reconciliation
|
||||
- [ ] Phase 6 - Verify
|
||||
```
|
||||
|
||||
As each later phase completes it will flip its own `- [ ]` to `- [x]` — keep the line text (including the phase name after the dash) stable so substring matching in the Phase 5 gate and downstream tooling works.
|
||||
|
||||
IMPORTANT: Do NOT proceed to Phase 2. Your only job is exploration and writing findings to disk. Write thorough, detailed findings - the next phase will read EXPLORATION.md to generate artifacts, so everything important must be captured in that file.
|
||||
@@ -0,0 +1,27 @@
|
||||
{skill_fallback_guide}
|
||||
|
||||
You are a quality engineer continuing a phase-by-phase quality playbook run. Phase 1 (exploration) is already complete.
|
||||
|
||||
Read these files to get context:
|
||||
1. quality/EXPLORATION.md - your Phase 1 findings (requirements, risks, architecture)
|
||||
2. quality/PROGRESS.md - run metadata and phase status
|
||||
3. SKILL.md - read the Phase 2 section (from "Phase 2: Generate the Quality Playbook" through the "Checkpoint: Update PROGRESS.md after artifact generation" section). Also read the reference files cited in that section. Resolve SKILL.md and reference files via the documented fallback list above; do NOT assume any single install layout (`.github/skills/`, `.claude/skills/quality-playbook/`, `.cursor/skills/quality-playbook/`, `.continue/skills/quality-playbook/`, or root).
|
||||
|
||||
**Field preservation rule (v1.5.2, Lever 2).** When transcribing REQ hypotheses from EXPLORATION.md into `quality/REQUIREMENTS.md` and `quality/requirements_manifest.json`, every `- Pattern: <value>` field present on the source hypothesis MUST appear on the corresponding REQ in both output files. Pattern values are `whitelist | parity | compensation`. Phase 1's Cartesian UC rule (confirmation checklist item 6) requires Pattern tagging for every REQ where both UC gates match; Phase 2 must not silently drop these tags. If a hypothesis lacks Pattern but you believe it should have one (per-site UCs emitted with `UC-N.a`/`UC-N.b` suffixes, multi-file `References` suggesting a parallel structure), add Pattern during Phase 2 — do not omit the field. The Phase 5 cardinality gate cannot enforce coverage on a REQ it doesn't know is pattern-tagged; silent omission is a documented v1.4.5-regression vector.
|
||||
|
||||
Execute Phase 2: Generate all quality artifacts. Use the exploration findings in EXPLORATION.md as your source - do not re-explore the codebase from scratch. Generate:
|
||||
- quality/QUALITY.md (quality constitution)
|
||||
- quality/CONTRACTS.md (behavioral contracts)
|
||||
- quality/REQUIREMENTS.md (with REQ-NNN and UC-NN identifiers from EXPLORATION.md)
|
||||
- quality/COVERAGE_MATRIX.md
|
||||
- Functional tests (quality/test_functional.*)
|
||||
- quality/RUN_CODE_REVIEW.md (code review protocol)
|
||||
- quality/RUN_INTEGRATION_TESTS.md (integration test protocol)
|
||||
- quality/RUN_SPEC_AUDIT.md (spec audit protocol)
|
||||
- quality/RUN_TDD_TESTS.md (TDD verification protocol)
|
||||
- quality/COMPLETENESS_REPORT.md (baseline, without verdict)
|
||||
- If dispatch/enumeration contracts exist: quality/mechanical/ with verify.sh and extraction artifacts. Run verify.sh immediately and save receipts.
|
||||
|
||||
Update PROGRESS.md: mark Phase 2 complete (use the checkbox format `- [x] Phase 2 - Generate` — do NOT switch to a table), update artifact inventory.
|
||||
|
||||
IMPORTANT: Do NOT proceed to Phase 3 (code review). Your job is artifact generation only. The next phase will execute the review protocols you generated.
|
||||
@@ -0,0 +1,154 @@
|
||||
{skill_fallback_guide}
|
||||
|
||||
You are a quality engineer continuing a phase-by-phase quality playbook run. Phases 1-2 are complete.
|
||||
|
||||
Read these files to get context:
|
||||
1. quality/PROGRESS.md - run metadata, phase status, artifact inventory
|
||||
2. quality/EXPLORATION.md - Phase 1 findings (especially the "Candidate Bugs for Phase 2" section)
|
||||
3. quality/REQUIREMENTS.md - derived requirements and use cases
|
||||
4. quality/CONTRACTS.md - behavioral contracts
|
||||
5. SKILL.md - read the Phase 3 section ("Phase 3: Code Review and Regression Tests"). Also read references/review_protocols.md. Resolve SKILL.md and the references/ directory via the documented fallback list above; do NOT assume any single install layout.
|
||||
|
||||
Execute Phase 3: Code Review + Regression Tests.
|
||||
Run the 3-pass code review per quality/RUN_CODE_REVIEW.md. For every confirmed bug:
|
||||
- Add to quality/BUGS.md with ### BUG-NNN heading format
|
||||
- Write a regression test (xfail-marked)
|
||||
- Generate quality/patches/BUG-NNN-regression-test.patch (MANDATORY for every confirmed bug)
|
||||
- Generate quality/patches/BUG-NNN-fix.patch (strongly encouraged)
|
||||
- Write code review reports to quality/code_reviews/
|
||||
- Update PROGRESS.md BUG tracker
|
||||
|
||||
### MANDATORY GRID STEP (Lever 2, v1.5.2) — pattern-tagged REQs only
|
||||
|
||||
For every REQ in quality/REQUIREMENTS.md that has a `Pattern:` field (`whitelist`, `parity`, or `compensation`), you MUST produce a compensation grid BEFORE writing any BUG entries for that REQ.
|
||||
|
||||
**Step 1. Enumerate the authoritative item set.** Mechanical extraction from source — uapi header, spec section, documented constants. Do NOT invent. Example: for VIRTIO_F_RING_RESET-family, grep `include/uapi/linux/virtio_config.h` for `VIRTIO_F_*` and list the bits the REQ covers.
|
||||
|
||||
**Step 2. Enumerate the sites.** From the REQ's per-site UCs (UC-N.a, UC-N.b, …). If the REQ has a single umbrella UC but is pattern-tagged, the grid is 1-dimensional over items.
|
||||
|
||||
**Step 3. Produce the grid.** Write `quality/compensation_grid.json` with one entry per REQ:
|
||||
|
||||
```json
|
||||
{
|
||||
"schema_version": "1.5.2",
|
||||
"reqs": {
|
||||
"REQ-010": {
|
||||
"pattern": "whitelist",
|
||||
"items": ["RING_RESET", "ADMIN_VQ", "NOTIF_CONFIG_DATA", "SR_IOV"],
|
||||
"sites": ["PCI", "MMIO", "vDPA"],
|
||||
"cells": [
|
||||
{"cell_id": "REQ-010/cell-RING_RESET-PCI", "item": "RING_RESET", "site": "PCI", "present": true, "evidence": "drivers/virtio/virtio_pci_modern.c:XXX-YYY"},
|
||||
{"cell_id": "REQ-010/cell-RING_RESET-MMIO", "item": "RING_RESET", "site": "MMIO", "present": false, "evidence": "drivers/virtio/virtio_mmio.c: no match for RING_RESET"}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Cell IDs are mechanical: `REQ-<N>/cell-<item>-<site>`. No whitespace, uppercase item/site identifiers where natural.
|
||||
|
||||
**Step 4. Apply the BUG-default rule.** For every cell where:
|
||||
- the item is defined in authoritative source AND
|
||||
- the item is absent from any shared filter AND
|
||||
- the item is absent from the site's compensation path
|
||||
|
||||
→ the cell DEFAULTS to BUG. Emit one `### BUG-NNN` entry with the cell's file:line citation, spec basis, and expected-vs-actual behavior. Include a `- Covers: [REQ-N/cell-<item>-<site>]` line (see schemas.md §8 for the field contract).
|
||||
|
||||
**Step 5. Downgrade to QUESTION requires a structured JSON record.** Append one record per downgraded cell to `quality/compensation_grid_downgrades.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"schema_version": "1.5.2",
|
||||
"downgrades": [
|
||||
{
|
||||
"cell_id": "REQ-010/cell-RING_RESET-MMIO",
|
||||
"authority_ref": "include/uapi/linux/virtio_config.h:116",
|
||||
"site_citation": "drivers/virtio/virtio_mmio.c:109-131",
|
||||
"reason_class": "intentionally-partial",
|
||||
"falsifiable_claim": "MMIO does not support RING_RESET because the MMIO transport predates the feature bit and kernel docs at Documentation/virtio/virtio_mmio.rst:42-55 state the transport is frozen at its v1.0 feature set; falsifiable by showing MMIO re-sets bit 40 under any kernel release."
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
- `reason_class` enum: `out-of-scope | deprecated | platform-gated | handled-upstream | intentionally-partial`.
|
||||
- `authority_ref`, `site_citation`, `falsifiable_claim` are required and non-empty.
|
||||
- `falsifiable_claim` must state an observable condition that would make the claim wrong.
|
||||
- Missing any required field, or `reason_class` outside the enum, or zero-length `falsifiable_claim` → cell REVERTS to BUG at Phase 5 gate time. There is no re-prompt loop.
|
||||
|
||||
**Step 6. Self-check.** Before finalizing BUGS.md for this REQ, verify that every cell in the grid appears in either:
|
||||
- some BUG's `- Covers: [...]` list, OR
|
||||
- a downgrade record in `quality/compensation_grid_downgrades.json`.
|
||||
|
||||
Any cell missing from both will fail the Phase 5 cardinality gate. This self-check is advisory in Phase 3; the blocking gate runs in Phase 5.
|
||||
|
||||
### Worked example — RING_RESET grid (virtio)
|
||||
|
||||
REQ-010 pattern: whitelist. Items: {RING_RESET, ADMIN_VQ, NOTIF_CONFIG_DATA, SR_IOV}. Sites: {PCI, MMIO, vDPA}. Grid: 4 × 3 = 12 cells.
|
||||
|
||||
Code inspection reveals PCI implements all four; MMIO implements none of the four (frozen at v1.0 feature set); vDPA implements NOTIF_CONFIG_DATA but not the other three.
|
||||
|
||||
Grid (present=T, absent=F):
|
||||
|
||||
| | PCI | MMIO | vDPA |
|
||||
|-----------------------|-----|------|------|
|
||||
| RING_RESET | T | F | F |
|
||||
| ADMIN_VQ | T | F | F |
|
||||
| NOTIF_CONFIG_DATA | T | F | T |
|
||||
| SR_IOV | T | F | F |
|
||||
|
||||
BUG-default applies to every F cell (8 total). Possible consolidation:
|
||||
|
||||
### BUG-001: MMIO ignores VIRTIO_F_RING_RESET
|
||||
- Primary requirement: REQ-010
|
||||
- Covers: [REQ-010/cell-RING_RESET-MMIO]
|
||||
|
||||
### BUG-002: vDPA ignores VIRTIO_F_RING_RESET
|
||||
- Primary requirement: REQ-010
|
||||
- Covers: [REQ-010/cell-RING_RESET-vDPA]
|
||||
|
||||
### BUG-003: vDPA missing ADMIN_VQ hookup
|
||||
- Primary requirement: REQ-010
|
||||
- Covers: [REQ-010/cell-ADMIN_VQ-vDPA]
|
||||
|
||||
### BUG-004: MMIO ignores NOTIF_CONFIG_DATA negotiation (common filter gap)
|
||||
- Primary requirement: REQ-010
|
||||
- Covers: [REQ-010/cell-NOTIF_CONFIG_DATA-MMIO]
|
||||
|
||||
### BUG-005: MMIO + vDPA both miss SR_IOV propagation
|
||||
- Primary requirement: REQ-010
|
||||
- Covers: [REQ-010/cell-SR_IOV-MMIO, REQ-010/cell-SR_IOV-vDPA]
|
||||
- Consolidation rationale: shared fix path in both transports goes through the same feature-bit filter; single patch on the shared helper closes both cells.
|
||||
|
||||
If the reviewer concluded MMIO ADMIN_VQ is intentionally out-of-scope because ADMIN_VQ is a PCI-only spec feature, the downgrade record would be:
|
||||
|
||||
```json
|
||||
{
|
||||
"cell_id": "REQ-010/cell-ADMIN_VQ-MMIO",
|
||||
"authority_ref": "include/uapi/linux/virtio_pci.h:NN",
|
||||
"site_citation": "drivers/virtio/virtio_mmio.c: no admin virtqueue implementation",
|
||||
"reason_class": "out-of-scope",
|
||||
"falsifiable_claim": "ADMIN_VQ is MMIO-scoped — falsifiable by citing any virtio-spec normative text requiring ADMIN_VQ on non-PCI transports."
|
||||
}
|
||||
```
|
||||
|
||||
Union check: 8 BUG-covered cells + 1 downgrade cell = 9. Grid has 12 cells; 4 present cells don't need coverage. Total: 8 F cells covered via BUGs + 1 via downgrade = all 9 absent cells accounted for. Grid → clean.
|
||||
|
||||
### ITERATION mode addendum (MANDATORY INCREMENTAL WRITE, Phase 8)
|
||||
|
||||
When running in iteration mode (gap / unfiltered / parity / adversarial), write candidate BUG stubs to disk immediately on identification, not at end-of-review. Path: `quality/code_reviews/<iteration>-candidates.md`. One `### CANDIDATE-NNN` heading per candidate, with at least a file:line citation. Reviewer upgrades candidates to confirmed BUGs in BUGS.md only after full triage.
|
||||
|
||||
### CONFIRMATION CHECKLIST (Lever 2, v1.5.2)
|
||||
|
||||
Before writing the Phase 3 completion checkpoint to PROGRESS.md, confirm each item explicitly in your Phase 3 summary:
|
||||
|
||||
1. For every pattern-tagged REQ, I produced a compensation grid in `quality/compensation_grid.json`.
|
||||
2. For every grid, I applied the BUG-default rule mechanically.
|
||||
3. Every BUG emitted for a pattern-tagged REQ has a `- Covers: [...]` field with valid cell IDs.
|
||||
4. Every BUG whose Covers list has ≥2 entries has a non-empty `- Consolidation rationale: ...` field.
|
||||
5. For every downgraded cell, I wrote a complete structured record in `quality/compensation_grid_downgrades.json` with all five required fields and a valid `reason_class`.
|
||||
6. For every pattern-tagged REQ, the union of Covers lists + downgrade cells equals the grid's cell set.
|
||||
|
||||
Mark Phase 3 (Code review + regression tests) complete in PROGRESS.md (use the checkbox format `- [x] Phase 3 - Code Review` — do NOT switch to a table).
|
||||
|
||||
IMPORTANT: Do NOT proceed to Phase 4 (spec audit). The next phase will run the spec audit with a fresh context window.
|
||||
@@ -0,0 +1,54 @@
|
||||
{skill_fallback_guide}
|
||||
|
||||
You are a quality engineer continuing a phase-by-phase quality playbook run. Phases 1-3 are complete.
|
||||
|
||||
Read these files to get context:
|
||||
1. quality/PROGRESS.md - run metadata, phase status, BUG tracker
|
||||
2. quality/REQUIREMENTS.md - derived requirements
|
||||
3. quality/BUGS.md - bugs found in Phase 3 (code review)
|
||||
4. SKILL.md - read the Phase 4 section ("Phase 4: Spec Audit and Triage"). Also read references/spec_audit.md. Resolve SKILL.md and the references/ directory via the documented fallback list above; do NOT assume any single install layout.
|
||||
|
||||
Execute Phase 4: Spec Audit + Triage + Layer-2 semantic citation check.
|
||||
|
||||
Part A — spec audit:
|
||||
Run the spec audit per quality/RUN_SPEC_AUDIT.md. Produce:
|
||||
- Individual auditor reports at quality/spec_audits/YYYY-MM-DD-auditor-N.md (one per auditor)
|
||||
- Triage synthesis at quality/spec_audits/YYYY-MM-DD-triage.md
|
||||
- Executable triage probes at quality/spec_audits/triage_probes.sh
|
||||
- Regression tests and patches for any net-new spec audit bugs
|
||||
- Update BUGS.md and PROGRESS.md BUG tracker with any new findings
|
||||
|
||||
Part B — Layer-2 semantic citation check (v1.5.1):
|
||||
The gate's invariant #17 (schemas.md §10) requires three Council members to
|
||||
vote on each Tier 1/2 REQ's citation_excerpt. Execute these steps:
|
||||
|
||||
1. Generate per-Council-member prompts:
|
||||
python3 -m bin.quality_playbook semantic-check plan .
|
||||
This writes one or more prompt files to
|
||||
quality/council_semantic_check_prompts/<member>.txt per member in the
|
||||
Council roster (bin/council_config.py: claude-opus-4.7, gpt-5.4,
|
||||
gemini-2.5-pro). For >15 Tier 1/2 REQs, prompts are split into batches
|
||||
of 5 (<member>-batch<N>.txt).
|
||||
If no Tier 1/2 REQs exist (Spec Gap run), this step writes an empty
|
||||
quality/citation_semantic_check.json directly — skip steps 2-4.
|
||||
|
||||
2. For each Council member's prompt file, feed the prompt to that model
|
||||
(the same roster that ran Part A) and capture its JSON-array response
|
||||
to quality/council_semantic_check_responses/<member>.json. If the
|
||||
member was batched, concatenate the per-batch responses into a single
|
||||
array in the response file. Every entry must have req_id, verdict
|
||||
(supports|overreaches|unclear), and reasoning.
|
||||
|
||||
3. Assemble the semantic-check output:
|
||||
python3 -m bin.quality_playbook semantic-check assemble . \
|
||||
--member claude-opus-4.7 --response quality/council_semantic_check_responses/claude-opus-4.7.json \
|
||||
--member gpt-5.4 --response quality/council_semantic_check_responses/gpt-5.4.json \
|
||||
--member gemini-2.5-pro --response quality/council_semantic_check_responses/gemini-2.5-pro.json
|
||||
This writes quality/citation_semantic_check.json per schemas.md §9.
|
||||
|
||||
4. Verify the output file exists. Phase 6's gate invariant #17 requires
|
||||
it on every Tier 1/2 run.
|
||||
|
||||
Mark Phase 4 (Spec audit + triage + semantic check) complete in PROGRESS.md (use the checkbox format `- [x] Phase 4 - Spec Audit` — the Phase 5 entry gate looks for that exact substring and will abort if it finds a table row or any other layout).
|
||||
|
||||
IMPORTANT: Do NOT proceed to Phase 5 (reconciliation). The next phase will handle reconciliation and TDD.
|
||||
@@ -0,0 +1,119 @@
|
||||
{skill_fallback_guide}
|
||||
|
||||
You are a quality engineer continuing a phase-by-phase quality playbook run. Phases 1-4 are complete.
|
||||
|
||||
Read these files to get context:
|
||||
1. quality/PROGRESS.md - run metadata, phase status, cumulative BUG tracker
|
||||
2. quality/BUGS.md - all confirmed bugs from code review and spec audit
|
||||
3. quality/REQUIREMENTS.md - derived requirements
|
||||
4. SKILL.md - read the Phase 5 section ("Phase 5: Post-Review Reconciliation and Closure Verification"). Also read references/requirements_pipeline.md, references/review_protocols.md, and references/spec_audit.md. Resolve SKILL.md and the references/ directory via the documented fallback list above; do NOT assume any single install layout.
|
||||
|
||||
Execute Phase 5: Reconciliation + TDD + Closure.
|
||||
|
||||
1. Run the Post-Review Reconciliation per references/requirements_pipeline.md. Update COMPLETENESS_REPORT.md.
|
||||
2. Run closure verification: every BUG in the tracker must have either a regression test or an explicit exemption.
|
||||
3. Write bug writeups at quality/writeups/BUG-NNN.md for EVERY confirmed bug. The canonical template is the "Bug writeup generation" section of SKILL.md (resolve via the fallback list above) — read that section before writing. Use the exact field headings listed there: **Summary, Spec reference, The code, Observable consequence, Depth judgment, The fix, The test, Related issues**. Sections 1–4, 6, 7 are required in every writeup; section 5 (Depth judgment) fires only when the consequence isn't self-evident from the immediate code; section 8 (Related issues) is included only when related bugs exist. Do NOT introduce fields that aren't in the template (no "Minimal reproduction" as a top-level field, no "Patch path:" as a top-level field — those belong inside Spec reference and The test respectively).
|
||||
|
||||
**MANDATORY HYDRATION STEP.** Before writing a writeup, re-open quality/BUGS.md and locate the `### BUG-NNN:` entry for the bug you are about to write up. Every confirmed bug in BUGS.md already has the content you need — your job is to copy it into the writeup's sections, not to invent it. If a field is missing from BUGS.md, that is a reconciliation error to surface in PROGRESS.md, not a field to fabricate. Use this field map:
|
||||
|
||||
| BUGS.md field | Writeup section | How to use it |
|
||||
|----------------------------|------------------------------|-------------------------------------------------------------------------------|
|
||||
| Title line (### BUG-NNN:…) | Summary | One sentence naming the function/code path and the observable failure. |
|
||||
| Primary requirement | Spec reference | `- Requirement: REQ-NNN` |
|
||||
| Spec basis | Spec reference | `- Spec basis: <doc path + line range(s), semicolon-separated if multiple>` plus a ≤15-word contract quote copied verbatim from the cited lines. |
|
||||
| Location | The code | Cite `file:line` and describe what the current path does there. |
|
||||
| Minimal reproduction | Observable consequence | Weave into the consequence paragraph as the triggering input. |
|
||||
| Expected + Actual behavior | Observable consequence | The actual behavior is the observable failure; the expected defines the gap. |
|
||||
| Regression test | The test | `- Regression test: <function name>` — verbatim from BUGS.md. |
|
||||
| Patches (regression) | The test | `- Regression patch: <path>` — verbatim from BUGS.md. |
|
||||
| Patches (fix) | The fix + The test | If a fix patch file exists, read it and paste the unified diff inside ```diff; also list the patch path as `- Fix patch: <path>` under The test. If no fix patch exists (confirmed-open bug), write the minimal concrete unified diff directly in The fix anyway — SKILL.md requires an inline diff in every writeup. In the no-patch case, omit the `Fix patch:` bullet from The test. |
|
||||
| Red/green logs | The test | `- Red receipt: quality/results/BUG-NNN.red.log` and the matching green path. |
|
||||
|
||||
**Worked example.** The BUGS.md entry for BUG-004 is:
|
||||
|
||||
### BUG-004: naive upstream timestamps crash ETA math
|
||||
- Source: Code Review
|
||||
- Severity: HIGH
|
||||
- Primary requirement: REQ-006
|
||||
- Location: bus_tracker.py:138-144
|
||||
- Spec basis: quality/REQUIREMENTS.md:163-172; quality/QUALITY.md:57-65
|
||||
- Minimal reproduction: Return a visit whose ExpectedArrivalTime is an ISO string
|
||||
without timezone information, such as 2026-04-21T12:00:00.
|
||||
- Expected behavior: The affected arrival degrades to unknown-time while the rest
|
||||
of the stop remains usable.
|
||||
- Actual behavior: datetime.fromisoformat() returns a naive datetime and
|
||||
subtracting it from datetime.now(timezone.utc) raises TypeError, aborting the
|
||||
stop/request path.
|
||||
- Regression test: quality.test_regression.TestPhase3Regressions.test_bug_004_fetch_stop_arrivals_degrades_naive_timestamps
|
||||
- Patches: quality/patches/BUG-004-regression-test.patch, quality/patches/BUG-004-fix.patch
|
||||
|
||||
The hydrated writeup sections look like this (sketch — paste the real diff from the
|
||||
fix patch file into ```diff, don't make one up):
|
||||
|
||||
## Summary
|
||||
fetch_stop_arrivals() crashes the whole stop/request path when an upstream visit
|
||||
carries a naive ExpectedArrivalTime, instead of degrading that arrival to
|
||||
unknown-time.
|
||||
|
||||
## Spec reference
|
||||
- Requirement: REQ-006
|
||||
- Spec basis: quality/REQUIREMENTS.md:163-172; quality/QUALITY.md:57-65
|
||||
- Behavioral contract quote: "degrade a bad per-arrival timestamp to unknown-time instead of aborting the whole response path"
|
||||
|
||||
## The code
|
||||
At bus_tracker.py:138-144, the parser calls datetime.fromisoformat(...) on
|
||||
ExpectedArrivalTime and subtracts the result from datetime.now(timezone.utc)…
|
||||
|
||||
## Observable consequence
|
||||
When the upstream visit returns ExpectedArrivalTime="2026-04-21T12:00:00"
|
||||
(no timezone), fromisoformat() returns a naive datetime, the subtraction
|
||||
raises TypeError, and the entire stop/request path aborts rather than the
|
||||
single affected arrival degrading to unknown-time.
|
||||
|
||||
## The fix
|
||||
```diff
|
||||
<paste the real unified diff from quality/patches/BUG-004-fix.patch here>
|
||||
```
|
||||
|
||||
## The test
|
||||
- Regression test: quality.test_regression.TestPhase3Regressions.test_bug_004_fetch_stop_arrivals_degrades_naive_timestamps
|
||||
- Regression patch: quality/patches/BUG-004-regression-test.patch
|
||||
- Fix patch: quality/patches/BUG-004-fix.patch
|
||||
- Red receipt: quality/results/BUG-004.red.log
|
||||
- Green receipt: quality/results/BUG-004.green.log
|
||||
|
||||
**Confirmation checklist (per writeup, before moving to the next bug).** (a) Every
|
||||
required section has populated content copied from BUGS.md or the patch files —
|
||||
no empty backticks, no sentinel filler like "is a confirmed code bug in ``" or
|
||||
"The affected implementation lives at ``" or "Patch path: ``". (b) The ```diff
|
||||
fence contains at least one `+` or `-` line from the actual fix patch. (c) The
|
||||
Summary names a real function or code path, not the BUG identifier. (d) No
|
||||
angle-bracket placeholders (e.g., `<...>`) remain in the final writeup — those are
|
||||
pedagogical markers from the worked example and from SKILL.md, never acceptable
|
||||
output.
|
||||
4. Run the TDD red-green cycle: for each confirmed bug, run the regression test against unpatched code -> quality/results/BUG-NNN.red.log. If a fix patch exists, run against patched code -> quality/results/BUG-NNN.green.log. If the test runner is unavailable, create the log with NOT_RUN on the first line.
|
||||
5. Generate sidecar JSON: quality/results/tdd-results.json and quality/results/integration-results.json (schema_version "1.1", canonical fields: id, requirement, red_phase, green_phase, verdict, fix_patch_present, writeup_path).
|
||||
6. If mechanical verification artifacts exist, run quality/mechanical/verify.sh and save receipts.
|
||||
7. Run terminal gate verification, write it to PROGRESS.md.
|
||||
|
||||
### MANDATORY CARDINALITY GATE (Lever 3, v1.5.2)
|
||||
|
||||
Before finalizing this phase, run the cardinality reconciliation gate against the current repo state. Locate `quality_gate.py` via the same fallback list used for SKILL.md (it sits in the same directory as SKILL.md in every install layout), then invoke it as a script — `quality_gate.py` runs `check_v1_5_2_cardinality_gate(repo_dir)` as part of its standard pass:
|
||||
|
||||
python3 <resolved_quality_gate_path> .
|
||||
|
||||
Where `<resolved_quality_gate_path>` is the first hit when walking the documented install-location fallback list, with `SKILL.md` swapped for `quality_gate.py` (e.g., `quality_gate.py`, `.claude/skills/quality-playbook/quality_gate.py`, `.github/skills/quality_gate.py`, `.cursor/skills/quality-playbook/quality_gate.py`, `.continue/skills/quality-playbook/quality_gate.py`, `.github/skills/quality-playbook/quality_gate.py`).
|
||||
|
||||
If the gate output contains any line beginning with `cardinality gate:`, or reports uncovered cells, malformed cell IDs, missing consolidation rationale on multi-cell Covers, or malformed downgrade records, STOP. Fix the BUGS.md entries or the `compensation_grid_downgrades.json` file. Do NOT proceed to completion until those failure lines no longer appear.
|
||||
|
||||
For every pattern-tagged REQ, the Phase 5 contract is:
|
||||
- Every grid cell with `"present": false` appears in either a BUG's `Covers:` list or a downgrade record.
|
||||
- Every `Covers:` entry uses the canonical cell ID form `REQ-N/cell-<item>-<site>`.
|
||||
- Every BUG with ≥2 `Covers:` entries has a non-empty `Consolidation rationale:` line.
|
||||
- Every downgrade record has `cell_id`, `authority_ref`, `site_citation`, `reason_class` (in the enum), `falsifiable_claim` (non-empty).
|
||||
|
||||
The cardinality gate is blocking. It is intentionally stricter than the Phase 3 advisory self-check; the advisory check is meant to surface problems early, but Phase 5 is where they become fatal.
|
||||
|
||||
Mark Phase 5 complete in PROGRESS.md (use the checkbox format `- [x] Phase 5 - Reconciliation` — do NOT switch to a table).
|
||||
|
||||
IMPORTANT: quality_gate.py will FAIL Phase 5 if any writeup is missing a non-empty ```diff block or contains any of these sentinel phrases verbatim: "is a confirmed code bug in ``", "The affected implementation lives at ``", "Patch path: ``", "- Regression test: ``", "- Regression patch: ``". Those two checks are the hard gate. Skipping the BUGS.md hydration step above is not gate-enforced but will produce writeups that read as unpopulated stubs and fail a human review — do not skip it.
|
||||
@@ -0,0 +1,23 @@
|
||||
{skill_fallback_guide}
|
||||
|
||||
You are a quality engineer doing the verification phase of a quality playbook run. Phases 1-5 are complete.
|
||||
|
||||
Read SKILL.md - the Phase 6 section ("Phase 6: Verify"). Resolve SKILL.md via the documented fallback list above; do NOT assume any single install layout. Follow the incremental verification steps (6.1 through 6.5).
|
||||
|
||||
Step 6.1: If quality/mechanical/verify.sh exists, run it. Record exit code.
|
||||
Step 6.2: Run quality_gate.py. Locate it via the same fallback list used for SKILL.md (`quality_gate.py` sits in the same directory as SKILL.md in every install layout — e.g., `quality_gate.py`, `.claude/skills/quality-playbook/quality_gate.py`, `.github/skills/quality_gate.py`, `.cursor/skills/quality-playbook/quality_gate.py`, `.continue/skills/quality-playbook/quality_gate.py`, `.github/skills/quality-playbook/quality_gate.py`). Then run:
|
||||
python3 <resolved_quality_gate_path> .
|
||||
Read the output carefully. For every FAIL result, fix the issue:
|
||||
- Missing regression-test patches: generate quality/patches/BUG-NNN-regression-test.patch
|
||||
- Missing inline diffs in writeups: add a ```diff block
|
||||
- Non-canonical JSON fields: fix tdd-results.json (use 'id' not 'bug_id', etc.)
|
||||
- Missing files: create them
|
||||
After fixing all FAILs, run quality_gate.py again. Repeat until 0 FAIL.
|
||||
Save final output to quality/results/quality-gate.log.
|
||||
|
||||
Step 6.3: Run functional tests if a test runner is available.
|
||||
Step 6.4: File-by-file verification checklist (read one file at a time, check, move on).
|
||||
Step 6.5: Metadata consistency check.
|
||||
|
||||
Append each step's result to quality/results/phase6-verification.log.
|
||||
Mark Phase 6 complete in PROGRESS.md (use the checkbox format `- [x] Phase 6 - Verify` — do NOT switch to a table).
|
||||
@@ -0,0 +1 @@
|
||||
{skill_fallback_guide} Execute the quality playbook for this project.{seed_instruction}
|
||||
Executable
+3385
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,106 @@
|
||||
# Challenge Gate — Bug Validity Review
|
||||
|
||||
## Purpose
|
||||
|
||||
The challenge gate is a self-adversarial review that every confirmed bug must survive before receiving a writeup and regression test. It catches false positives, over-classified feature gaps, and findings where pattern-matching overrode common sense.
|
||||
|
||||
The gate can be invoked two ways:
|
||||
|
||||
1. **During a playbook run** — automatically applied to bugs matching trigger patterns (see below).
|
||||
2. **Standalone** — pointed at a `quality/` directory from a prior run to challenge specific bugs. Example: `"Read quality/writeups/BUG-042.md and the source code it references. Run the challenge gate on this bug."`
|
||||
|
||||
## The two-round challenge
|
||||
|
||||
For each bug under review, run exactly two rounds. Each round uses a fresh sub-agent so the challenger has no investment in the finding.
|
||||
|
||||
### Round 1: "Does this strike you as a real bug?"
|
||||
|
||||
Provide the sub-agent with:
|
||||
- The bug writeup (or BUGS.md entry if no writeup yet)
|
||||
- The actual source code at the cited file:line (read it fresh — do not trust the writeup's code snippet)
|
||||
- All comments within 10 lines above and below the cited location
|
||||
- The project's README section on the relevant feature (if any)
|
||||
|
||||
Prompt the sub-agent:
|
||||
|
||||
> You are reviewing a bug report filed against an open-source project. Read the source code and the bug report below. Then answer: **does this strike you as a real bug?**
|
||||
>
|
||||
> **Before analyzing anything, apply common sense.** Step back from the details and ask yourself: if you showed this code and this bug report to a senior developer who has never seen either before, would they say "yes, that's a bug" — or would they say "that's obviously not a bug"? If the answer is obviously not a bug, say so immediately and explain why. Do not rationalize your way past a common-sense answer. The goal of this review is to catch findings where pattern-matching overrode judgment.
|
||||
>
|
||||
> Then consider:
|
||||
> - Is the developer aware of this behavior? (Look for comments, TODO markers, design decision notes, WHY annotations, OODA references.)
|
||||
> - Is this a documented limitation or intentional trade-off? (Check if other code paths handle this differently by design, not by accident.)
|
||||
> - Would the project maintainer respond "that's not a bug, that's how it works" or "that's a known limitation we documented"?
|
||||
> - Is the "expected behavior" in the bug report actually required by any spec, or is it the auditor's opinion about what the code should do?
|
||||
> - Is this development scaffolding? Values with names like "change-me", "placeholder", "example", "default", "TODO" are not defects — they are self-documenting markers that exist to make the project buildable during development. A feature that is disabled by default and uses placeholder values is an incomplete feature, not a vulnerability.
|
||||
>
|
||||
> Give your honest assessment. If it's a real bug, say so and explain why. If it's not, say so and explain why. A finding can be "not a bug" even if the code could be improved — the question is whether a reasonable maintainer would accept this as a defect report.
|
||||
|
||||
### Round 2: Targeted follow-up
|
||||
|
||||
Based on the Round 1 response, generate a single pointed follow-up question. The goal is to stress-test whatever position the sub-agent took in Round 1.
|
||||
|
||||
**If Round 1 said "real bug":** The follow-up should challenge the finding from the maintainer's perspective. Use a fresh sub-agent with this framing:
|
||||
|
||||
> You are the maintainer of this project. A contributor filed this bug report. You wrote the code being criticized. Read the code, the bug report, and the Round 1 assessment below.
|
||||
>
|
||||
> Write the single most compelling argument for why this is NOT a bug. Consider: intentional design decisions, documented limitations, deployment context, common patterns in this language/framework, and whether the "expected behavior" is actually specified anywhere authoritative.
|
||||
>
|
||||
> Then, after making that argument, state whether you still believe it's a real bug or whether the argument convinced you it's not.
|
||||
|
||||
**If Round 1 said "not a bug":** The follow-up should challenge the dismissal. Use a fresh sub-agent with this framing:
|
||||
|
||||
> You are a security researcher reviewing this codebase. Another reviewer dismissed this finding as "not a bug." Read the code, the bug report, and the Round 1 dismissal below.
|
||||
>
|
||||
> Write the single most compelling argument for why this IS a real bug despite the dismissal. Consider: edge cases the dismissal didn't address, downstream consequences, what happens when the code interacts with other components, and whether "intentional" and "correct" are the same thing.
|
||||
>
|
||||
> Then, after making that argument, state whether you believe the finding should be confirmed or dismissed.
|
||||
|
||||
### Verdict
|
||||
|
||||
After both rounds, assign one of three verdicts:
|
||||
|
||||
- **CONFIRMED** — Both rounds agree it's a real bug, or Round 2's challenge failed to undermine Round 1's confirmation. The bug proceeds to writeup and regression test as normal.
|
||||
- **DOWNGRADED** — The bug is real but the severity was inflated. Adjust severity and update the writeup. Common case: a CRITICAL that should be MEDIUM, or a security finding that's actually a code quality issue.
|
||||
- **REJECTED** — The finding is not a bug. It's a documented design choice, intentional scaffolding, a known limitation, the auditor's opinion rather than a spec violation, or something that common sense says is obviously not a defect. Remove the bug ID. If useful, record it in a "Reviewed and dismissed" appendix in BUGS.md with the challenge reasoning.
|
||||
|
||||
Write the verdict and both rounds' reasoning to `quality/challenge/BUG-NNN-challenge.md`. This file is the audit trail — it shows reviewers that each finding was stress-tested.
|
||||
|
||||
## Auto-trigger patterns
|
||||
|
||||
During a playbook run, automatically apply the challenge gate to any bug matching one or more of these patterns. These patterns are where false positives concentrate:
|
||||
|
||||
| Pattern | Why it triggers | Example |
|
||||
|---------|----------------|---------|
|
||||
| **Security-class finding** (credential leak, auth bypass, injection) | Severity calibration auto-escalates these; the model is incentivized to defend them | BUG-041: "hardcoded JWT secret" that was a development placeholder |
|
||||
| **Code contains design-decision comments at the cited location** | WHY comments, OODA references, TODO-with-explanation, or design decision docs within 10 lines of the cited code suggest the developer made a conscious choice | BUG-007/008: `// WHY-OODA81: Batch upload uses "default" workspace` |
|
||||
| **The "expected behavior" has no spec basis** | Bug's spec_basis field says "code inconsistency" rather than citing a spec document, or the requirement was invented by the auditor (Tier 3 / REQ-NNN created during the run) | BUG-041: REQ-019 was created by the auditor, not derived from project docs |
|
||||
| **Another code path handles the same concern differently** | If text_upload does X but file_upload doesn't, that might be a real inconsistency — or it might be intentional divergence. The challenge sorts out which. | BUG-001/002: text_upload merges source_ids, file_upload overwrites — challenge confirms this is a real bug because text_upload has an explicit fix comment |
|
||||
| **The finding is about missing functionality rather than incorrect behavior** | "This handler doesn't do X" is often a feature gap, not a bug. The challenge checks whether X was ever promised. | BUG-009/029: batch upload "missing" graph writes that were never part of the batch upload's documented scope |
|
||||
|
||||
The pattern list is intentionally conservative — it triggers on categories with historically high false-positive rates. Bugs that don't match any pattern skip the challenge gate and proceed directly to writeup.
|
||||
|
||||
To add new patterns: append a row to the table above with the pattern description, the reasoning, and a concrete example from a prior run.
|
||||
|
||||
## Standalone invocation
|
||||
|
||||
When invoked standalone (not during a playbook run), the challenge gate:
|
||||
|
||||
1. Reads the specified bug writeup from `quality/writeups/BUG-NNN.md`
|
||||
2. Reads the source code at the cited file:line (fresh read, not from the writeup)
|
||||
3. Runs both rounds as described above
|
||||
4. Writes the verdict to `quality/challenge/BUG-NNN-challenge.md`
|
||||
5. If the verdict is REJECTED, suggests removing the bug from BUGS.md and tdd-results.json
|
||||
|
||||
Example prompt for standalone use:
|
||||
```
|
||||
Read the quality playbook skill at .github/skills/SKILL.md and .github/skills/references/challenge_gate.md.
|
||||
Run the challenge gate on BUG-042 using the writeup at quality/writeups/BUG-042.md
|
||||
and the source code in this repo.
|
||||
```
|
||||
|
||||
## Token budget
|
||||
|
||||
Each bug costs roughly 2 sub-agent calls. For a typical run with 5-10 auto-triggered bugs, that's 10-20 sub-agent calls. This is significantly cheaper than a full iteration cycle and catches the highest-value false positives.
|
||||
|
||||
For runs with many security findings (>15 auto-triggered), consider batching: run Round 1 on all triggered bugs first, then only run Round 2 on bugs where Round 1 was ambiguous or where the confidence was low.
|
||||
@@ -0,0 +1,59 @@
|
||||
# Code-only mode
|
||||
|
||||
*Last updated: 2026-05-03 (v1.5.6 Phase 3 — initial publication).*
|
||||
|
||||
When the Quality Playbook runs against a target repo whose `reference_docs/` directory is absent or empty, it operates in **code-only mode**. This document explains what that means, why it matters, and how to upgrade a code-only run into a full-documentation run for the next pass.
|
||||
|
||||
## What "code-only mode" means
|
||||
|
||||
The playbook's normal Phase 1 derivation reads two kinds of evidence:
|
||||
|
||||
- **Code evidence (Tier 3+)** — the source tree itself, plus inline comments, defensive patterns, tests, and any inline documentation co-located with the code.
|
||||
- **Documentation evidence (Tier 1/2)** — plaintext files the operator drops into `reference_docs/` (free-form notes, design docs, retrospectives, AI chats) and `reference_docs/cite/` (project specs, RFCs, API contracts that requirements should be traceable back to).
|
||||
|
||||
Code-only mode is the run state where no documentation evidence is available. The playbook proceeds — it does not abort — but every requirement it derives leans entirely on code evidence. The Phase 1 EXPLORATION.md gets a "Documentation status: code-only mode" opening section that surfaces the mode so reviewers see it on first read.
|
||||
|
||||
## What to expect from a code-only run
|
||||
|
||||
In our benchmark runs, code-only passes consistently produce:
|
||||
|
||||
- **Fewer requirements derived overall.** Without spec-language to anchor, Phase 1 has no Tier 1/2 evidence to cite, so the requirements set falls back to Tier 3 (code-as-spec) entirely.
|
||||
- **Possibly fewer bugs found.** Code review (Phase 3) is most effective when the reviewer knows what the code is *supposed* to do — bugs that violate documented intent are easier to surface than bugs that hide behind ambiguous code-as-spec. With no documentation, the reviewer has to infer intent from the code itself, which leaves a class of intent-violation defects undetected.
|
||||
- **Higher reliance on code-internal signals.** Defensive patterns (error checks, validation), test names, and comment-style annotations carry more weight in the absence of external docs.
|
||||
|
||||
The bug counts in code-only mode are still useful — they reflect what's discoverable from the code alone — but they are a lower bound on what a fully-documented run would produce.
|
||||
|
||||
## How to upgrade to a full-documentation run
|
||||
|
||||
Place plaintext documentation files in the target repo's `reference_docs/` tree before re-running Phase 1:
|
||||
|
||||
```
|
||||
<target-repo>/
|
||||
reference_docs/
|
||||
project_notes.md # Tier 4 — informal notes, AI chats
|
||||
design_overview.md # Tier 3-4 — internal design decisions
|
||||
cite/
|
||||
api_spec.md # Tier 1/2 — citable specs, RFCs, contracts
|
||||
protocol_v3.txt # Tier 1/2 — formal specifications
|
||||
```
|
||||
|
||||
Files at the top level of `reference_docs/` count as informal context (Tier 4). Files under `reference_docs/cite/` count as citable evidence (Tier 1 or 2 depending on the source's authority — see `schemas.md` §3.1). Both `.md` and `.txt` are recognized; other formats are ignored.
|
||||
|
||||
After dropping in documentation, re-run the playbook. Phase 1 will detect the populated `reference_docs/` and skip the code-only-mode downgrade. The new run's EXPLORATION.md, REQUIREMENTS.md, and BUGS.md will reflect the richer evidence base.
|
||||
|
||||
## Opt-out: `--require-docs`
|
||||
|
||||
Operators who want runs to abort instead of proceeding in code-only mode can pass `--require-docs` to `python3 -m bin.run_playbook` (v1.5.6+). When `--require-docs` is set and `reference_docs/` is empty at Phase 1 entry, the playbook:
|
||||
|
||||
1. Appends an `aborted_missing_docs` event to `quality/run_state.jsonl` (event type registered in `references/run_state_schema.md`).
|
||||
2. Writes a clear `ERROR: aborted_missing_docs — reference_docs/ empty and --require-docs set` block to `quality/PROGRESS.md`.
|
||||
3. Aborts before any LLM work (exit non-zero, same as a gate-fail).
|
||||
|
||||
The flag is off by default. Use it for compliance/policy contexts where a quiet code-only-mode downgrade would mask a real process gap (e.g., "every release run must cite a spec; no spec means the run shouldn't have started"). The flag is the opt-IN counterpart to `--no-formal-docs`'s opt-OUT (which suppresses the WARN banner for the same code-only-mode case but allows the run to continue).
|
||||
|
||||
## Cross-references
|
||||
|
||||
- **README** — Step 1 of "How to use the Quality Playbook" describes documentation as the first thing to provide.
|
||||
- **`SKILL.md`** — Phase 1 prose describes how documentation evidence is used during exploration.
|
||||
- **`bin/reference_docs_ingest.py`** — the implementation that ingests the `reference_docs/` tree.
|
||||
- **`references/run_state_schema.md`** — defines the `documentation_state` event the playbook emits when code-only mode triggers, so the downgrade is searchable in audit trails.
|
||||
@@ -155,6 +155,26 @@ State machines are a special category of defensive pattern. When you find status
|
||||
3. Look for states you can enter but never leave (terminal state without cleanup)
|
||||
4. Look for operations that should be available in a state but are blocked by an incomplete guard
|
||||
|
||||
## Enumeration and Whitelist Completeness
|
||||
|
||||
When a function uses `switch`/`case`, `match`, if-else chains, or any dispatch construct to handle a set of named constants (feature bits, enum values, command codes, event types, permission flags), perform the **two-list enumeration check**:
|
||||
|
||||
1. **List A (defined):** Extract every constant from the relevant header, enum, or spec that the code should handle. Use grep — do not list from memory.
|
||||
2. **List B (handled):** Extract every case label, branch condition, or map key from the dispatch code. Use grep or line-by-line read — do not summarize.
|
||||
3. **Diff:** Compare the two lists. Any constant in A but not in B is a potential gap. Any constant in B but not in A is a potential dead case.
|
||||
|
||||
**Why this exists:** AI models reliably hallucinate completeness for switch/case constructs. The model sees a function with many case labels, sees constants defined elsewhere, and concludes all constants are handled without actually checking. In one observed case, the model asserted that a kernel feature-bit whitelist "preserves supported ring transport bits including VIRTIO_F_RING_RESET" when that constant was entirely absent from the switch — the model hallucinated coverage because the constant existed in a header the function's callers used. The mechanical two-list check is the only reliable countermeasure.
|
||||
|
||||
**Triage verification probes must produce executable evidence.** When triage confirms or rejects an enumeration finding via verification probe, prose reasoning alone is insufficient. The probe must produce a test assertion for each constant: `assert "case VIRTIO_F_RING_RESET:" in source_of("vring_transport_features"), "RING_RESET at line NNN"`. This rule exists because in v1.3.16, the triage correctly received a minority finding about RING_RESET but rejected it with a hallucinated claim that "lines 3527-3528 explicitly preserve RING_RESET" — those lines were actually the `default:` branch. Had the triage been forced to write an assertion, it would have failed, exposing the hallucination.
|
||||
|
||||
**Code-side lists must be extracted from the code, not copied from requirements.** When performing the two-list check in the code review or spec audit, the "handled" list must be extracted directly from the function body with per-item line numbers. Do not copy from REQUIREMENTS.md, CONTRACTS.md, the audit prompt, or any other generated artifact. If the two lists (code-extracted vs. requirements-claimed) are word-for-word identical, that is a red flag that the code list was copied — redo the extraction. In v1.3.17, the code review's "case labels present" list was identical to the requirements list, proving it was copied rather than extracted. Three spec auditors then inherited this false list and none independently verified. The per-item line-number citation prevents this: you cannot cite "line 3527: `case VIRTIO_F_RING_RESET:`" when line 3527 actually contains `default:`.
|
||||
|
||||
**Mechanical verification artifacts outrank prose lists.** If `quality/mechanical/<function>_cases.txt` exists for a dispatch function, use it as the authoritative source for what the function handles. Do not replace it with a hand-written list. If no mechanical artifact exists, generate one using a non-interactive shell pipeline (e.g., `awk` + `grep`) before writing contracts or requirements about the function's coverage.
|
||||
|
||||
**Artifact integrity risk:** In v1.3.19 testing, the model executed the correct extraction command but wrote its own fabricated output to the file instead of letting the shell redirect capture it. The fabricated file included a hallucinated `case VIRTIO_F_RING_RESET:` line that the real command does not produce. To mitigate: `quality/mechanical/verify.sh` re-runs every extraction command and diffs against saved files. If any diff is non-empty, the artifact was tampered with and must be regenerated.
|
||||
|
||||
**Where to apply:** Feature-bit negotiation functions, protocol message dispatchers, permission check switches, configuration option handlers, codec/format registration tables, HTTP method/status code handlers, and any function where a `default:` or `else` clause silently drops unrecognized values.
|
||||
|
||||
**Converting state machine gaps to scenarios:**
|
||||
|
||||
```markdown
|
||||
|
||||
@@ -0,0 +1,339 @@
|
||||
# Exploration Patterns for Bug Discovery
|
||||
|
||||
This reference defines the exploration patterns that Phase 1 applies during codebase exploration. These patterns target bug classes most commonly missed when exploration stays at the subsystem or architecture level.
|
||||
|
||||
Requirements problems are the most expensive to fix because they are not caught until after implementation. The exploration phase is requirements elicitation — it determines what the code review and spec audit will look for. A requirement that is never derived is a bug that is never found. These patterns exist to systematically surface requirements that broad exploration misses.
|
||||
|
||||
Each pattern includes a definition, the bug class it targets, diverse examples from different domains, and the expected output format for EXPLORATION.md.
|
||||
|
||||
**Important: These patterns supplement free exploration — they do not replace it.** Phase 1 begins with open-ended exploration driven by domain knowledge and codebase understanding. After that open exploration, apply the patterns below as a structured second pass to catch specific bug classes. If you find yourself only looking for things the patterns describe, you are using them wrong. The patterns are a checklist to run after you have already formed your own understanding of the codebase's risks.
|
||||
|
||||
---
|
||||
|
||||
## Pattern 1: Fallback and Degradation Path Parity
|
||||
|
||||
### Definition
|
||||
|
||||
When code provides multiple strategies for accomplishing the same goal — a primary path and one or more fallback paths — each fallback must preserve the same behavioral invariants as the primary. The fallback may use a different mechanism, but the observable contract must be equivalent.
|
||||
|
||||
### Bug class
|
||||
|
||||
Fallback paths are written later, tested less, and reviewed with less scrutiny than primary paths. They often omit steps the primary path performs (validation, cleanup, index assignment, resource release) because the developer copied the primary path and simplified it for the "degraded" case. The result is a function that works correctly in the common case but violates its contract when the fallback activates.
|
||||
|
||||
### Examples across domains
|
||||
|
||||
- **Authentication:** A web service tries OAuth token validation, falls back to API key lookup, falls back to session cookie. Each fallback must enforce the same authorization scope. Bug: the API key fallback skips scope validation and grants full access.
|
||||
- **Connection pooling:** A database client tries the primary connection pool, falls back to a secondary pool, falls back to creating a one-off connection. Each path must apply the same timeout and transaction isolation settings. Bug: the one-off connection fallback uses the driver default isolation level instead of the configured one.
|
||||
- **Resource allocation:** A memory allocator tries a fast slab path, falls back to a slow page-level path. Both must zero-initialize sensitive fields. Bug: the slow path returns uninitialized memory because zero-fill was only in the slab fast path.
|
||||
- **HTTP redirect handling:** A client follows a redirect and must strip security-sensitive headers (Authorization, Proxy-Authorization, cookies) when the redirect crosses an origin boundary. Bug: the redirect path strips Authorization but not Proxy-Authorization, leaking proxy credentials to the redirected origin.
|
||||
- **Serialization fallback:** A message broker tries binary serialization, falls back to JSON, falls back to string encoding. Each path must preserve the same field ordering and null-handling semantics. Bug: the JSON fallback silently drops null fields that binary serialization preserves.
|
||||
|
||||
### How to apply
|
||||
|
||||
For each core module, look for: conditional chains that try one approach then fall through to another, strategy/adapter patterns where multiple implementations are selected at runtime, retry logic with different strategies per attempt, feature-negotiation cascades where capabilities determine which code path runs, HTTP redirect/retry logic that must preserve or strip headers.
|
||||
|
||||
For each cascade found:
|
||||
1. List the primary path and every fallback.
|
||||
2. For each fallback, check whether it performs the same critical operations as the primary (validation, resource setup, index assignment, cleanup, error reporting, header stripping, resource release).
|
||||
3. Any operation present in the primary but missing in a fallback is a candidate requirement.
|
||||
|
||||
### EXPLORATION.md output format
|
||||
|
||||
```
|
||||
## Fallback Path Analysis
|
||||
|
||||
### [Name of cascade]
|
||||
- **Primary path:** [function, file:line] — [what it does]
|
||||
- **Fallback 1:** [function, file:line] — [what it does, what differs]
|
||||
- **Fallback 2:** [function, file:line] — [what it does, what differs]
|
||||
- **Parity gaps:** [specific operations present in primary but missing in fallback]
|
||||
- **Candidate requirements:** REQ-NNN: [fallback must do X]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Pattern 2: Dispatcher Return-Value Correctness
|
||||
|
||||
### Definition
|
||||
|
||||
When a function dispatches on input type or condition and must return a status value, the return value must be correct for every combination of inputs — not just the primary case. Dispatchers that handle multiple event types, request types, or state transitions are particularly prone to return-value bugs in edge combinations.
|
||||
|
||||
### Bug class
|
||||
|
||||
Dispatchers are typically written and tested for the common case. The return value is correct when the primary event fires. But when an unusual combination occurs (only a secondary event, no events at all, multiple concurrent events), the return-value logic may be wrong — returning "not handled" for a handled event, returning success for a partial failure, or returning a stale value from a previous iteration.
|
||||
|
||||
### Examples across domains
|
||||
|
||||
- **HTTP middleware:** A request dispatcher checks for authentication, rate-limiting, and routing. When rate-limiting triggers but authentication was already set, the dispatcher returns the auth status code instead of the rate-limit status code. Bug: rate-limited requests get 401 instead of 429.
|
||||
- **CORS handler chain:** A CORS preflight handler sets 400 (rejected), then the missing-OPTIONS-handler path sets 404, then an AFTER handler normalizes 404→200 (meant for allowed origins). Bug: rejected preflights get 200 because the status was overwritten by downstream handlers.
|
||||
- **Event loop:** A poll/select loop handles read-ready, write-ready, and error conditions. When only an error condition fires on a socket with no pending reads, the loop returns "no events" because the read-ready check was false. Bug: connection errors are silently ignored.
|
||||
- **State machine transition:** A state machine dispatch function handles valid transitions, invalid transitions, and no-op transitions. When a no-op transition occurs (current state == target state), the function returns an error code intended for invalid transitions. Bug: idempotent operations fail when they should succeed.
|
||||
- **Interrupt handler:** A hardware interrupt handler checks for multiple event types (data-ready, configuration-change, error). When only a secondary event fires (e.g., config change with no data), the handler returns "not mine" because the primary event check failed and the secondary path doesn't set the handled flag. Bug: legitimate secondary events are reported as spurious.
|
||||
|
||||
### How to apply
|
||||
|
||||
For each core module, look for: functions with switch/case or if-else chains that return a status, interrupt/event handlers that handle multiple event types, request dispatchers that check multiple conditions before returning, state machine transition functions, middleware chains where multiple handlers write to the same response status.
|
||||
|
||||
For each dispatcher found:
|
||||
1. Enumerate all input combinations (not just the ones with explicit case labels — also the implicit "else" and "default" paths).
|
||||
2. For each combination, trace the return value through the entire handler chain (not just the immediate function).
|
||||
3. Any combination where the return value doesn't match the expected semantics is a candidate requirement.
|
||||
|
||||
### EXPLORATION.md output format
|
||||
|
||||
```
|
||||
## Dispatcher Return-Value Analysis
|
||||
|
||||
### [Function name] at [file:line]
|
||||
- **Input types:** [list of conditions/events the function dispatches on]
|
||||
- **Combinations checked:**
|
||||
- [Condition A only]: returns [X] — correct/incorrect because [reason]
|
||||
- [Condition B only]: returns [X] — correct/incorrect because [reason]
|
||||
- [Both A and B]: returns [X] — correct/incorrect because [reason]
|
||||
- [Neither A nor B]: returns [X] — correct/incorrect because [reason]
|
||||
- **Candidate requirements:** REQ-NNN: [function must return Y when only B fires]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Pattern 3: Cross-Implementation Contract Consistency
|
||||
|
||||
### Definition
|
||||
|
||||
When multiple functions implement the same logical operation for different contexts (different transports, different backends, different protocol versions), they should all satisfy the same specification requirement. A step that is mandatory in the specification must appear in every implementation — a missing step in one implementation that is present in another is a strong bug signal.
|
||||
|
||||
### Bug class
|
||||
|
||||
When the same operation is implemented in multiple places, each implementation is typically written by a different developer or at a different time. The specification says "reset must wait for completion," and the developer of implementation A writes the wait loop, but the developer of implementation B writes only the reset trigger and forgets the wait. The bug is invisible when testing implementation B in isolation because it "works" on fast hardware — the race condition only manifests under load or on slow devices.
|
||||
|
||||
### Examples across domains
|
||||
|
||||
- **Device reset:** A spec says "the driver must write zero and then poll until the status register reads back zero." The PCI implementation includes the poll loop. The MMIO implementation writes zero but does not poll. Bug: MMIO reset can race with reinitialization.
|
||||
- **Database driver:** A connection-close spec says "the driver must send a termination message, wait for acknowledgment, then release the socket." The PostgreSQL driver does all three. The MySQL driver sends the termination message and releases the socket without waiting for acknowledgment. Bug: the server may process the termination after the socket is reused.
|
||||
- **HTTP header encoding:** A Headers class constructor decodes raw bytes as Latin-1 per RFC 7230. The mutation method (`__setitem__`) encodes values as UTF-8. Bug: round-tripping a Latin-1 header through get-then-set corrupts the value because the encoding changed.
|
||||
- **Cache invalidation:** A cache spec says "invalidation must remove the entry and notify all subscribers." The in-memory cache does both. The distributed cache removes the entry but does not broadcast the notification. Bug: other nodes serve stale data.
|
||||
- **File locking:** A storage spec says "lock acquisition must set a timeout and clean up on failure." The local filesystem implementation sets the timeout. The NFS implementation uses blocking lock with no timeout. Bug: NFS lock contention can hang the process indefinitely.
|
||||
|
||||
### How to apply
|
||||
|
||||
For each core module, look for: the same operation name implemented in multiple files or classes, interface/trait implementations across different backends, protocol-version-specific implementations of the same message, transport-specific implementations of the same lifecycle operation, constructor vs. mutation implementations of the same logical operation.
|
||||
|
||||
For each pair (or set) of implementations:
|
||||
1. Identify the specification requirement they share.
|
||||
2. List the mandatory steps from the spec.
|
||||
3. Check each implementation for each step.
|
||||
4. Any step present in one but missing in another is a candidate requirement.
|
||||
|
||||
**Check every cross-transport operation, not just the most obvious one.** If a codebase has multiple transports (PCI, MMIO, vDPA) or backends (PostgreSQL, MySQL), enumerate all operations that have cross-implementation equivalents — reset, interrupt handling, feature negotiation, queue setup, configuration access — and check each one. The first cross-implementation gap you find is rarely the only one. A common failure mode is analyzing reset thoroughly and then skipping interrupt dispatch, which has the same cross-transport structure.
|
||||
|
||||
### EXPLORATION.md output format
|
||||
|
||||
```
|
||||
## Cross-Implementation Consistency
|
||||
|
||||
### [Operation name] — [spec reference]
|
||||
- **Implementation A:** [function, file:line] — performs steps: [1, 2, 3]
|
||||
- **Implementation B:** [function, file:line] — performs steps: [1, 3] (missing step 2)
|
||||
- **Gap:** [Implementation B missing step 2: description]
|
||||
- **Candidate requirements:** REQ-NNN: [all implementations of X must perform step 2]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Pattern 4: Enumeration and Representation Completeness
|
||||
|
||||
### Definition
|
||||
|
||||
When a codebase maintains a closed set of recognized values — a switch/case whitelist, an array of valid constants, an enum/tagged-union definition, a trait/visitor method family, a set of schema keywords, a registry of accepted entries — every value that the specification, upstream definition, or the library's own public API surface says should be accepted must appear in the set. Values not in the set are silently dropped, rejected, or mishandled, and the absence of an entry is invisible at the call site.
|
||||
|
||||
### Bug class
|
||||
|
||||
Closed sets are written once and rarely revisited. When a new capability is added to the specification or upstream header, the code that defines the capability (the constant, the feature flag, the enum variant) is updated, and the code that uses the capability is updated, but the closed set that gates whether the capability survives a filtering step is forgotten. The feature appears to be supported — it's defined, it's negotiated, it's used — but it's silently stripped by a filter function that nobody remembered to update. The bug is invisible in normal testing because the feature simply doesn't activate, and the absence of activation looks like "the other end doesn't support it."
|
||||
|
||||
This pattern also covers **internal representations** that must mirror a public API. If a library's public API accepts i128/u128 integers but an internal buffered representation only has variants for i64/u64, values that pass through the buffer are silently truncated or rejected — even though the public API promises to handle them.
|
||||
|
||||
### Examples across domains
|
||||
|
||||
- **Feature negotiation filter:** A transport layer maintains a switch/case whitelist of feature bits that should survive filtering. A new feature (`RING_RESET`) is added to the UAPI header and used by higher-level code, but never added to the whitelist. Bug: the feature is silently cleared during negotiation, disabling a capability the driver claims to support.
|
||||
- **Serialization internal representation:** A serialization library's public `Deserializer` trait supports `deserialize_i128()`/`deserialize_u128()`. An internal buffered representation (`Content` enum) used by untagged and internally-tagged enum deserialization has variants only for `I64`/`U64`. Bug: 128-bit integers that pass through the buffer are rejected with a "no variant for i128" error, even though the public API claims to support them.
|
||||
- **Schema keyword importer:** A validation library imports JSON Schema documents. The spec defines `uniqueItems`, `contains`, `minContains`, `maxContains` for arrays. The importer recognizes these keywords (no parse error) but doesn't enforce them. Bug: imported schemas silently accept arrays that violate the original constraints.
|
||||
- **Permission system:** An authorization middleware maintains an array of recognized permission strings. A new permission (`audit:write`) is added to the role definitions but not to the middleware's whitelist. Bug: users with the `audit:write` role are silently denied access because the middleware doesn't recognize the permission.
|
||||
- **Protocol message types:** A message router maintains a switch/case dispatch for recognized message types. A new message type is added to the protocol spec and the serialization layer, but not to the router. Bug: the new message type is silently dropped by the router's default case, and the sender receives no error.
|
||||
|
||||
### How to apply
|
||||
|
||||
For each core module, look for: switch/case statements with explicit case labels and a default that drops/clears/rejects, arrays or sets of accepted values used for filtering or validation, registration functions where new entries must be added manually, enum/tagged-union definitions that mirror a specification or public API, trait/visitor method families where each method handles one variant, schema importers that must handle every keyword the spec defines, internal representations (buffers, IR, AST) that must cover the full range of the public interface.
|
||||
|
||||
For each closed set found:
|
||||
1. Identify the authoritative source that defines what values should be valid. This could be: a spec, a header file, an upstream enum, a protocol definition, **or the library's own public API surface** (trait methods, function signatures, type definitions).
|
||||
2. Extract the closed set mechanically (save the case labels, enum variants, visitor methods, array entries, or schema keywords to a file).
|
||||
3. Compare the extracted set against the authoritative source. Every value in the authoritative source that is absent from the closed set is a candidate requirement.
|
||||
|
||||
**Caller compensation does not excuse a missing entry.** If a closed set in a shared/generic function is missing an entry, that is a bug — even if specific callers compensate by restoring the value after the function runs. The compensation is a workaround, not a fix. Any new caller that doesn't know to compensate silently inherits the bug. Report each missing entry as a finding and note which callers (if any) compensate, but do not dismiss the finding because of compensation.
|
||||
|
||||
### EXPLORATION.md output format
|
||||
|
||||
```
|
||||
## Enumeration/Representation Completeness
|
||||
|
||||
### [Function/type name] at [file:line]
|
||||
- **Purpose:** [what this closed set gates — e.g., "feature bits that survive transport filtering" or "integer variants the buffer can hold"]
|
||||
- **Authoritative source:** [where valid values are defined — e.g., "include/uapi/linux/virtio_config.h" or "public Deserializer trait methods"]
|
||||
- **Extracted entries:** [list of values in the closed set, or reference to mechanical extraction file]
|
||||
- **Missing entries:** [values present in the authoritative source but absent from the closed set]
|
||||
- **Candidate requirements:** REQ-NNN: [closed set must include X]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Pattern 5: API Surface Consistency
|
||||
|
||||
### Definition
|
||||
|
||||
When the same logical operation can be performed through multiple API surfaces — direct method vs. view/wrapper, constructor vs. mutator, sync vs. async variant, primary API vs. convenience alias — all surfaces must produce equivalent observable behavior for the same input. A divergence between two paths to the same operation is a bug, because callers reasonably expect consistent behavior regardless of which surface they use.
|
||||
|
||||
### Bug class
|
||||
|
||||
Libraries often expose the same underlying data through multiple interfaces: a direct method and a collection view (`add()` vs. `asList().add()`), a constructor and a setter, a sync and async variant. These surfaces are implemented at different times, often by different developers, and their edge-case handling diverges — especially around null/sentinel values, encoding, ordering, and error reporting. The divergence is invisible in normal testing because tests typically exercise only one surface per operation.
|
||||
|
||||
### Examples across domains
|
||||
|
||||
- **JSON null handling:** `JsonArray.add(null)` converts null to `JsonNull.INSTANCE` and succeeds. `JsonArray.asList().add(null)` throws `NullPointerException` because the view's wrapper unconditionally rejects null. Bug: two methods for the same operation have contradictory null semantics.
|
||||
- **HTTP header encoding:** `Headers([(b"X-Custom", b"\xe9")])` constructs a header from Latin-1 bytes. `headers["X-Custom"] = b"\xe9"` stores the value as UTF-8. Bug: round-tripping a header through get-then-set changes the encoding silently.
|
||||
- **WebSocket protocol negotiation:** `WebSocketUpgrade::protocols()` returns a `BTreeSet<HeaderValue>`, which sorts and deduplicates the client's preference-ordered protocol list. Bug: the application sees a different order than the client sent, breaking preference-based negotiation.
|
||||
- **Configuration option propagation:** `res.sendFile(path, { etag: false })` should disable ETag for this response. But the code converts the option to a boolean before passing to the underlying `send` module, losing the "strong" vs "weak" ETag mode. Bug: per-call ETag configuration is silently ignored or lossy-converted.
|
||||
- **Map duplicate detection:** `map.put(key, value)` returns the previous value to signal duplicates. When the previous value is legitimately `null`, `put()` returns `null` — the same value it returns for "no previous entry." Bug: duplicate keys go undetected when the first value is null.
|
||||
|
||||
### How to apply
|
||||
|
||||
For each core module, look for: view/wrapper objects returned by methods like `asList()`, `asMap()`, `unmodifiableView()`, `stream()`, `iterator()`; constructor vs. mutation method pairs; sync vs. async variants of the same operation; convenience aliases that delegate to a primary implementation; methods that accept options/configuration objects.
|
||||
|
||||
For each pair of surfaces:
|
||||
1. Identify the logical operation they share.
|
||||
2. Test the same edge-case inputs on both surfaces (null, empty, boundary values, special characters, ordering-sensitive data).
|
||||
3. Any divergence in behavior (different exceptions, different encoding, different ordering, one succeeds and the other fails) is a candidate requirement.
|
||||
|
||||
### EXPLORATION.md output format
|
||||
|
||||
```
|
||||
## API Surface Consistency
|
||||
|
||||
### [Operation name] — [two surfaces compared]
|
||||
- **Surface A:** [method, file:line] — [behavior on edge input]
|
||||
- **Surface B:** [method, file:line] — [behavior on same edge input]
|
||||
- **Divergence:** [what differs — exception type, encoding, ordering, null handling]
|
||||
- **Candidate requirements:** REQ-NNN: [both surfaces must behave equivalently for input X]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Pattern 6: Spec-Structured Parsing Fidelity
|
||||
|
||||
### Definition
|
||||
|
||||
When code parses values defined by a formal grammar or specification — HTTP headers, URLs, MIME types, CLI flags, JSON Schema keywords, file paths — the parsing must match the grammar's actual rules. Shortcuts (substring matching, exact equality, wrong delimiter, prefix matching without boundary checks) produce parsers that work for common inputs but fail on valid edge cases or accept invalid inputs.
|
||||
|
||||
### Bug class
|
||||
|
||||
Developers frequently implement "good enough" parsers that handle the common case: `header.contains("gzip")` instead of tokenizing by comma and trimming whitespace, `url.startsWith("/api")` instead of checking path segment boundaries, `connection == "Upgrade"` instead of case-insensitive token list membership. These shortcuts pass all unit tests because tests use well-formed inputs, but they break on real-world edge cases like `gzip;q=0` (explicitly rejected), `Connection: keep-alive, Upgrade` (token list), or `/api-docs` (prefix match without boundary).
|
||||
|
||||
### Examples across domains
|
||||
|
||||
- **HTTP Accept-Encoding:** Middleware checks `accept.contains("gzip")` to decide whether to compress. This matches `gzip;q=0` (client explicitly rejects gzip) and `xgzip` (not a valid encoding). Bug: responses are compressed when the client said not to.
|
||||
- **WebSocket Connection header:** Code checks `connection == "Upgrade"` (exact match). Per RFC 7230, `Connection` is a comma-separated token list; `Connection: keep-alive, Upgrade` is valid but fails exact match. Bug: valid WebSocket upgrades are rejected.
|
||||
- **SPA fallback routing:** A single-page-app handler matches paths with `path.startsWith("/app")`. This matches both `/app/users` (correct) and `/api-docs` (incorrect sibling route). Bug: API documentation requests are swallowed by the SPA handler.
|
||||
- **MIME type parameter handling:** Content negotiation compares `text/html;level=1` against handler keys but strips parameters before matching. Bug: the `level=1` parameter selected during negotiation is lost from the response Content-Type.
|
||||
- **URL host normalization:** Code detects internationalized domain names by checking `host.startsWith("xn--")`. Per IDNA, only individual labels start with `xn--`; `foo.xn--example.com` has the punycode label in the middle. Bug: internationalized subdomains are not decoded.
|
||||
|
||||
### How to apply
|
||||
|
||||
For each core module, look for: string comparisons on values defined by RFCs or specs (headers, URLs, MIME types, encoding names), `contains()` / `indexOf()` / `startsWith()` / `endsWith()` on structured values, case-sensitive comparisons where the spec requires case-insensitive, splitting on the wrong delimiter or not splitting at all, prefix/suffix matching without path-segment or token boundaries.
|
||||
|
||||
For each parser found:
|
||||
1. Identify the spec that defines the grammar (RFC, ABNF, JSON Schema spec, POSIX, etc.).
|
||||
2. Check whether the implementation handles: token lists (comma-separated), quoted strings, parameters (semicolon-separated), case folding, whitespace trimming, boundary conditions.
|
||||
3. Construct an input that is valid per the spec but would fail the implementation's shortcut parser. That input is a candidate test case and the parsing gap is a candidate requirement.
|
||||
|
||||
### EXPLORATION.md output format
|
||||
|
||||
```
|
||||
## Spec-Structured Parsing
|
||||
|
||||
### [Parser location] at [file:line]
|
||||
- **Spec:** [which grammar/RFC/standard defines the format]
|
||||
- **Implementation technique:** [contains/equals/startsWith/split-on-X]
|
||||
- **Spec-valid input that breaks the parser:** [concrete example]
|
||||
- **Why it breaks:** [substring match includes invalid case / missing case folding / etc.]
|
||||
- **Candidate requirements:** REQ-NNN: [parser must tokenize per RFC NNNN §N.N]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Pattern 7: Composition and Mount-Context Awareness
|
||||
|
||||
### Definition
|
||||
|
||||
When code operates inside a composed context — mounted at a sub-route, nested inside a parent module, scoped to a child container, wrapped by a framework adapter — the framework typically maintains a *canonical* representation of the active state (what the active context says is true right now) alongside the *raw* representation from the outer call site (what the outer caller passed in originally). Code that reads or writes the raw representation when canonical was needed (or vice versa) works correctly at the outer level, where they happen to be identical, but fails silently under composition, where they diverge.
|
||||
|
||||
### Bug class
|
||||
|
||||
Code is written and tested at the outer level — top-level routes, root module, single-tenant deployment, default scope — where canonical and raw state are identical. When the same code runs inside a composed context, the framework updates the canonical state (mounted child path, scoped logger, transaction-scoped connection, locale-aware comparator) but the raw state still reflects the outer call. The defect manifests in two symmetric directions: a function that *reads* raw state where canonical was needed sees stale data and produces silent drift (never matches, leaks parent context, returns the wrong output); a function that *writes* an outward-facing value from canonical state where raw is needed produces output the consumer can't use (drops the mount prefix, returns a child-relative path the parent's clients can't follow). Either way, the test suite typically exercises the outer level only and never sees the divergence.
|
||||
|
||||
### Examples across domains
|
||||
|
||||
- **HTTP routing middleware (mount-context):** A middleware comparing the request path against a configured endpoint reads `r.URL.Path`. When mounted at a sub-route, the framework's canonical "active routing path" (e.g., `RoutePath` in chi, `req.url` in Express sub-app) is the child-relative path while `r.URL.Path` remains the full URL path. Bug: middleware never matches inside the mounted child because it reads the wrong path representation.
|
||||
- **Database transaction context:** A repository method opens its own connection via the connection pool. When called inside an explicit transaction, the framework's canonical "current transaction" context is the explicit one, but the method reads from the connection pool directly. Bug: the method's writes don't participate in the surrounding transaction; rollback leaves orphan rows.
|
||||
- **Logging context propagation:** A library logs via `logging.getLogger(__name__)`. When invoked inside an async task or worker pool that has scoped a contextvar-based correlation ID, the logger doesn't read the contextvar. Bug: the library's log lines lack the correlation ID the framework was propagating, breaking traceability.
|
||||
- **Locale-sensitive comparison:** A sort function uses `str.lower()` for case-insensitive comparison. When called inside a locale-aware context (Turkish "i" / "İ" / "ı" semantics), the framework's canonical locale is set but `str.lower()` reads the default locale. Bug: equality comparisons silently differ depending on which locale is canonically active.
|
||||
- **Authorization scope inheritance:** An ACL check reads `request.user` (the raw authenticated principal). When invoked inside an impersonation context, the framework's canonical `request.effective_user` is the impersonated principal but the check still reads the original. Bug: privilege escalation — the check authorizes the wrong principal.
|
||||
|
||||
### How to apply
|
||||
|
||||
Identify every function or component that reads or writes state that *can be canonical-vs-raw under composition*. The check is: does this code path run unchanged when its caller is composed inside a larger context, and if so, does the state it observes (or produces) change accordingly?
|
||||
|
||||
**Disambiguation from Pattern 4.** Pattern 4 (Enumeration and Representation Completeness) is about closed sets of values: the bug is "value missing from the recognizer's closed set." Pattern 7 is about choice of state variable: the bug is "function reads or writes the wrong representation of state under composition." If both frames seem to apply, prefer the one whose REQ is more testable. The two patterns rarely overlap on the same defect; when they do, the canonical-vs-raw framing usually points more directly at the fix.
|
||||
|
||||
**Budget.** Cap candidates at 3-5 highest-impact composition seams per pass. If more than 10 candidates emerge from this pattern alone, the net is too wide and the pattern is being over-applied — revisit Step 1 with a tighter "what does this framework actually maintain canonically under composition" filter.
|
||||
|
||||
For each candidate found:
|
||||
|
||||
1. **Identify the canonical and raw representations, both for reads and writes.** What does the framework maintain as "the active state for this concern" under composition? What does the function actually read? Then ask the symmetric question for outputs: when this function constructs a value that flows outward (a redirect target, a derived path, a logged correlation ID, an authorized resource handle), is that value being built from the right representation for its consumer? Read-side and write-side defects are equally common; check both directions.
|
||||
2. **Trace the composition seam.** Where does the framework update canonical state? Is the function downstream of that update site? Does it read from the canonical or the raw representation?
|
||||
3. **Construct the composition test.** What is the smallest example where this code runs inside a composed parent (mounted router, nested transaction, scoped logger, impersonation context)? Does the function's behavior match the outer-level behavior, or does it silently drift?
|
||||
4. **Record what happens.** A function that drifts under composition is a candidate requirement: "function `<X>` MUST read `<canonical_state>` (not `<raw_state>`) [or write `<raw_state>` for outward-facing output] so that behavior remains correct when composed inside `<parent_context>`."
|
||||
|
||||
Common composition seams worth checking explicitly:
|
||||
|
||||
- **Routing frameworks:** `RoutePath` / `req.url` / `request.path_info` versus the original URL. Each level of mounting updates the canonical path; raw URLs do not.
|
||||
- **Transaction managers:** explicit transaction context versus connection pool's auto-commit default. Composed code must read the active transaction.
|
||||
- **Logging / tracing:** contextvar-scoped correlation IDs versus thread-local or default loggers. Composed code must read the contextvar.
|
||||
- **Authorization / impersonation:** effective principal versus raw user. Composed code must read the effective principal.
|
||||
|
||||
### EXPLORATION.md output format
|
||||
|
||||
```
|
||||
## Composition and Mount-Context Analysis
|
||||
|
||||
### [Function/component name] at [file:line]
|
||||
- **Composes inside:** [parent context — e.g., "any chi router that calls Mount() to attach this middleware"]
|
||||
- **Canonical representation:** [what the framework maintains under composition — e.g., "rctx.RoutePath, the active mounted path"]
|
||||
- **Raw representation read by this code:** [what the function actually reads — e.g., "r.URL.Path, the full request URL"]
|
||||
- **Drift scenario:** [smallest composition example that exposes the divergence — e.g., "router.Mount("/api", child) where child uses this middleware to serve /ping"]
|
||||
- **Observable failure:** [what wrong behavior results — e.g., "404 instead of the expected response"]
|
||||
- **Candidate requirements:** REQ-NNN: [function MUST read canonical state under composition]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Extending This List
|
||||
|
||||
These patterns were derived from analyzing 56 confirmed bugs across 11 open-source repositories spanning 7 languages. Each pattern represents a class of requirements that broad architectural summaries consistently miss.
|
||||
|
||||
To add a new pattern:
|
||||
1. Identify a confirmed bug that was missed by exploration but would have been found with a specific analysis technique.
|
||||
2. Generalize the technique: what question should the explorer have asked about the code?
|
||||
3. Provide at least 5 diverse examples from different domains (not all from the same project).
|
||||
4. Define the expected output format for EXPLORATION.md.
|
||||
5. Add the pattern to this file and add the corresponding section to the EXPLORATION.md template in SKILL.md.
|
||||
|
||||
The goal is a library of systematic exploration techniques that accumulate over time as new bug classes are discovered.
|
||||
@@ -32,94 +32,26 @@ For a medium-sized project (5–15 source files), this typically yields 35–50
|
||||
|
||||
Before writing any test code, read 2–3 existing test files and identify how they import project modules. This is critical — projects handle imports differently and getting it wrong means every test fails with resolution errors.
|
||||
|
||||
Identify the import convention used in the project. Whatever pattern the existing tests use, copy it exactly. Do not guess or invent a different pattern.
|
||||
|
||||
Common patterns by language:
|
||||
|
||||
**Python:**
|
||||
- `sys.path.insert(0, "src/")` then bare imports (`from module import func`)
|
||||
- Package imports (`from myproject.module import func`)
|
||||
- Relative imports with conftest.py path manipulation
|
||||
|
||||
**Java:**
|
||||
- `import com.example.project.Module;` matching the package structure
|
||||
- Test source root must mirror main source root
|
||||
|
||||
**Scala:**
|
||||
- `import com.example.project._` or `import com.example.project.{ClassA, ClassB}`
|
||||
- SBT project layout: `src/test/scala/` mirrors `src/main/scala/`
|
||||
|
||||
**TypeScript/JavaScript:**
|
||||
- `import { func } from '../src/module'` with relative paths
|
||||
- Path aliases from `tsconfig.json` (e.g., `@/module`)
|
||||
|
||||
**Go:**
|
||||
- Same package: test files in the same directory with `package mypackage`
|
||||
- Black-box testing: `package mypackage_test` with explicit imports
|
||||
- Internal packages may require specific import paths
|
||||
|
||||
**Rust:**
|
||||
- `use crate::module::function;` for unit tests in the same crate
|
||||
- `use myproject::module::function;` for integration tests in `tests/`
|
||||
|
||||
Whatever pattern the existing tests use, copy it exactly. Do not guess or invent a different pattern.
|
||||
- **Python:** `sys.path.insert(0, "src/")` then bare imports; package imports (`from myproject.module import func`); relative imports with conftest.py path manipulation
|
||||
- **Go:** Same-package tests (`package mypackage`) give access to unexported identifiers; black-box tests (`package mypackage_test`) test only exported API; internal packages may require specific import paths
|
||||
- **Java:** `import com.example.project.Module;` matching the package structure; test source root must mirror main source root
|
||||
- **TypeScript:** `import { func } from '../src/module'` with relative paths; path aliases from `tsconfig.json` (e.g., `@/module`)
|
||||
- **Rust:** `use crate::module::function;` for unit tests in the same crate; `use myproject::module::function;` for integration tests in `tests/`
|
||||
- **Scala:** `import com.example.project._` or `import com.example.project.{ClassA, ClassB}`; SBT layout mirrors `src/main/scala/` in `src/test/scala/`
|
||||
|
||||
## Create Test Setup BEFORE Writing Tests
|
||||
|
||||
Every test framework has a mechanism for shared setup. If your tests use shared fixtures or test data, you MUST create the setup file before writing tests. Test frameworks do not auto-discover fixtures from other directories.
|
||||
|
||||
**By language:**
|
||||
|
||||
**Python (pytest):** Create `quality/conftest.py` defining every fixture. Fixtures in `tests/conftest.py` are NOT available to `quality/test_functional.py`. Preferred: write tests that create data inline using `tmp_path` to eliminate conftest dependency.
|
||||
|
||||
**Java (JUnit):** Use `@BeforeEach`/`@BeforeAll` methods in the test class, or create a shared `TestFixtures` utility class in the same package.
|
||||
|
||||
**Scala (ScalaTest):** Mix in a trait with `before`/`after` blocks, or use inline data builders. If using SBT, ensure the test file is in the correct source tree.
|
||||
|
||||
**TypeScript (Jest):** Use `beforeAll`/`beforeEach` in the test file, or create a `quality/testUtils.ts` with factory functions.
|
||||
|
||||
**Go (testing):** Helper functions in the same `_test.go` file with `t.Helper()`. Use `t.TempDir()` for temporary directories. Go convention strongly prefers inline setup — avoid shared test state.
|
||||
|
||||
**Rust (cargo test):** Helper functions in a `#[cfg(test)] mod tests` block or a `test_utils.rs` module. Use builder patterns for constructing test data. For integration tests, place files in `tests/`.
|
||||
Identify your framework's setup mechanism (fixtures, `@BeforeEach`, `beforeAll`, helper functions, builder patterns, etc.) and follow the conventions already used in the project's existing tests.
|
||||
|
||||
**Rule: Every fixture or test helper referenced must be defined.** If a test depends on shared setup that doesn't exist, the test will error during setup (not fail during assertion) — producing broken tests that look like they pass.
|
||||
|
||||
**Preferred approach across all languages:** Write tests that create their own data inline. This eliminates cross-file dependencies:
|
||||
|
||||
```python
|
||||
# Python
|
||||
def test_config_validation(tmp_path):
|
||||
config = {"pipeline": {"name": "Test", "steps": [...]}}
|
||||
```
|
||||
|
||||
```java
|
||||
// Java
|
||||
@Test
|
||||
void testConfigValidation(@TempDir Path tempDir) {
|
||||
var config = Map.of("pipeline", Map.of("name", "Test"));
|
||||
}
|
||||
```
|
||||
|
||||
```typescript
|
||||
// TypeScript
|
||||
test('config validation', () => {
|
||||
const config = { pipeline: { name: 'Test', steps: [] } };
|
||||
});
|
||||
```
|
||||
|
||||
```go
|
||||
// Go
|
||||
func TestConfigValidation(t *testing.T) {
|
||||
tmpDir := t.TempDir()
|
||||
config := Config{Pipeline: Pipeline{Name: "Test"}}
|
||||
}
|
||||
```
|
||||
|
||||
```rust
|
||||
// Rust
|
||||
#[test]
|
||||
fn test_config_validation() {
|
||||
let config = Config { pipeline: Pipeline { name: "Test".into() } };
|
||||
}
|
||||
```
|
||||
**Preferred approach across all languages:** Write tests that create their own data inline. This eliminates cross-file dependencies. Create test data directly in each test function using the framework's temporary directory support and literal data structures.
|
||||
|
||||
**After writing all tests, run the test suite and check for setup errors.** Setup errors (fixture not found, import failures) count as broken tests regardless of how the framework categorizes them.
|
||||
|
||||
@@ -133,14 +65,14 @@ If you genuinely cannot write a meaningful test for a defensive pattern (e.g., i
|
||||
|
||||
Before writing a single test, build a function call map. For every function you plan to test:
|
||||
|
||||
1. **Read the function/method signature** — not just the name, but every parameter, its type, and default value. In Python, read the `def` line and type hints. In Java, read the method signature and generics. In Scala, read the method definition and implicit parameters. In TypeScript, read the type annotations.
|
||||
1. **Read the function/method signature** — not just the name, but every parameter, its type, and default value.
|
||||
2. **Read the documentation** — docstrings, Javadoc, TSDoc, ScalaDoc. They often specify return types, exceptions, and edge case behavior.
|
||||
3. **Read one existing test that calls it** — existing tests show you the exact calling convention, fixture shape, and assertion pattern.
|
||||
4. **Read real data files** — if the function processes configs, schemas, or data files, read an actual file from the project. Your test fixtures must match this shape exactly.
|
||||
|
||||
**Common failure pattern:** The agent explores the architecture, understands conceptually what a function does, then writes a test call with guessed parameters. The test fails because the real function takes `(config, items_data, limit)` not `(items, seed, strategy)`. Reading the actual signature takes 5 seconds and prevents this entirely.
|
||||
|
||||
**Library version awareness:** Check the project's dependency manifest (`requirements.txt`, `build.sbt`, `package.json`, `pom.xml`, `build.gradle`, `Cargo.toml`) to verify what's available. Use the test framework's skip mechanism for optional dependencies: Python `pytest.importorskip()`, JUnit `Assumptions.assumeTrue()`, ScalaTest `assume()`, Jest conditional `describe.skip`, Go `t.Skip()`, Rust `#[ignore]` with a comment explaining the prerequisite.
|
||||
**Library version awareness:** Check the project's dependency manifest (`requirements.txt`, `build.sbt`, `package.json`, `pom.xml`, `build.gradle`, `Cargo.toml`) to verify what's available. Use the test framework's skip mechanism for optional dependencies (e.g., `pytest.importorskip()`, `Assumptions.assumeTrue()`, `t.Skip()`, `#[ignore]`, etc.).
|
||||
|
||||
## Writing Spec-Derived Tests
|
||||
|
||||
@@ -151,68 +83,9 @@ Each test should:
|
||||
2. **Execute** — Call the function, run the pipeline, make the request
|
||||
3. **Assert specific properties** the spec requires
|
||||
|
||||
```python
|
||||
# Python (pytest)
|
||||
class TestSpecRequirements:
|
||||
def test_requirement_from_spec_section_N(self, fixture):
|
||||
"""[Req: formal — Design Doc §N] X should produce Y."""
|
||||
result = process(fixture)
|
||||
assert result.property == expected_value
|
||||
```
|
||||
Each test should include a traceability annotation (via docstring, display name, or comment) citing the spec section it verifies, e.g., `[Req: formal — Design Doc §N] X should produce Y`.
|
||||
|
||||
```java
|
||||
// Java (JUnit 5)
|
||||
class SpecRequirementsTest {
|
||||
@Test
|
||||
@DisplayName("[Req: formal — Design Doc §N] X should produce Y")
|
||||
void testRequirementFromSpecSectionN() {
|
||||
var result = process(fixture);
|
||||
assertEquals(expectedValue, result.getProperty());
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
```scala
|
||||
// Scala (ScalaTest)
|
||||
class SpecRequirements extends FlatSpec with Matchers {
|
||||
// [Req: formal — Design Doc §N] X should produce Y
|
||||
"Section N requirement" should "produce Y from X" in {
|
||||
val result = process(fixture)
|
||||
result.property should equal (expectedValue)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
```typescript
|
||||
// TypeScript (Jest)
|
||||
describe('Spec Requirements', () => {
|
||||
test('[Req: formal — Design Doc §N] X should produce Y', () => {
|
||||
const result = process(fixture);
|
||||
expect(result.property).toBe(expectedValue);
|
||||
});
|
||||
});
|
||||
```
|
||||
|
||||
```go
|
||||
// Go (testing)
|
||||
func TestSpecRequirement_SectionN_XProducesY(t *testing.T) {
|
||||
// [Req: formal — Design Doc §N] X should produce Y
|
||||
result := Process(fixture)
|
||||
if result.Property != expectedValue {
|
||||
t.Errorf("expected %v, got %v", expectedValue, result.Property)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
```rust
|
||||
// Rust (cargo test)
|
||||
#[test]
|
||||
fn test_spec_requirement_section_n_x_produces_y() {
|
||||
// [Req: formal — Design Doc §N] X should produce Y
|
||||
let result = process(&fixture);
|
||||
assert_eq!(result.property, expected_value);
|
||||
}
|
||||
```
|
||||
|
||||
## What Makes a Good Functional Test
|
||||
|
||||
@@ -226,72 +99,9 @@ fn test_spec_requirement_section_n_x_produces_y() {
|
||||
|
||||
If the project handles multiple input types, cross-variant coverage is where silent bugs hide. Aim for roughly 30% of tests exercising all variants — the exact percentage matters less than ensuring every cross-cutting property is tested across all variants.
|
||||
|
||||
Use your framework's parametrization mechanism:
|
||||
Use your framework's parametrization mechanism (e.g., `@pytest.mark.parametrize`, `@ParameterizedTest`, `test.each`, table-driven tests, iterating over cases) to run the same assertion logic across all variants.
|
||||
|
||||
```python
|
||||
# Python (pytest)
|
||||
@pytest.mark.parametrize("variant", [variant_a, variant_b, variant_c])
|
||||
def test_feature_works(variant):
|
||||
output = process(variant.input)
|
||||
assert output.has_expected_property
|
||||
```
|
||||
|
||||
```java
|
||||
// Java (JUnit 5)
|
||||
@ParameterizedTest
|
||||
@MethodSource("variantProvider")
|
||||
void testFeatureWorks(Variant variant) {
|
||||
var output = process(variant.getInput());
|
||||
assertTrue(output.hasExpectedProperty());
|
||||
}
|
||||
```
|
||||
|
||||
```scala
|
||||
// Scala (ScalaTest)
|
||||
Seq(variantA, variantB, variantC).foreach { variant =>
|
||||
it should s"work for ${variant.name}" in {
|
||||
val output = process(variant.input)
|
||||
output should have ('expectedProperty (true))
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
```typescript
|
||||
// TypeScript (Jest)
|
||||
test.each([variantA, variantB, variantC])(
|
||||
'feature works for %s', (variant) => {
|
||||
const output = process(variant.input);
|
||||
expect(output).toHaveProperty('expectedProperty');
|
||||
});
|
||||
```
|
||||
|
||||
```go
|
||||
// Go (testing) — table-driven tests
|
||||
func TestFeatureWorksAcrossVariants(t *testing.T) {
|
||||
variants := []Variant{variantA, variantB, variantC}
|
||||
for _, v := range variants {
|
||||
t.Run(v.Name, func(t *testing.T) {
|
||||
output := Process(v.Input)
|
||||
if !output.HasExpectedProperty() {
|
||||
t.Errorf("variant %s: missing expected property", v.Name)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
```rust
|
||||
// Rust (cargo test) — iterate over cases
|
||||
#[test]
|
||||
fn test_feature_works_across_variants() {
|
||||
let variants = [variant_a(), variant_b(), variant_c()];
|
||||
for v in &variants {
|
||||
let output = process(&v.input);
|
||||
assert!(output.has_expected_property(),
|
||||
"variant {}: missing expected property", v.name);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
If parametrization doesn't fit, loop explicitly within a single test.
|
||||
|
||||
@@ -312,68 +122,15 @@ These patterns look like tests but don't catch real bugs:
|
||||
|
||||
### The Exception-Catching Anti-Pattern in Detail
|
||||
|
||||
```java
|
||||
// Java — WRONG: tests the validation mechanism
|
||||
@Test
|
||||
void testBadValueRejected() {
|
||||
fixture.setField("invalid"); // Schema rejects this!
|
||||
assertThrows(ValidationException.class, () -> process(fixture));
|
||||
// Tells you nothing about output
|
||||
}
|
||||
|
||||
// Java — RIGHT: tests the requirement
|
||||
@Test
|
||||
void testBadValueNotInOutput() {
|
||||
fixture.setField(null); // Schema accepts null for Optional
|
||||
var output = process(fixture);
|
||||
assertFalse(output.contains(badProperty)); // Bad data absent
|
||||
assertTrue(output.contains(expectedType)); // Rest still works
|
||||
}
|
||||
```
|
||||
|
||||
```scala
|
||||
// Scala — WRONG: tests the decoder, not the requirement
|
||||
"bad value" should "be rejected" in {
|
||||
val input = fixture.copy(field = "invalid") // Circe decoder fails!
|
||||
a [DecodingFailure] should be thrownBy process(input)
|
||||
// Tells you nothing about output
|
||||
}
|
||||
|
||||
// Scala — RIGHT: tests the requirement
|
||||
"missing optional field" should "not produce bad output" in {
|
||||
val input = fixture.copy(field = None) // Option[String] accepts None
|
||||
val output = process(input)
|
||||
output should not contain badProperty // Bad data absent
|
||||
output should contain (expectedType) // Rest still works
|
||||
}
|
||||
```
|
||||
|
||||
```typescript
|
||||
// TypeScript — WRONG: tests the validation mechanism
|
||||
test('bad value rejected', () => {
|
||||
fixture.field = 'invalid'; // Zod schema rejects this!
|
||||
expect(() => process(fixture)).toThrow(ZodError);
|
||||
// Tells you nothing about output
|
||||
});
|
||||
|
||||
// TypeScript — RIGHT: tests the requirement
|
||||
test('bad value not in output', () => {
|
||||
fixture.field = undefined; // Schema accepts undefined for optional
|
||||
const output = process(fixture);
|
||||
expect(output).not.toContain(badProperty); // Bad data absent
|
||||
expect(output).toContain(expectedType); // Rest still works
|
||||
});
|
||||
```
|
||||
|
||||
```python
|
||||
# Python — WRONG: tests the validation mechanism
|
||||
# WRONG: tests the validation mechanism
|
||||
def test_bad_value_rejected(fixture):
|
||||
fixture.field = "invalid" # Schema rejects this!
|
||||
with pytest.raises(ValidationError):
|
||||
process(fixture)
|
||||
# Tells you nothing about output
|
||||
|
||||
# Python — RIGHT: tests the requirement
|
||||
# RIGHT: tests the requirement
|
||||
def test_bad_value_not_in_output(fixture):
|
||||
fixture.field = None # Schema accepts None for Optional
|
||||
output = process(fixture)
|
||||
@@ -381,42 +138,9 @@ def test_bad_value_not_in_output(fixture):
|
||||
assert expected_type in output # Rest still works
|
||||
```
|
||||
|
||||
```go
|
||||
// Go — WRONG: tests the error, not the outcome
|
||||
func TestBadValueRejected(t *testing.T) {
|
||||
fixture.Field = "invalid" // Validator rejects this!
|
||||
_, err := Process(fixture)
|
||||
if err == nil { t.Fatal("expected error") }
|
||||
// Tells you nothing about output
|
||||
}
|
||||
The pattern is the same in every language: don't test that the validation mechanism rejects bad input — test that the system produces correct output when given edge-case input the schema accepts. The WRONG approach tests the implementation (the validator); the RIGHT approach tests the requirement (the output).
|
||||
|
||||
// Go — RIGHT: tests the requirement
|
||||
func TestBadValueNotInOutput(t *testing.T) {
|
||||
fixture.Field = "" // Zero value is valid
|
||||
output, err := Process(fixture)
|
||||
if err != nil { t.Fatalf("unexpected error: %v", err) }
|
||||
if containsBadProperty(output) { t.Error("bad data should be absent") }
|
||||
if !containsExpectedType(output) { t.Error("expected data should be present") }
|
||||
}
|
||||
```
|
||||
|
||||
```rust
|
||||
// Rust — WRONG: tests the error, not the outcome
|
||||
#[test]
|
||||
fn test_bad_value_rejected() {
|
||||
let input = Fixture { field: "invalid".into(), ..default() };
|
||||
assert!(process(&input).is_err()); // Tells you nothing about output
|
||||
}
|
||||
|
||||
// Rust — RIGHT: tests the requirement
|
||||
#[test]
|
||||
fn test_bad_value_not_in_output() {
|
||||
let input = Fixture { field: None, ..default() }; // Option accepts None
|
||||
let output = process(&input).expect("should succeed");
|
||||
assert!(!output.contains(bad_property)); // Bad data absent
|
||||
assert!(output.contains(expected_type)); // Rest still works
|
||||
}
|
||||
```
|
||||
|
||||
Always check your Step 5b schema map before choosing mutation values.
|
||||
|
||||
@@ -428,154 +152,20 @@ Ask: "What does the *spec* say should happen?" The spec says "invalid data shoul
|
||||
|
||||
## Fitness-to-Purpose Scenario Tests
|
||||
|
||||
For each scenario in QUALITY.md, write a test. This is a 1:1 mapping:
|
||||
For each scenario in QUALITY.md, write a test. This is a 1:1 mapping. Each test should include a traceability annotation citing the scenario, e.g., `[Req: formal — QUALITY.md Scenario 1]`, and be named to match the scenario's memorable name.
|
||||
|
||||
```scala
|
||||
// Scala (ScalaTest)
|
||||
class FitnessScenarios extends FlatSpec with Matchers {
|
||||
// [Req: formal — QUALITY.md Scenario 1]
|
||||
"Scenario 1: [Name]" should "prevent [failure mode]" in {
|
||||
val result = process(fixture)
|
||||
result.property should equal (expectedValue)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
```python
|
||||
# Python (pytest)
|
||||
class TestFitnessScenarios:
|
||||
"""Tests for fitness-to-purpose scenarios from QUALITY.md."""
|
||||
|
||||
def test_scenario_1_memorable_name(self, fixture):
|
||||
"""[Req: formal — QUALITY.md Scenario 1] [Name].
|
||||
Requirement: [What the code must do].
|
||||
"""
|
||||
result = process(fixture)
|
||||
assert condition_that_prevents_the_failure
|
||||
```
|
||||
|
||||
```java
|
||||
// Java (JUnit 5)
|
||||
class FitnessScenariosTest {
|
||||
@Test
|
||||
@DisplayName("[Req: formal — QUALITY.md Scenario 1] [Name]")
|
||||
void testScenario1MemorableName() {
|
||||
var result = process(fixture);
|
||||
assertTrue(conditionThatPreventsFailure(result));
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
```typescript
|
||||
// TypeScript (Jest)
|
||||
describe('Fitness Scenarios', () => {
|
||||
test('[Req: formal — QUALITY.md Scenario 1] [Name]', () => {
|
||||
const result = process(fixture);
|
||||
expect(conditionThatPreventsFailure(result)).toBe(true);
|
||||
});
|
||||
});
|
||||
```
|
||||
|
||||
```go
|
||||
// Go (testing)
|
||||
func TestScenario1_MemorableName(t *testing.T) {
|
||||
// [Req: formal — QUALITY.md Scenario 1] [Name]
|
||||
// Requirement: [What the code must do]
|
||||
result := Process(fixture)
|
||||
if !conditionThatPreventsFailure(result) {
|
||||
t.Error("scenario 1 failed: [describe expected behavior]")
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
```rust
|
||||
// Rust (cargo test)
|
||||
#[test]
|
||||
fn test_scenario_1_memorable_name() {
|
||||
// [Req: formal — QUALITY.md Scenario 1] [Name]
|
||||
// Requirement: [What the code must do]
|
||||
let result = process(&fixture);
|
||||
assert!(condition_that_prevents_the_failure(&result));
|
||||
}
|
||||
```
|
||||
|
||||
## Boundary and Negative Tests
|
||||
|
||||
One test per defensive pattern from Step 5:
|
||||
One test per defensive pattern from Step 5. Each test should include a traceability annotation citing the defensive pattern, e.g., `[Req: inferred — from function_name() guard] guards against X`.
|
||||
|
||||
```typescript
|
||||
// TypeScript (Jest)
|
||||
describe('Boundaries and Edge Cases', () => {
|
||||
test('[Req: inferred — from functionName() guard] guards against X', () => {
|
||||
const input = { ...validFixture, field: null };
|
||||
const result = process(input);
|
||||
expect(result).not.toContainBadOutput();
|
||||
});
|
||||
});
|
||||
```
|
||||
For each boundary test:
|
||||
1. Mutate input to trigger the defensive code path (using a value the schema accepts)
|
||||
2. Process the mutated input
|
||||
3. Assert graceful handling — the result is valid despite the edge-case input
|
||||
|
||||
```python
|
||||
# Python (pytest)
|
||||
class TestBoundariesAndEdgeCases:
|
||||
"""Tests for boundary conditions, malformed input, error handling."""
|
||||
|
||||
def test_defensive_pattern_name(self, fixture):
|
||||
"""[Req: inferred — from function_name() guard] guards against X."""
|
||||
# Mutate to trigger defensive code path
|
||||
# Assert graceful handling
|
||||
```
|
||||
|
||||
```java
|
||||
// Java (JUnit 5)
|
||||
class BoundariesAndEdgeCasesTest {
|
||||
@Test
|
||||
@DisplayName("[Req: inferred — from methodName() guard] guards against X")
|
||||
void testDefensivePatternName() {
|
||||
fixture.setField(null); // Trigger defensive code path
|
||||
var result = process(fixture);
|
||||
assertNotNull(result); // Assert graceful handling
|
||||
assertFalse(result.containsBadData());
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
```scala
|
||||
// Scala (ScalaTest)
|
||||
class BoundariesAndEdgeCases extends FlatSpec with Matchers {
|
||||
// [Req: inferred — from methodName() guard]
|
||||
"defensive pattern: methodName()" should "guard against X" in {
|
||||
val input = fixture.copy(field = None) // Trigger defensive code path
|
||||
val result = process(input)
|
||||
result should equal (defined)
|
||||
result.get should not contain badData
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
```go
|
||||
// Go (testing)
|
||||
func TestDefensivePattern_FunctionName_GuardsAgainstX(t *testing.T) {
|
||||
// [Req: inferred — from FunctionName() guard] guards against X
|
||||
input := defaultFixture()
|
||||
input.Field = nil // Trigger defensive code path
|
||||
result, err := Process(input)
|
||||
if err != nil {
|
||||
t.Fatalf("expected graceful handling, got: %v", err)
|
||||
}
|
||||
// Assert result is valid despite edge-case input
|
||||
}
|
||||
```
|
||||
|
||||
```rust
|
||||
// Rust (cargo test)
|
||||
#[test]
|
||||
fn test_defensive_pattern_function_name_guards_against_x() {
|
||||
// [Req: inferred — from function_name() guard] guards against X
|
||||
let input = Fixture { field: None, ..default_fixture() };
|
||||
let result = process(&input).expect("expected graceful handling");
|
||||
// Assert result is valid despite edge-case input
|
||||
}
|
||||
```
|
||||
|
||||
Use your Step 5b schema map when choosing mutation values. Every mutation must use a value the schema accepts.
|
||||
|
||||
|
||||
@@ -0,0 +1,191 @@
|
||||
# Iteration Mode Reference
|
||||
|
||||
> This file contains the detailed instructions for each iteration strategy.
|
||||
> The agent reads this file when running an iteration — all operational detail lives here,
|
||||
> not in the prompt or in the benchmark runner.
|
||||
|
||||
## Iteration cycle
|
||||
|
||||
The recommended iteration order is: **gap → unfiltered → parity → adversarial**. Each strategy finds different bug classes, and running them in this order maximizes cumulative yield. After each iteration, the skill prints a suggested prompt for the next strategy — follow the cycle until you hit diminishing returns or decide to stop.
|
||||
|
||||
```
|
||||
Baseline run # structured three-stage exploration
|
||||
→ gap scan previous coverage, explore gaps # finds bugs in uncovered subsystems
|
||||
→ unfiltered pure domain-driven, no structure # finds bugs that structure suppresses
|
||||
→ parity cross-path comparison and diffing # finds inconsistencies between parallel implementations
|
||||
→ adversarial challenge dismissed/demoted findings # recovers Type II errors from previous triage
|
||||
```
|
||||
|
||||
## Shared rules for all strategies
|
||||
|
||||
These rules apply to every iteration strategy:
|
||||
|
||||
1. **ITER file naming.** Write findings to `quality/EXPLORATION_ITER{N}.md` — check which iteration files already exist and use the next number (e.g., `EXPLORATION_ITER2.md` for the first iteration, `EXPLORATION_ITER3.md` for the second).
|
||||
|
||||
2. **Do NOT delete or archive quality/.** You are building on the existing run, not replacing it.
|
||||
|
||||
3. **Context budget discipline.** A first-run EXPLORATION.md can be 200–400 lines. Loading it all into context before starting your own exploration leaves too little room for deep investigation. The previous-run scan should consume ~20–30 lines of context. Targeted deep-reads should consume ~40–60 lines total. This leaves the bulk of your context budget for new exploration.
|
||||
|
||||
4. **Merge.** After completing the strategy-specific exploration, create or update `quality/EXPLORATION_MERGED.md` that combines findings from ALL iterations. For each section, concatenate the findings with clear attribution (`[Iteration 1]` / `[Iteration 2: gap]` / `[Iteration 3: unfiltered]` / etc.). Include the strategy name in the attribution so downstream phases can see which approach surfaced each finding. The Candidate Bugs section should be re-consolidated from all findings across all iterations. If `EXPLORATION_MERGED.md` already exists from a previous iteration, merge the new iteration's findings into it rather than starting from scratch.
|
||||
|
||||
**Demoted Candidates Manifest (mandatory in EXPLORATION_MERGED.md).** After re-consolidating the Candidate Bugs section, add or update a `## Demoted Candidates` section at the end of EXPLORATION_MERGED.md. This section tracks findings that were dismissed, demoted, or deprioritized during any iteration — they are the raw material for the adversarial strategy. For each demoted candidate, record:
|
||||
|
||||
```
|
||||
### DC-NNN: [short title]
|
||||
- **Source:** [which iteration and strategy first surfaced this]
|
||||
- **Dismissal reason:** [why it was demoted — e.g., "classified as design choice," "insufficient evidence," "needs runtime confirmation"]
|
||||
- **Code location:** [file:line references]
|
||||
- **Re-promotion criteria:** [specific evidence that would flip this to a confirmed candidate — e.g., "show that the permissive behavior violates a documented contract," "trace the code path to prove the edge case is reachable," "demonstrate that the output differs from what the spec requires"]
|
||||
- **Status:** DEMOTED | RE-PROMOTED [iteration] | FALSE POSITIVE [iteration]
|
||||
```
|
||||
|
||||
The re-promotion criteria are the most important field — they tell the adversarial strategy exactly what evidence to gather. Vague criteria like "needs more investigation" are not acceptable; write criteria that a different agent session could act on without additional context. If a subsequent iteration re-promotes or definitively falsifies a demoted candidate, update its status and add a note explaining the resolution.
|
||||
|
||||
5. **Continue with Phases 2–6.** Use `EXPLORATION_MERGED.md` as the primary input for Phase 2 artifact generation. All downstream artifacts (REQUIREMENTS.md, code review, spec audit) should reference the merged exploration.
|
||||
|
||||
**TDD is mandatory for iteration runs (v1.3.49).** Iteration runs must execute the full TDD red-green cycle for every newly confirmed bug, exactly as baseline runs do. This means: for each new BUG-NNN confirmed in this iteration, create a regression test patch, run it against unpatched code to produce `quality/results/BUG-NNN.red.log`, and if a fix patch exists, run it against patched code to produce `quality/results/BUG-NNN.green.log`. The TDD Log Closure Gate in Phase 5 applies equally to iteration runs — missing log files will cause quality_gate.sh to FAIL. Do not skip TDD because this is "just an iteration" or because prior bugs already have logs. New bugs need new logs. If the test runner is not available for the project's language, create the log file with `NOT_RUN` on the first line and an explanation — the file must still exist.
|
||||
|
||||
6. **Iteration mode completion gate.** Before proceeding to Phase 2 (applies to all strategies):
|
||||
- `quality/ITERATION_PLAN.md` exists and names the strategy used
|
||||
- `quality/EXPLORATION_ITER{N}.md` exists for this iteration with at least 80 lines of substantive content
|
||||
- `quality/EXPLORATION_MERGED.md` exists and contains findings from all iterations
|
||||
- The merged Candidate Bugs section has at least 2 new candidates not present in previous iterations
|
||||
- At least 1 finding covers a code area not explored in previous iterations OR re-confirms a previously dismissed finding with fresh evidence
|
||||
|
||||
7. **Suggested next iteration.** At the end of Phase 6, after writing the final PROGRESS.md summary, print a suggested prompt for the next iteration strategy in the cycle. If the current strategy was:
|
||||
- **gap** → suggest: `Run the next iteration of the quality playbook using the unfiltered strategy.`
|
||||
- **unfiltered** → suggest: `Run the next iteration of the quality playbook using the parity strategy.`
|
||||
- **parity** → suggest: `Run the next iteration of the quality playbook using the adversarial strategy.`
|
||||
- **adversarial** → suggest: `Run the quality playbook from scratch.` (cycle complete)
|
||||
- **baseline (no strategy)** → suggest: `Run the next iteration of the quality playbook using the gap strategy.`
|
||||
|
||||
Format the suggestion clearly so the user can copy-paste it:
|
||||
```
|
||||
────────────────────────────────────────────────────────
|
||||
Next iteration suggestion:
|
||||
"Run the next iteration of the quality playbook using the [strategy] strategy."
|
||||
────────────────────────────────────────────────────────
|
||||
```
|
||||
|
||||
## Meta-strategy: `all` — run every strategy in sequence
|
||||
|
||||
The `all` strategy is a runner-level convenience that executes gap → unfiltered → parity → adversarial in order, each as a separate agent session. A single agent session cannot run multiple strategies (context budget), so `all` is implemented by the orchestrator agent or benchmark runner as a loop of iteration calls. If any strategy finds zero new bugs, stop early (diminishing returns).
|
||||
|
||||
Usage (orchestrator agent): "Run all iterations" — the agent runs gap → unfiltered → parity → adversarial sequentially.
|
||||
Usage (benchmark runner): `python3 bin/run_playbook.py --next-iteration --strategy all <targets>` (benchmark tooling, not shipped with the skill). `--strategy` also accepts a comma-separated ordered subset, e.g. `--strategy unfiltered,parity,adversarial`.
|
||||
|
||||
---
|
||||
|
||||
## Strategy: `gap` (default) — find what the previous run missed
|
||||
|
||||
Scan the previous run's coverage and deliberately explore elsewhere. Best when the first run was structurally sound but only covered a subset of the codebase.
|
||||
|
||||
1. **Coverage scan (lightweight).** Read the previous `quality/EXPLORATION.md` using a divide-and-conquer strategy — do NOT load the entire file into context at once. Instead:
|
||||
- Read just the section headers and first 2–3 lines of each section to build a coverage map
|
||||
- For each section, record: section name, subsystems covered, number of findings, depth level (shallow = single-function mentions, deep = multi-function traces)
|
||||
- Write the coverage map to `quality/ITERATION_PLAN.md`
|
||||
|
||||
2. **Gap identification.** From the coverage map, identify:
|
||||
- Subsystems or modules that were not explored at all
|
||||
- Sections with shallow findings (few lines, single-function mentions, no code-path traces)
|
||||
- Quality Risks scenarios that were listed but never traced to specific code
|
||||
- Pattern deep dives that could apply but weren't selected (from the applicability matrix)
|
||||
- Domain-knowledge questions from Step 6 that weren't addressed
|
||||
|
||||
3. **Targeted deep-read.** For only the 2–3 thinnest or most gap-rich sections, read the full section content from the previous EXPLORATION.md. This gives you specific context about what was already found without consuming your entire context budget on previous findings.
|
||||
|
||||
4. **Gap exploration.** Run a focused Phase 1 exploration targeting only the identified gaps. Use the same three-stage approach (open exploration → quality risks → selected patterns) but scoped to the uncovered areas. Write findings to `quality/EXPLORATION_ITER{N}.md` using the same template structure.
|
||||
|
||||
---
|
||||
|
||||
## Strategy: `unfiltered` — pure domain-driven exploration without structural constraints
|
||||
|
||||
Ignore the three-stage gated structure entirely. Explore the codebase the way an experienced developer would — reading code, following hunches, tracing suspicious paths — with no pattern templates, applicability matrices, or section format requirements. This strategy deliberately removes the structural scaffolding to let domain expertise drive discovery without constraint.
|
||||
|
||||
**Why this strategy exists:** In benchmarking, the unfiltered domain-driven approach used in skill versions v1.3.25–v1.3.26 found bugs that the structured three-stage approach consistently missed, particularly in web frameworks and HTTP libraries. The structured approach excels at systematic coverage but can over-constrain exploration, causing the model to spend context on format compliance rather than deep code reading. The unfiltered strategy recovers that lost discovery power.
|
||||
|
||||
1. **Lightweight previous-run scan.** Read just the `## Candidate Bugs for Phase 2` section and `quality/BUGS.md` from the previous run to know what was already found. Do NOT read the full EXPLORATION.md — you want a fresh perspective, not to be anchored by previous exploration paths. Write a brief note to `quality/ITERATION_PLAN.md` listing what the previous run found and confirming you are using the unfiltered strategy.
|
||||
|
||||
2. **Unfiltered exploration.** Explore the codebase from scratch using pure domain knowledge. No required sections, no pattern applicability matrix, no gate self-check. Instead:
|
||||
- Read source code deeply — entry points, hot paths, error handling, edge cases
|
||||
- Follow your domain expertise: "What would an expert in [this domain] find suspicious?"
|
||||
- For each suspicious finding, trace the code path across 2+ functions with file:line citations
|
||||
- Generate bug hypotheses directly — not "areas to investigate" but "this specific code at file:line produces wrong behavior because [reason]"
|
||||
- Write findings to `quality/EXPLORATION_ITER{N}.md` as a flat list of findings, each with file:line references and a bug hypothesis. No structural template required — depth and specificity matter, not section formatting.
|
||||
- Minimum: 10 concrete findings with file:line references, at least 5 of which trace code paths across 2+ functions
|
||||
|
||||
3. **Domain-knowledge questions.** Complete these questions using the code you just explored AND your domain knowledge. Write your answers inline with your findings, not in a separate gated section:
|
||||
- What API surface inconsistencies exist between similar methods?
|
||||
- Where does the code do ad-hoc string parsing of structured formats?
|
||||
- What inputs would a domain expert try that a developer might not test?
|
||||
- What metadata or configuration values could be silently wrong?
|
||||
|
||||
---
|
||||
|
||||
## Strategy: `parity` — cross-path comparison and diffing
|
||||
|
||||
Systematically enumerate parallel implementations of the same contract and diff them for inconsistencies. This strategy finds bugs by comparing code paths that should behave the same way but don't.
|
||||
|
||||
**Why this strategy exists:** In benchmarking, the v1.3.40 skill version found 5 bugs in virtio using "fallback path parity" and "cross-implementation consistency" as explicit exploration patterns. Three of those bugs (MSI-X slow_virtqueues reattach, GFP_KERNEL under spinlock, INTx admin queue_idx) were found by lining up parallel code paths and spotting differences — not by exploring individual subsystems. The gap, unfiltered, and adversarial strategies all explore areas or challenge decisions, but none explicitly compare parallel paths. This strategy fills that gap.
|
||||
|
||||
1. **Enumerate parallel paths.** Scan the codebase for groups of code that implement the same contract or handle the same logical operation via different paths. Common categories:
|
||||
- **Transport/backend variants:** multiple implementations of the same interface (e.g., PCI vs MMIO vs vDPA, sync vs async, HTTP/1.1 vs HTTP/2)
|
||||
- **Fallback chains:** primary path → fallback → last-resort (e.g., MSI-X → shared → INTx, rich error → generic error)
|
||||
- **Setup vs teardown/reset:** initialization path vs cleanup/reset path for the same resource
|
||||
- **Happy path vs error path:** normal flow vs exception/error handling for the same operation
|
||||
- **Public API variants:** overloaded methods, convenience wrappers, format-specific parsers that should produce equivalent results
|
||||
- Write the enumeration to `quality/ITERATION_PLAN.md` with a brief description of each parallel group.
|
||||
|
||||
2. **Pairwise comparison.** For each parallel group, read the code paths side by side and systematically check each comparison sub-type below. Not every sub-type applies to every parallel group — but explicitly considering each one prevents the strategy from only finding "obvious" discrepancies while missing structural ones.
|
||||
|
||||
**Comparison sub-type checklist** (check each one for every parallel group):
|
||||
|
||||
- **Resource lifecycle parity:** Compare what setup/init does with a resource vs. what teardown/reset/cleanup does with the same resource. Every resource acquired in setup must be released in teardown — and in the same order, with the same scope. Look for resources that setup creates but reset forgets (e.g., a list populated during probe but not drained during reset).
|
||||
- **Allocation context parity:** Compare allocation flags, lock context, and interrupt state across parallel paths. If one path allocates with `GFP_KERNEL` (sleepable) but runs under a spinlock that another path doesn't hold, that's a bug. Check: what locks are held? What allocation flags are valid in that context? Do parallel paths agree?
|
||||
- **Identifier and index parity:** Compare how parallel paths compute indices, offsets, or identifiers for the same logical entity. If setup uses `queue_index + admin_offset` but reset uses `raw_queue_index`, the mismatch is a bug candidate.
|
||||
- **Capability/feature-bit parity:** Compare which feature bits, flags, or capabilities each parallel path checks or sets. If the MSI-X path checks a slow-path vector list but the INTx fallback path doesn't, vectors may be misrouted after fallback.
|
||||
- **Error/exception parity:** Compare error handling between paths. If the primary path handles an error gracefully but the fallback path lets it propagate, the fallback is less robust than the primary — which is backwards.
|
||||
- **Iteration/collection parity:** Compare what collections each path iterates over. If setup iterates over `all_queues` but reset iterates over `active_queues`, resources for inactive queues leak.
|
||||
|
||||
For each discrepancy found, trace both code paths with file:line citations and determine whether the difference is intentional (documented, tested, or structurally necessary) or a bug.
|
||||
|
||||
3. **Cross-file contract tracing.** For the most promising discrepancies, trace the call chain across files to verify:
|
||||
- What lock/interrupt context each path runs in
|
||||
- What allocation flags are valid in that context
|
||||
- Whether the contract (documented in specs, comments, or headers) requires parity
|
||||
- Write findings to `quality/EXPLORATION_ITER{N}.md` with both code paths cited for each finding.
|
||||
|
||||
4. **Minimum output:** At least 5 parallel groups enumerated, at least 8 pairwise comparisons traced with file:line references, at least 3 concrete discrepancy findings.
|
||||
|
||||
---
|
||||
|
||||
## Strategy: `adversarial` — challenge the previous run's conclusions
|
||||
|
||||
Re-investigate what the previous run dismissed, demoted, or marked SATISFIED. This strategy assumes the previous run made Type II errors (missed real bugs by being too conservative) and systematically challenges those decisions.
|
||||
|
||||
**Why this strategy exists:** In benchmarking, the triage step reliably demotes legitimate findings by demanding excessive evidence, marking ambiguous cases as "design choice," or accepting code-review SATISFIED verdicts without deep verification. The adversarial strategy specifically targets these failure modes.
|
||||
|
||||
1. **Load previous decisions.** Read these files from the previous run (use divide-and-conquer — section headers first, then targeted deep reads):
|
||||
- `quality/EXPLORATION_MERGED.md` — specifically the `## Demoted Candidates` section (this is your primary input — it contains structured re-promotion criteria for each dismissed finding)
|
||||
- `quality/BUGS.md` — what was confirmed (to avoid re-finding the same bugs)
|
||||
- `quality/spec_audits/*triage*` — what was dismissed or demoted during triage, and why
|
||||
- `quality/code_reviews/*.md` — Pass 2 SATISFIED/VIOLATED verdicts
|
||||
- `quality/EXPLORATION.md` — just the `## Candidate Bugs for Phase 2` section to see which candidates didn't become confirmed bugs
|
||||
- Write a summary to `quality/ITERATION_PLAN.md` listing: (a) demoted candidates from the manifest with their re-promotion criteria, (b) additional dismissed triage findings not yet in the manifest, (c) candidates that weren't promoted, (d) requirements marked SATISFIED that had thin evidence
|
||||
|
||||
2. **Re-investigate dismissed findings with a lower evidentiary bar.** The adversarial strategy uses a deliberately lower evidentiary standard than earlier strategies. The baseline and gap strategies rightly demand strong evidence to avoid false positives during initial discovery. But by the adversarial iteration, remaining undiscovered bugs are precisely the ones that conservative triage keeps rejecting — they look ambiguous, they could be "design choices," they lack dramatic runtime failures. For these findings:
|
||||
- A code-path trace showing observable semantic drift (output differs from what spec or contract requires) is sufficient to confirm — you do not need a runtime crash or dramatic failure
|
||||
- "Permissive behavior" is not automatically a design choice — check whether the spec, docs, or API contract defines the expected behavior. If the code deviates from a documented contract, it's a bug regardless of whether the deviation is "permissive"
|
||||
- If the Demoted Candidates Manifest includes re-promotion criteria, attempt to satisfy those criteria specifically. Each criterion was written to be actionable — follow it
|
||||
- Read the specific code location cited in the finding
|
||||
- Trace the code path independently — do not rely on the previous run's analysis
|
||||
- Make an explicit CONFIRMED/FALSE-POSITIVE determination with fresh evidence
|
||||
- Update the Demoted Candidates Manifest: change status to RE-PROMOTED or FALSE POSITIVE with the iteration attribution
|
||||
|
||||
3. **Challenge SATISFIED verdicts.** For each requirement the code review marked SATISFIED with thin evidence (single-line citation, no code-path trace, or grouped with 3+ other requirements under one citation):
|
||||
- Re-verify the requirement by reading the cited code and tracing the behavior
|
||||
- Check whether the requirement is actually satisfied or whether the review took a shallow pass
|
||||
|
||||
4. **Explore adjacent code.** For each re-confirmed or newly confirmed finding, explore the surrounding code for related bugs — bugs cluster. If a function has one bug, its callers and siblings likely have related issues.
|
||||
|
||||
5. Write all findings to `quality/EXPLORATION_ITER{N}.md`. Each finding must include: the original source (triage dismissal, candidate demotion, or SATISFIED challenge), the fresh evidence, and the new determination.
|
||||
@@ -0,0 +1,63 @@
|
||||
# Orchestrator Protocol
|
||||
|
||||
Shared rules for all orchestrator agent files (Claude Code, Copilot, Cursor, Windsurf). Platform-specific agent files reference this file for the hardening rules and verification gates.
|
||||
|
||||
## Role Definition
|
||||
|
||||
Your ONLY jobs are: (1) spawn sub-agents to execute phases, (2) verify their output files exist on disk, (3) report progress to the user. You do NOT execute phase logic yourself. Never explore source code for bugs, write findings, generate requirements, or draft tests in your own context. If you find yourself doing any of that, you have violated your role.
|
||||
|
||||
**Why this is strict.** The Quality Playbook is intentionally multi-session: each phase needs the full context window for deep analysis. Running phases in the orchestrator's context is the single most common failure mode — the orchestrator collapses into single-context execution, produces shallow summaries, and writes zero files to disk. This happened on a real casbin run and is why this protocol was hardened.
|
||||
|
||||
## File-Writing Override
|
||||
|
||||
The user's invocation of the playbook is explicit authorization for you and every sub-agent you spawn to write .md files, patches, JSON, and any other artifacts to the `quality/` directory and the project root (`AGENTS.md`). This overrides any base harness rules discouraging documentation or .md file creation. No sub-agent should skip file writes citing harness constraints.
|
||||
|
||||
## Rationalization Patterns
|
||||
|
||||
If you catch yourself producing text like any of these, stop — that's the tell that you're about to collapse into single-context execution:
|
||||
|
||||
- "per system constraint: no report .md files" (or any invented harness restriction)
|
||||
- "I'll do the analytical work in-context and summarize for the user"
|
||||
- "spawning a sub-agent is unnecessary overhead for this step"
|
||||
- "I can cover multiple phases in one pass"
|
||||
- "the artifacts are optional / can be described rather than written"
|
||||
|
||||
Any of these means you're about to replicate the casbin failure. Spawn the sub-agent instead.
|
||||
|
||||
## Grounding
|
||||
|
||||
If `ai_context/DEVELOPMENT_CONTEXT.md` exists in the skill repo or the working directory, read it before Phase 1. It contains the three-axes improvement model and the design intent behind phase separation. Grounding in this document materially reduces the chance of collapsing into single-context execution.
|
||||
|
||||
## Post-Phase Verification Gate (Mandatory)
|
||||
|
||||
After each sub-agent returns, confirm that the expected output files exist and contain real content — not empty scaffolding or placeholder text. If any required file is missing or trivially small, the phase failed regardless of what the sub-agent reported. The sub-agent's claim of completion is insufficient evidence — only files on disk count.
|
||||
|
||||
Express each check as content criteria ("verify that `quality/EXPLORATION.md` exists and has at least 120 lines"), not as specific tool invocations. Use whatever file-reading and directory-listing capability is available.
|
||||
|
||||
### Expected outputs per phase
|
||||
|
||||
Cross-reference SKILL.md's Complete Artifact Contract for the authoritative list.
|
||||
|
||||
- **Phase 1 (Explore):** `quality/EXPLORATION.md` exists with at least 120 lines of substantive content; `quality/PROGRESS.md` exists with Phase 1 marked complete.
|
||||
- **Phase 2 (Generate):** All of these exist: `quality/REQUIREMENTS.md`, `quality/QUALITY.md`, `quality/CONTRACTS.md`, `quality/COVERAGE_MATRIX.md`, `quality/COMPLETENESS_REPORT.md`, `quality/RUN_CODE_REVIEW.md`, `quality/RUN_INTEGRATION_TESTS.md`, `quality/RUN_SPEC_AUDIT.md`, `quality/RUN_TDD_TESTS.md`. A functional test file exists in `quality/` (naming varies by language: `quality/test_functional.<ext>`). **AGENTS.md is NOT a Phase 2 output** — it is generated post-Phase-6 by the orchestrator (see SKILL.md Phase 2 source-modification guardrail). Phase 2 writes ONLY into `quality/`.
|
||||
- **Phase 3 (Code Review):** `quality/code_reviews/` contains at least one review file. If bugs were confirmed: `quality/BUGS.md` has at least one `### BUG-` entry, `quality/patches/` contains a regression-test patch per confirmed bug, and `quality/test_regression.*` exists.
|
||||
- **Phase 4 (Spec Audit):** `quality/spec_audits/` contains at least one triage file AND at least one individual auditor file.
|
||||
- **Phase 5 (Reconciliation):** If bugs were confirmed: `quality/results/tdd-results.json` exists, a writeup at `quality/writeups/BUG-NNN.md` exists for every confirmed bug, and a red-phase log exists at `quality/results/BUG-NNN.red.log` for every confirmed bug.
|
||||
- **Phase 6 (Verify):** `quality/results/quality-gate.log` exists and PROGRESS.md marks Phase 6 complete with a Terminal Gate Verification section.
|
||||
|
||||
### After verification passes
|
||||
|
||||
Report the phase's key findings to the user. Continue to the next phase (or stop if in phase-by-phase mode).
|
||||
|
||||
### If verification fails
|
||||
|
||||
Report what files are missing or empty. Do NOT spawn the next phase — the missing output must be repaired first. Offer to retry the failed phase in a fresh sub-agent.
|
||||
|
||||
## Error Recovery
|
||||
|
||||
If a sub-agent fails or runs out of context:
|
||||
|
||||
1. Assess what was saved to disk (PROGRESS.md and the `quality/` directory).
|
||||
2. Report the failure with specifics.
|
||||
3. Suggest retrying in a fresh sub-agent — phase writes are preserved incrementally, so a retry can pick up where the previous attempt left off.
|
||||
4. Never skip phases — each depends on prior output.
|
||||
@@ -0,0 +1,427 @@
|
||||
# Requirements Pipeline
|
||||
|
||||
## Overview
|
||||
|
||||
This document defines the five-phase requirements generation pipeline for Step 7 of the Quality Playbook. The pipeline separates contract discovery from requirement derivation, uses file-based external memory so the model doesn't need to hold everything in context simultaneously, and includes mechanical verification with a completeness gate.
|
||||
|
||||
**Why a pipeline?** Single-pass requirement generation runs out of attention after ~70 requirements because the model is simultaneously discovering contracts and writing formal requirements. Separating these into distinct phases with file-based handoffs produces significantly more complete coverage. In testing on Gson (81 source files, ~21K lines), single-pass produced 48 requirements; the pipeline produced 110.
|
||||
|
||||
## Files produced
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `quality/CONTRACTS.md` | Raw behavioral contracts extracted from source |
|
||||
| `quality/REQUIREMENTS.md` | Testable requirements with narrative (the primary deliverable) |
|
||||
| `quality/COVERAGE_MATRIX.md` | Contract-to-requirement traceability |
|
||||
| `quality/COMPLETENESS_REPORT.md` | Final completeness assessment with verdict |
|
||||
| `quality/VERSION_HISTORY.md` | Review log with version table and provenance |
|
||||
| `quality/REFINEMENT_HINTS.md` | Review progress and feedback (created during review) |
|
||||
|
||||
Versioned backups go in `quality/history/vX.Y/`.
|
||||
|
||||
---
|
||||
|
||||
## Phase A: Extract behavioral contracts
|
||||
|
||||
**Input:** All source files in the project (or a scoped subsystem — see scaling check below).
|
||||
**Output:** `quality/CONTRACTS.md`
|
||||
|
||||
### Scaling check
|
||||
|
||||
Before starting extraction, count the source files in the project (exclude tests, generated code, vendored dependencies, and build artifacts).
|
||||
|
||||
- **Standard project (≤300 source files):** Proceed normally — extract contracts from all files. Projects in this range have been tested end-to-end (e.g., Gson at ~81 source files produced 110 requirements with full coverage).
|
||||
- **Large project (301–500 source files):** Focus on the 3–5 core subsystems identified in Phase 1, Step 2. Extract contracts from those modules and their internal dependencies. Note the scope in the CONTRACTS.md header so reviewers know what was covered.
|
||||
- **Very large project (>500 source files):** Recommend that the user scope the pipeline to one subsystem at a time. Each subsystem gets its own pipeline run producing its own REQUIREMENTS.md, CONTRACTS.md, etc. Tell the user: "This project has N source files. For best results, run the requirements pipeline separately for each major subsystem (e.g., 'Generate requirements for the authentication module'). A single pipeline run across the full codebase will miss contracts due to context limits."
|
||||
|
||||
If the user explicitly asks for full-project scope on a large codebase, honor the request but warn that coverage will be thinner than subsystem-level runs.
|
||||
|
||||
### Scope breadth on the initial pass
|
||||
|
||||
On the first pipeline run, favor breadth over depth. Cover all major subsystems and modules rather than going deep on a few. The goal is a broad baseline that the self-refinement loop and later review/refinement passes can deepen. If you focus on 3 modules and skip 8 others, the completeness check can't find gaps in modules it never saw.
|
||||
|
||||
For projects with both a core library and supporting modules (middleware, plugins, adapters, extensions), include at least the core and the highest-risk supporting modules in Phase A. Note the scope in the CONTRACTS.md header so it's clear what was covered and what wasn't. Refinement passes can expand scope later, but the initial pass should cast the widest net the context window allows.
|
||||
|
||||
### Contract extraction
|
||||
|
||||
Read every source file (within scope) and list every behavioral contract it implements or should implement. A behavioral contract is any promise the code makes to its callers:
|
||||
|
||||
- **METHOD**: What a public method guarantees about return value, side effects, exceptions, thread safety
|
||||
- **NULL**: What happens when null is passed, returned, or stored
|
||||
- **CONFIG**: What effect a configuration option has at its boundaries
|
||||
- **ERROR**: What exceptions are thrown, when, and with what diagnostic information
|
||||
- **INVARIANT**: Properties that must always hold
|
||||
- **COMPAT**: Behaviors preserved for backward compatibility
|
||||
- **ORDER**: Whether output/iteration order is stable, documented, or undefined
|
||||
- **LIFECYCLE**: Resource creation/cleanup, initialization sequencing
|
||||
- **THREAD**: Thread-safety guarantees or requirements
|
||||
|
||||
### Contract extraction rules
|
||||
|
||||
- **Be thorough.** For a 200-line file, expect 5–15 contracts. For a 1000-line file, expect 20–40. If you're finding fewer than 3 contracts in a file with real logic, you're skipping things.
|
||||
- **Include internal files.** Internal contracts matter because the public API depends on them.
|
||||
- **Include "should exist" contracts** — things the code doesn't do but should based on its domain. These catch absence bugs.
|
||||
- **Read the code, not just the Javadoc/docstrings.** When documentation and code disagree, list both.
|
||||
- **This is discovery, not judgment.** List everything, even if it seems obvious.
|
||||
|
||||
### Output format
|
||||
|
||||
```
|
||||
# Behavioral Contract Extraction
|
||||
Generated: [date]
|
||||
Source files analyzed: N
|
||||
Total contracts extracted: N
|
||||
|
||||
## Summary by category
|
||||
- METHOD: N
|
||||
- NULL: N
|
||||
- CONFIG: N
|
||||
[etc.]
|
||||
|
||||
### path/to/file.ext (N contracts)
|
||||
|
||||
1. [METHOD] ClassName.methodName(): description of what it guarantees
|
||||
2. [NULL] ClassName.methodName(): what happens when null is passed/returned
|
||||
[etc.]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Requirement heading format
|
||||
|
||||
All requirements in REQUIREMENTS.md must use the format `### REQ-NNN: Title` where NNN is a zero-padded three-digit number and Title is a short descriptive name. Do not use alternative formats like `### REQ-NNN — Title`, `### REQ-NNN. Title`, `**REQ-NNN**: Title`, or freeform headings without a number. Consistent formatting enables automated tooling to parse and cross-reference requirements.
|
||||
|
||||
---
|
||||
|
||||
## Phase B: Derive requirements from contracts
|
||||
|
||||
**Input:** `quality/CONTRACTS.md`, project documentation, SKILL.md Step 7 template.
|
||||
**Output:** `quality/REQUIREMENTS.md`
|
||||
|
||||
### How to work
|
||||
|
||||
**B.1 — Group related contracts.** Many contracts across different files serve the same behavioral requirement. Group them by behavioral concern, not by file. Don't merge unrelated contracts just because they're in the same file.
|
||||
|
||||
**B.2 — Enrich with intent.** For each group, find the user story from documentation: GitHub issues state what users expect, the user guide states intended behavior, troubleshooting docs reveal known edge cases, design docs explain design goals. The "so that" clause must come from understanding who cares and why.
|
||||
|
||||
**B.3 — Write requirements.** Use the 7-field template from SKILL.md Step 7. Conditions of satisfaction come from the individual contracts in the group — each contract becomes a condition of satisfaction.
|
||||
|
||||
**B.4 — Check for orphan contracts.** After writing all requirements, verify every contract in CONTRACTS.md is covered. Uncovered contracts become new requirements or get added to existing requirements' conditions of satisfaction.
|
||||
|
||||
### Rules
|
||||
|
||||
- **Do not cap the requirement count.** Write as many as the contracts warrant.
|
||||
- **Every contract must map to at least one requirement.**
|
||||
- **One requirement per distinct behavioral concern.** Don't merge "thread safety" with "null handling" just because they're in the same class.
|
||||
- **Do not modify CONTRACTS.md.** Only read it.
|
||||
|
||||
---
|
||||
|
||||
## Phase C: Verify coverage (loop, max 3 iterations)
|
||||
|
||||
**Input:** `quality/CONTRACTS.md`, `quality/REQUIREMENTS.md`
|
||||
**Output:** `quality/COVERAGE_MATRIX.md`, updated `quality/REQUIREMENTS.md`
|
||||
|
||||
For every contract in CONTRACTS.md, determine whether it is covered by a requirement. A contract is "covered" if a requirement's conditions of satisfaction explicitly test the behavior. A contract is NOT covered if it's only tangentially mentioned, implied but not stated, or if a different aspect of the same file is covered but this specific contract isn't.
|
||||
|
||||
### Output format
|
||||
|
||||
```
|
||||
# Contract Coverage Matrix
|
||||
Generated: [date]
|
||||
Total contracts: N
|
||||
Covered: N (percentage)
|
||||
Uncovered: N (percentage)
|
||||
Partially covered: N (percentage)
|
||||
|
||||
## Fully covered contracts
|
||||
[file]: [contract summary] → REQ-NNN (conditions of satisfaction #M)
|
||||
|
||||
## Partially covered contracts
|
||||
[file]: [contract summary] → REQ-NNN covers the general area but misses [specific aspect]
|
||||
|
||||
## Uncovered contracts
|
||||
[file]: [contract summary] → No requirement addresses this behavior
|
||||
```
|
||||
|
||||
After writing the matrix, fix gaps in REQUIREMENTS.md: add missing conditions to existing requirements or create new requirements. Report changes.
|
||||
|
||||
**Loop termination:** If uncovered count reaches 0, proceed to Phase D. Otherwise, regenerate the matrix and check again. Maximum 3 iterations.
|
||||
|
||||
---
|
||||
|
||||
## Phase D: Completeness check
|
||||
|
||||
**Input:** `quality/REQUIREMENTS.md`, `quality/CONTRACTS.md`, `quality/COVERAGE_MATRIX.md`, source tree.
|
||||
**Output:** `quality/COMPLETENESS_REPORT.md`, updated `quality/REQUIREMENTS.md`
|
||||
|
||||
This is the final gate before the narrative pass. Run three checks:
|
||||
|
||||
### Check 1: Domain completeness
|
||||
|
||||
The following behavioral domains MUST have requirements. Check each one. This checklist is a minimum — if you notice a domain not listed that should have requirements for this project's domain, add it.
|
||||
|
||||
- [ ] **Null handling:** explicit null, absent fields, null keys, null values in collections
|
||||
- [ ] **Type coercion:** string↔number, string↔boolean, number precision, overflow
|
||||
- [ ] **Primitive vs wrapper:** primitive vs object null semantics during deserialization (for languages with this distinction)
|
||||
- [ ] **Generic types:** erasure boundaries, wildcard handling, recursive generics (for languages with generics)
|
||||
- [ ] **Thread safety:** concurrent access, publication safety, cache visibility
|
||||
- [ ] **Error diagnostics:** exception types, path context, location information
|
||||
- [ ] **Resource management:** stream closing, reader/writer lifecycle
|
||||
- [ ] **Backward compatibility:** wire format stability, API behavioral stability
|
||||
- [ ] **Security:** DoS protection (nesting depth, string length), injection prevention
|
||||
- [ ] **Encoding:** Unicode, BOM, surrogate pairs, escape sequences
|
||||
- [ ] **Date/time:** format precedence, timezone handling, precision
|
||||
- [ ] **Collections:** arrays, lists, sets, maps, queues — empty, null elements, ordering
|
||||
- [ ] **Enums:** name resolution, aliases, unknown values
|
||||
- [ ] **Polymorphism:** runtime type vs declared type, adapter/handler delegation
|
||||
- [ ] **Tree model / intermediate representation:** mutation semantics, deep copy structural independence, null normalization
|
||||
- [ ] **Configuration:** builder immutability, instance isolation, option composition
|
||||
- [ ] **Entry points:** every distinct public entry point must have its own contract — string-based, stream-based, tree-based, standalone parsing, multi-value parsing. If the library has N ways to start a read or write, there must be N sets of contracts.
|
||||
- [ ] **Output escaping:** which characters are escaped by default, what disabling escaping changes, how builder-level and writer-level controls interact
|
||||
- [ ] **Built-in type handler contracts:** for each built-in handler that processes a standard library type, state what it promises about format, precision, normalization, and round-trip fidelity. The requirement should specify the handler's promise, not just that a handler exists.
|
||||
- [ ] **Field/property serialization ordering:** whether output order follows declaration order, inheritance order, alphabetical order, or is undefined. State whether ordering is a promised contract or merely observed behavior.
|
||||
- [ ] **Identity contracts for public types:** `toString()`, `hashCode()`/`equals()` (or language equivalent) on public model types. These are behavioral contracts users depend on for comparison, logging, and collection key usage.
|
||||
- [ ] **Input validation:** for every configuration field with domain constraints, state the valid range and whether validation exists.
|
||||
|
||||
For each domain, either cite the REQ-NNN numbers that cover it or flag it as a gap.
|
||||
|
||||
### Check 2: Testability audit
|
||||
|
||||
For each requirement, check whether its conditions of satisfaction are actually testable. Can a reviewer write a concrete test case from this condition? Is pass/fail unambiguous? Does the condition cover failure modes, not just the happy path?
|
||||
|
||||
### Check 3: Cross-requirement consistency
|
||||
|
||||
Check pairs of requirements that reference the same concept. Do ranges agree? Do null-handling rules agree? Do thread-safety guarantees conflict with lifecycle contracts? Do configuration defaults match across requirements?
|
||||
|
||||
### Check 4: Cross-artifact consistency (if code review or spec audit results exist)
|
||||
|
||||
If `quality/code_reviews/` or `quality/spec_audits/` contain results from a previous or current run, read them. For every finding with status VIOLATED, BUG, or INCONSISTENT, check whether the requirements address the behavioral concern that finding targets. If a code review found a bug in compression header parsing that the requirements don't cover, that's a completeness gap — add a requirement or conditions of satisfaction to close it.
|
||||
|
||||
**The completeness report cannot say COMPLETE if unaddressed findings exist.** If any VIOLATED/BUG/INCONSISTENT finding from code review or spec audit targets behavior not covered by requirements, the verdict must be INCOMPLETE with the specific gaps listed.
|
||||
|
||||
This check exists because earlier versions of the pipeline produced completeness reports that said "COMPLETE" while the code review in the same run found requirement violations. The completeness report must be consistent with all other quality artifacts.
|
||||
|
||||
### Post-review completeness refresh (mandatory)
|
||||
|
||||
**After the code review and spec audit are complete**, re-read `quality/COMPLETENESS_REPORT.md` and update it. The initial completeness report was written before the code review and spec audit ran, so it cannot reflect their findings. This refresh step reconciles the completeness verdict with the actual review results.
|
||||
|
||||
**Procedure:**
|
||||
1. Read the combined summary from `quality/code_reviews/` — count VIOLATED and BUG findings.
|
||||
2. Read the triage summary from `quality/spec_audits/` — count confirmed code bugs.
|
||||
3. For each finding, check whether REQUIREMENTS.md has a requirement covering that behavior.
|
||||
4. Append a `## Post-Review Reconciliation` section to COMPLETENESS_REPORT.md:
|
||||
|
||||
```
|
||||
## Post-Review Reconciliation
|
||||
Updated: [date]
|
||||
|
||||
### Code review findings: N VIOLATED, M BUG
|
||||
- [finding summary] → covered by REQ-NNN / NOT COVERED (gap)
|
||||
- ...
|
||||
|
||||
### Spec audit findings: N confirmed code bugs
|
||||
- [finding summary] → covered by REQ-NNN / NOT COVERED (gap)
|
||||
- ...
|
||||
|
||||
### Updated verdict
|
||||
[COMPLETE if all findings are covered by requirements, INCOMPLETE if gaps remain]
|
||||
```
|
||||
|
||||
5. If the original verdict was COMPLETE but unaddressed findings exist, change the verdict to INCOMPLETE.
|
||||
|
||||
### Resolving code review vs spec audit conflicts
|
||||
|
||||
When the code review and spec audit disagree about the same behavioral claim — one says BUG, the other says design choice or false positive — the reconciliation must resolve the conflict, not paper over it.
|
||||
|
||||
**Resolution procedure:**
|
||||
1. Identify the factual claim at the center of the disagreement. What does the code actually do?
|
||||
2. Deploy a verification probe: give a model the disputed claim and the relevant source code, and ask it to report ground truth. (See `spec_audit.md` § "The Verification Probe.")
|
||||
3. Record the resolution in the Post-Review Reconciliation section:
|
||||
```
|
||||
### Conflicts resolved
|
||||
- [finding description]: Code review said [X], spec audit said [Y].
|
||||
Verification probe: [what the code actually does].
|
||||
Resolution: [BUG CONFIRMED / FALSE POSITIVE / DESIGN CHOICE]. [Explanation.]
|
||||
```
|
||||
4. If the resolution confirms a BUG, ensure it has a regression test. If the resolution overturns a BUG, clean up the regression test per `review_protocols.md` § "Cleaning up after spec audit reversals."
|
||||
|
||||
**Do not resolve conflicts by defaulting to one source.** Neither the code review nor the spec audit is automatically more authoritative — they use different methods (structural reading vs. spec comparison) and have different blind spots. The verification probe is the tiebreaker.
|
||||
|
||||
**This refresh is not optional.** A completeness report that predates the code review is a timestamp, not a quality gate. The refresh turns it into an actual reconciliation.
|
||||
|
||||
### Output format
|
||||
|
||||
```
|
||||
# Completeness Report
|
||||
Generated: [date]
|
||||
|
||||
## Domain coverage
|
||||
[For each domain: COVERED (REQ-NNN, REQ-NNN) or GAP (description)]
|
||||
|
||||
## Testability issues
|
||||
[For each vague requirement: REQ-NNN — condition N is not testable because...]
|
||||
|
||||
## Consistency issues
|
||||
[For each conflict: REQ-NNN and REQ-NNN disagree about...]
|
||||
|
||||
## Cross-artifact gaps (if code review/spec audit results exist)
|
||||
[For each unaddressed finding: finding summary → missing requirement or condition]
|
||||
|
||||
## Verdict
|
||||
COMPLETE or INCOMPLETE with recommended actions
|
||||
```
|
||||
|
||||
Then fix what you can: add requirements for domain gaps, sharpen vague conditions, resolve consistency issues, and close cross-artifact gaps.
|
||||
|
||||
**Important:** This is the final check. Be adversarial. Assume previous passes were imperfect. For each domain marked COVERED, verify that the cited requirements actually address the checklist item — don't just check the box.
|
||||
|
||||
### Self-refinement loop (max 3 iterations)
|
||||
|
||||
After the initial completeness check, run up to 3 refinement iterations to close the gaps Phase D identified:
|
||||
|
||||
1. **Read the completeness report.** Identify all GAP entries, testability issues, and consistency issues.
|
||||
2. **Fix gaps in REQUIREMENTS.md.** For each GAP: add a new requirement using the 7-field template, or add conditions of satisfaction to an existing requirement. For testability issues: sharpen the condition. For consistency issues: resolve the conflict.
|
||||
3. **Re-run all three checks** (domain completeness, testability audit, cross-requirement consistency). Write the updated results to COMPLETENESS_REPORT.md.
|
||||
4. **Count the delta.** How many new requirements were added or existing requirements modified in this iteration?
|
||||
5. **Short-circuit check:** If the delta is fewer than 3 changes, stop — you've hit diminishing returns. Proceed to Phase E.
|
||||
|
||||
**Why this works:** The initial completeness check identifies gaps but the model may not fix all of them in one pass, especially conceptual gaps where the model needs to re-read source files to understand what's missing. Each iteration shrinks the gap. Three iterations is enough to close the mechanical gaps; the remaining conceptual gaps are where cross-model audit and human review earn their keep.
|
||||
|
||||
**Why it has limits:** This is self-refinement — the same model checking its own work. It catches gaps the model can see once they're pointed out (uncovered domains, vague conditions, numeric inconsistencies) but won't catch blind spots the model doesn't recognize as gaps. That's by design. The review and refinement protocols exist for closing those deeper gaps with different models or human input.
|
||||
|
||||
After the loop completes (or short-circuits), proceed to Phase E.
|
||||
|
||||
---
|
||||
|
||||
## Phase E: Narrative pass
|
||||
|
||||
**Input:** `quality/REQUIREMENTS.md`, `quality/CONTRACTS.md`, project documentation, source tree.
|
||||
**Output:** Restructured `quality/REQUIREMENTS.md`
|
||||
|
||||
**Before starting:** Save a backup: `cp quality/REQUIREMENTS.md quality/REQUIREMENTS_pre_narrative.md`
|
||||
|
||||
This phase transforms the specification into a guide. Add explanatory tissue so a new team member, code reviewer, or AI agent can read the document top-to-bottom and understand the software.
|
||||
|
||||
### E.1 — Project overview (new, top of document)
|
||||
|
||||
Write 400–600 words of connected prose explaining: what the software is, who uses it and why (primary personas and goals), how data flows through the major components, and the design philosophy (key architectural decisions and why they were made).
|
||||
|
||||
### E.2 — Use cases (new, after overview)
|
||||
|
||||
Write 6–8 use cases in the style of Applied Software Project Management (Stellman & Greene). Each has:
|
||||
|
||||
- **Name**: Short descriptive name
|
||||
- **Actor**: Who initiates it
|
||||
- **Preconditions**: What must be true before this begins
|
||||
- **Steps**: Numbered actor/system action sequence
|
||||
- **Postconditions**: What is true on success
|
||||
- **Alternative paths**: Variations and error cases
|
||||
- **Requirements**: Which REQ-NNN numbers this use case exercises
|
||||
|
||||
Cover the major usage patterns. The use cases are the bridge between "what the software does" and "what the requirements specify."
|
||||
|
||||
### E.3 — Cross-cutting concerns (new, after use cases)
|
||||
|
||||
Document architectural invariants that span multiple categories: threading model, null contract, error philosophy, backward compatibility strategy, configuration composition. Each references specific REQ-NNN numbers. Write as prose paragraphs.
|
||||
|
||||
### E.4 — Category narratives (augment existing)
|
||||
|
||||
For each requirement category, add 2–4 sentences before the first requirement explaining what the category covers, how it relates to other categories, and what a reviewer should keep in mind.
|
||||
|
||||
### E.5 — Reorder for top-down flow
|
||||
|
||||
Reorder categories from user-facing (entry points, configuration) to infrastructure (error handling, backward compatibility). Fold any catch-all sections into proper categories.
|
||||
|
||||
### E.6 — Renumber sequentially
|
||||
|
||||
After reordering, renumber all requirements REQ-001 through REQ-NNN following document order. Update all internal cross-references.
|
||||
|
||||
### Rules
|
||||
|
||||
- **Do not delete, merge, or weaken any existing requirement.**
|
||||
- **Do not add new requirements in this pass.**
|
||||
- **Write the overview and use cases from the user's perspective.**
|
||||
- **Use cases must cite specific REQ numbers.**
|
||||
|
||||
---
|
||||
|
||||
## Versioning protocol
|
||||
|
||||
### Version scheme: major.minor
|
||||
|
||||
- **Major** bump: structural changes (new pipeline architecture, narrative pass added, major scope expansion). Bumped by the user.
|
||||
- **Minor** bump: refinement passes, gap fills, sharpened conditions. Increments automatically on each pipeline run or refinement pass.
|
||||
|
||||
### VERSION_HISTORY.md
|
||||
|
||||
Maintain a version history file at `quality/VERSION_HISTORY.md`:
|
||||
|
||||
```markdown
|
||||
# Requirements Version History
|
||||
|
||||
## Current version: vX.Y
|
||||
|
||||
| Version | Date | Model | Author | Reqs | Summary |
|
||||
|---------|------|-------|--------|------|---------|
|
||||
| v1.0 | YYYY-MM-DD | [model] | Quality Playbook | N | Initial pipeline generation |
|
||||
| v1.1 | YYYY-MM-DD | [model] | [author] | N | [what changed] |
|
||||
|
||||
## Pending review
|
||||
[status from REFINEMENT_HINTS.md if review is in progress]
|
||||
```
|
||||
|
||||
The **Author** column records provenance: "Quality Playbook" for automated pipeline runs, a person's name for manual edits, a model name for refinement passes.
|
||||
|
||||
### Backup protocol
|
||||
|
||||
Before each version change, copy all quality files to `quality/history/vX.Y/`:
|
||||
|
||||
```
|
||||
quality/history/
|
||||
├── v1.0/
|
||||
│ ├── REQUIREMENTS.md
|
||||
│ ├── CONTRACTS.md
|
||||
│ ├── COVERAGE_MATRIX.md
|
||||
│ └── COMPLETENESS_REPORT.md
|
||||
├── v1.1/
|
||||
│ └── ...
|
||||
└── v2.0/
|
||||
└── ...
|
||||
```
|
||||
|
||||
Each version folder is a complete snapshot. Users can diff any two versions.
|
||||
|
||||
### Version stamping
|
||||
|
||||
The REQUIREMENTS.md header includes the current version:
|
||||
|
||||
```markdown
|
||||
# Behavioral Requirements — [Project Name]
|
||||
Version: vX.Y
|
||||
Generated: [date]
|
||||
Pipeline: contract-extraction v2 with narrative pass
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## After the pipeline: review and refinement
|
||||
|
||||
The pipeline produces a solid baseline, but AI isn't 100% reliable. The skill provides two standalone tools for iterative improvement:
|
||||
|
||||
### Requirements review (`quality/REVIEW_REQUIREMENTS.md`)
|
||||
|
||||
An interactive or guided review of requirements organized by use case. Three modes:
|
||||
- **Self-guided**: Pick use cases to drill into
|
||||
- **Fully guided**: Walk through use cases sequentially
|
||||
- **Cross-model audit**: A different model fact-checks the completeness report
|
||||
|
||||
Progress and feedback are tracked in `quality/REFINEMENT_HINTS.md`. See the generated `quality/REVIEW_REQUIREMENTS.md` for the full protocol.
|
||||
|
||||
### Requirements refinement (`quality/REFINE_REQUIREMENTS.md`)
|
||||
|
||||
Reads `quality/REFINEMENT_HINTS.md` and updates `quality/REQUIREMENTS.md` to close identified gaps. Can be run with any model. Backs up the current version, bumps minor version, reports all changes. See the generated `quality/REFINE_REQUIREMENTS.md` for the full protocol.
|
||||
|
||||
### Multi-model refinement
|
||||
|
||||
Users can run refinement passes with different models to catch different blind spots. Each pass: backup → refine → version bump → log in VERSION_HISTORY.md. Run as many models as desired until diminishing returns.
|
||||
@@ -0,0 +1,113 @@
|
||||
# Requirements Refinement Protocol
|
||||
|
||||
## Overview
|
||||
|
||||
This is the template for `quality/REFINE_REQUIREMENTS.md`. The playbook generates this file alongside the requirements pipeline output. It provides a structured process for updating requirements based on review feedback, and can be run with any model.
|
||||
|
||||
## Generated file template
|
||||
|
||||
The playbook should generate the following as `quality/REFINE_REQUIREMENTS.md`:
|
||||
|
||||
---
|
||||
|
||||
```markdown
|
||||
# Requirements Refinement Protocol: [Project Name]
|
||||
|
||||
## How to use
|
||||
|
||||
This protocol reads feedback from `quality/REFINEMENT_HINTS.md` and updates `quality/REQUIREMENTS.md` to close identified gaps. It can be run with any AI model — the protocol is self-contained.
|
||||
|
||||
**Multi-model refinement:** You can run this protocol multiple times with different models. Each run backs up the current version, makes targeted improvements, bumps the minor version, and logs the changes. Run as many models as you want until you hit diminishing returns.
|
||||
|
||||
---
|
||||
|
||||
## Before starting
|
||||
|
||||
1. Read `quality/REFINEMENT_HINTS.md` — this contains the review feedback to address.
|
||||
2. Read `quality/REQUIREMENTS.md` — the current requirements to update.
|
||||
3. Read `quality/CONTRACTS.md` — for contract-level detail when adding new conditions.
|
||||
4. Read `quality/VERSION_HISTORY.md` — to determine the current version number.
|
||||
|
||||
## Step 1: Backup and version
|
||||
|
||||
1. Read the current version from `quality/VERSION_HISTORY.md`.
|
||||
2. Copy all files in `quality/` to `quality/history/vX.Y/` (current version number).
|
||||
3. Bump the minor version: v1.2 becomes v1.3.
|
||||
4. Update the version stamp at the top of `quality/REQUIREMENTS.md`.
|
||||
|
||||
## Step 2: Process feedback
|
||||
|
||||
Read each item in REFINEMENT_HINTS.md and categorize it:
|
||||
|
||||
- **Gap — missing requirement:** A behavioral contract or domain area has no requirement. Create a new requirement using the 7-field template.
|
||||
- **Gap — missing condition:** An existing requirement doesn't cover a specific scenario. Add a condition of satisfaction to the existing requirement.
|
||||
- **Gap — missing use case coverage:** A use case doesn't link to a requirement that governs one of its steps. Add the REQ-NNN to the use case's Requirements line.
|
||||
- **Sharpening — vague condition:** A condition of satisfaction is too vague to test. Rewrite it with concrete pass/fail criteria.
|
||||
- **Correction — wrong content:** A requirement states something incorrect. Fix the specific field.
|
||||
- **Cross-model audit finding:** A domain was marked COVERED in the completeness report but the cited requirements don't actually address it. Add the missing requirements.
|
||||
- **Removal (user-directed only):** The user explicitly states a requirement is incorrect and should be removed (e.g., "REQ-047 is incorrect because X — remove it"). Only process removals when the hint clearly comes from the user, not from an automated pass. Log the removal and reason in the change report.
|
||||
|
||||
## Step 3: Make changes
|
||||
|
||||
For each feedback item:
|
||||
|
||||
1. **New requirements:** Add at the end of the appropriate category section. Continue the existing numbering sequence. Follow the 7-field template exactly.
|
||||
2. **Modified requirements:** Edit the specific field that needs changing. Do not rewrite requirements that aren't flagged.
|
||||
3. **Use case updates:** Add newly created REQ numbers to the relevant use case's Requirements line.
|
||||
4. **Cross-cutting concerns:** If new requirements affect cross-cutting concerns, update those sections.
|
||||
|
||||
### Rules
|
||||
|
||||
- **Do not delete or weaken existing requirements during automated refinement.** Every requirement that exists today must exist after refinement with at least the same conditions of satisfaction — unless the user has explicitly marked a requirement for removal with a reason. User-directed removals are the only exception.
|
||||
- **Do not renumber existing requirements.** New requirements get the next available number. This preserves traceability across versions.
|
||||
- **Do not restructure the document.** The narrative pass already established the structure. Refinement is surgical — add, sharpen, or fix individual items.
|
||||
- **Each change must be traceable to a feedback item.** Don't make changes that weren't asked for.
|
||||
|
||||
## Step 4: Report changes
|
||||
|
||||
After all changes, append a summary to `quality/REFINEMENT_HINTS.md`:
|
||||
|
||||
```
|
||||
## Refinement Pass — v[new version]
|
||||
Date: [date]
|
||||
Model: [model name]
|
||||
|
||||
### Changes made
|
||||
- REQ-NNN (NEW): [brief description] — addresses feedback: "[quoted hint]"
|
||||
- REQ-NNN: Added condition of satisfaction for [what] — addresses feedback: "[quoted hint]"
|
||||
- REQ-NNN: Sharpened condition #N: [what changed] — addresses feedback: "[quoted hint]"
|
||||
- Use Case N: Added REQ-NNN to requirements list
|
||||
|
||||
### Feedback items not addressed
|
||||
- "[quoted hint]" — reason: [why this wasn't actionable or was out of scope]
|
||||
|
||||
### Summary
|
||||
Added N new requirements, modified N existing requirements, updated N use cases.
|
||||
Total requirements: N (was N).
|
||||
```
|
||||
|
||||
## Step 5: Update version history
|
||||
|
||||
Add a row to `quality/VERSION_HISTORY.md`:
|
||||
|
||||
```
|
||||
| vX.Y | YYYY-MM-DD | [model] | [author] | N | [summary of changes] |
|
||||
```
|
||||
|
||||
## Step 6: Update completeness report
|
||||
|
||||
If new requirements were added that address domain checklist gaps, update the relevant domain entries in `quality/COMPLETENESS_REPORT.md` to cite the new REQ numbers.
|
||||
|
||||
---
|
||||
|
||||
## Running multiple refinement passes
|
||||
|
||||
Each pass follows the same protocol:
|
||||
1. Read the latest REFINEMENT_HINTS.md (which now includes the previous pass's report)
|
||||
2. Focus only on feedback items marked "not addressed" or new feedback added since the last pass
|
||||
3. Backup, bump version, make changes, report
|
||||
|
||||
The user can add new hints between passes by editing REFINEMENT_HINTS.md directly. The next refinement pass picks them up automatically.
|
||||
|
||||
The user can also run a fresh cross-model audit (Mode 3 of the review protocol) between refinement passes to find new gaps that the previous refinement didn't catch. This creates a review → refine → review → refine cycle that converges on completeness.
|
||||
```
|
||||
@@ -0,0 +1,158 @@
|
||||
# Requirements Review Protocol
|
||||
|
||||
## Overview
|
||||
|
||||
This is the template for `quality/REVIEW_REQUIREMENTS.md`. The playbook generates this file alongside the requirements pipeline output. It provides three modes for reviewing requirements interactively after generation.
|
||||
|
||||
## Generated file template
|
||||
|
||||
The playbook should generate the following as `quality/REVIEW_REQUIREMENTS.md`:
|
||||
|
||||
---
|
||||
|
||||
```markdown
|
||||
# Requirements Review Protocol: [Project Name]
|
||||
|
||||
## How to use
|
||||
|
||||
This protocol helps you review the generated requirements for completeness and accuracy. Run it with any AI model — the review is self-contained and reads from the files in `quality/`.
|
||||
|
||||
**Before starting:** Make sure `quality/REQUIREMENTS.md` exists (from the pipeline) and that you've read the Project Overview and Use Cases sections at the top.
|
||||
|
||||
### Choose a review mode
|
||||
|
||||
**Mode 1 — Self-guided review.** You pick which use cases to examine. Best when you already know which areas of the project need the most scrutiny.
|
||||
|
||||
**Mode 2 — Fully guided review.** The AI walks you through every use case in order, drilling into each linked requirement. Best for a thorough first review.
|
||||
|
||||
**Mode 3 — Cross-model audit.** A different AI model fact-checks the completeness report by verifying that every domain marked COVERED actually has requirements addressing the checklist item. Best run with a different model than the one that generated the requirements.
|
||||
|
||||
All three modes track progress in `quality/REFINEMENT_HINTS.md`.
|
||||
|
||||
---
|
||||
|
||||
## Mode 1: Self-guided review
|
||||
|
||||
Read `quality/REQUIREMENTS.md` and present the user with a numbered list of use cases:
|
||||
|
||||
```
|
||||
Use cases in REQUIREMENTS.md:
|
||||
1. [x] Use Case 1: [name] (reviewed)
|
||||
2. [ ] Use Case 2: [name]
|
||||
3. [ ] Use Case 3: [name]
|
||||
...
|
||||
```
|
||||
|
||||
Check `quality/REFINEMENT_HINTS.md` for review progress — use cases marked `[x]` have already been reviewed. Present the list and ask the user which use case to examine.
|
||||
|
||||
When the user picks a use case:
|
||||
1. Show the use case (actor, steps, postconditions, alternative paths)
|
||||
2. List the linked REQ-NNN numbers
|
||||
3. Ask: "Want to drill into any of these requirements, or does this use case look complete?"
|
||||
|
||||
When drilling into a requirement:
|
||||
1. Show the full requirement (summary, user story, conditions of satisfaction, alternative paths)
|
||||
2. Ask: "Does this capture the right behavior? Anything missing or wrong?"
|
||||
3. Record feedback in REFINEMENT_HINTS.md under the use case heading
|
||||
|
||||
After reviewing a use case, mark it `[x]` in REFINEMENT_HINTS.md and return to the use case list.
|
||||
|
||||
Also offer: "Are there any cross-cutting concerns or requirements NOT linked to a use case that you'd like to review?"
|
||||
|
||||
---
|
||||
|
||||
## Mode 2: Fully guided review
|
||||
|
||||
Same as Mode 1, but instead of asking the user to pick, start at Use Case 1 and proceed sequentially.
|
||||
|
||||
For each use case:
|
||||
1. Present the use case overview
|
||||
2. Walk through each linked requirement one by one
|
||||
3. For each requirement, ask: "Does this look right? Anything missing?"
|
||||
4. Record any feedback in REFINEMENT_HINTS.md
|
||||
5. Mark the use case as reviewed
|
||||
6. Move to the next use case
|
||||
|
||||
After all use cases:
|
||||
1. Present the Cross-Cutting Concerns section
|
||||
2. Ask: "Any concerns about threading, null handling, errors, compatibility, or configuration composition?"
|
||||
3. Ask: "Are there any requirements you expected to see that aren't here?"
|
||||
4. Record feedback and present a summary of all hints collected
|
||||
|
||||
---
|
||||
|
||||
## Mode 3: Cross-model audit
|
||||
|
||||
Read `quality/COMPLETENESS_REPORT.md` and `quality/REQUIREMENTS.md`. For each domain in the completeness report:
|
||||
|
||||
1. Read the domain checklist item (from the report's domain coverage section)
|
||||
2. Read each cited REQ-NNN
|
||||
3. Verify: does this requirement actually address the domain checklist item?
|
||||
4. If the citation is wrong (the requirement covers something else), flag it as a gap
|
||||
|
||||
Also check:
|
||||
- Are there requirements that don't appear in any use case's Requirements list? If so, flag as potentially orphaned.
|
||||
- Does every use case's alternative paths section have corresponding requirements for the error/edge cases it mentions?
|
||||
- Do the cross-cutting concerns reference requirements that actually exist and address the stated concern?
|
||||
|
||||
Write findings to `quality/REFINEMENT_HINTS.md` under a `## Cross-Model Audit` heading:
|
||||
|
||||
```
|
||||
## Cross-Model Audit
|
||||
Date: [date]
|
||||
Model: [model name]
|
||||
|
||||
### Verified domains
|
||||
- Null handling: CONFIRMED (REQ-054, REQ-055 correctly address null semantics)
|
||||
- ...
|
||||
|
||||
### Gaps found
|
||||
- Entry points: COMPLETENESS_REPORT cites REQ-100, REQ-101 but these are about
|
||||
pretty printing, not entry point contracts. JsonStreamParser has no coverage.
|
||||
- ...
|
||||
|
||||
### Orphaned requirements
|
||||
- REQ-NNN is not linked to any use case
|
||||
- ...
|
||||
```
|
||||
|
||||
Present findings to the user and ask which gaps should be addressed in a refinement pass.
|
||||
|
||||
---
|
||||
|
||||
## REFINEMENT_HINTS.md format
|
||||
|
||||
The review protocol creates and maintains this file:
|
||||
|
||||
```markdown
|
||||
# Refinement Hints
|
||||
|
||||
## Review Progress
|
||||
- [x] Use Case 1: [name] — reviewed, no issues
|
||||
- [x] Use Case 2: [name] — reviewed, see feedback below
|
||||
- [ ] Use Case 3: [name]
|
||||
- [ ] Use Case 4: [name]
|
||||
...
|
||||
|
||||
## Cross-Cutting Concerns
|
||||
- [ ] Threading model — not yet reviewed
|
||||
- [ ] Null contract — not yet reviewed
|
||||
- [ ] Error philosophy — not yet reviewed
|
||||
- [ ] Backward compatibility — not yet reviewed
|
||||
- [ ] Configuration composition — not yet reviewed
|
||||
|
||||
## Feedback
|
||||
|
||||
### Use Case 2: [name]
|
||||
- REQ-NNN: [specific feedback about what's missing or wrong]
|
||||
- General: [broader observation about this use case's coverage]
|
||||
|
||||
### Cross-Model Audit
|
||||
[if Mode 3 was run]
|
||||
|
||||
## Additional hints
|
||||
[freeform feedback from the user, not tied to a specific use case]
|
||||
```
|
||||
|
||||
This file serves dual purpose: it tracks review progress (so the user can resume across sessions) AND accumulates feedback that the refinement pass reads.
|
||||
```
|
||||
@@ -11,23 +11,16 @@
|
||||
|
||||
Before reviewing, read these files for context:
|
||||
1. `quality/QUALITY.md` — Quality constitution and fitness-to-purpose scenarios
|
||||
2. [Main architectural doc]
|
||||
3. [Key design decisions doc]
|
||||
4. [Any other essential context]
|
||||
2. `quality/REQUIREMENTS.md` — Testable requirements derived during playbook generation
|
||||
3. [Main architectural doc]
|
||||
4. [Key design decisions doc]
|
||||
5. [Any other essential context]
|
||||
|
||||
## What to Check
|
||||
## Pass 1: Structural Review
|
||||
|
||||
### Focus Area 1: [Subsystem/Risk Area Name]
|
||||
Read the code and report anything that looks wrong. No requirements, no focus areas — use your own knowledge of code correctness. Look for: race conditions, null pointer hazards, resource leaks, off-by-one errors, type mismatches, error handling gaps, and any code that looks suspicious.
|
||||
|
||||
**Where:** [Specific files and functions]
|
||||
**What:** [Specific things to look for]
|
||||
**Why:** [What goes wrong if this is incorrect]
|
||||
|
||||
### Focus Area 2: [Subsystem/Risk Area Name]
|
||||
|
||||
[Repeat for 4–6 focus areas, mapped to architecture and risk areas from exploration]
|
||||
|
||||
## Guardrails
|
||||
### Guardrails
|
||||
|
||||
- **Line numbers are mandatory.** If you cannot cite a specific line, do not include the finding.
|
||||
- **Read function bodies, not just signatures.** Don't assume a function works correctly based on its name.
|
||||
@@ -35,21 +28,189 @@ Before reviewing, read these files for context:
|
||||
- **Grep before claiming missing.** If you think a feature is absent, search the codebase. If found in a different file, that's a location defect, not a missing feature.
|
||||
- **Do NOT suggest style changes, refactors, or improvements.** Only flag things that are incorrect or could cause failures.
|
||||
|
||||
## Output Format
|
||||
|
||||
Save findings to `quality/code_reviews/YYYY-MM-DD-reviewer.md`
|
||||
### Output
|
||||
|
||||
For each file reviewed:
|
||||
|
||||
### filename.ext
|
||||
#### filename.ext
|
||||
- **Line NNN:** [BUG / QUESTION / INCOMPLETE] Description. Expected vs. actual. Why it matters.
|
||||
|
||||
### Summary
|
||||
- Total findings by severity
|
||||
- Files with no findings
|
||||
- Overall assessment: SHIP IT / FIX FIRST / NEEDS DISCUSSION
|
||||
## Pass 2: Requirement Verification
|
||||
|
||||
Read `quality/REQUIREMENTS.md`. For each requirement, check whether the code satisfies it. This is a pure verification pass — your only job is "does the code satisfy this requirement?"
|
||||
|
||||
Do NOT also do a general code review. Do NOT look for other bugs. Do NOT evaluate code quality. Just check each requirement.
|
||||
|
||||
For each requirement, report one of:
|
||||
- **SATISFIED**: The code implements this requirement. Quote the specific code.
|
||||
- **VIOLATED**: The code does NOT satisfy this requirement. Explain what the code does vs. what the requirement says. Quote the code.
|
||||
- **PARTIALLY SATISFIED**: Some aspects implemented, others missing. Explain both.
|
||||
- **NOT ASSESSABLE**: Can't be checked from the files under review.
|
||||
|
||||
### Output
|
||||
|
||||
For each requirement:
|
||||
|
||||
#### REQ-N: [requirement text]
|
||||
**Status**: SATISFIED / VIOLATED / PARTIALLY SATISFIED / NOT ASSESSABLE
|
||||
**Evidence**: [file:line] — [code quote]
|
||||
**Analysis**: [explanation]
|
||||
[If VIOLATED] **Severity**: [impact description]
|
||||
|
||||
## Pass 3: Cross-Requirement Consistency
|
||||
|
||||
Compare pairs of requirements from `quality/REQUIREMENTS.md` that reference the same field, constant, range, or security policy. For each pair, check whether their constraints are mutually consistent.
|
||||
|
||||
What to look for:
|
||||
- **Numeric range vs bit width**: If one requirement says the valid range is [0, N) and another says the field is M bits wide, does N = 2^M?
|
||||
- **Security policy propagation**: If one requirement says a CA file is configured, do all requirements about connections that should use it actually reference using it?
|
||||
- **Validation bounds vs encoding limits**: Does a validation check in one file agree with the storage capacity in another?
|
||||
- **Lifecycle consistency**: If a resource is created by one requirement's code, is it cleaned up by another's?
|
||||
|
||||
For each pair that shares a concept, verify consistency against the actual code.
|
||||
|
||||
### Output
|
||||
|
||||
For each shared concept:
|
||||
|
||||
#### Shared Concept: [name]
|
||||
**Requirements**: REQ-X, REQ-Y
|
||||
**What REQ-X claims**: [summary]
|
||||
**What REQ-Y claims**: [summary]
|
||||
**Consistency**: CONSISTENT / INCONSISTENT
|
||||
**Code evidence**: [quotes from both locations]
|
||||
**Analysis**: [explanation]
|
||||
[If INCONSISTENT] **Impact**: [what happens when the contradiction is triggered]
|
||||
|
||||
## Combined Summary
|
||||
|
||||
| Source | Finding | Severity | Status |
|
||||
|--------|---------|----------|--------|
|
||||
| Pass 1 | [structural finding] | [severity] | BUG / QUESTION |
|
||||
| Pass 2, REQ-N | [requirement violation] | [severity] | VIOLATED |
|
||||
| Pass 3, REQ-X vs REQ-Y | [consistency issue] | [severity] | INCONSISTENT |
|
||||
|
||||
- Total findings by pass and severity
|
||||
- Overall assessment: SHIP / FIX BEFORE MERGE / BLOCK
|
||||
```
|
||||
|
||||
### Execution requirements
|
||||
|
||||
**All three passes are mandatory.** Do not consolidate passes into a single review. Each pass produces distinct findings because it uses a different lens:
|
||||
|
||||
- **Pass 1** finds structural bugs (race conditions, null hazards, resource leaks)
|
||||
- **Pass 2** finds requirement violations (missing behavior, spec deviations)
|
||||
- **Pass 3** finds cross-requirement contradictions (inconsistent ranges, conflicting guarantees)
|
||||
|
||||
**Write each pass as a clearly labeled section** in the output file. Use the headers `## Pass 1: Structural Review`, `## Pass 2: Requirement Verification`, `## Pass 3: Cross-Requirement Consistency`, and `## Combined Summary`.
|
||||
|
||||
**If a pass has no findings, explain why.** Do not just write "No findings." Write what you checked and why nothing was wrong. For example: "Reviewed 12 functions in lib/response.js for null hazards, resource leaks, and error handling gaps. No confirmed bugs — all error paths either throw or return a well-defined default." A pass with no findings and no explanation is a pass that wasn't done.
|
||||
|
||||
**Scoping for large codebases:** If the project has more than 50 requirements, Pass 2 does not need to verify every requirement against every file. Instead, focus Pass 2 on the requirements most relevant to the files being reviewed — check the requirements that reference those files or that govern the behavioral domain those files implement. The goal is depth on the files under review, not breadth across all requirements.
|
||||
|
||||
**Self-check before finishing:** After writing all three passes and the combined summary, verify: (1) all three pass sections exist in the output, (2) Pass 2 references specific REQ-NNN numbers with SATISFIED/VIOLATED verdicts, (3) Pass 3 identifies at least one shared concept between requirements (even if consistent), (4) every BUG finding has a corresponding regression test in `quality/test_regression.*` (see Phase 2 below), (5) every regression test exercises the actual code path cited in the finding (see test-finding alignment check below). If any check fails, go back and complete the missing work.
|
||||
|
||||
### Adversarial stance when documentation is available
|
||||
|
||||
If the playbook was generated with supplemental documentation (reference_docs/, community docs, user guides, API references), the code review must use that documentation *against* the code, not in its defense. Documentation tells you what the code is supposed to do. Your job is to find where it doesn't.
|
||||
|
||||
**Do not let documentation explanations excuse code defects.** If the docs say "the library handles X gracefully" but the code doesn't check for X, that's a bug — the documentation makes it *more* of a bug, not less. A richer understanding of intent should make you *harder* on the code, not softer.
|
||||
|
||||
The failure mode this addresses: when models have access to documentation, they build a richer mental model of the software and become more *forgiving* of code that approximately matches that model. The documentation gives the model reasons to believe the code works, which suppresses detections. Fight this by treating documentation as the prosecution's evidence — it defines what the code promised, and your job is to find broken promises.
|
||||
|
||||
### Test-finding alignment check
|
||||
|
||||
For each regression test that claims to reproduce a specific finding, verify that the test actually exercises the cited code path. A test that targets a different function, a different branch, or a different failure mode than the finding it claims to reproduce is worse than no test — it creates false confidence.
|
||||
|
||||
**Verification procedure:** For each regression test:
|
||||
1. Read the finding: note the specific file, line number, function, and failure condition
|
||||
2. Read the test: identify which function it calls and what condition it asserts
|
||||
3. Confirm alignment: the test must call the function cited in the finding, trigger the specific condition the finding describes, and assert on the behavior the finding says is wrong
|
||||
|
||||
If the test doesn't exercise the cited code path, either fix the test or mark the finding as UNCONFIRMED. Do not ship a regression test that passes or fails for reasons unrelated to the finding.
|
||||
|
||||
### Closure mandate
|
||||
|
||||
Every confirmed BUG finding must produce a regression test in `quality/test_regression.*`. The test must be an executable source file in the project's language — not a Markdown file, not prose documentation, not a comment block describing what a test would do. If the project uses Java, write a `.java` file. If Python, a `.py` file. The test must compile (or parse) and be runnable by the project's test framework.
|
||||
|
||||
**No language exemptions.** If introducing failing tests before fixes is a concern, use the language's expected-failure mechanism. The guard must be the **earliest syntactic guard for the framework** — a decorator or annotation where idiomatic, otherwise the first executable line in the test body:
|
||||
|
||||
- **Python (pytest):** `@pytest.mark.xfail(strict=True, reason="BUG-NNN: [description]")` — decorator above `def test_...():`. When the bug is present: XFAIL (expected). When the bug is fixed but marker not removed: XPASS → strict mode fails, signaling the guard should be removed.
|
||||
- **Python (unittest):** `@unittest.expectedFailure` — decorator above the test method.
|
||||
- **Go:** `t.Skip("BUG-NNN: [description] — unskip after applying quality/patches/BUG-NNN-fix.patch")` — first line inside the test function. Note: Go's `t.Skip` hides the test (reports SKIP, not FAIL), which is weaker than Python's xfail.
|
||||
- **Rust:** `#[ignore]` attribute on the test function — the standard "don't run in default suite" mechanism. Use `#[should_panic]` only for panic-shaped bugs.
|
||||
- **Java (JUnit 5):** `@Disabled("BUG-NNN: [description]")` — annotation above the test method.
|
||||
- **TypeScript/JavaScript (Jest):** `test.failing("BUG-NNN: [description]", () => { ... })`
|
||||
- **TypeScript/JavaScript (Vitest):** `test.fails("BUG-NNN: [description]", () => { ... })`
|
||||
- **JavaScript (Mocha):** `it.skip("BUG-NNN: [description]", () => { ... })` or `this.skip()` inside the test body for conditional skipping.
|
||||
|
||||
Every guard must reference the bug ID (BUG-NNN format) and the fix patch path so that someone encountering a skipped test knows how to resolve it.
|
||||
|
||||
These patterns ensure every bug has an executable test that can be enabled when the fix lands, without polluting CI with expected failures.
|
||||
|
||||
**TDD red/green interaction with skip guards.** During TDD verification, the red and green phases must temporarily bypass the skip guard:
|
||||
- **Red phase (NEVER SKIPPED):** Remove or disable the guard, run against unpatched code. Must fail. Re-enable guard after recording result. **The red phase is mandatory for every confirmed bug, even when no fix patch exists.** Record `verdict: "confirmed open"` with `red_phase: "fail"` and `green_phase: "skipped"`. Do not use `verdict: "skipped"` — that value is deprecated.
|
||||
- **Green phase:** Remove or disable the guard, apply fix patch, run. Must pass. Re-enable guard if fix will be reverted. If no fix patch exists, record `green_phase: "skipped"`.
|
||||
- **After successful red→green:** Generate a per-bug writeup at `quality/writeups/BUG-NNN.md` (see SKILL.md File 7, "Bug writeup generation"). Record the path in `tdd-results.json` as `writeup_path`. After writing `tdd-results.json`, reopen it and verify all required fields, enum values, and no extra undocumented root keys (see SKILL.md post-write validation step). Both sidecar JSON files must use `schema_version: "1.1"`.
|
||||
- **After TDD cycle:** Guard remains in committed regression test file, removed only when fix is permanently merged.
|
||||
|
||||
**The only acceptable exemption** is when a regression test genuinely cannot be written — for example, the bug requires multi-threaded timing that can't be reliably reproduced, or requires an external service not available in the test environment. In that case, write an explicit exemption note in the combined summary explaining why, and include a minimal code sketch showing what you would test if you could.
|
||||
|
||||
Findings without either an executable regression test or an explicit exemption note are incomplete. The combined summary must not include unresolved findings — every BUG must have closure.
|
||||
|
||||
### Regression test semantic convention
|
||||
|
||||
All regression tests must assert **desired correct behavior** and be marked as expected-to-fail on the current code. Do not write tests that assert the current broken behavior and pass. The distinction matters:
|
||||
|
||||
- **Correct:** Test says "this input should produce X" → test fails because buggy code produces Y → marked `xfail`/`@Disabled`/`t.Skip` → when bug is fixed, test passes and the skip marker is removed.
|
||||
- **Wrong:** Test says "this input produces Y (the buggy output)" → test passes on buggy code → when bug is fixed, test fails silently → stale test that now asserts wrong behavior.
|
||||
|
||||
The `xfail(strict=True)` pattern (Python/pytest) is the gold standard: it fails if the bug is present (expected), and also fails if the bug is fixed but the `xfail` marker wasn't removed (strict). Other languages should approximate this with skip + reason.
|
||||
|
||||
### Post-review closure verification
|
||||
|
||||
After writing all regression tests and the combined summary, run this checklist:
|
||||
|
||||
1. **Count BUGs in the combined summary.** This is the expected count.
|
||||
2. **Count test functions in `quality/test_regression.*`.** This should equal or exceed the BUG count (some BUGs may need multiple tests).
|
||||
3. **For each BUG row in the summary**, verify it has either:
|
||||
- A `REGRESSION TEST:` line citing the test function name, OR
|
||||
- An `EXEMPTION:` line explaining why no test was written
|
||||
4. **If any BUG lacks both**, go back and write the test or the exemption before declaring the review complete.
|
||||
|
||||
This checklist is the enforcement mechanism for the closure mandate. Without it, the mandate is aspirational — agents document bugs fully in the pass summaries but skip the regression test and move on.
|
||||
|
||||
### Post-spec-audit regression tests
|
||||
|
||||
The closure mandate applies to spec-audit confirmed code bugs, not just code review bugs. After the spec audit triage categorizes findings, every finding classified as "Real code bug" must get a regression test — using the same conventions as code review regression tests (executable source file, expected-failure marker, test-finding alignment).
|
||||
|
||||
**Why this is a separate step:** Code review regression tests are written immediately after the code review, before the spec audit runs. This means spec-audit bugs are systematically orphaned — they appear in the triage report but never enter the regression test file. Across v1.3.4 runs on 8 repos, spec-audit bugs accounted for ~30% of all findings, and only 1 of 8 repos (httpx) wrote regression tests for them.
|
||||
|
||||
**Procedure:**
|
||||
1. After spec audit triage, read the triage summary for findings classified as "Real code bug."
|
||||
2. For each, write a regression test in `quality/test_regression.*` using the same format as code review regression tests. Use the spec audit report as the source citation: `[BUG from spec_audits/YYYY-MM-DD-triage.md]`.
|
||||
3. Run the test to confirm it fails (expected) or passes (needs investigation).
|
||||
4. Update the cumulative BUG tracker in PROGRESS.md with the test reference.
|
||||
|
||||
If the spec audit produced no confirmed code bugs, skip this step — but document that in PROGRESS.md so the audit trail is complete.
|
||||
|
||||
### Cleaning up after spec audit reversals
|
||||
|
||||
When the spec audit overturns a code review finding (classifies a BUG as a design choice or false positive), the corresponding regression test must be either deleted or moved to a separate file (`quality/design_behavior_tests.*`) that documents intentional behavior. A failing test that points at documented-correct behavior is worse than no test — it creates noise and erodes trust in the regression suite.
|
||||
|
||||
After spec audit triage, check: does any test in `quality/test_regression.*` correspond to a finding that was reclassified as non-defect? If so, remove it from the regression file.
|
||||
|
||||
### Why Three Passes Instead of Focus Areas
|
||||
|
||||
Previous experiments (the QPB NSQ benchmark) showed that focus areas don't reliably improve AI code review. A generic "review for bugs" prompt scored 65.5%, while a playbook with 7 named focus areas scored 48.3% — the focus areas narrowed the model's attention and suppressed detections.
|
||||
|
||||
The three-pass pipeline works because each pass does one thing well with no cross-contamination:
|
||||
- **Pass 1** lets the model do what it's already good at (structural review, ~65% of defects)
|
||||
- **Pass 2** catches individual requirement violations that structural review misses (absence bugs, spec deviations)
|
||||
- **Pass 3** catches contradictions between individually-correct pieces of code (cross-file arithmetic bugs, security policy gaps)
|
||||
|
||||
Experiments on the NSQ codebase showed this pipeline finding 2 of 3 defects that were invisible to all structural review conditions — with zero knowledge of the specific bugs. The defects found were a cross-file numeric mismatch (validation bound vs bit field width) and a security design gap (configured CA not propagated to outbound auth client).
|
||||
|
||||
### Phase 2: Regression Tests for Confirmed Bugs
|
||||
|
||||
After the code review produces findings, write regression tests that reproduce each BUG finding. This transforms the review from "here are potential bugs" into "here are proven bugs with failing tests."
|
||||
@@ -228,7 +389,7 @@ After all tests complete, show a summary table and a recommendation:
|
||||
|
||||
**Passed:** 7/8 | **Failed:** 1/8
|
||||
|
||||
**Recommendation:** FIX FIRST — Rate limit handling needs investigation.
|
||||
**Recommendation:** FIX BEFORE MERGE — Rate limit handling needs investigation.
|
||||
```
|
||||
|
||||
Then save the detailed results to `quality/results/YYYY-MM-DD-integration.md`.
|
||||
@@ -246,7 +407,7 @@ Save results to `quality/results/YYYY-MM-DD-integration.md`
|
||||
[Specific failures, unexpected behavior, performance observations]
|
||||
|
||||
### Recommendation
|
||||
[SHIP IT / FIX FIRST / NEEDS INVESTIGATION]
|
||||
[SHIP / FIX BEFORE MERGE / BLOCK]
|
||||
```
|
||||
|
||||
### Tips for Writing Good Integration Checks
|
||||
@@ -362,6 +523,80 @@ The number of units/records/iterations per integration test run matters:
|
||||
|
||||
Look for `chunk_size`, `batch_size`, or similar configuration in the project to calibrate. When in doubt, 10–30 records is usually the right range for integration testing — enough to catch real issues without burning API budget.
|
||||
|
||||
### Integration Testing for Skills and LLM-Automated Tools
|
||||
|
||||
When the project under test is an AI skill, CLI tool that wraps LLM calls, or any software whose primary execution path involves invoking an AI model, the integration test protocol must include **LLM-automated integration tests** — tests that run the tool end-to-end via a command-line AI agent and structurally verify the output.
|
||||
|
||||
This is distinct from standard integration tests because the system under test doesn't have a deterministic API to call. The "integration" is: install the skill into a test repo, invoke it through a CLI agent (GitHub Copilot CLI, Claude Code, or similar), and verify the output artifacts meet structural and content expectations.
|
||||
|
||||
**Why this matters:** Skills and LLM tools cannot be tested by calling functions directly — their execution path goes through an AI agent that interprets instructions, reads files, and produces artifacts. The only way to test whether the skill works is to run it. Manual execution is fine for development, but a quality playbook should encode the test as a repeatable protocol.
|
||||
|
||||
**Protocol structure for skill/LLM integration tests:**
|
||||
|
||||
```markdown
|
||||
## Skill Integration Test Protocol
|
||||
|
||||
### Prerequisites
|
||||
- CLI agent installed and configured (e.g., `gh copilot`, `claude`, `npx @anthropic-ai/claude-code`)
|
||||
- Test repo prepared with skill installed at `.github/skills/SKILL.md` (or equivalent)
|
||||
- Clean `quality/` directory (no artifacts from prior runs)
|
||||
- Optional: `reference_docs/` folder for with-docs comparison runs
|
||||
|
||||
### Test Matrix
|
||||
|
||||
| Test | Method | Pass Criteria |
|
||||
|------|--------|---------------|
|
||||
| Full execution | Run skill via CLI with "execute" prompt | All expected artifacts exist in `quality/` |
|
||||
| PROGRESS.md completeness | Read `quality/PROGRESS.md` | All phases checked complete, BUG tracker populated |
|
||||
| Artifact structural check | Verify each expected file | Files are non-empty, contain expected sections |
|
||||
| BUG tracker closure | Count BUG entries vs regression tests | Every BUG has a test reference or exemption |
|
||||
| Baseline vs with-docs (optional) | Run twice: without and with reference_docs/ | With-docs run produces >= baseline requirement count |
|
||||
|
||||
### Execution
|
||||
|
||||
```bash
|
||||
# Install skill into test repo
|
||||
cp -r path/to/skill/.github test-repo/.github
|
||||
|
||||
# Run via CLI agent (adapt command to your agent)
|
||||
cd test-repo
|
||||
gh copilot -p "Read .github/skills/SKILL.md and its reference files. Execute the quality playbook for this project." \
|
||||
--model gpt-5.4 --yolo > quality_run.output.txt 2>&1
|
||||
```
|
||||
|
||||
### Structural Verification (automated)
|
||||
|
||||
After the run, verify output structurally:
|
||||
|
||||
```bash
|
||||
# Required artifacts exist and are non-empty
|
||||
for f in quality/QUALITY.md quality/REQUIREMENTS.md quality/CONTRACTS.md \
|
||||
quality/COVERAGE_MATRIX.md quality/COMPLETENESS_REPORT.md \
|
||||
quality/PROGRESS.md quality/RUN_CODE_REVIEW.md \
|
||||
quality/RUN_INTEGRATION_TESTS.md quality/RUN_SPEC_AUDIT.md; do
|
||||
[ -s "$f" ] || echo "FAIL: $f missing or empty"
|
||||
done
|
||||
|
||||
# Functional test file exists (language-appropriate name)
|
||||
ls quality/test_functional.* quality/FunctionalSpec.* quality/functional.test.* 2>/dev/null \
|
||||
|| echo "FAIL: no functional test file"
|
||||
|
||||
# PROGRESS.md has all phases checked
|
||||
grep -c '\[x\]' quality/PROGRESS.md # should equal total phase count
|
||||
|
||||
# BUG tracker has entries (if bugs were found)
|
||||
grep -c '^| [0-9]' quality/PROGRESS.md
|
||||
|
||||
# Code reviews and spec audits produced substantive files
|
||||
find quality/code_reviews -name "*.md" -size +500c | wc -l # should be >= 1
|
||||
find quality/spec_audits -name "*triage*" -size +500c | wc -l # should be >= 1
|
||||
```
|
||||
```
|
||||
|
||||
**Baseline vs with-docs comparison pattern:** Run the skill twice on the same repo — once without supplemental docs, once with a `reference_docs/` folder containing project history. Compare: requirement count, scenario count, bug count, and pipeline completion. The with-docs run should produce equal or more requirements and equal or more bugs. If the baseline outperforms the with-docs run on bug detection, that's a finding about the docs quality, not a skill failure.
|
||||
|
||||
**When to generate this protocol:** Generate a skill integration test section in `RUN_INTEGRATION_TESTS.md` whenever the project being analyzed is a skill, a CLI tool that wraps AI calls, or a framework for building AI-powered tools. Look for: `SKILL.md` files, prompt templates, LLM client configurations, agent orchestration code, or references to AI models in the codebase.
|
||||
|
||||
### Post-Run Verification Depth
|
||||
|
||||
A run that completes without errors may still be wrong. For each integration test run, verify at multiple levels:
|
||||
|
||||
@@ -0,0 +1,366 @@
|
||||
# Run-State Schema (v1.5.6)
|
||||
|
||||
*Authoritative schema for `quality/run_state.jsonl`, `quality/PROGRESS.md`, and `Calibration Cycles/<cycle>/run_state.jsonl`. The playbook AI writes these files directly via the file-tool layer; the orchestrator AI reads them to drive multi-benchmark calibration cycles.*
|
||||
|
||||
*Companion to: `docs/design/QPB_v1.5.5_Design.md` ("Design — Run-state event taxonomy" section).*
|
||||
|
||||
---
|
||||
|
||||
## File locations and ownership
|
||||
|
||||
- `<benchmark>/quality/run_state.jsonl` — per-run event log. Append-only. Written by the AI executing the playbook.
|
||||
- `<benchmark>/quality/PROGRESS.md` — human-readable run status. Atomically rewritten by the AI on each event.
|
||||
- `Calibration Cycles/<cycle>/run_state.jsonl` — cycle-level event log. Append-only. Written by the orchestrator AI.
|
||||
|
||||
All three live in the bind-mounted workspace owned by the user. The AI writes via Edit/Write file tools, never via shell redirection or `tee` (which routes through a different UID layer in some sandbox runtimes).
|
||||
|
||||
---
|
||||
|
||||
## Schema versioning
|
||||
|
||||
Every `run_state.jsonl` opens with an `_index` event recording `schema_version`. Current version: `"1.5.6"`. Schema bumps preserve backward compatibility — older files remain readable by newer parsers. Breaking schema changes bump the major number.
|
||||
|
||||
---
|
||||
|
||||
## Required fields (every event)
|
||||
|
||||
Every event object MUST have:
|
||||
|
||||
- `ts` — ISO 8601 UTC timestamp with `Z` suffix (e.g. `"2026-05-15T14:32:01Z"`). Sub-second precision allowed but not required.
|
||||
- `event` — string, the event-type name. Must match one of the names listed in `_index.event_types`.
|
||||
|
||||
Events MAY have additional fields per their type's spec below. Unknown fields are tolerated by readers (forward-compatible).
|
||||
|
||||
---
|
||||
|
||||
## Per-run events (`<benchmark>/quality/run_state.jsonl`)
|
||||
|
||||
### `_index`
|
||||
|
||||
ALWAYS the first line. Records schema metadata.
|
||||
|
||||
| Field | Type | Required | Notes |
|
||||
|---|---|---|---|
|
||||
| `event` | string | yes | Always `"_index"` |
|
||||
| `ts` | string | yes | ISO 8601 UTC |
|
||||
| `schema_version` | string | yes | `"1.5.6"` |
|
||||
| `event_types` | array of string | yes | Every event type this file uses |
|
||||
| `benchmark` | string | yes | E.g. `"chi-1.3.45"`, `"virtio-1.5.1"` |
|
||||
| `lever_state` | string | yes | E.g. `"pre-pattern7"`, `"post-pattern7"`, `"baseline"` |
|
||||
| `started_at` | string | yes | ISO 8601 UTC, equals `ts` of this event |
|
||||
|
||||
### `run_start`
|
||||
|
||||
Marks the beginning of a playbook run.
|
||||
|
||||
| Field | Type | Required | Notes |
|
||||
|---|---|---|---|
|
||||
| `event` | string | yes | `"run_start"` |
|
||||
| `ts` | string | yes | |
|
||||
| `runner` | string | yes | One of `"claude"`, `"codex"`, `"copilot"`, `"cursor"` |
|
||||
| `playbook_version` | string | yes | E.g. `"1.5.6-pre"`, `"1.5.6"` (matches `bin.benchmark_lib.RELEASE_VERSION`) |
|
||||
| `target_path` | string | yes | Relative path to benchmark target |
|
||||
|
||||
### `phase_start`
|
||||
|
||||
Marks the beginning of one of the six playbook phases.
|
||||
|
||||
| Field | Type | Required | Notes |
|
||||
|---|---|---|---|
|
||||
| `event` | string | yes | `"phase_start"` |
|
||||
| `ts` | string | yes | |
|
||||
| `phase` | integer | yes | 1, 2, 3, 4, 5, or 6 |
|
||||
|
||||
### `pattern_walked`
|
||||
|
||||
Phase 1 only. Records that one of the seven exploration patterns was walked.
|
||||
|
||||
| Field | Type | Required | Notes |
|
||||
|---|---|---|---|
|
||||
| `event` | string | yes | `"pattern_walked"` |
|
||||
| `ts` | string | yes | |
|
||||
| `phase` | integer | yes | Always 1 |
|
||||
| `pattern` | integer | yes | 1 through 7 |
|
||||
| `findings_count` | integer | yes | Number of findings produced by this pattern |
|
||||
| `duration_seconds` | number | optional | Wall-clock for this pattern walk |
|
||||
|
||||
### `pass_started` / `pass_ended`
|
||||
|
||||
Phase 4 only. Records start/end of one of the four skill-derivation passes.
|
||||
|
||||
| Field | Type | Required | Notes |
|
||||
|---|---|---|---|
|
||||
| `event` | string | yes | `"pass_started"` or `"pass_ended"` |
|
||||
| `ts` | string | yes | |
|
||||
| `phase` | integer | yes | Always 4 |
|
||||
| `pass` | string | yes | One of `"A"`, `"B"`, `"C"`, `"D"` |
|
||||
| `output_artifact` | string | optional | Relative path to pass artifact (on `pass_ended`) |
|
||||
|
||||
### `finding_logged`
|
||||
|
||||
Records that a finding (skill-divergence, code-bug, etc.) was logged in the current phase.
|
||||
|
||||
| Field | Type | Required | Notes |
|
||||
|---|---|---|---|
|
||||
| `event` | string | yes | `"finding_logged"` |
|
||||
| `ts` | string | yes | |
|
||||
| `phase` | integer | yes | 1-6 |
|
||||
| `finding_id` | string | yes | E.g. `"BUG-007"`, `"REQ-042"` |
|
||||
| `category` | string | yes | E.g. `"code-bug"`, `"skill-divergence"`, `"missing-citation"`, `"prose-to-code-mismatch"` |
|
||||
|
||||
### `artifact_written`
|
||||
|
||||
Records that an artifact file was produced/updated.
|
||||
|
||||
| Field | Type | Required | Notes |
|
||||
|---|---|---|---|
|
||||
| `event` | string | yes | `"artifact_written"` |
|
||||
| `ts` | string | yes | |
|
||||
| `relative_path` | string | yes | Path relative to benchmark target (e.g. `"quality/EXPLORATION.md"`) |
|
||||
| `byte_size` | integer | optional | Size of the file at write time |
|
||||
| `line_count` | integer | optional | Line count |
|
||||
|
||||
### `gate_check`
|
||||
|
||||
Records the outcome of a single quality-gate check.
|
||||
|
||||
| Field | Type | Required | Notes |
|
||||
|---|---|---|---|
|
||||
| `event` | string | yes | `"gate_check"` |
|
||||
| `ts` | string | yes | |
|
||||
| `gate_name` | string | yes | Identifier from `quality_gate.py` |
|
||||
| `verdict` | string | yes | One of `"pass"`, `"fail"`, `"warn"`, `"skip"` |
|
||||
| `reason` | string | optional | Human-readable explanation |
|
||||
|
||||
### `phase_end`
|
||||
|
||||
Marks the end of a phase. Cross-validated against the phase's expected artifacts before being written (see "Cross-validation rules" below).
|
||||
|
||||
| Field | Type | Required | Notes |
|
||||
|---|---|---|---|
|
||||
| `event` | string | yes | `"phase_end"` |
|
||||
| `ts` | string | yes | |
|
||||
| `phase` | integer | yes | 1-6 |
|
||||
| `key_counts` | object | yes | Phase-specific counts (see below) |
|
||||
| `artifacts_produced` | array of string | yes | Relative paths of artifacts produced this phase |
|
||||
| `duration_seconds` | number | optional | Wall-clock for the whole phase |
|
||||
|
||||
`key_counts` per phase:
|
||||
|
||||
- Phase 1: `{"findings_total": N, "patterns_walked": M}` (M should be 7 for full Phase 1)
|
||||
- Phase 2: `{"findings_promoted": N, "findings_dropped": M}`
|
||||
- Phase 3: `{"bugs_identified": N, "bug_writeups": M}`
|
||||
- Phase 4: `{"req_count": N, "uc_count": M, "passes_complete": K}` (K should be 4)
|
||||
- Phase 5: `{"gate_checks_total": N, "gate_failures": M}`
|
||||
- Phase 6: `{"bugs_md_count": N, "gate_verdict": "pass|fail|partial"}`
|
||||
|
||||
### `error`
|
||||
|
||||
Records an error during the run.
|
||||
|
||||
| Field | Type | Required | Notes |
|
||||
|---|---|---|---|
|
||||
| `event` | string | yes | `"error"` |
|
||||
| `ts` | string | yes | |
|
||||
| `phase` | integer | optional | If error is phase-scoped |
|
||||
| `message` | string | yes | Human-readable description |
|
||||
| `recoverable` | boolean | yes | If true, the run will retry the affected phase; if false, the run is aborting |
|
||||
|
||||
### `documentation_state`
|
||||
|
||||
v1.5.6+. Records the documentation-availability state at Phase 1 entry. Currently the only emitted state is `"code_only"`, indicating that `reference_docs/` and `reference_docs/cite/` carry no recognized plaintext content (`.md` or `.txt`) and Phase 1 is proceeding in code-only mode (see `references/code-only-mode.md`). A `"with_docs"` value is reserved for future explicit emission; today the absence of a `documentation_state` event implies docs were present.
|
||||
|
||||
| Field | Type | Required | Notes |
|
||||
|---|---|---|---|
|
||||
| `event` | string | yes | `"documentation_state"` |
|
||||
| `ts` | string | yes | |
|
||||
| `state` | string | yes | Currently `"code_only"`. Future values may include `"with_docs"`. |
|
||||
| `reason` | string | yes | Free-form (e.g. `"reference_docs/ empty"`) |
|
||||
|
||||
When `documentation_state state="code_only"` is emitted, the playbook also prepends a "Documentation status: code-only mode" section to `quality/EXPLORATION.md` and adds a "Documentation state: code_only" line to `quality/PROGRESS.md` so the downgrade is visible to anyone reading either artifact. New runs adding the `documentation_state` event must include it in the `_index.event_types` list.
|
||||
|
||||
### `aborted_missing_docs`
|
||||
|
||||
v1.5.6+. Records that the run aborted at Phase 1 entry because `--require-docs` was set and `reference_docs/` was empty. Mutually exclusive with `documentation_state state="code_only"` for the same Phase 1 entry — `--require-docs` is the opt-IN abort path; the absence of the flag preserves the documented code-only-mode downgrade. After this event the runner returns non-zero without invoking any LLM work, so no `phase_start phase=1` is recorded.
|
||||
|
||||
| Field | Type | Required | Notes |
|
||||
|---|---|---|---|
|
||||
| `event` | string | yes | `"aborted_missing_docs"` |
|
||||
| `ts` | string | yes | |
|
||||
| `reason` | string | yes | Free-form (e.g. `"reference_docs/ empty and --require-docs set"`) |
|
||||
|
||||
When `aborted_missing_docs` is emitted, the playbook also writes an `ERROR: aborted_missing_docs — <reason>` block to `quality/PROGRESS.md` so the abort is visible without reading the JSONL. New runs that pass `--require-docs` against an empty `reference_docs/` must include `aborted_missing_docs` in the `_index.event_types` list.
|
||||
|
||||
### `run_end`
|
||||
|
||||
Marks the end of the playbook run.
|
||||
|
||||
| Field | Type | Required | Notes |
|
||||
|---|---|---|---|
|
||||
| `event` | string | yes | `"run_end"` |
|
||||
| `ts` | string | yes | |
|
||||
| `status` | string | yes | One of `"success"`, `"aborted"`, `"failed"` |
|
||||
| `total_findings` | integer | optional | Sum across all phases |
|
||||
| `final_verdict` | string | optional | The Phase 6 gate verdict |
|
||||
|
||||
---
|
||||
|
||||
## Cycle-level events (`Calibration Cycles/<cycle>/run_state.jsonl`)
|
||||
|
||||
### `_index` (cycle-level)
|
||||
|
||||
| Field | Type | Required | Notes |
|
||||
|---|---|---|---|
|
||||
| `event` | string | yes | `"_index"` |
|
||||
| `ts` | string | yes | |
|
||||
| `schema_version` | string | yes | `"1.5.6"` |
|
||||
| `event_types` | array of string | yes | |
|
||||
| `cycle_name` | string | yes | E.g. `"2026-05-15-pattern7-displacement-recovery"` |
|
||||
| `lever_under_test` | string | yes | E.g. `"lever-1-exploration-breadth-depth"` |
|
||||
| `benchmarks` | array of string | yes | Cycle's pinned benchmark list |
|
||||
| `iteration` | integer | yes | Iteration ordinal (1, 2, or 3 — see iterate-cap) |
|
||||
|
||||
### `cycle_start`
|
||||
|
||||
| Field | Type | Required | Notes |
|
||||
|---|---|---|---|
|
||||
| `event` | string | yes | `"cycle_start"` |
|
||||
| `ts` | string | yes | |
|
||||
| `hypothesis` | string | yes | The cycle's testable hypothesis |
|
||||
| `noise_floor_threshold` | number | yes | Recall delta below this is treated as noise (default 0.05) |
|
||||
|
||||
### `benchmark_start`
|
||||
|
||||
| Field | Type | Required | Notes |
|
||||
|---|---|---|---|
|
||||
| `event` | string | yes | `"benchmark_start"` |
|
||||
| `ts` | string | yes | |
|
||||
| `benchmark` | string | yes | |
|
||||
| `lever_state` | string | yes | `"pre-lever"` or `"post-lever"` |
|
||||
|
||||
### `lever_change_applied`
|
||||
|
||||
| Field | Type | Required | Notes |
|
||||
|---|---|---|---|
|
||||
| `event` | string | yes | `"lever_change_applied"` |
|
||||
| `ts` | string | yes | |
|
||||
| `lever_id` | string | yes | E.g. `"lever-1-exploration-breadth-depth"` |
|
||||
| `files_changed` | array of string | yes | Paths relative to QPB repo root |
|
||||
| `commit_sha` | string | yes | Commit SHA on the implementing branch |
|
||||
| `description` | string | yes | What the change is (e.g. `"Pattern 7 budget cap 3-5 → 2-3"`) |
|
||||
|
||||
### `lever_change_reverted`
|
||||
|
||||
| Field | Type | Required | Notes |
|
||||
|---|---|---|---|
|
||||
| `event` | string | yes | `"lever_change_reverted"` |
|
||||
| `ts` | string | yes | |
|
||||
| `files_changed` | array of string | yes | |
|
||||
| `commit_sha` | string | optional | Null/absent if revert is uncommitted |
|
||||
|
||||
### `benchmark_end`
|
||||
|
||||
| Field | Type | Required | Notes |
|
||||
|---|---|---|---|
|
||||
| `event` | string | yes | `"benchmark_end"` |
|
||||
| `ts` | string | yes | |
|
||||
| `benchmark` | string | yes | |
|
||||
| `lever_state` | string | yes | |
|
||||
| `recall` | number | yes | 0.0-1.0 |
|
||||
| `bugs_found` | array of string | yes | Bug IDs found this run |
|
||||
| `bugs_missed` | array of string | yes | Bug IDs in baseline missed this run |
|
||||
| `historical_baseline_path` | string | yes | Path to the baseline BUGS.md used for recall computation |
|
||||
|
||||
### `cycle_end`
|
||||
|
||||
| Field | Type | Required | Notes |
|
||||
|---|---|---|---|
|
||||
| `event` | string | yes | `"cycle_end"` |
|
||||
| `ts` | string | yes | |
|
||||
| `verdict` | string | yes | One of `"ship"`, `"revert"`, `"iterate"`, `"halt-iterate-cap"` |
|
||||
| `recall_before` | object | yes | Per-benchmark recall before lever change |
|
||||
| `recall_after` | object | yes | Per-benchmark recall after lever change |
|
||||
| `delta` | object | yes | Per-benchmark delta (recall_after - recall_before) |
|
||||
| `cross_benchmark_check` | object | yes | `{"clean": bool, "regressions": [list of bench/bug pairs that regressed]}` |
|
||||
|
||||
---
|
||||
|
||||
## Cross-validation rules (per `phase_end`)
|
||||
|
||||
The AI verifies these conditions before appending a `phase_end` event. If any check fails, the AI appends an `error` event with `recoverable: true` and re-runs the failing phase.
|
||||
|
||||
| Phase | Required conditions |
|
||||
|---|---|
|
||||
| 1 | `quality/EXPLORATION.md` exists, ≥ 120 lines (aligned with the Phase 2 startup gate in `bin/run_playbook.check_phase_gate`), contains at least one finding section (regex `^##\s+(Finding\|Open Exploration Findings\|\d+\.)` — accepts `## Finding ...`, the SKILL-prescribed exact heading `## Open Exploration Findings`, and numbered `## N.` headings) |
|
||||
| 2 | All nine fixed-name Generate-contract artifacts exist non-empty under `quality/`: `REQUIREMENTS.md`, `QUALITY.md`, `CONTRACTS.md`, `COVERAGE_MATRIX.md`, `COMPLETENESS_REPORT.md`, `RUN_CODE_REVIEW.md`, `RUN_INTEGRATION_TESTS.md`, `RUN_SPEC_AUDIT.md`, `RUN_TDD_TESTS.md`. Plus at least one non-empty `quality/test_functional.<ext>` (extension varies by primary language). Pre-v1.5.6 this row described the v1.5.5-design triage model (`EXPLORATION_MERGED.md` / `triage.md`); that mapping was never adopted by shipped SKILL.md / orchestrator_protocol.md / agent files, which always documented Phase 2 as Generate. |
|
||||
| 3 | `quality/code_reviews/` directory contains at least one review file. If `quality/BUGS.md` has any `### BUG-` heading, `quality/patches/` contains at least one `BUG-*-regression-test.patch` file. Pre-v1.5.6 this row checked `quality/RUN_CODE_REVIEW.md` (a Phase 2 Generate output, not a Phase 3 review result) — same v1.5.5-design / shipped-Generate drift class as the Phase 2 row. Cluster B reconciled. |
|
||||
| 4 | `quality/spec_audits/` directory contains at least one `*-triage.md` file AND at least one `*-auditor-*.md` file (per orchestrator_protocol.md naming convention). When neither name pattern matches, the validator falls back to a weaker "≥2 files" check — older bootstrap runs with arbitrary `.md` names still pass; the gate at Phase 6 enforces deeper conformance. Pre-v1.5.6 this row checked `quality/REQUIREMENTS.md` + `COVERAGE_MATRIX.md` (Phase 2 outputs) — same v1.5.5-design drift class. Cluster B reconciled. |
|
||||
| 5 | If `quality/BUGS.md` has confirmed `### BUG-` entries: `quality/results/tdd-results.json` exists non-empty; for every confirmed bug, `quality/writeups/BUG-NNN.md` exists AND `quality/results/BUG-NNN.red.log` exists. With no confirmed bugs the row is vacuously satisfied. Pre-v1.5.6 this row checked `quality/results/quality-gate.log` (a Phase 6 output) — same v1.5.5-design drift class. Cluster B reconciled. |
|
||||
| 6 | `quality/results/quality-gate.log` exists non-empty AND `quality/PROGRESS.md` contains a `Terminal Gate Verification` section (the orchestrator-protocol marker that Phase 6 ran the script-verified gate to completion). Pre-v1.5.6 this row checked `quality/BUGS.md` + `quality/INDEX.md` — BUGS.md is a Phase 3 output, INDEX.md was never adopted in the shipped contract. Same v1.5.5-design drift class. Cluster B reconciled. |
|
||||
|
||||
The `run_end` event additionally requires: all 6 `phase_end` events present in the log; the final BUGS.md count matches `phase_end phase=6 key_counts.bugs_md_count`.
|
||||
|
||||
---
|
||||
|
||||
## Resume semantics
|
||||
|
||||
When an AI session starts on a run directory:
|
||||
|
||||
1. If `quality/run_state.jsonl` does not exist: fresh run. Write `_index` + `run_start` + `phase_start phase=1`.
|
||||
2. If it exists: read all events. Find the last `phase_start` not followed by a matching `phase_end`. Call it the "in-progress phase".
|
||||
3. Verify the in-progress phase's expected artifacts (per cross-validation rules above):
|
||||
- If artifacts complete: append the missing `phase_end` event and proceed to the next phase. Note: this is the "session crashed mid-phase but the work is done" recovery path.
|
||||
- If artifacts incomplete: re-run that phase from scratch. The prior session left a partial state that can't be safely resumed.
|
||||
4. If all 6 `phase_end` events are present but no `run_end`: append `run_end status=success` and finalize.
|
||||
|
||||
The policy is "trust artifacts more than events." If events claim phase 4 done but `REQUIREMENTS.md` doesn't exist, the AI re-runs phase 4. If events stop mid-phase but artifacts are complete, the AI catches up the events.
|
||||
|
||||
---
|
||||
|
||||
## PROGRESS.md format
|
||||
|
||||
Atomically rewritten on every event. Markdown.
|
||||
|
||||
```markdown
|
||||
# QPB Run Progress
|
||||
|
||||
**Started:** 2026-05-15T14:32:01Z **Benchmark:** chi-1.5.1 **Lever:** post-pattern7
|
||||
**Runner:** claude **Playbook version:** 1.5.6
|
||||
|
||||
## Phases
|
||||
|
||||
- [x] Phase 1 — Explore (10:10, 12 findings, patterns 1-7 walked)
|
||||
- [x] Phase 2 — Generate (0:42, 9 artifacts produced)
|
||||
- [x] Phase 3 — Code Review (15:31, 6 bugs identified)
|
||||
- [x] Phase 4 — Spec Audit (3 auditors, 1 triage)
|
||||
- [ ] Phase 5 — Reconciliation *(in progress, started 14:58:31Z)*
|
||||
- [ ] Phase 6 — Verify
|
||||
|
||||
## Recent events (last 10)
|
||||
|
||||
- 2026-05-15T14:58:31Z — phase_start phase=5
|
||||
- 2026-05-15T14:58:30Z — phase_end phase=4 passes=[A,B,C,D] req_count=89
|
||||
- 2026-05-15T14:42:11Z — phase_end phase=1 findings=12
|
||||
|
||||
## Artifacts produced
|
||||
|
||||
- quality/EXPLORATION.md (12,034 bytes)
|
||||
- quality/REQUIREMENTS.md (28,891 bytes)
|
||||
- quality/COVERAGE_MATRIX.md (3,022 bytes)
|
||||
```
|
||||
|
||||
Sections (header, phase checklist, recent events, artifacts produced) are required. Phase checklist uses `[x]` for complete phases (with summary stats), `[ ]` for incomplete, with in-progress phase noted explicitly with start time. Recent events shows last 10 event lines from `run_state.jsonl` in human-readable form. Artifacts produced shows files written this run with byte sizes.
|
||||
|
||||
---
|
||||
|
||||
## Format invariants (enforced by `bin/run_state_lib.py` validators)
|
||||
|
||||
1. `_index` is line 1.
|
||||
2. Every line is valid JSON (one object per line).
|
||||
3. Every event has `ts` and `event` fields.
|
||||
4. Every `event` value appears in `_index.event_types`.
|
||||
5. Append-only: events are added, never edited. Editing a prior event is a schema violation.
|
||||
6. `phase_start` and `phase_end` events for a given phase appear at most once per run (no out-of-order or duplicate phase markers).
|
||||
7. `run_start` is the second line (after `_index`); `run_end` is the last line if the run completed.
|
||||
|
||||
Validators are read-only checks. They surface violations as findings; they don't auto-correct.
|
||||
@@ -65,6 +65,31 @@ Requirements are tagged with `[Req: tier — source]`. Weight your findings by t
|
||||
|
||||
---
|
||||
|
||||
## Pre-audit docs validation (required triage section)
|
||||
|
||||
The triage report must include a `## Pre-audit docs validation` section regardless of whether `reference_docs/` exists. This section documents what the auditors used as their factual baseline.
|
||||
|
||||
**If `reference_docs/` exists:** Spot-check the gathered docs for factual accuracy before running the audit. Stale or incorrect docs can skew audit confidence — a model that reads "the library handles X by doing Y" in the docs will rate a divergent finding higher even if the docs are wrong.
|
||||
|
||||
**Quick validation procedure (5 minutes max):**
|
||||
1. Pick 2–3 factual claims from `reference_docs/` that describe specific runtime behavior (e.g., "invalid input raises ValueError", "field X defaults to Y", "format Z is not supported").
|
||||
2. Grep the source code for the cited behavior. Does the code match the docs?
|
||||
3. If any claim is wrong, note it in the triage header: "reference_docs/ contains N known inaccuracies: [list]. Findings that rely on these claims are downgraded to NEEDS REVIEW."
|
||||
|
||||
**Spot-check claims about code contents must extract, not assert.** When the spec audit prompt or pre-validation includes claims like "function X handles constant Y at line Z," the triage must read the cited lines and report what they actually contain. Do not confirm a claim by checking that the function exists or that the constant is defined somewhere — confirm it by showing the exact text at the cited lines. Format each spot-check result as:
|
||||
|
||||
```
|
||||
Claim: "vring_transport_features() preserves VIRTIO_F_RING_RESET at line 3527"
|
||||
Actual line 3527: `default:`
|
||||
Result: CLAIM IS FALSE — line 3527 is the default branch, not a RING_RESET case label
|
||||
```
|
||||
|
||||
Spot-check claims derived from generated requirements or gathered docs (rather than from the code) are **hypotheses to test**, not facts to confirm. This rule prevents the contamination chain observed in v1.3.17 where a false spot-check claim was accepted as "accurate" without reading the actual lines, causing three auditors to inherit a hallucinated code-presence claim.
|
||||
|
||||
**If `reference_docs/` does not exist:** State this explicitly: "No supplemental docs provided. Auditors relied on in-repo specs and code only." This confirms the absence is intentional, not an oversight.
|
||||
|
||||
This section fires in every triage, not just when docs are present. In v1.3.5 cross-repo testing, it only fired in 1/8 repos because it was conditional — making it required ensures the audit trail always documents the factual baseline.
|
||||
|
||||
## Running the Audit
|
||||
|
||||
1. Give the identical prompt to three AI tools
|
||||
@@ -73,7 +98,18 @@ Requirements are tagged with `[Req: tier — source]`. Weight your findings by t
|
||||
|
||||
## Triage Process
|
||||
|
||||
After all three models report, merge findings:
|
||||
After all three models report, merge findings.
|
||||
|
||||
**Log the effective council size.** If a model did not return a usable report (timeout, empty output, refusal), record this in the triage header:
|
||||
|
||||
```
|
||||
## Council Status
|
||||
- Model A: Fresh report received (YYYY-MM-DD)
|
||||
- Model B: Fresh report received (YYYY-MM-DD)
|
||||
- Model C: TIMEOUT — no usable report. Effective council: 2/3.
|
||||
```
|
||||
|
||||
When the effective council is 2/3, downgrade the confidence tier: "All three" becomes impossible, "Two of three" becomes the ceiling. When the effective council is 1/3, all findings are "Needs verification" regardless of how confident that single model is. Do not silently substitute stale reports from prior runs — if a model didn't produce a fresh report for this run, it didn't participate.
|
||||
|
||||
| Confidence | Found By | Action |
|
||||
|------------|----------|--------|
|
||||
@@ -81,10 +117,36 @@ After all three models report, merge findings:
|
||||
| High | Two of three | Likely real — verify and fix |
|
||||
| Needs verification | One only | Could be real or hallucinated — deploy verification probe |
|
||||
|
||||
**When the effective council is 2/3 or less:** Distinguish single-auditor findings from multi-auditor findings explicitly in the triage. With a 2/3 council, a finding from both present auditors has "High" confidence. A finding from only one present auditor has "Needs verification" — it cannot be promoted to confirmed BUG without a verification probe, because the missing auditor might have contradicted it. Do not treat all findings as equivalent just because the council is incomplete.
|
||||
|
||||
In the triage summary table, add a column for auditor agreement: "2/2 present", "1/2 present", etc. This makes the confidence tier visible and auditable.
|
||||
|
||||
**Incomplete council gate for enumeration/dispatch checks.** If the effective council is less than 3/3 and the run includes whitelist/enumeration/dispatch-function checks (claims about which constants a function handles), the audit may not conclude "no confirmed defects" for those checks without executed mechanical proof. Check whether `quality/mechanical/<function>_cases.txt` exists for each relevant function. If it does and shows the constant is present, the claim is confirmed. If it does and shows the constant is absent, the claim is false regardless of what any auditor wrote. If no mechanical artifact exists, generate one before closing the enumeration check. This rule exists because v1.3.18 had an effective council of 1/3, and the single model's triage fabricated line contents for enumeration claims — a mechanical artifact would have caught the contradiction.
|
||||
|
||||
### The Verification Probe
|
||||
|
||||
When models disagree on factual claims, deploy a read-only probe: give one model the disputed claim and ask it to read the code and report ground truth. Never resolve factual disputes by majority vote — the majority can be wrong about what code actually does.
|
||||
|
||||
**Verification probes must produce executable evidence.** Prose reasoning is not sufficient for either confirmations or rejections. Every verification probe must produce a test assertion that mechanically proves the determination:
|
||||
|
||||
**For rejections** (finding is false positive): Write an assertion that PASSES, proving the auditor's claim is wrong:
|
||||
```python
|
||||
# Rejection proof: function X does check for null at line 247
|
||||
assert "if (ptr == NULL)" in source_of("X"), "X has null check at line 247"
|
||||
```
|
||||
If you cannot write a passing assertion, **do not reject the finding**. The inability to produce mechanical proof is itself evidence that the finding may be real.
|
||||
|
||||
**For confirmations** (finding is a real bug): Write an assertion that FAILS (expected-failure), proving the bug exists:
|
||||
```python
|
||||
# Confirmation proof: RING_RESET is not a case label in the whitelist
|
||||
assert "case VIRTIO_F_RING_RESET:" in source_of("vring_transport_features"), \
|
||||
"RING_RESET should be in the switch but is not — cleared by default at line 3527"
|
||||
```
|
||||
|
||||
**Every assertion must cite an exact line number** for the evidence it references. Not "lines 3527-3528" but "line 3527: `default:`" — showing what the line actually contains.
|
||||
|
||||
**Why this rule exists:** In v1.3.16 virtio testing, the triage received a correct minority finding that VIRTIO_F_RING_RESET was missing from a switch/case whitelist. The triage performed a verification probe that claimed lines 3527-3528 "explicitly preserve VIRTIO_F_RING_RESET" — but those lines contained the `default:` branch. The probe hallucinated compliance. Had it been required to write `assert "case VIRTIO_F_RING_RESET:" in source`, the assertion would have failed, exposing the hallucination. Requiring executable evidence makes hallucinated rejections self-defeating.
|
||||
|
||||
### Categorize Each Confirmed Finding
|
||||
|
||||
- **Spec bug** — Spec is wrong, code is fine → update spec
|
||||
@@ -96,6 +158,45 @@ When models disagree on factual claims, deploy a read-only probe: give one model
|
||||
|
||||
That last category is the bridge between the spec audit and the test suite. Every confirmed finding not already covered by a test should become one.
|
||||
|
||||
### Legacy and historical scripts
|
||||
|
||||
Scripts documented as "historical," "deprecated," or "not part of current workflow" are sometimes downgraded during triage on the theory that they don't affect current operations. This is correct when the script genuinely never runs. But if the script's bug has already materialized in canonical artifacts — duplicate entries in a published file, stale data in a checked-in cache, incorrect mappings that downstream tools consume — the bug is not historical. It's a live defect in the repository's published state.
|
||||
|
||||
**Rule: If a legacy script's bug is already visible in canonical artifacts, promote it to confirmed BUG regardless of the script's status.** The script may be historical, but the damage it left behind is current. The regression test should target the artifact (the duplicate entry, the stale mapping), not the script — because the artifact is what users encounter.
|
||||
|
||||
This rule exists because v1.3.5 bootstrap runs on QPB found duplicate changelog entries and stale cache mappings produced by a "historical" script. Both triages downgraded the findings because the script was historical. But the duplicate entries were already in the published library, visible to every user.
|
||||
|
||||
### Cross-artifact consistency check
|
||||
|
||||
After triage, compare the spec audit findings against the code review findings from `quality/code_reviews/`. If the code review and spec audit disagree on the same factual claim (one says a bug is real, the other calls it a false positive), flag the disagreement and deploy a verification probe. The code review and spec audit use different methods (structural reading vs. spec comparison), so disagreements are informative, not errors. But a factual contradiction about what the code actually does needs to be resolved before either report is trusted.
|
||||
|
||||
## Detecting partial sessions and carried-over artifacts
|
||||
|
||||
### Partial session detection
|
||||
|
||||
A session that terminates early (timeout, context exhaustion, crash) may generate scaffolding (directory structure, empty templates) without producing the actual review or audit content. The retry mechanism in the run script can regenerate scaffolding but cannot recover the analytical work.
|
||||
|
||||
**After any session completes, check for partial results:**
|
||||
1. If `quality/code_reviews/` exists but contains no `.md` files with actual findings (or only contains template headers with no BUG/VIOLATED/INCONSISTENT entries), the code review did not run. Mark this as FAILED in PROGRESS.md, not as "complete with no findings."
|
||||
2. If `quality/spec_audits/` exists but contains no triage summary, the spec audit did not run.
|
||||
3. If `quality/test_regression.*` exists but contains only imports and no test functions, regression tests were not written.
|
||||
|
||||
A partial session is not a "clean run with no findings" — it's a failed run that needs to be re-executed. PROGRESS.md should record this clearly: "Phase 6: FAILED — code review session terminated before producing findings. Re-run required."
|
||||
|
||||
### Provenance headers on carried-over artifacts
|
||||
|
||||
When a new playbook run finds existing artifacts from a previous run (after archiving), or when artifacts survive from a failed session, they must carry provenance headers so readers know their origin.
|
||||
|
||||
**If any artifact was NOT generated fresh in the current run**, add a provenance header:
|
||||
|
||||
```markdown
|
||||
<!-- PROVENANCE: This file was carried over from a previous run ([date]).
|
||||
It was NOT regenerated by the current v1.3.5 run.
|
||||
Treat findings as potentially stale — verify against current source before acting. -->
|
||||
```
|
||||
|
||||
This prevents the failure mode observed in v1.3.4 where express and zod silently preserved v1.3.3 code reviews and spec audits without marking them as archival. Users reading those artifacts assumed they were fresh v1.3.4 results.
|
||||
|
||||
## Fix Execution Rules
|
||||
|
||||
- Group fixes by subsystem, not by defect number
|
||||
@@ -130,6 +231,10 @@ Different models have different audit strengths. In practice:
|
||||
|
||||
The specific models that excel will change over time. The principle holds: use multiple models with different strengths, and always include the four guardrails.
|
||||
|
||||
### Minimum model capability
|
||||
|
||||
The audit protocol requires reading function bodies, citing line numbers, grepping before claiming missing, and classifying defect types. Lightweight or speed-optimized models (Haiku-class, GPT-4o-mini-class) are not suitable as auditors. They tend to skim rather than read, skip the grep step, and produce shallow or empty reports ("No defects found") on codebases where stronger models find real bugs. Use models with strong code-reading ability for all three auditor slots. A weak auditor doesn't just miss findings — it reduces the Council from three independent perspectives to two.
|
||||
|
||||
## Tips for Writing Scrutiny Areas
|
||||
|
||||
The scrutiny areas are the most important part of the prompt. Generic questions like "check if the code matches the spec" produce generic answers. Specific questions that name functions, files, and edge cases produce specific findings.
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# Verification Checklist (Phase 3)
|
||||
# Verification Checklist (Phase 6: Verify)
|
||||
|
||||
Before declaring the quality playbook complete, check every benchmark below. If any fails, go back and fix it.
|
||||
|
||||
@@ -53,6 +53,8 @@ Run the test suite using the project's test runner:
|
||||
|
||||
**Check for both failures AND errors.** Most test frameworks distinguish between test failures (assertion errors) and test errors (setup failures, missing fixtures, import/resolution errors, exceptions during initialization). Both are broken tests. A common mistake: generating tests that reference shared fixtures or helpers that don't exist. These show up as setup errors, not assertion failures — but they are just as broken.
|
||||
|
||||
**Expected-failure (xfail) tests do not count against this benchmark.** Regression tests in `quality/test_regression.*` use expected-failure markers (`@pytest.mark.xfail(strict=True)`, `@Disabled`, `t.Skip`, `#[ignore]`) to confirm that known bugs are still present. These tests are *supposed* to fail — that's the point. The "zero failures and zero errors" benchmark applies to `quality/test_functional.*` (the functional test suite), not to `quality/test_regression.*` (the bug confirmation suite). If your test runner reports failures from xfail-marked regression tests, that's correct behavior, not a benchmark violation. If an xfail test unexpectedly *passes*, that means the bug was fixed and the xfail marker should be removed — treat that as a finding to investigate, not a test failure.
|
||||
|
||||
After running, check:
|
||||
- All tests passed — count must equal total test count
|
||||
- Zero failures
|
||||
@@ -70,7 +72,7 @@ Run the project's full test suite (not just your new tests). Your new files shou
|
||||
|
||||
Every scenario should mention actual function names, file names, or patterns that exist in the codebase. Grep for each reference to confirm it exists.
|
||||
|
||||
If working from non-formal requirements, verify that each scenario and test includes a requirement tag using the canonical format: `[Req: formal — README §3]`, `[Req: inferred — from validate_input() behavior]`, `[Req: user-confirmed — "must handle empty input"]`. Inferred requirements should be flagged for user review in Phase 4.
|
||||
If working from non-formal requirements, verify that each scenario and test includes a requirement tag using the canonical format: `[Req: formal — README §3]`, `[Req: inferred — from validate_input() behavior]`, `[Req: user-confirmed — "must handle empty input"]`. Inferred requirements should be flagged for user review in the Phase 7 interactive session.
|
||||
|
||||
### 11. RUN_CODE_REVIEW.md Is Self-Contained
|
||||
|
||||
@@ -93,6 +95,158 @@ If any field name, count, or type is wrong, fix it before proceeding. The table
|
||||
|
||||
The definitive audit prompt should work when pasted into Claude Code, Cursor, and Copilot without modification (except file reference syntax).
|
||||
|
||||
### 14. Structured Output Schemas Are Valid and Conformant
|
||||
|
||||
Verify that `RUN_TDD_TESTS.md` and `RUN_INTEGRATION_TESTS.md` both instruct the agent to produce:
|
||||
- JUnit XML output using the framework's native reporter (pytest `--junitxml`, gotestsum `--junitxml`, Maven Surefire reports, `jest-junit`, `cargo2junit`)
|
||||
- A sidecar JSON file (`tdd-results.json` or `integration-results.json`) in `quality/results/`
|
||||
|
||||
Check that each protocol's JSON schema includes all mandatory fields:
|
||||
- **tdd-results.json:** `schema_version`, `skill_version`, `date`, `project`, `bugs`, `summary`. Per-bug: `id`, `requirement`, `red_phase`, `green_phase`, `verdict`, `fix_patch_present`, `writeup_path`.
|
||||
- **integration-results.json:** `schema_version`, `skill_version`, `date`, `project`, `recommendation`, `groups`, `summary`, `uc_coverage`. Per-group: `group`, `name`, `use_cases`, `result`.
|
||||
|
||||
Verify that the protocol does NOT contain flat command-list schemas (a `"results"` or `"commands_run"` array without `"groups"` is non-conformant). Verify that verdict/result enum values use only the allowed values defined in SKILL.md (e.g., `"TDD verified"`, `"red failed"`, `"green failed"`, `"confirmed open"` for TDD verdicts; `"pass"`, `"fail"`, `"skipped"`, `"error"` for integration results; `"SHIP"`, `"FIX BEFORE MERGE"`, `"BLOCK"` for recommendations). The TDD verdict `"skipped"` is deprecated — use `"confirmed open"` with `red_phase: "fail"` and `green_phase: "skipped"` instead. The TDD summary must include a `confirmed_open` count alongside `verified`, `red_failed`, and `green_failed`.
|
||||
|
||||
Both sidecar JSON templates must use `schema_version: "1.1"` (v1.1 change: `verdict: "skipped"` deprecated in favor of `"confirmed open"`). Both protocols must include a **post-write validation step** instructing the agent to reopen the sidecar JSON after writing it and verify required fields, enum values, and no extra undocumented root keys.
|
||||
|
||||
### 15. Patch Validation Gate Is Executable
|
||||
|
||||
For each confirmed bug with patches, verify:
|
||||
1. The `git apply --check` commands specified in the patch validation gate use the correct patch paths (`quality/patches/BUG-NNN-*.patch`)
|
||||
2. The compile/syntax check command matches the project's actual build system — not a generic placeholder
|
||||
3. For interpreted languages (Python, JavaScript), the gate specifies the appropriate syntax check (`python -m py_compile`, `node --check`, `pytest --collect-only`, or equivalent)
|
||||
4. The gate includes a temporary worktree or stash-and-revert instruction to comply with the source boundary rule
|
||||
|
||||
### 16. Regression Test Skip Guards Are Present
|
||||
|
||||
Grep `quality/test_regression.*` for the language-appropriate skip/xfail mechanism. Every test function must have a guard:
|
||||
- Python: `@pytest.mark.xfail` or `@unittest.expectedFailure`
|
||||
- Go: `t.Skip(`
|
||||
- Java: `@Disabled`
|
||||
- Rust: `#[ignore]`
|
||||
- TypeScript/JavaScript: `test.failing(`, `test.fails(`, or `it.skip(`
|
||||
|
||||
A regression test without a skip guard will cause unexpected failures when the test suite runs on unpatched code. Every guard must reference the bug ID (BUG-NNN format) and the fix patch path.
|
||||
|
||||
### 17. Integration Group Commands Pass Pre-Flight Discovery
|
||||
|
||||
For each integration test group command in `RUN_INTEGRATION_TESTS.md`, verify that the command discovers at least one test using the framework's dry-run mode (`pytest --collect-only`, `go test -list`, `vitest list`, `jest --listTests`, `cargo test -- --list`). A group whose command fails discovery will produce a `covered_fail` result that masks a selector bug as a code bug. If a command cannot be validated (no dry-run mode available), note the limitation.
|
||||
|
||||
### 18. Version Stamps Present on All Generated Files
|
||||
|
||||
Grep every generated Markdown file in `quality/` for the attribution line: `Generated by [Quality Playbook]`. Grep every generated code file for `Generated by Quality Playbook`. Every file must have the stamp with the correct version number. Files without stamps are not traceable to the tool and version that created them. **Exemptions:** sidecar JSON files (use `skill_version` field), JUnit XML files (framework-generated), and `.patch` files (stamp would break `git apply`). For Python files with shebang or encoding pragma, verify the stamp comes after the pragma, not before.
|
||||
|
||||
### 19. Enumeration Completeness Checks Performed
|
||||
|
||||
Verify that the code review (Pass 1 and Pass 2) performed mechanical two-list enumeration checks wherever the code uses `switch`/`case`, `match`, or if-else chains to dispatch on named constants. For each such check, the review must show: (a) the list of constants defined in headers/enums/specs, (b) the list of case labels actually present in the code, (c) any gaps. A review that claims "the whitelist covers all values" or "all cases are handled" without showing the two-list comparison is non-conformant — this is the specific hallucination pattern the check prevents.
|
||||
|
||||
### 20. Bug Writeups Generated for All Confirmed Bugs
|
||||
|
||||
For each bug in `tdd-results.json` (both `verdict: "TDD verified"` and `verdict: "confirmed open"`), verify that a corresponding `quality/writeups/BUG-NNN.md` file exists and that `tdd-results.json` has a non-null `writeup_path` for that bug. Each writeup must include: summary, spec reference, code citation, observable consequence, fix diff, and test description. A confirmed bug without a writeup is incomplete.
|
||||
|
||||
### 21. Triage Verification Probes Include Executable Evidence
|
||||
|
||||
Open the triage report (`quality/spec_audits/YYYY-MM-DD-triage.md`). For every finding that was confirmed or rejected via a verification probe, verify that the triage entry includes a test assertion (not just prose reasoning). Rejections must include a PASSING assertion proving the finding is wrong. Confirmations must include a FAILING assertion proving the bug exists. Every assertion must cite an exact line number. A triage decision based on prose reasoning alone ("lines 3527-3528 explicitly preserve X") without a mechanical assertion is non-conformant.
|
||||
|
||||
### 22. Enumeration Lists Extracted From Code, Not Copied From Requirements
|
||||
|
||||
When the code review includes an enumeration check (e.g., "case labels present in function X"), verify that the code-side list includes per-item line numbers from the actual source. If the list matches the requirements list word-for-word without line numbers, the enumeration was likely copied rather than extracted and must be redone. Also verify that the triage pre-audit spot-checks report the actual contents of cited lines ("line 3527 contains `default:`") rather than merely confirming claims ("line 3527 preserves RING_RESET").
|
||||
|
||||
### 23. Mechanical Verification Artifacts Exist and Pass Integrity Check
|
||||
|
||||
For every contract or requirement that asserts a function handles/preserves/dispatches a set of named constants (feature bits, enum values, opcode tables), verify that a corresponding `quality/mechanical/<function>_cases.txt` file exists and was generated by a non-interactive shell pipeline. Contracts that reference dispatch-function coverage without citing a mechanical artifact are non-conformant.
|
||||
|
||||
**Integrity check (mandatory):** Run `bash quality/mechanical/verify.sh`. This script re-executes the same extraction commands that generated each mechanical artifact and diffs the results. If ANY diff is non-empty, the artifact was tampered with — the model may have written expected output instead of capturing actual shell output. A mismatched artifact must be regenerated by re-running the extraction command (not by editing the file). This check exists because in v1.3.19, the model executed the correct awk/grep command but wrote a fabricated 9-line output (including a hallucinated `case VIRTIO_F_RING_RESET:`) to the file, when the actual command only produces 8 lines.
|
||||
|
||||
### 24. Source-Inspection Regression Tests Execute (No `run=False`)
|
||||
|
||||
Grep `quality/test_regression.*` for `run=False` (Python), `t.Skip` with a source-inspection comment, or equivalent skip mechanisms. Any regression test whose purpose is source-structure verification (string presence in function bodies, case label existence, enum extraction) must execute — it must NOT use `run=False`. These tests are safe, deterministic string-match operations. An `xfail(strict=True)` test that actually fails reports as XFAIL (expected), which is correct behavior. A source-inspection test with `run=False` is the worst possible state: the correct check exists but never fires.
|
||||
|
||||
### 25. Contradiction Gate Passed (Executed Evidence vs. Prose)
|
||||
|
||||
|
||||
Verify that no executed artifact contradicts a prose artifact at closure. Specifically: (a) if any `quality/mechanical/*` file shows a constant as absent, no prose artifact (`CONTRACTS.md`, `REQUIREMENTS.md`, code review, triage) may claim it is present; (b) if any regression test with `xfail` actually fails (XFAIL), `BUGS.md` may not claim that bug is "fixed in working tree" without a commit reference; (c) if TDD traceability shows a red-phase failure, the triage may not claim the corresponding code is compliant. Any contradiction must be resolved before closure.
|
||||
|
||||
### 26. Version Stamp Consistency
|
||||
|
||||
Read the `version:` field from the SKILL.md metadata (locate SKILL.md in the skill installation directory — typically `.github/skills/SKILL.md` or `.claude/skills/quality-playbook/SKILL.md`). Check every generated artifact: PROGRESS.md's `Skill version:` field, every `> Generated by` attribution line, every code file header stamp, and every sidecar JSON `skill_version` field. Every version stamp must match the SKILL.md metadata exactly. A single mismatch is a benchmark failure. This check exists because in v1.3.21 benchmarking, 5 of 9 repos had version stamps from older skill versions due to a hardcoded template.
|
||||
|
||||
### 27. Mechanical Directory Conformance
|
||||
|
||||
If `quality/mechanical/` exists, it must contain at minimum a `verify.sh` file. An empty `quality/mechanical/` directory is non-conformant. If no dispatch-function contracts exist, the directory should not exist — instead record `Mechanical verification: NOT APPLICABLE` in PROGRESS.md. If the directory exists with extraction artifacts, `verify.sh` must include one verification block per saved file (not just one). A verify.sh that checks only one artifact when multiple exist is incomplete.
|
||||
|
||||
### 28. TDD Artifact Closure
|
||||
|
||||
If `quality/BUGS.md` contains any confirmed bugs, `quality/results/tdd-results.json` is mandatory. If any bug has a red-phase result, `quality/TDD_TRACEABILITY.md` is also mandatory. Zero-bug repos may omit both files. For repos where TDD cannot execute, tdd-results.json must exist with `verdict: "deferred"` and a `notes` field explaining why.
|
||||
|
||||
### 29. Triage-to-BUGS.md Sync
|
||||
|
||||
After spec audit triage, every finding confirmed as a code bug must appear in `quality/BUGS.md`. A triage report with confirmed code bugs and no corresponding BUGS.md entries is non-conformant. If BUGS.md does not exist when confirmed bugs exist, it must be created.
|
||||
|
||||
### 30. Writeups for All Confirmed Bugs
|
||||
|
||||
Every confirmed bug (TDD-verified or confirmed-open) must have a writeup at `quality/writeups/BUG-NNN.md`. For confirmed-open bugs without fix patches, the writeup notes the absence of fix/green-phase evidence. A run with confirmed bugs and no writeups directory is incomplete.
|
||||
|
||||
### 31. Phase 4 Triage File Exists
|
||||
|
||||
Phase 4 is not complete until a triage file exists at `quality/spec_audits/YYYY-MM-DD-triage.md`. If only auditor reports exist with no triage synthesis, Phase 4 is incomplete.
|
||||
|
||||
### 32. Seed Checks Executed Mechanically (Continuation Mode)
|
||||
|
||||
When `quality/previous_runs/` exists and Phase 0 runs, verify that `quality/SEED_CHECKS.md` was generated with one entry per unique bug from prior runs. Each seed must have a mechanical verification result (FAIL = bug still present, PASS = bug fixed) obtained by actually running the assertion — not by reading prose from the prior run. If a seed's regression test exists in a prior run, the assertion must be re-executed against the current source tree. A seed marked FAIL without executing the assertion is non-conformant. This benchmark only applies when continuation mode is active (prior runs exist).
|
||||
|
||||
### 33. Convergence Status Recorded in PROGRESS.md (Continuation Mode)
|
||||
|
||||
When Phase 0 runs, verify that PROGRESS.md contains a `## Convergence` section with: run number, seed count, net-new bug count, and a CONVERGED/NOT CONVERGED verdict. The net-new count must equal the number of bugs in BUGS.md that don't match any seed by file:line. A missing convergence section when `SEED_CHECKS.md` exists is non-conformant. This benchmark only applies when continuation mode is active.
|
||||
|
||||
### 34. BUGS.md Always Exists
|
||||
|
||||
Every completed run must produce `quality/BUGS.md`. If the run confirmed source-code bugs, BUGS.md must list them. If the run found zero source-code bugs, BUGS.md must contain a `## Summary` with a positive assertion: "No confirmed source-code bugs found" with counts of candidates evaluated and eliminated. A completed run (Phase 5 marked complete) with no BUGS.md is non-conformant. This benchmark exists because in v1.3.22 benchmarking, express completed all phases with zero source bugs but produced no BUGS.md, making it ambiguous whether the file was intentionally omitted or accidentally skipped.
|
||||
|
||||
### 35. Immediate Mechanical Integrity Gate (Phase 2a)
|
||||
|
||||
If `quality/mechanical/` exists, verify that `bash quality/mechanical/verify.sh` was executed immediately after each `*_cases.txt` was written — before any contract, requirement, or triage artifact cites the extraction. Evidence: `quality/results/mechanical-verify.log` and `quality/results/mechanical-verify.exit` exist, and the exit file contains `0`. If these receipt files are missing or the exit code is non-zero, the mechanical extraction was not verified at the point of creation. This benchmark exists because v1.3.23 deferred verification to Phase 6, allowing downstream artifacts (CONTRACTS.md, REQUIREMENTS.md, triage probes) to build on a forged extraction for the entire run before the mismatch was (not) caught.
|
||||
|
||||
### 36. Mechanical Artifacts Not Used as Evidence in Triage Probes
|
||||
|
||||
Grep all triage and verification probe files (`quality/spec_audits/*`) for `open('quality/mechanical/` or `cat quality/mechanical/`. If any probe reads a `quality/mechanical/*.txt` file as sole evidence for what a source file contains, it is circular verification and the benchmark fails. Probes must read the source file directly or re-execute the extraction pipeline. This benchmark exists because v1.3.23 Probe C validated the forged mechanical artifact instead of the source code, passing with fabricated data.
|
||||
|
||||
### 37. Phase 6 Mechanical Closure Uses Bash (Not Python Substitution)
|
||||
|
||||
If `quality/mechanical/` exists, verify that Phase 6 ran `bash quality/mechanical/verify.sh` as a literal shell command — not a Python script reading the artifact file. Evidence: `quality/results/mechanical-verify.log` contains output from the bash script (lines like "OK: ..." or "MISMATCH: ..."), not Python tracebacks or `pathlib` output. PROGRESS.md must include a `## Phase 6 Mechanical Closure` heading with the recorded stdout and exit code. This benchmark exists because v1.3.23 substituted Python `Path.read_text()` for `bash verify.sh`, creating a circular check that passed despite the artifact being fabricated.
|
||||
|
||||
### 38. Individual Auditor Report Artifacts Exist
|
||||
|
||||
If Phase 4 (spec audit) ran, verify that individual auditor report files exist at `quality/spec_audits/YYYY-MM-DD-auditor-N.md` (one per auditor), not just the triage synthesis. A single triage file without individual reports conflates discovery with reconciliation. This benchmark exists to ensure pre-reconciliation findings are preserved for independent verification.
|
||||
|
||||
### 39. BUGS.md Uses Canonical Heading Format
|
||||
|
||||
Every confirmed bug in BUGS.md must use the heading level `### BUG-NNN`. Grep for `^### BUG-` and count; grep for other bug heading patterns (`^## BUG-`, `^\*\*BUG-`, `^- BUG-`) and verify zero matches. Inconsistent heading levels cause machine-readable counts to disagree with the document.
|
||||
|
||||
### 40. Artifact File-Existence Gate Passed
|
||||
|
||||
Before Phase 5 is marked complete, verify that all required artifacts exist as files on disk — not just referenced in PROGRESS.md. Required files: EXPLORATION.md, BUGS.md, REQUIREMENTS.md, QUALITY.md, PROGRESS.md, COVERAGE_MATRIX.md, COMPLETENESS_REPORT.md, CONTRACTS.md, test_functional.* (or language-appropriate alternative: FunctionalSpec.*, FunctionalTest.*, functional.test.*), RUN_CODE_REVIEW.md, RUN_INTEGRATION_TESTS.md, RUN_SPEC_AUDIT.md, RUN_TDD_TESTS.md, and AGENTS.md (at project root). If Phase 3 ran: at least one file in code_reviews/. If Phase 4 ran: at least one auditor file and a triage file in spec_audits/. If Phase 0 or 0b ran: SEED_CHECKS.md as a standalone file. If confirmed bugs exist: tdd-results.json in results/. If any bug has a red-phase result: TDD_TRACEABILITY.md. This benchmark exists because v1.3.24 benchmarking showed express writing a terminal gate section to PROGRESS.md claiming 1 confirmed bug, but BUGS.md, code review files, and spec audit files were never written to disk.
|
||||
|
||||
### 41. Sidecar JSON Post-Write Validation
|
||||
|
||||
After `tdd-results.json` and/or `integration-results.json` are written, verify that each file contains all required keys with conformant values. For `tdd-results.json`: required root keys are `schema_version`, `skill_version`, `date`, `project`, `bugs`, `summary`. Each `bugs` entry must have `id`, `requirement`, `red_phase`, `green_phase`, `verdict`, `fix_patch_present`, `writeup_path`. The `summary` must include `confirmed_open`. For `integration-results.json`: required root keys are `schema_version`, `skill_version`, `date`, `project`, `recommendation`, `groups`, `summary`, `uc_coverage`. Both must have `schema_version: "1.1"`. A sidecar JSON with missing required keys, non-standard root keys, or invalid enum values is non-conformant. This benchmark exists because v1.3.25 benchmarking showed 6 of 8 repos with non-conformant sidecar JSON — httpx invented an alternate schema, serde used legacy shape, javalin omitted `summary` and per-bug fields, express used invalid phase values, and others used invalid verdict/result enum values.
|
||||
|
||||
### 42. Script-Verified Closure Gate Passed
|
||||
|
||||
Before Phase 5 is marked complete, `quality_gate.sh` must be executed from the project root and must exit 0. The script's full output must be saved to `quality/results/quality-gate.log`. A Phase 5 completion with no `quality-gate.log` or with a log showing FAIL results is non-conformant. This benchmark exists because v1.3.21–v1.3.25 relied entirely on model self-attestation for artifact conformance checks, and benchmarking showed persistent non-compliance (heading format, sidecar schema, use case identifiers, version stamps) that a script catches mechanically.
|
||||
|
||||
### 43. Canonical Use Case Identifiers Present
|
||||
|
||||
REQUIREMENTS.md must contain use cases labeled with canonical identifiers in the format `UC-01`, `UC-02`, etc. Grep for `UC-[0-9]` and count matches. A repo with use case content but no canonical identifiers is non-conformant. This benchmark exists because v1.3.25 benchmarking showed 7 of 8 repos with use case sections but no machine-readable identifiers — downstream tooling cannot count or cross-reference use cases without a canonical format.
|
||||
|
||||
### 44. Regression-Test Patches Exist for Every Confirmed Bug
|
||||
|
||||
For every confirmed bug (any BUG-NNN entry in BUGS.md), verify that `quality/patches/BUG-NNN-regression-test.patch` exists. A confirmed bug without a regression-test patch is incomplete — the patch is the strongest independent evidence that the bug exists. Fix patches (`BUG-NNN-fix.patch`) are optional but strongly encouraged for simple fixes. This benchmark exists because v1.3.25 and v1.3.26 benchmarking showed 4/8 repos with 0 patch files despite having confirmed bugs, and the writeups described what fixes should look like without generating actual patch files.
|
||||
|
||||
### 45. Writeup Inline Fix Diffs
|
||||
|
||||
Every writeup at `quality/writeups/BUG-NNN.md` must contain a ` ```diff ` fenced code block with the proposed fix in unified diff format. This is section 6 ("The fix") of the writeup template. A writeup that says "see patch file" or "no fix patch included" without an inline diff is incomplete — the inline diff is what makes the writeup actionable for a maintainer reading just the writeup without access to the patch directory. This benchmark exists because v1.3.27 benchmarking showed virtio producing 4 writeups with 0 inline diffs despite having fix patches in `quality/patches/`. The model wrote prose descriptions of the fix instead of pasting the actual diff.
|
||||
|
||||
## Quick Checklist Format
|
||||
|
||||
Use this as a final sign-off:
|
||||
@@ -112,3 +266,37 @@ Use this as a final sign-off:
|
||||
- [ ] Integration test quality gates were written from a Field Reference Table (not memory)
|
||||
- [ ] Integration tests have specific pass criteria
|
||||
- [ ] Spec audit prompt is copy-pasteable and uses `[Req: tier — source]` tag format
|
||||
- [ ] Structured output schemas include all mandatory fields and valid enum values
|
||||
- [ ] Patch validation gate uses correct commands for the project's build system
|
||||
- [ ] Every regression test has a skip/xfail guard referencing the bug ID
|
||||
- [ ] Integration group commands pass pre-flight discovery (dry-run finds tests)
|
||||
- [ ] Every generated file has a version stamp with correct version number
|
||||
- [ ] Enumeration completeness checks show two-list comparisons (not just assertions of coverage)
|
||||
- [ ] Every TDD-verified bug has a writeup at `quality/writeups/BUG-NNN.md`
|
||||
- [ ] Triage verification probes include test assertions (not just prose) for confirmations and rejections
|
||||
- [ ] Enumeration code-side lists include per-item line numbers (not copied from requirements)
|
||||
- [ ] Dispatch-function contracts cite `quality/mechanical/` artifacts (not hand-written lists)
|
||||
- [ ] `bash quality/mechanical/verify.sh` passes (artifacts match re-extracted output)
|
||||
- [ ] Source-inspection regression tests execute (no `run=False` for string-match tests)
|
||||
- [ ] No executed artifact contradicts any prose artifact at closure (contradiction gate passed)
|
||||
- [ ] All generated artifact version stamps match SKILL.md metadata version exactly
|
||||
- [ ] `quality/mechanical/` is either absent (no dispatch contracts) or contains verify.sh + all extraction artifacts
|
||||
- [ ] If BUGS.md has confirmed bugs: tdd-results.json exists (mandatory); TDD_TRACEABILITY.md exists if any bug has red-phase result
|
||||
- [ ] Every confirmed bug in triage appears in BUGS.md (triage-to-BUGS.md sync)
|
||||
- [ ] Every confirmed bug (TDD-verified or confirmed-open) has a writeup at `quality/writeups/BUG-NNN.md`
|
||||
- [ ] Phase 4 has a triage file at `quality/spec_audits/YYYY-MM-DD-triage.md`
|
||||
- [ ] (Continuation mode) Seed checks in `SEED_CHECKS.md` were executed mechanically, not inferred from prose
|
||||
- [ ] Mechanical verification receipt files exist (`mechanical-verify.log` + `mechanical-verify.exit`) when `quality/mechanical/` exists
|
||||
- [ ] No triage probe reads `quality/mechanical/*.txt` as sole evidence for source code contents
|
||||
- [ ] Phase 6 mechanical closure used `bash verify.sh` (not Python substitution)
|
||||
- [ ] Individual auditor reports exist at `quality/spec_audits/*-auditor-N.md` (not just triage)
|
||||
- [ ] All BUGS.md bug headings use `### BUG-NNN` format
|
||||
- [ ] quality/BUGS.md exists (zero-bug runs include a summary of candidates evaluated and eliminated)
|
||||
- [ ] All required artifact files exist on disk before Phase 5 marked complete (not just referenced in PROGRESS.md)
|
||||
- [ ] (Continuation mode) PROGRESS.md contains `## Convergence` section with net-new count and verdict
|
||||
- [ ] `quality/BUGS.md` exists (zero-bug runs include a summary of candidates evaluated and eliminated)
|
||||
- [ ] Sidecar JSON files (`tdd-results.json`, `integration-results.json`) contain all required keys with `schema_version: "1.1"`
|
||||
- [ ] `quality_gate.sh` was executed and exited 0; output saved to `quality/results/quality-gate.log`
|
||||
- [ ] REQUIREMENTS.md contains canonical use case identifiers (`UC-01`, `UC-02`, etc.)
|
||||
- [ ] Every confirmed bug has `quality/patches/BUG-NNN-regression-test.patch`
|
||||
- [ ] Every writeup has an inline fix diff (` ```diff ` block in section 6)
|
||||
|
||||
Reference in New Issue
Block a user