Rebuilds branch from upstream/staged (was previously merged from upstream/main, which brought in materialized plugin files that fail Check Plugin Structure on PRs targeting staged). Changes vs. staged: - Update skills/quality-playbook/ to v1.5.6 (31 bundled assets: SKILL.md + LICENSE.txt + 16 references/ + 9 phase_prompts/ + 3 agents/ + bin/citation_verifier.py + quality_gate.py). - Add agents/quality-playbook.agent.md (top-level orchestrator). name: quality-playbook (validator-compliant). - Update docs/README.skills.md quality-playbook row description + bundled-assets list to v1.5.6. - Fix 'unparseable' → 'unparsable' in quality_gate.py (5 instances; codespell preference, both spellings valid). Closes the v1.4.0 → v1.5.6 update in a single clean commit on top of upstream/staged. The preserved backup branch backup-bedbe84-pre-rebuild (SHA bedbe848fa3c0f0eda8e653c42b599a17dd2e354) holds the prior history for reference.
18 KiB
Run-State Schema (v1.5.6)
Authoritative schema for quality/run_state.jsonl, quality/PROGRESS.md, and Calibration Cycles/<cycle>/run_state.jsonl. The playbook AI writes these files directly via the file-tool layer; the orchestrator AI reads them to drive multi-benchmark calibration cycles.
Companion to: docs/design/QPB_v1.5.5_Design.md ("Design — Run-state event taxonomy" section).
File locations and ownership
<benchmark>/quality/run_state.jsonl— per-run event log. Append-only. Written by the AI executing the playbook.<benchmark>/quality/PROGRESS.md— human-readable run status. Atomically rewritten by the AI on each event.Calibration Cycles/<cycle>/run_state.jsonl— cycle-level event log. Append-only. Written by the orchestrator AI.
All three live in the bind-mounted workspace owned by the user. The AI writes via Edit/Write file tools, never via shell redirection or tee (which routes through a different UID layer in some sandbox runtimes).
Schema versioning
Every run_state.jsonl opens with an _index event recording schema_version. Current version: "1.5.6". Schema bumps preserve backward compatibility — older files remain readable by newer parsers. Breaking schema changes bump the major number.
Required fields (every event)
Every event object MUST have:
ts— ISO 8601 UTC timestamp withZsuffix (e.g."2026-05-15T14:32:01Z"). Sub-second precision allowed but not required.event— string, the event-type name. Must match one of the names listed in_index.event_types.
Events MAY have additional fields per their type's spec below. Unknown fields are tolerated by readers (forward-compatible).
Per-run events (<benchmark>/quality/run_state.jsonl)
_index
ALWAYS the first line. Records schema metadata.
| Field | Type | Required | Notes |
|---|---|---|---|
event |
string | yes | Always "_index" |
ts |
string | yes | ISO 8601 UTC |
schema_version |
string | yes | "1.5.6" |
event_types |
array of string | yes | Every event type this file uses |
benchmark |
string | yes | E.g. "chi-1.3.45", "virtio-1.5.1" |
lever_state |
string | yes | E.g. "pre-pattern7", "post-pattern7", "baseline" |
started_at |
string | yes | ISO 8601 UTC, equals ts of this event |
run_start
Marks the beginning of a playbook run.
| Field | Type | Required | Notes |
|---|---|---|---|
event |
string | yes | "run_start" |
ts |
string | yes | |
runner |
string | yes | One of "claude", "codex", "copilot", "cursor" |
playbook_version |
string | yes | E.g. "1.5.6-pre", "1.5.6" (matches bin.benchmark_lib.RELEASE_VERSION) |
target_path |
string | yes | Relative path to benchmark target |
phase_start
Marks the beginning of one of the six playbook phases.
| Field | Type | Required | Notes |
|---|---|---|---|
event |
string | yes | "phase_start" |
ts |
string | yes | |
phase |
integer | yes | 1, 2, 3, 4, 5, or 6 |
pattern_walked
Phase 1 only. Records that one of the seven exploration patterns was walked.
| Field | Type | Required | Notes |
|---|---|---|---|
event |
string | yes | "pattern_walked" |
ts |
string | yes | |
phase |
integer | yes | Always 1 |
pattern |
integer | yes | 1 through 7 |
findings_count |
integer | yes | Number of findings produced by this pattern |
duration_seconds |
number | optional | Wall-clock for this pattern walk |
pass_started / pass_ended
Phase 4 only. Records start/end of one of the four skill-derivation passes.
| Field | Type | Required | Notes |
|---|---|---|---|
event |
string | yes | "pass_started" or "pass_ended" |
ts |
string | yes | |
phase |
integer | yes | Always 4 |
pass |
string | yes | One of "A", "B", "C", "D" |
output_artifact |
string | optional | Relative path to pass artifact (on pass_ended) |
finding_logged
Records that a finding (skill-divergence, code-bug, etc.) was logged in the current phase.
| Field | Type | Required | Notes |
|---|---|---|---|
event |
string | yes | "finding_logged" |
ts |
string | yes | |
phase |
integer | yes | 1-6 |
finding_id |
string | yes | E.g. "BUG-007", "REQ-042" |
category |
string | yes | E.g. "code-bug", "skill-divergence", "missing-citation", "prose-to-code-mismatch" |
artifact_written
Records that an artifact file was produced/updated.
| Field | Type | Required | Notes |
|---|---|---|---|
event |
string | yes | "artifact_written" |
ts |
string | yes | |
relative_path |
string | yes | Path relative to benchmark target (e.g. "quality/EXPLORATION.md") |
byte_size |
integer | optional | Size of the file at write time |
line_count |
integer | optional | Line count |
gate_check
Records the outcome of a single quality-gate check.
| Field | Type | Required | Notes |
|---|---|---|---|
event |
string | yes | "gate_check" |
ts |
string | yes | |
gate_name |
string | yes | Identifier from quality_gate.py |
verdict |
string | yes | One of "pass", "fail", "warn", "skip" |
reason |
string | optional | Human-readable explanation |
phase_end
Marks the end of a phase. Cross-validated against the phase's expected artifacts before being written (see "Cross-validation rules" below).
| Field | Type | Required | Notes |
|---|---|---|---|
event |
string | yes | "phase_end" |
ts |
string | yes | |
phase |
integer | yes | 1-6 |
key_counts |
object | yes | Phase-specific counts (see below) |
artifacts_produced |
array of string | yes | Relative paths of artifacts produced this phase |
duration_seconds |
number | optional | Wall-clock for the whole phase |
key_counts per phase:
- Phase 1:
{"findings_total": N, "patterns_walked": M}(M should be 7 for full Phase 1) - Phase 2:
{"findings_promoted": N, "findings_dropped": M} - Phase 3:
{"bugs_identified": N, "bug_writeups": M} - Phase 4:
{"req_count": N, "uc_count": M, "passes_complete": K}(K should be 4) - Phase 5:
{"gate_checks_total": N, "gate_failures": M} - Phase 6:
{"bugs_md_count": N, "gate_verdict": "pass|fail|partial"}
error
Records an error during the run.
| Field | Type | Required | Notes |
|---|---|---|---|
event |
string | yes | "error" |
ts |
string | yes | |
phase |
integer | optional | If error is phase-scoped |
message |
string | yes | Human-readable description |
recoverable |
boolean | yes | If true, the run will retry the affected phase; if false, the run is aborting |
documentation_state
v1.5.6+. Records the documentation-availability state at Phase 1 entry. Currently the only emitted state is "code_only", indicating that reference_docs/ and reference_docs/cite/ carry no recognized plaintext content (.md or .txt) and Phase 1 is proceeding in code-only mode (see references/code-only-mode.md). A "with_docs" value is reserved for future explicit emission; today the absence of a documentation_state event implies docs were present.
| Field | Type | Required | Notes |
|---|---|---|---|
event |
string | yes | "documentation_state" |
ts |
string | yes | |
state |
string | yes | Currently "code_only". Future values may include "with_docs". |
reason |
string | yes | Free-form (e.g. "reference_docs/ empty") |
When documentation_state state="code_only" is emitted, the playbook also prepends a "Documentation status: code-only mode" section to quality/EXPLORATION.md and adds a "Documentation state: code_only" line to quality/PROGRESS.md so the downgrade is visible to anyone reading either artifact. New runs adding the documentation_state event must include it in the _index.event_types list.
aborted_missing_docs
v1.5.6+. Records that the run aborted at Phase 1 entry because --require-docs was set and reference_docs/ was empty. Mutually exclusive with documentation_state state="code_only" for the same Phase 1 entry — --require-docs is the opt-IN abort path; the absence of the flag preserves the documented code-only-mode downgrade. After this event the runner returns non-zero without invoking any LLM work, so no phase_start phase=1 is recorded.
| Field | Type | Required | Notes |
|---|---|---|---|
event |
string | yes | "aborted_missing_docs" |
ts |
string | yes | |
reason |
string | yes | Free-form (e.g. "reference_docs/ empty and --require-docs set") |
When aborted_missing_docs is emitted, the playbook also writes an ERROR: aborted_missing_docs — <reason> block to quality/PROGRESS.md so the abort is visible without reading the JSONL. New runs that pass --require-docs against an empty reference_docs/ must include aborted_missing_docs in the _index.event_types list.
run_end
Marks the end of the playbook run.
| Field | Type | Required | Notes |
|---|---|---|---|
event |
string | yes | "run_end" |
ts |
string | yes | |
status |
string | yes | One of "success", "aborted", "failed" |
total_findings |
integer | optional | Sum across all phases |
final_verdict |
string | optional | The Phase 6 gate verdict |
Cycle-level events (Calibration Cycles/<cycle>/run_state.jsonl)
_index (cycle-level)
| Field | Type | Required | Notes |
|---|---|---|---|
event |
string | yes | "_index" |
ts |
string | yes | |
schema_version |
string | yes | "1.5.6" |
event_types |
array of string | yes | |
cycle_name |
string | yes | E.g. "2026-05-15-pattern7-displacement-recovery" |
lever_under_test |
string | yes | E.g. "lever-1-exploration-breadth-depth" |
benchmarks |
array of string | yes | Cycle's pinned benchmark list |
iteration |
integer | yes | Iteration ordinal (1, 2, or 3 — see iterate-cap) |
cycle_start
| Field | Type | Required | Notes |
|---|---|---|---|
event |
string | yes | "cycle_start" |
ts |
string | yes | |
hypothesis |
string | yes | The cycle's testable hypothesis |
noise_floor_threshold |
number | yes | Recall delta below this is treated as noise (default 0.05) |
benchmark_start
| Field | Type | Required | Notes |
|---|---|---|---|
event |
string | yes | "benchmark_start" |
ts |
string | yes | |
benchmark |
string | yes | |
lever_state |
string | yes | "pre-lever" or "post-lever" |
lever_change_applied
| Field | Type | Required | Notes |
|---|---|---|---|
event |
string | yes | "lever_change_applied" |
ts |
string | yes | |
lever_id |
string | yes | E.g. "lever-1-exploration-breadth-depth" |
files_changed |
array of string | yes | Paths relative to QPB repo root |
commit_sha |
string | yes | Commit SHA on the implementing branch |
description |
string | yes | What the change is (e.g. "Pattern 7 budget cap 3-5 → 2-3") |
lever_change_reverted
| Field | Type | Required | Notes |
|---|---|---|---|
event |
string | yes | "lever_change_reverted" |
ts |
string | yes | |
files_changed |
array of string | yes | |
commit_sha |
string | optional | Null/absent if revert is uncommitted |
benchmark_end
| Field | Type | Required | Notes |
|---|---|---|---|
event |
string | yes | "benchmark_end" |
ts |
string | yes | |
benchmark |
string | yes | |
lever_state |
string | yes | |
recall |
number | yes | 0.0-1.0 |
bugs_found |
array of string | yes | Bug IDs found this run |
bugs_missed |
array of string | yes | Bug IDs in baseline missed this run |
historical_baseline_path |
string | yes | Path to the baseline BUGS.md used for recall computation |
cycle_end
| Field | Type | Required | Notes |
|---|---|---|---|
event |
string | yes | "cycle_end" |
ts |
string | yes | |
verdict |
string | yes | One of "ship", "revert", "iterate", "halt-iterate-cap" |
recall_before |
object | yes | Per-benchmark recall before lever change |
recall_after |
object | yes | Per-benchmark recall after lever change |
delta |
object | yes | Per-benchmark delta (recall_after - recall_before) |
cross_benchmark_check |
object | yes | {"clean": bool, "regressions": [list of bench/bug pairs that regressed]} |
Cross-validation rules (per phase_end)
The AI verifies these conditions before appending a phase_end event. If any check fails, the AI appends an error event with recoverable: true and re-runs the failing phase.
| Phase | Required conditions |
|---|---|
| 1 | quality/EXPLORATION.md exists, ≥ 120 lines (aligned with the Phase 2 startup gate in bin/run_playbook.check_phase_gate), contains at least one finding section (regex ^##\s+(Finding|Open Exploration Findings|\d+\.) — accepts ## Finding ..., the SKILL-prescribed exact heading ## Open Exploration Findings, and numbered ## N. headings) |
| 2 | All nine fixed-name Generate-contract artifacts exist non-empty under quality/: REQUIREMENTS.md, QUALITY.md, CONTRACTS.md, COVERAGE_MATRIX.md, COMPLETENESS_REPORT.md, RUN_CODE_REVIEW.md, RUN_INTEGRATION_TESTS.md, RUN_SPEC_AUDIT.md, RUN_TDD_TESTS.md. Plus at least one non-empty quality/test_functional.<ext> (extension varies by primary language). Pre-v1.5.6 this row described the v1.5.5-design triage model (EXPLORATION_MERGED.md / triage.md); that mapping was never adopted by shipped SKILL.md / orchestrator_protocol.md / agent files, which always documented Phase 2 as Generate. |
| 3 | quality/code_reviews/ directory contains at least one review file. If quality/BUGS.md has any ### BUG- heading, quality/patches/ contains at least one BUG-*-regression-test.patch file. Pre-v1.5.6 this row checked quality/RUN_CODE_REVIEW.md (a Phase 2 Generate output, not a Phase 3 review result) — same v1.5.5-design / shipped-Generate drift class as the Phase 2 row. Cluster B reconciled. |
| 4 | quality/spec_audits/ directory contains at least one *-triage.md file AND at least one *-auditor-*.md file (per orchestrator_protocol.md naming convention). When neither name pattern matches, the validator falls back to a weaker "≥2 files" check — older bootstrap runs with arbitrary .md names still pass; the gate at Phase 6 enforces deeper conformance. Pre-v1.5.6 this row checked quality/REQUIREMENTS.md + COVERAGE_MATRIX.md (Phase 2 outputs) — same v1.5.5-design drift class. Cluster B reconciled. |
| 5 | If quality/BUGS.md has confirmed ### BUG- entries: quality/results/tdd-results.json exists non-empty; for every confirmed bug, quality/writeups/BUG-NNN.md exists AND quality/results/BUG-NNN.red.log exists. With no confirmed bugs the row is vacuously satisfied. Pre-v1.5.6 this row checked quality/results/quality-gate.log (a Phase 6 output) — same v1.5.5-design drift class. Cluster B reconciled. |
| 6 | quality/results/quality-gate.log exists non-empty AND quality/PROGRESS.md contains a Terminal Gate Verification section (the orchestrator-protocol marker that Phase 6 ran the script-verified gate to completion). Pre-v1.5.6 this row checked quality/BUGS.md + quality/INDEX.md — BUGS.md is a Phase 3 output, INDEX.md was never adopted in the shipped contract. Same v1.5.5-design drift class. Cluster B reconciled. |
The run_end event additionally requires: all 6 phase_end events present in the log; the final BUGS.md count matches phase_end phase=6 key_counts.bugs_md_count.
Resume semantics
When an AI session starts on a run directory:
- If
quality/run_state.jsonldoes not exist: fresh run. Write_index+run_start+phase_start phase=1. - If it exists: read all events. Find the last
phase_startnot followed by a matchingphase_end. Call it the "in-progress phase". - Verify the in-progress phase's expected artifacts (per cross-validation rules above):
- If artifacts complete: append the missing
phase_endevent and proceed to the next phase. Note: this is the "session crashed mid-phase but the work is done" recovery path. - If artifacts incomplete: re-run that phase from scratch. The prior session left a partial state that can't be safely resumed.
- If artifacts complete: append the missing
- If all 6
phase_endevents are present but norun_end: appendrun_end status=successand finalize.
The policy is "trust artifacts more than events." If events claim phase 4 done but REQUIREMENTS.md doesn't exist, the AI re-runs phase 4. If events stop mid-phase but artifacts are complete, the AI catches up the events.
PROGRESS.md format
Atomically rewritten on every event. Markdown.
# QPB Run Progress
**Started:** 2026-05-15T14:32:01Z **Benchmark:** chi-1.5.1 **Lever:** post-pattern7
**Runner:** claude **Playbook version:** 1.5.6
## Phases
- [x] Phase 1 — Explore (10:10, 12 findings, patterns 1-7 walked)
- [x] Phase 2 — Generate (0:42, 9 artifacts produced)
- [x] Phase 3 — Code Review (15:31, 6 bugs identified)
- [x] Phase 4 — Spec Audit (3 auditors, 1 triage)
- [ ] Phase 5 — Reconciliation *(in progress, started 14:58:31Z)*
- [ ] Phase 6 — Verify
## Recent events (last 10)
- 2026-05-15T14:58:31Z — phase_start phase=5
- 2026-05-15T14:58:30Z — phase_end phase=4 passes=[A,B,C,D] req_count=89
- 2026-05-15T14:42:11Z — phase_end phase=1 findings=12
## Artifacts produced
- quality/EXPLORATION.md (12,034 bytes)
- quality/REQUIREMENTS.md (28,891 bytes)
- quality/COVERAGE_MATRIX.md (3,022 bytes)
Sections (header, phase checklist, recent events, artifacts produced) are required. Phase checklist uses [x] for complete phases (with summary stats), [ ] for incomplete, with in-progress phase noted explicitly with start time. Recent events shows last 10 event lines from run_state.jsonl in human-readable form. Artifacts produced shows files written this run with byte sizes.
Format invariants (enforced by bin/run_state_lib.py validators)
_indexis line 1.- Every line is valid JSON (one object per line).
- Every event has
tsandeventfields. - Every
eventvalue appears in_index.event_types. - Append-only: events are added, never edited. Editing a prior event is a schema violation.
phase_startandphase_endevents for a given phase appear at most once per run (no out-of-order or duplicate phase markers).run_startis the second line (after_index);run_endis the last line if the run completed.
Validators are read-only checks. They surface violations as findings; they don't auto-correct.