mirror of https://github.com/github/awesome-copilot.git synced 2026-05-15 11:11:48 +00:00

Files

T

Andrew Stellman b8441d218b Update quality-playbook skill to v1.5.6 + add agent (#1402 )

Rebuilds branch from upstream/staged (was previously merged from
upstream/main, which brought in materialized plugin files that
fail Check Plugin Structure on PRs targeting staged).

Changes vs. staged:
- Update skills/quality-playbook/ to v1.5.6 (31 bundled assets:
  SKILL.md + LICENSE.txt + 16 references/ + 9 phase_prompts/ +
  3 agents/ + bin/citation_verifier.py + quality_gate.py).
- Add agents/quality-playbook.agent.md (top-level orchestrator).
  name: quality-playbook (validator-compliant).
- Update docs/README.skills.md quality-playbook row description
  + bundled-assets list to v1.5.6.
- Fix 'unparseable' → 'unparsable' in quality_gate.py (5 instances;
  codespell preference, both spellings valid).

Closes the v1.4.0 → v1.5.6 update in a single clean commit on top of
upstream/staged. The preserved backup branch backup-bedbe84-pre-rebuild
(SHA bedbe848fa3c0f0eda8e653c42b599a17dd2e354) holds the prior history for reference.

2026-05-11 11:31:53 +10:00

18 KiB

Raw Blame History

Run-State Schema (v1.5.6)

Authoritative schema for quality/run_state.jsonl, quality/PROGRESS.md, and Calibration Cycles/<cycle>/run_state.jsonl. The playbook AI writes these files directly via the file-tool layer; the orchestrator AI reads them to drive multi-benchmark calibration cycles.

Companion to: docs/design/QPB_v1.5.5_Design.md ("Design — Run-state event taxonomy" section).

File locations and ownership

<benchmark>/quality/run_state.jsonl — per-run event log. Append-only. Written by the AI executing the playbook.
<benchmark>/quality/PROGRESS.md — human-readable run status. Atomically rewritten by the AI on each event.
Calibration Cycles/<cycle>/run_state.jsonl — cycle-level event log. Append-only. Written by the orchestrator AI.

All three live in the bind-mounted workspace owned by the user. The AI writes via Edit/Write file tools, never via shell redirection or tee (which routes through a different UID layer in some sandbox runtimes).

Schema versioning

Every run_state.jsonl opens with an _index event recording schema_version. Current version: "1.5.6". Schema bumps preserve backward compatibility — older files remain readable by newer parsers. Breaking schema changes bump the major number.

Required fields (every event)

Every event object MUST have:

ts — ISO 8601 UTC timestamp with Z suffix (e.g. "2026-05-15T14:32:01Z"). Sub-second precision allowed but not required.
event — string, the event-type name. Must match one of the names listed in _index.event_types.

Events MAY have additional fields per their type's spec below. Unknown fields are tolerated by readers (forward-compatible).

Per-run events (`<benchmark>/quality/run_state.jsonl`)

`_index`

ALWAYS the first line. Records schema metadata.

Field	Type	Required	Notes
`event`	string	yes	Always `"_index"`
`ts`	string	yes	ISO 8601 UTC
`schema_version`	string	yes	`"1.5.6"`
`event_types`	array of string	yes	Every event type this file uses
`benchmark`	string	yes	E.g. `"chi-1.3.45"`, `"virtio-1.5.1"`
`lever_state`	string	yes	E.g. `"pre-pattern7"`, `"post-pattern7"`, `"baseline"`
`started_at`	string	yes	ISO 8601 UTC, equals `ts` of this event

`run_start`

Marks the beginning of a playbook run.

Field	Type	Required	Notes
`event`	string	yes	`"run_start"`
`ts`	string	yes
`runner`	string	yes	One of `"claude"`, `"codex"`, `"copilot"`, `"cursor"`
`playbook_version`	string	yes	E.g. `"1.5.6-pre"`, `"1.5.6"` (matches `bin.benchmark_lib.RELEASE_VERSION`)
`target_path`	string	yes	Relative path to benchmark target

`phase_start`

Marks the beginning of one of the six playbook phases.

Field	Type	Required	Notes
`event`	string	yes	`"phase_start"`
`ts`	string	yes
`phase`	integer	yes	1, 2, 3, 4, 5, or 6

`pattern_walked`

Phase 1 only. Records that one of the seven exploration patterns was walked.

Field	Type	Required	Notes
`event`	string	yes	`"pattern_walked"`
`ts`	string	yes
`phase`	integer	yes	Always 1
`pattern`	integer	yes	1 through 7
`findings_count`	integer	yes	Number of findings produced by this pattern
`duration_seconds`	number	optional	Wall-clock for this pattern walk

`pass_started` / `pass_ended`

Phase 4 only. Records start/end of one of the four skill-derivation passes.

Field	Type	Required	Notes
`event`	string	yes	`"pass_started"` or `"pass_ended"`
`ts`	string	yes
`phase`	integer	yes	Always 4
`pass`	string	yes	One of `"A"`, `"B"`, `"C"`, `"D"`
`output_artifact`	string	optional	Relative path to pass artifact (on `pass_ended`)

`finding_logged`

Records that a finding (skill-divergence, code-bug, etc.) was logged in the current phase.

Field	Type	Required	Notes
`event`	string	yes	`"finding_logged"`
`ts`	string	yes
`phase`	integer	yes	1-6
`finding_id`	string	yes	E.g. `"BUG-007"`, `"REQ-042"`
`category`	string	yes	E.g. `"code-bug"`, `"skill-divergence"`, `"missing-citation"`, `"prose-to-code-mismatch"`

`artifact_written`

Records that an artifact file was produced/updated.

Field	Type	Required	Notes
`event`	string	yes	`"artifact_written"`
`ts`	string	yes
`relative_path`	string	yes	Path relative to benchmark target (e.g. `"quality/EXPLORATION.md"`)
`byte_size`	integer	optional	Size of the file at write time
`line_count`	integer	optional	Line count

`gate_check`

Records the outcome of a single quality-gate check.

Field	Type	Required	Notes
`event`	string	yes	`"gate_check"`
`ts`	string	yes
`gate_name`	string	yes	Identifier from `quality_gate.py`
`verdict`	string	yes	One of `"pass"`, `"fail"`, `"warn"`, `"skip"`
`reason`	string	optional	Human-readable explanation

`phase_end`

Marks the end of a phase. Cross-validated against the phase's expected artifacts before being written (see "Cross-validation rules" below).

Field	Type	Required	Notes
`event`	string	yes	`"phase_end"`
`ts`	string	yes
`phase`	integer	yes	1-6
`key_counts`	object	yes	Phase-specific counts (see below)
`artifacts_produced`	array of string	yes	Relative paths of artifacts produced this phase
`duration_seconds`	number	optional	Wall-clock for the whole phase

key_counts per phase:

Phase 1: {"findings_total": N, "patterns_walked": M} (M should be 7 for full Phase 1)
Phase 2: {"findings_promoted": N, "findings_dropped": M}
Phase 3: {"bugs_identified": N, "bug_writeups": M}
Phase 4: {"req_count": N, "uc_count": M, "passes_complete": K} (K should be 4)
Phase 5: {"gate_checks_total": N, "gate_failures": M}
Phase 6: {"bugs_md_count": N, "gate_verdict": "pass|fail|partial"}

`error`

Records an error during the run.

Field	Type	Required	Notes
`event`	string	yes	`"error"`
`ts`	string	yes
`phase`	integer	optional	If error is phase-scoped
`message`	string	yes	Human-readable description
`recoverable`	boolean	yes	If true, the run will retry the affected phase; if false, the run is aborting

`documentation_state`

v1.5.6+. Records the documentation-availability state at Phase 1 entry. Currently the only emitted state is "code_only", indicating that reference_docs/ and reference_docs/cite/ carry no recognized plaintext content (.md or .txt) and Phase 1 is proceeding in code-only mode (see references/code-only-mode.md). A "with_docs" value is reserved for future explicit emission; today the absence of a documentation_state event implies docs were present.

Field	Type	Required	Notes
`event`	string	yes	`"documentation_state"`
`ts`	string	yes
`state`	string	yes	Currently `"code_only"`. Future values may include `"with_docs"`.
`reason`	string	yes	Free-form (e.g. `"reference_docs/ empty"`)

When documentation_state state="code_only" is emitted, the playbook also prepends a "Documentation status: code-only mode" section to quality/EXPLORATION.md and adds a "Documentation state: code_only" line to quality/PROGRESS.md so the downgrade is visible to anyone reading either artifact. New runs adding the documentation_state event must include it in the _index.event_types list.

`aborted_missing_docs`

v1.5.6+. Records that the run aborted at Phase 1 entry because --require-docs was set and reference_docs/ was empty. Mutually exclusive with documentation_state state="code_only" for the same Phase 1 entry — --require-docs is the opt-IN abort path; the absence of the flag preserves the documented code-only-mode downgrade. After this event the runner returns non-zero without invoking any LLM work, so no phase_start phase=1 is recorded.

Field	Type	Required	Notes
`event`	string	yes	`"aborted_missing_docs"`
`ts`	string	yes
`reason`	string	yes	Free-form (e.g. `"reference_docs/ empty and --require-docs set"`)

When aborted_missing_docs is emitted, the playbook also writes an ERROR: aborted_missing_docs — <reason> block to quality/PROGRESS.md so the abort is visible without reading the JSONL. New runs that pass --require-docs against an empty reference_docs/ must include aborted_missing_docs in the _index.event_types list.

`run_end`

Marks the end of the playbook run.

Field	Type	Required	Notes
`event`	string	yes	`"run_end"`
`ts`	string	yes
`status`	string	yes	One of `"success"`, `"aborted"`, `"failed"`
`total_findings`	integer	optional	Sum across all phases
`final_verdict`	string	optional	The Phase 6 gate verdict

Cycle-level events (`Calibration Cycles/<cycle>/run_state.jsonl`)

`_index` (cycle-level)

Field	Type	Required	Notes
`event`	string	yes	`"_index"`
`ts`	string	yes
`schema_version`	string	yes	`"1.5.6"`
`event_types`	array of string	yes
`cycle_name`	string	yes	E.g. `"2026-05-15-pattern7-displacement-recovery"`
`lever_under_test`	string	yes	E.g. `"lever-1-exploration-breadth-depth"`
`benchmarks`	array of string	yes	Cycle's pinned benchmark list
`iteration`	integer	yes	Iteration ordinal (1, 2, or 3 — see iterate-cap)

`cycle_start`

Field	Type	Required	Notes
`event`	string	yes	`"cycle_start"`
`ts`	string	yes
`hypothesis`	string	yes	The cycle's testable hypothesis
`noise_floor_threshold`	number	yes	Recall delta below this is treated as noise (default 0.05)

`benchmark_start`

Field	Type	Required	Notes
`event`	string	yes	`"benchmark_start"`
`ts`	string	yes
`benchmark`	string	yes
`lever_state`	string	yes	`"pre-lever"` or `"post-lever"`

`lever_change_applied`

Field	Type	Required	Notes
`event`	string	yes	`"lever_change_applied"`
`ts`	string	yes
`lever_id`	string	yes	E.g. `"lever-1-exploration-breadth-depth"`
`files_changed`	array of string	yes	Paths relative to QPB repo root
`commit_sha`	string	yes	Commit SHA on the implementing branch
`description`	string	yes	What the change is (e.g. `"Pattern 7 budget cap 3-5 → 2-3"`)

`lever_change_reverted`

Field	Type	Required	Notes
`event`	string	yes	`"lever_change_reverted"`
`ts`	string	yes
`files_changed`	array of string	yes
`commit_sha`	string	optional	Null/absent if revert is uncommitted

`benchmark_end`

Field	Type	Required	Notes
`event`	string	yes	`"benchmark_end"`
`ts`	string	yes
`benchmark`	string	yes
`lever_state`	string	yes
`recall`	number	yes	0.0-1.0
`bugs_found`	array of string	yes	Bug IDs found this run
`bugs_missed`	array of string	yes	Bug IDs in baseline missed this run
`historical_baseline_path`	string	yes	Path to the baseline BUGS.md used for recall computation

`cycle_end`

Field	Type	Required	Notes
`event`	string	yes	`"cycle_end"`
`ts`	string	yes
`verdict`	string	yes	One of `"ship"`, `"revert"`, `"iterate"`, `"halt-iterate-cap"`
`recall_before`	object	yes	Per-benchmark recall before lever change
`recall_after`	object	yes	Per-benchmark recall after lever change
`delta`	object	yes	Per-benchmark delta (recall_after - recall_before)
`cross_benchmark_check`	object	yes	`{"clean": bool, "regressions": [list of bench/bug pairs that regressed]}`

Cross-validation rules (per `phase_end`)

The AI verifies these conditions before appending a phase_end event. If any check fails, the AI appends an error event with recoverable: true and re-runs the failing phase.

Phase	Required conditions
1	`quality/EXPLORATION.md` exists, ≥ 120 lines (aligned with the Phase 2 startup gate in `bin/run_playbook.check_phase_gate`), contains at least one finding section (regex `^##\s+(Finding\|Open Exploration Findings\|\d+\.)` — accepts `## Finding ...`, the SKILL-prescribed exact heading `## Open Exploration Findings`, and numbered `## N.` headings)
2	All nine fixed-name Generate-contract artifacts exist non-empty under `quality/`: `REQUIREMENTS.md`, `QUALITY.md`, `CONTRACTS.md`, `COVERAGE_MATRIX.md`, `COMPLETENESS_REPORT.md`, `RUN_CODE_REVIEW.md`, `RUN_INTEGRATION_TESTS.md`, `RUN_SPEC_AUDIT.md`, `RUN_TDD_TESTS.md`. Plus at least one non-empty `quality/test_functional.<ext>` (extension varies by primary language). Pre-v1.5.6 this row described the v1.5.5-design triage model (`EXPLORATION_MERGED.md` / `triage.md`); that mapping was never adopted by shipped SKILL.md / orchestrator_protocol.md / agent files, which always documented Phase 2 as Generate.
3	`quality/code_reviews/` directory contains at least one review file. If `quality/BUGS.md` has any `### BUG-` heading, `quality/patches/` contains at least one `BUG-*-regression-test.patch` file. Pre-v1.5.6 this row checked `quality/RUN_CODE_REVIEW.md` (a Phase 2 Generate output, not a Phase 3 review result) — same v1.5.5-design / shipped-Generate drift class as the Phase 2 row. Cluster B reconciled.
4	`quality/spec_audits/` directory contains at least one `-triage.md` file AND at least one `-auditor-*.md` file (per orchestrator_protocol.md naming convention). When neither name pattern matches, the validator falls back to a weaker "≥2 files" check — older bootstrap runs with arbitrary `.md` names still pass; the gate at Phase 6 enforces deeper conformance. Pre-v1.5.6 this row checked `quality/REQUIREMENTS.md` + `COVERAGE_MATRIX.md` (Phase 2 outputs) — same v1.5.5-design drift class. Cluster B reconciled.
5	If `quality/BUGS.md` has confirmed `### BUG-` entries: `quality/results/tdd-results.json` exists non-empty; for every confirmed bug, `quality/writeups/BUG-NNN.md` exists AND `quality/results/BUG-NNN.red.log` exists. With no confirmed bugs the row is vacuously satisfied. Pre-v1.5.6 this row checked `quality/results/quality-gate.log` (a Phase 6 output) — same v1.5.5-design drift class. Cluster B reconciled.
6	`quality/results/quality-gate.log` exists non-empty AND `quality/PROGRESS.md` contains a `Terminal Gate Verification` section (the orchestrator-protocol marker that Phase 6 ran the script-verified gate to completion). Pre-v1.5.6 this row checked `quality/BUGS.md` + `quality/INDEX.md` — BUGS.md is a Phase 3 output, INDEX.md was never adopted in the shipped contract. Same v1.5.5-design drift class. Cluster B reconciled.

The run_end event additionally requires: all 6 phase_end events present in the log; the final BUGS.md count matches phase_end phase=6 key_counts.bugs_md_count.

Resume semantics

When an AI session starts on a run directory:

If quality/run_state.jsonl does not exist: fresh run. Write _index + run_start + phase_start phase=1.
If it exists: read all events. Find the last phase_start not followed by a matching phase_end. Call it the "in-progress phase".
Verify the in-progress phase's expected artifacts (per cross-validation rules above):
- If artifacts complete: append the missing phase_end event and proceed to the next phase. Note: this is the "session crashed mid-phase but the work is done" recovery path.
- If artifacts incomplete: re-run that phase from scratch. The prior session left a partial state that can't be safely resumed.
If all 6 phase_end events are present but no run_end: append run_end status=success and finalize.

The policy is "trust artifacts more than events." If events claim phase 4 done but REQUIREMENTS.md doesn't exist, the AI re-runs phase 4. If events stop mid-phase but artifacts are complete, the AI catches up the events.

PROGRESS.md format

Atomically rewritten on every event. Markdown.

# QPB Run Progress

**Started:** 2026-05-15T14:32:01Z  **Benchmark:** chi-1.5.1  **Lever:** post-pattern7
**Runner:** claude  **Playbook version:** 1.5.6

## Phases

- [x] Phase 1 — Explore (10:10, 12 findings, patterns 1-7 walked)
- [x] Phase 2 — Generate (0:42, 9 artifacts produced)
- [x] Phase 3 — Code Review (15:31, 6 bugs identified)
- [x] Phase 4 — Spec Audit (3 auditors, 1 triage)
- [ ] Phase 5 — Reconciliation *(in progress, started 14:58:31Z)*
- [ ] Phase 6 — Verify

## Recent events (last 10)

- 2026-05-15T14:58:31Z — phase_start phase=5
- 2026-05-15T14:58:30Z — phase_end phase=4 passes=[A,B,C,D] req_count=89
- 2026-05-15T14:42:11Z — phase_end phase=1 findings=12

## Artifacts produced

- quality/EXPLORATION.md (12,034 bytes)
- quality/REQUIREMENTS.md (28,891 bytes)
- quality/COVERAGE_MATRIX.md (3,022 bytes)

Sections (header, phase checklist, recent events, artifacts produced) are required. Phase checklist uses [x] for complete phases (with summary stats), [ ] for incomplete, with in-progress phase noted explicitly with start time. Recent events shows last 10 event lines from run_state.jsonl in human-readable form. Artifacts produced shows files written this run with byte sizes.

Format invariants (enforced by `bin/run_state_lib.py` validators)

_index is line 1.
Every line is valid JSON (one object per line).
Every event has ts and event fields.
Every event value appears in _index.event_types.
Append-only: events are added, never edited. Editing a prior event is a schema violation.
phase_start and phase_end events for a given phase appear at most once per run (no out-of-order or duplicate phase markers).
run_start is the second line (after _index); run_end is the last line if the run completed.

Validators are read-only checks. They surface violations as findings; they don't auto-correct.

18 KiB Raw Blame History

Run-State Schema (v1.5.6)

File locations and ownership

Schema versioning

Required fields (every event)

Per-run events (<benchmark>/quality/run_state.jsonl)

_index

run_start

phase_start

pattern_walked

pass_started / pass_ended

finding_logged

artifact_written

gate_check

phase_end

error

documentation_state

aborted_missing_docs

run_end

Cycle-level events (Calibration Cycles/<cycle>/run_state.jsonl)

_index (cycle-level)

cycle_start

benchmark_start

lever_change_applied

lever_change_reverted

benchmark_end

cycle_end

Cross-validation rules (per phase_end)