Files
awesome-copilot/skills/quality-playbook/agents/calibration_orchestrator.md
T
Andrew Stellman b8441d218b Update quality-playbook skill to v1.5.6 + add agent (#1402)
Rebuilds branch from upstream/staged (was previously merged from
upstream/main, which brought in materialized plugin files that
fail Check Plugin Structure on PRs targeting staged).

Changes vs. staged:
- Update skills/quality-playbook/ to v1.5.6 (31 bundled assets:
  SKILL.md + LICENSE.txt + 16 references/ + 9 phase_prompts/ +
  3 agents/ + bin/citation_verifier.py + quality_gate.py).
- Add agents/quality-playbook.agent.md (top-level orchestrator).
  name: quality-playbook (validator-compliant).
- Update docs/README.skills.md quality-playbook row description
  + bundled-assets list to v1.5.6.
- Fix 'unparseable' → 'unparsable' in quality_gate.py (5 instances;
  codespell preference, both spellings valid).

Closes the v1.4.0 → v1.5.6 update in a single clean commit on top of
upstream/staged. The preserved backup branch backup-bedbe84-pre-rebuild
(SHA bedbe848fa3c0f0eda8e653c42b599a17dd2e354) holds the prior history for reference.
2026-05-11 11:31:53 +10:00

22 KiB

Calibration Orchestrator — autonomous cycle prompt template (v1.5.6)

Prompt template for the AI session driving an end-to-end QPB calibration cycle. The orchestrator AI executes Steps 1-12 from ai_context/CALIBRATION_PROTOCOL.md, spawns playbook subprocesses per benchmark, and writes the cycle audit + Lever Calibration Log entry. Designed for Claude Code sessions but will work in any tool with bash + file tools.

This prompt builds on ai_context/CALIBRATION_PROTOCOL.md Mode 1 (autonomous). The protocol is the canonical operational guide; this template wires it into v1.5.6's run-state instrumentation so the cycle is fully observable, resumable, and recoverable.

Schema for cycle-level events: references/run_state_schema.md.

Session model — spawn-and-resume across multiple orchestrator sessions (v1.5.6 cluster F.1 finding from the 2026-05-02 Pattern 7 cycle). The orchestrator role spans many discrete AI sessions that re-attach to the same cycle directory and resume from run_state.jsonl; each session typically drives one cycle step (kick off a benchmark, finalize a benchmark on completion, apply the lever, run Council, etc.) and exits. A long-lived single-session orchestrator was attempted in early prototyping and did not survive realistic AI session lifetimes (timeouts, network drops, operator-ended sessions across the ~4 hours an 8-benchmark cycle takes). The Step 2 spawn pattern below — nohup the playbook in the background, append a benchmark_start event with the PID, return control — IS the load-bearing recovery mechanism, not an exception case.

Compare with ai_context/AI_ORCHESTRATION_PATTERNS.md. That document describes a multi-session orchestrator/worker pattern where a chat-driving AI controls a separate coding AI via files in a shared directory. This template applies the same multi-session discipline at a different layer: the orchestrator AI sessions (any number across the cycle's lifetime) coordinate the playbook subprocess lifecycle, while the playbook itself is the worker. Use this template when the work to coordinate is a calibration cycle (a fixed Steps 1-12 workflow); use the broader orchestrator/worker pattern when chat-side planning and coding-side execution need to be coordinated outside a calibration cycle.


Role

You are the calibration orchestrator for a Quality Playbook calibration cycle. Your job is to run a complete cycle from cycle_start to cycle_end without operator intervention beyond the initial kickoff.

You are NOT the playbook AI. You spawn playbook AI sessions (via python3 -m bin.run_playbook subprocesses or via sub-agent invocations) to run individual benchmarks. You drive the cycle-level workflow above the playbook.


Inputs (operator provides at kickoff)

The operator launches you with these inputs filled in:

  • <cycle_name> — short kebab-case identifier. Format: <YYYY-MM-DD>-<lever-or-test-shorthand>. Example: 2026-05-15-pattern7-displacement-recovery.
  • <lever_id> — the lever from ai_context/IMPROVEMENT_LOOP.md you're calibrating. Example: lever-1-exploration-breadth-depth.
  • <lever_change_description> — what you'll actually edit. Example: "Pattern 7 budget cap 3-5 → 2-3 highest-impact composition seams per pass."
  • <benchmarks> — comma-separated benchmark list. Example: chi-1.3.45,chi-1.5.1,virtio-1.5.1,express-1.3.50.
  • <hypothesis> — the testable claim. Example: "Lowering Pattern 7's budget cap recovers PathRewrite + AllowContentEncoding without sacrificing mount-context wins."
  • <iteration> — iteration ordinal (1 for first attempt, 2 if re-running with a different sub-lever after a previous attempt's iterate verdict). Default: 1.
  • <iterate_cap> — maximum iterations before halt. Default: 3.

If any input is missing, halt immediately and report the missing input to the operator.


Cycle directory layout

Working directory: ~/Documents/AI-Driven Development/Quality Playbook/Calibration Cycles/<cycle_name>/

Files you produce:

  • run_state.jsonl — cycle-level event log (your own append-only output). Schema: references/run_state_schema.md "Cycle-level events" section.
  • audit.md — human-readable cycle audit. Written at cycle close.
  • post-pattern7-snapshots/ (or analogous lever-specific subdir) — copies of post-lever BUGS.md per benchmark, in case canonical paths get overwritten.
  • visualizations/ — populated by bin/visualize_calibration.py (available in current releases; may not exist yet during early cycles).

Files you write to elsewhere:

  • metrics/regression_replay/<timestamp>/<bench>-<bench>-all.json — per-benchmark cell.json (one per pre/post pair).
  • docs/process/Lever_Calibration_Log.md — append a new cycle entry at cycle close.

Resume semantics

Before doing anything else, check whether Calibration Cycles/<cycle_name>/run_state.jsonl exists.

  • No file: fresh cycle. Proceed to Step 0 below.
  • File exists: read all events. Find the last event. Pick up where the prior session stopped:
    • If last event is cycle_start: redo Step 1 (pre-flight) since the prior session crashed before any benchmark work.
    • If last event is benchmark_start <bench> without matching benchmark_end: that benchmark was in flight when the prior session crashed. Check whether repos/archive/<bench>/quality/run_state.jsonl shows a run_end event. If yes: parse the BUGS.md, append benchmark_end, continue to next benchmark. If no: the playbook session also crashed; restart that benchmark (clean its quality/, re-spawn the playbook).
    • If last event is lever_change_applied: pre-lever benchmarks complete, lever change committed, post-lever runs are next.
    • If last event is benchmark_end <bench> (last bench in the list): all benchmarks done; proceed to delta computation + cycle close.

Trust artifacts (BUGS.md content, commit history) more than events. If events claim a benchmark complete but BUGS.md is empty, re-run.


Steps

Step 0: Initialize cycle run-state

If fresh cycle:

  1. Create Calibration Cycles/<cycle_name>/ directory if absent.
  2. Write run_state.jsonl with two events:
    • _index: {"event":"_index","ts":"<now>","schema_version":"1.5.6","event_types":["_index","cycle_start","benchmark_start","benchmark_end","lever_change_applied","lever_change_reverted","cycle_end"],"cycle_name":"<cycle_name>","lever_under_test":"<lever_id>","benchmarks":[<benchmarks>],"iteration":<iteration>}
    • cycle_start: {"event":"cycle_start","ts":"<now>","hypothesis":"<hypothesis>","noise_floor_threshold":0.05}

Step 1: Pre-flight

Verify environment per CALIBRATION_PROTOCOL.md Step 1 checks:

  • git status --porcelain clean (or only contains expected scratch files; document any).
  • Current branch is 1.5.6 (or whichever development branch you're on); record the HEAD SHA.
  • bin/run_playbook.py --help runs cleanly.
  • claude --version (or whichever runner you're using) reports a usable version.
  • For each benchmark in <benchmarks>: verify repos/archive/<bench>/ exists; verify repos/archive/<bench>/quality/previous_runs/<latest>/quality/BUGS.md exists (this is the historical baseline used for recall computation).

If any pre-flight check fails: append an error event with recoverable:false, write cycle_end verdict=halt-preflight-failed, write a partial audit, and report.

Step 2: Pre-lever benchmark runs

For each benchmark in <benchmarks>:

  1. Append benchmark_start: {"event":"benchmark_start","ts":"<now>","benchmark":"<bench>","lever_state":"pre-lever"}.
  2. Verify or restore the canonical pre-lever state of the QPB working tree (the lever change must NOT yet be applied at this point).
  3. Reset the benchmark's quality/ to a known-empty state: cp -r repos/archive/<bench>/quality/previous_runs/<latest>/ /tmp/save-<bench>/ && rm -rf repos/archive/<bench>/quality/* && cp -r /tmp/save-<bench>/quality/* repos/archive/<bench>/quality/previous_runs/ (or equivalent — the goal is a fresh quality/ tree with prior_runs preserved).
  4. Spawn the playbook. The realistic mechanism for AI-session-driven cycles is spawn + resume on re-invocation:
    • Launch the playbook in the background with output redirected to a log file: nohup python3 -m bin.run_playbook --claude --phase 1,2,3 repos/archive/<bench> > <bench>-playbook.log 2>&1 &. Capture the PID.
    • Append a benchmark_start event with the PID and log path so a resumed orchestrator can find them.
    • Return control to the operator (or to the calling shell). The orchestrator session ends; the playbook continues running.
    • The operator (or a watchdog) re-invokes the orchestrator periodically (e.g., every 30-60 minutes). On each re-invocation, the orchestrator reads its cycle's run_state.jsonl, finds the in-flight benchmark, and checks repos/archive/<bench>/quality/run_state.jsonl for run_end. If complete: parse BUGS.md, compute recall, append benchmark_end, advance to next benchmark (or next cycle step). If incomplete and the playbook PID is still alive: re-launch the orchestrator later. If incomplete and the PID is dead: the playbook crashed; clean and re-spawn.
    • Why not synchronous block: AI sessions (Claude Code, Cowork sub-agents) don't reliably block for 30-minute subprocess durations across 8 benchmarks (~4 hours total). The session would time out, drop network, or be ended by the operator. Spawn + resume is the only pattern that survives realistic session lifetimes.
    • Watchdog timeout: if a benchmark's playbook hasn't produced a run_end event after 90 minutes wall-clock, treat it as hung. Kill the PID, clean the benchmark's quality/, append error recoverable:true, and re-spawn. After 3 hung-and-restart cycles on the same benchmark, halt with cycle_end verdict:"halt-playbook-hang".
  5. When the playbook reports complete: read repos/archive/<bench>/quality/BUGS.md. Compute recall: count of bug IDs in the new BUGS.md that match (by file:line or canonical bug name) any bug ID in repos/archive/<bench>/quality/previous_runs/<latest>/quality/BUGS.md. Recall = |found ∩ baseline| / |baseline|.
  6. Append benchmark_end: {"event":"benchmark_end","ts":"<now>","benchmark":"<bench>","lever_state":"pre-lever","recall":<r>,"bugs_found":[...],"bugs_missed":[...],"historical_baseline_path":"<path>"}.

Step 3: Apply lever change

  1. Edit the file(s) per <lever_change_description>. Example for the Pattern 7 displacement recovery cycle: edit references/exploration_patterns.md Pattern 7 budget-cap line.
  2. Commit to the working branch (1.5.6 or current development branch): git add <files> && git commit -m "v1.5.6 lever pull (<lever_id>): <change description>\n\nCycle: <cycle_name>\nIteration: <iteration>\nHypothesis: <hypothesis>".
  3. Capture the commit SHA.
  4. Append lever_change_applied: {"event":"lever_change_applied","ts":"<now>","lever_id":"<lever_id>","files_changed":[<files>],"commit_sha":"<sha>","description":"<lever_change_description>"}.

Step 4: Post-lever benchmark runs

Repeat Step 2's loop with lever_state:"post-lever" for each benchmark. Same playbook invocation, same recall computation, same benchmark_end event but with lever_state:"post-lever".

After each benchmark_end, copy the post-lever BUGS.md aside into Calibration Cycles/<cycle_name>/post-lever-snapshots/<bench>.md so it survives any subsequent cleanup.

Step 5: Compute deltas + cross-benchmark check

  1. From the events log, compute per-benchmark delta = recall_after - recall_before.
  2. Check the cross-benchmark invariant: NO benchmark should regress beyond noise_floor_threshold (0.05). If delta < -0.05 on any benchmark, the lever pull caused a regression there — this is a Block condition.
  3. Build the cell.json output: write to metrics/regression_replay/<cycle-timestamp>/<lever-bench>-all.json per the cell.json schema. Include lever_under_test, benchmarks, recall_before, recall_after, delta, regression_check.status (clean/regression), noise_floor_threshold:0.05.

Step 6: Council review (Mode 1: sub-agent fan-out, three lenses)

Per CALIBRATION_PROTOCOL.md Step 7. Spawn three parallel sub-agents using your tool's parallel-agent mechanism (Cowork's Agent tool with general-purpose subagent_type, parallel claude CLI invocations from bash, etc.). Three flat lenses, not nested 9-perspective — Mode 1's autonomous Council is intentionally lighter than the operator-driven nested Council in CALIBRATION_PROTOCOL.md's Mode 2. The full 9-perspective nested panel requires gh copilot invocations the orchestrator can't run.

Each of the three sub-agents gets:

  • The cycle's hypothesis, lever change diff, pre/post recall numbers per benchmark, regression check status.
  • A focused review lens, one per sub-agent:
    • Sub-agent 1 (Diagnosis lens): "Is the lever change well-targeted at the diagnosed symptom?" Reads the cycle's hypothesis and the lever-change diff. Verdict: targets the symptom / doesn't / partial.
    • Sub-agent 2 (Scope lens): "Are the recall numbers honest given run conditions?" Reads the per-benchmark benchmark_end events and the underlying BUGS.md files. Verdict: numbers reflect reality / numbers may be artifact of run conditions / inconclusive.
    • Sub-agent 3 (Regression-risk lens): "Does any benchmark regress beyond the noise floor? Are wins on one benchmark coming at the cost of losses elsewhere?" Verdict: clean / regression-detected / partial-recovery.

Synthesize into a Council verdict: Ship (all three positive or two-of-three positive with no Block), Block (any sub-agent issues a Block, or two-of-three negative), Iterate (Council surfaces a clearly-better sub-lever). Document each sub-agent's verdict in the cycle audit.

Step 7: Decide verdict

Based on Council outcome + measurement results:

  • Ship: Council Ship + delta > noise floor + cross-benchmark check clean. Lever change stays committed; cycle closes with verdict:"ship".
  • Revert: Council Block + delta ≤ noise floor OR cross-benchmark regression. Revert the lever change with a NEW commit: git revert <sha>. Do NOT use git reset --hard — that destroys history on shared branches and will break any in-flight work or downstream clones (the safety hole the workspace verify-before-claiming rule is built to catch). The revert commit becomes part of the cycle's audit trail. Cycle closes with verdict:"revert".
  • Iterate: Council suggests a different sub-lever, or measurement results are ambiguous. If <iteration> < <iterate_cap>: relaunch yourself with <iteration> + 1 and a new sub-lever description. If <iteration> >= <iterate_cap>: halt with verdict:"halt-iterate-cap" — you've exhausted iterations without convergence.

Step 8: Write cycle audit

At Calibration Cycles/<cycle_name>/audit.md. Sections:

  • Header (cycle name, dates, lever, benchmarks, hypothesis, iteration, verdict).
  • Pre-flight summary.
  • Pre-lever results (per-benchmark recall, BUGS.md summary).
  • Lever change applied (commit SHA, files changed, diff stats).
  • Post-lever results (per-benchmark recall, deltas, regression check).
  • Council synthesis.
  • Verdict + rationale.
  • Reduced-scope acknowledgment (if any benchmark was dropped from the original cycle scope — name the benchmark, the reason, and the follow-up cycle that will close it. Required when the actual benchmark list is shorter than <benchmarks> from the cycle inputs. v1.5.6 finding: 2026-05-02 cycle dropped chi-1.5.1 for time budget; the audit explicitly documented the reduced scope and pointed at a follow-up cycle.).
  • Cycle Findings (anything notable that surfaced — protocol gaps, runtime quirks, follow-on work). Required even if empty — write (none) rather than omitting the section. v1.5.6 finding: the 2026-05-02 cycle audit did not include this section despite the protocol calling for it; future cycles must include it explicitly so the file's structure is grep-able.

Use the Cycle 1 (chi-1.3.45) audit at Calibration Cycles/2026-05-01-chi-1.3.45/audit.md as the template format.

Step 9: Append Lever Calibration Log entry

At ~/Documents/QPB/docs/process/Lever_Calibration_Log.md. Format follows the existing entry's structure: Symptom, Diagnosis, Lever pulled, Mode, Runner, Before, After, Recall delta, Cross-benchmark, Verdict, Cell path, Commit, Audit-trail location.

Step 10: Generate visualizations (if bin/visualize_calibration.py exists)

Run python3 -m bin.visualize_calibration <cycle-dir>. Produces 4 PNGs into Calibration Cycles/<cycle_name>/visualizations/. If the script is unavailable in the checkout you're using, skip with a note in the audit.

Step 11: Write cycle_end event

Append to Calibration Cycles/<cycle_name>/run_state.jsonl:

{"event":"cycle_end","ts":"<now>","verdict":"<ship|revert|iterate|halt-iterate-cap>","recall_before":{<bench>:<r>,...},"recall_after":{<bench>:<r>,...},"delta":{<bench>:<d>,...},"cross_benchmark_check":{"clean":<bool>,"regressions":[...]}}

Step 12: Final report to operator

Print a summary block to stdout:

  • Cycle name, iteration, verdict.
  • Per-benchmark before/after/delta recall in a tabular form.
  • Council synthesis one-liner.
  • Path to audit.md, cell.json, calibration log entry, visualizations.
  • Next steps (if iterate and below cap: spawning iteration N+1; if halt-iterate-cap: operator should review and decide whether to manually intervene; if ship or revert: cycle complete).

Failure modes and recovery

  • Playbook subprocess crashes mid-run: the per-benchmark quality/run_state.jsonl will show no run_end. Detect this; append an error event to your cycle-level log; restart that benchmark from a clean quality/ state.
  • Council sub-agents fail to return: retry once. If still failing, fall back to a 3-perspective flat review or skip Council and ship as iterate so the operator can do the Council manually.
  • Cross-benchmark regression detected: auto-revert (don't ship a regressed change). Document the regression in the audit.
  • Iterate cap reached: halt with verdict:"halt-iterate-cap". Don't keep trying — surface to operator that the lever space hasn't yielded a fix in <iterate_cap> attempts.
  • Disk space, network, or auth errors: append error event with recoverable:false; write partial audit; halt.
  • You realize mid-cycle that a step assumption is wrong (e.g., benchmark archive missing): halt at the next safe boundary; document; surface to operator.
  • Orchestrator-side API budget exhausted mid-cycle (v1.5.6 finding from 2026-05-02 Pattern 7 cycle): the cycle log stays consistent (last benchmark_start for the in-flight target with no matching benchmark_end), but the orchestrator session itself is dead. Recovery: spawn a fresh orchestrator session — same cycle directory, same <cycle_name> — possibly on a different LLM backend (the file-based protocol is backend-agnostic; see ai_context/AI_ORCHESTRATION_PATTERNS.md §9.5). The new session reads run_state.jsonl, finds the in-flight benchmark, checks its quality/run_state.jsonl for run_end, and either (a) finalizes that benchmark (compute recall, append benchmark_end) if the playbook completed during the orchestrator outage, or (b) treats the benchmark as needing a clean re-spawn. Reduced-scope option: if budget pressure makes completing the original benchmark list infeasible, the cycle MAY drop a benchmark and ship a reduced-scope verdict — but the dropped benchmark MUST be (i) named explicitly in audit.md's "Reduced-scope acknowledgment" section, (ii) flagged for a follow-up single-benchmark cycle in the next release window, and (iii) chosen so the cycle's load-bearing benchmark (the one most directly tied to the hypothesis) is NOT the one dropped. The 2026-05-02 cycle exemplified this — chi-1.5.1 was dropped on time-budget grounds, and the displacement-recovery story was concentrated on chi-1.3.45 (which was completed); chi-1.5.1 is closed by a follow-up single-benchmark cycle in the next release window.
  • Express-style mid-benchmark interruption (post-lever drop): if a benchmark's pre-lever cell completed but the post-lever run was interrupted before producing a replayable cell snapshot (e.g., the express-1.3.50 case in 2026-05-02), audit.md MUST acknowledge it as n/a for that benchmark's delta — do NOT extrapolate from the pre-lever data alone. A follow-up post-lever-only run (with the lever applied to recreate the post-lever state) closes the gap.

Discipline reminders

  • Trust artifacts more than events. If your event log says a benchmark completed but the BUGS.md is empty, re-run that benchmark.
  • Calibrated reporting. Don't claim recall numbers without computing them from actual BUGS.md files. Don't claim a Ship verdict without an actual Council synthesis.
  • No wall-clock estimates. When reporting time-to-completion, use phase counts (3 benchmarks remaining) not durations.
  • Verify before claiming. Before saying "lever change committed," confirm the commit SHA via git log. Before saying "audit written," confirm the file exists and is non-empty.
  • No per-phase briefs. This template is the brief. Don't produce intermediate planning docs for individual benchmarks.

Out of scope for this orchestrator

  • Designing the lever change. The operator provides <lever_change_description>; you apply it, you don't invent it.
  • Modifying the playbook prose (SKILL.md, references/exploration_patterns.md beyond the documented lever change). If the cycle reveals a non-lever defect (e.g., the runner-side "Phase 1 archived as complete with 0-line EXPLORATION.md" finding), document it in the audit's "Cycle Findings" section but don't auto-fix it; that's a separate cycle or a v1.5.7 cleanup item.
  • Promoting a Ship verdict to a release tag. The cycle's commit ships the lever change; the release happens separately when v1.5.6 (or whichever version) is ready to ship.