mirror of
https://github.com/github/awesome-copilot.git
synced 2026-04-30 12:15:56 +00:00
* update eval-driven-dev skill * fix: update skill update command to use correct repository path * address comments. * update eval driven dev
342 lines
16 KiB
Markdown
342 lines
16 KiB
Markdown
# Step 6: Analyze Outcomes
|
||
|
||
**Why this step**: `pixie test` produced raw scores. Now you analyze those results to understand what they mean — completing pending evaluations, identifying patterns, validating hypotheses, and producing an actionable improvement plan. The analysis is structured in three phases that build on each other: entry-level → dataset-level → action plan.
|
||
|
||
---
|
||
|
||
## Result directory structure
|
||
|
||
After `pixie test`, the result directory looks like:
|
||
|
||
```text
|
||
{PIXIE_ROOT}/results/<test_id>/
|
||
meta.json
|
||
dataset-{idx}/
|
||
metadata.json
|
||
entry-{idx}/
|
||
config.json # evaluators, description, expectation
|
||
eval-input.jsonl # input data fed to evaluators
|
||
eval-output.jsonl # output data captured from app
|
||
evaluations.jsonl # scored + pending evaluations
|
||
trace.jsonl # LLM call traces
|
||
```
|
||
|
||
Read `meta.json` to find the `<test_id>`. All the data you need for analysis is in this directory.
|
||
|
||
---
|
||
|
||
## Hard completion gate
|
||
|
||
You are the grader for Step 6. **Pending evaluations are not a handoff to the user, and the web UI is not a substitute for grading.** You may use the web UI to browse traces and outputs, but completion happens by writing files on disk.
|
||
|
||
Step 6 is incomplete until all of the following are true:
|
||
|
||
- Every `"status": "pending"` entry in every `evaluations.jsonl` has been replaced with a scored entry that contains both `score` and `reasoning`.
|
||
- Every dataset directory contains `analysis.md` and `analysis-summary.md`.
|
||
- The test run root contains `action-plan.md` and `action-plan-summary.md`.
|
||
- The verifier script in this skill's `resources/` directory passes for the target results directory.
|
||
|
||
**Forbidden shortcuts**:
|
||
|
||
- Leaving any `"status": "pending"` entries in place
|
||
- Telling the user to review pending evaluations in the web UI
|
||
- Writing a single top-level substitute file such as `pixie_qa/06-analysis.md`
|
||
- Writing phrases like "likely passes" or "probably fails" without scoring the evaluation and updating `evaluations.jsonl`
|
||
|
||
If you do any of the above, Step 6 is not done.
|
||
|
||
## Iteration rule
|
||
|
||
If you are iterating across multiple fix/test cycles, every successful `pixie test` run creates a new `pixie_qa/results/<test_id>` directory and a new Step 6 obligation. The moment that directory exists, it becomes the analysis target for the current cycle.
|
||
|
||
Before you edit application code, prompts, datasets, evaluators, or rerun `pixie test`, complete Step 6 for that exact results directory. Do not skip earlier cycles and analyze only the last run.
|
||
|
||
**Additional forbidden shortcut**:
|
||
|
||
- Do not create a newer `pixie_qa/results/<test_id>` and leave an older one from the same task without Step 6 artifacts.
|
||
|
||
---
|
||
|
||
## Writing principles
|
||
|
||
Every analysis **detailed** artifact you produce must follow these principles:
|
||
|
||
- **Data-driven**: Every opinion or statement must be backed by concrete data from the evaluation run. Quote scores, cite entry indices, reference specific eval input/output content. No hand-waving. It is better to write nothing than to write something unsubstantiated.
|
||
- **Evidence-first**: Present the raw data and evidence before drawing conclusions. The reader (another coding agent) should be able to independently verify your conclusions from the evidence you cite.
|
||
- **Traceable**: For every conclusion, provide the chain: data source → observation → reasoning → conclusion. Another agent should be able to follow this chain backward to verify or challenge any claim.
|
||
- **No selling**: Do not advocate, promote, or use value-laden language ("excellent", "robust", "impressive", "well-designed"). State what the data shows and what actions it implies. Let the reader form quality judgments.
|
||
- **Action-oriented**: Every analysis should contribute to the end goal of concrete improvements to the evaluation pipeline or application. Do not write observations that don't lead somewhere.
|
||
|
||
Every persisted analysis **summary** artifact must follow these principles:
|
||
|
||
- **Concise**: The human reader should be able to understand the key findings and actions in under 2 minutes for any single artifact.
|
||
- **Conclusions-first**: Lead with what the reader needs to know (results, findings, actions), not with methodology or background.
|
||
- **Plain language**: Avoid jargon. A non-technical stakeholder should be able to follow the summary.
|
||
- **Consistent**: Summary conclusions must match the detailed version's evidence. Never add claims in the summary that aren't supported in the detailed version.
|
||
|
||
### Dual-variant pattern
|
||
|
||
Every persisted analysis artifact in this step has two files:
|
||
|
||
| Artifact | Detailed file (for agent) | Summary file (for human) |
|
||
| ---------------- | --------------------------- | ----------------------------------- |
|
||
| Dataset analysis | `dataset-{idx}/analysis.md` | `dataset-{idx}/analysis-summary.md` |
|
||
| Action plan | `action-plan.md` | `action-plan-summary.md` |
|
||
|
||
**Always write the detailed version first**, then derive the summary from it. The summary is a strict subset of the detailed version's content — it should never contain claims or conclusions not present in the detailed version.
|
||
|
||
---
|
||
|
||
## Phase 1: Entry-level grading pass
|
||
|
||
Process each dataset entry individually. For each `dataset-{idx}/entry-{idx}/`:
|
||
|
||
### 1a. Read the entry data
|
||
|
||
Read these files for the entry:
|
||
|
||
- `config.json` — what evaluators were configured, the description, the expectation
|
||
- `eval-input.jsonl` — what data was fed to the app/evaluators
|
||
- `eval-output.jsonl` — what the app produced
|
||
- `evaluations.jsonl` — current evaluation results (scored and pending)
|
||
- `trace.jsonl` — what LLM calls the app made (if available)
|
||
|
||
### 1b. Complete pending evaluations
|
||
|
||
If `evaluations.jsonl` contains entries with `"status": "pending"`, you must grade them:
|
||
|
||
1. Read the `criteria` field of the pending evaluation
|
||
2. Apply the criteria to the entry's eval input, eval output, and trace data
|
||
3. Assign a **score** between 0.0 and 1.0:
|
||
- `1.0` — fully meets the criteria
|
||
- `0.5`–`0.9` — partially meets criteria (explain what's missing)
|
||
- `0.0`–`0.4` — does not meet criteria
|
||
4. Write a **reasoning** string (1–3 sentences citing specific evidence from the output or trace)
|
||
5. Replace the pending entry in `evaluations.jsonl` with the scored result. **Do not append a second row and leave the pending row in place. Overwrite the pending row itself.**
|
||
|
||
**Before** (pending):
|
||
|
||
```json
|
||
{
|
||
"evaluator": "ResponseQuality",
|
||
"status": "pending",
|
||
"criteria": "The response should..."
|
||
}
|
||
```
|
||
|
||
**After** (scored):
|
||
|
||
```json
|
||
{
|
||
"evaluator": "ResponseQuality",
|
||
"score": 0.85,
|
||
"reasoning": "Response addresses the main question but omits..."
|
||
}
|
||
```
|
||
|
||
**Grading guidelines**:
|
||
|
||
- Be evidence-based — every score must reference specific output or trace content
|
||
- Use the criteria literally — do not expand or reinterpret beyond what's written
|
||
- Consider the trace — distinguish between app logic problems and LLM quality issues
|
||
- Be calibrated — reserve 1.0 for outputs that genuinely satisfy criteria fully
|
||
- Do not penalize LLM non-determinism — different phrasing of a correct answer is not a failure
|
||
- Do not defer to the user — if the evidence is sufficient to write "likely passes", it is sufficient to assign a score and update `evaluations.jsonl`
|
||
|
||
### 1c. Do not persist entry-level analysis files
|
||
|
||
In this trimmed workflow, **do not write `entry-{idx}/analysis.md` or `entry-{idx}/analysis-summary.md`**. Phase 1 is only for reading evidence and converting every pending evaluation into a scored row in `evaluations.jsonl`.
|
||
|
||
You may take temporary scratch notes while reasoning, but they are not deliverables. Persist only:
|
||
|
||
- updated `evaluations.jsonl` in each entry directory
|
||
- dataset-level analysis files in Phase 2
|
||
- run-level action plan files in Phase 3
|
||
|
||
---
|
||
|
||
## Phase 2: Dataset-level analysis
|
||
|
||
After all entries in a dataset are analyzed, produce the dataset-level analysis. Write `analysis.md` in the dataset directory (`dataset-{idx}/analysis.md`).
|
||
|
||
### 2a. Aggregate the data
|
||
|
||
Summarize across all entries in the dataset:
|
||
|
||
- Pass/fail counts and overall pass rate
|
||
- Per-evaluator statistics (pass rate, min/max/mean scores)
|
||
- Which entries failed which evaluators (failure clusters)
|
||
|
||
### 2b. Form and validate hypotheses
|
||
|
||
Come up with **exactly 3 high-confidence hypotheses** across these three dimensions:
|
||
|
||
1. **Test cases quality** — Does the set of test cases sufficiently and efficiently verify the application's capabilities? Does it cover the important failure modes? Are there blind spots?
|
||
|
||
2. **Evaluation criteria/evaluator quality** — Do the evaluators have proper granularity and grading to catch real issues? Are there rubber-stamp evaluators (all 1.0)? Are there flaky evaluators (high variance without code changes)? Are criteria too vague or too strict?
|
||
|
||
3. **Application quality** — Based on the evaluation results, what are the application's strengths and weaknesses? Where does it produce high-quality output? Where does it fail?
|
||
|
||
For each hypothesis:
|
||
|
||
- **State the hypothesis** clearly in one sentence
|
||
- **Cite the evidence** — entry indices, evaluator names, scores, reasoning quotes, trace data
|
||
- **Validate or invalidate** — look at the actual eval input/output data and code to confirm or refute
|
||
- **Conclusion** — what action does this hypothesis imply?
|
||
|
||
It is always possible to produce 3 hypotheses even when the data is limited. If the evaluation data doesn't give a conclusive answer on application quality, that itself is a signal about test case or evaluator gaps.
|
||
|
||
### 2c. Write the dataset analysis (two files)
|
||
|
||
Produce **two files** for the dataset analysis. Write the detailed version first, then derive the summary.
|
||
|
||
#### Detailed version: `dataset-{idx}/analysis.md`
|
||
|
||
This file is for **agent consumption** — it provides the complete data aggregation, hypothesis formation with evidence chains, and validated conclusions that a coding agent can act on directly.
|
||
|
||
**Writing principles:**
|
||
|
||
- **Show all the data before interpreting it.** Start with the raw aggregation (pass/fail, per-evaluator stats, failure clusters) before any hypotheses. The data should stand on its own.
|
||
- **For each hypothesis, present: data → reasoning → conclusion.** The reader should be able to follow your logic step by step and arrive at the same conclusion independently.
|
||
- **Cross-reference raw entry evidence directly.** When citing evidence, reference the specific entry index and the underlying files/data points (for example: `entry-3/evaluations.jsonl`, `entry-3/eval-output.jsonl`, or `entry-3/trace.jsonl`).
|
||
- **Distinguish correlation from causation.** If two entries fail the same evaluator, that's a pattern. But the root cause might differ — verify by checking the actual output data, don't assume.
|
||
- **Do not speculate without marking it.** If a conclusion is uncertain, say "Hypothesis (unvalidated): ..." and explain what additional data would confirm or refute it.
|
||
|
||
**Content:**
|
||
|
||
1. **Overview** — dataset name, entry count, overall pass rate
|
||
2. **Raw aggregation data**
|
||
- Per-evaluator statistics table (pass rate, score range, mean, standard deviation)
|
||
- Failure matrix: entries × evaluators showing scores, highlighting failures
|
||
- Failure clusters: entries grouped by shared failed evaluators
|
||
3. **Hypothesis 1: Test cases** — hypothesis statement, evidence with entry/evaluator references, validation steps taken, conclusion with specific action
|
||
4. **Hypothesis 2: Evaluators** — same structure
|
||
5. **Hypothesis 3: Application** — same structure
|
||
6. **Open questions** — anything the data doesn't conclusively answer, with suggestions for what additional data would help
|
||
|
||
#### Summary version: `dataset-{idx}/analysis-summary.md`
|
||
|
||
This file is for **human review** — a scannable overview of the dataset results, key findings, and recommended actions.
|
||
|
||
**Template:**
|
||
|
||
```markdown
|
||
# Dataset Analysis — Summary
|
||
|
||
**Dataset**: <name> | **Entries**: <N> | **Pass rate**: <X/N (Y%)>
|
||
|
||
## Results at a glance
|
||
|
||
| Evaluator | Pass rate | Avg score | Notes |
|
||
| --------- | --------- | --------- | ---------------------- |
|
||
| ... | ... | ... | <one-liner if notable> |
|
||
|
||
## Key findings
|
||
|
||
1. <Finding>: <1-2 sentences with the conclusion and its implication>
|
||
2. ...
|
||
3. ...
|
||
|
||
## Recommended actions (priority order)
|
||
|
||
1. <Action>: <what to do and expected impact, 1-2 sentences>
|
||
2. ...
|
||
3. ...
|
||
```
|
||
|
||
Maximum ~40 lines for the summary.
|
||
|
||
---
|
||
|
||
## Phase 3: Action plan (two files)
|
||
|
||
After all datasets are analyzed, produce the action plan. Write **two files** at the test run root. Write the detailed version first, then derive the summary.
|
||
|
||
### Detailed version: `{PIXIE_ROOT}/results/<test_id>/action-plan.md`
|
||
|
||
This file is for **agent consumption** — it provides specific, implementable improvement items with full evidence trails, so a coding agent can pick up any item and execute it without additional context-gathering.
|
||
|
||
**Writing principles:**
|
||
|
||
- **Each item must be self-contained.** A coding agent reading just one priority item should have enough context (evidence references, file paths, expected changes) to implement it.
|
||
- **Trace every item back to evidence.** Each priority must reference: which hypothesis (from which dataset analysis), which entries/evaluators provided the evidence, and what the specific data showed.
|
||
- **Be concrete about "How".** Don't say "improve the prompt" — say "In `scrapegraphai/prompts/generate_answer.py` line 45, add instruction: '...'". The more specific, the more actionable.
|
||
- **Do not include speculative items.** Every item must have validated evidence. If an item is based on an unvalidated hypothesis, either validate it first or exclude it.
|
||
|
||
**Structure:**
|
||
|
||
```markdown
|
||
# Action Plan (Detailed)
|
||
|
||
## Summary
|
||
|
||
- X datasets analyzed, Y total entries, Z% overall pass rate
|
||
- [1-2 sentence high-level assessment]
|
||
|
||
## Priority 1: [Most impactful improvement]
|
||
|
||
- **What**: [specific change to make]
|
||
- **Why**: [which hypothesis from which dataset analysis, with entry/evaluator references]
|
||
- **Evidence**: [specific scores, output excerpts, trace data that support this]
|
||
- **Expected impact**: [which entries/evaluators this will improve, and predicted score change]
|
||
- **How**: [concrete implementation steps with file paths and line numbers]
|
||
- **Verification**: [how to verify the fix worked — which entries to re-run, what scores to expect]
|
||
|
||
## Priority 2: ...
|
||
|
||
...
|
||
```
|
||
|
||
### Summary version: `{PIXIE_ROOT}/results/<test_id>/action-plan-summary.md`
|
||
|
||
This file is for **human review** — a prioritized list of improvements that a human can understand and approve in under 2 minutes.
|
||
|
||
**Template:**
|
||
|
||
```markdown
|
||
# Action Plan — Summary
|
||
|
||
**Overall**: <X entries, Y% pass rate. 1-sentence assessment.>
|
||
|
||
## Actions (priority order)
|
||
|
||
1. **<Action title>**: <What to change and why, 2-3 sentences. Expected impact.>
|
||
2. **<Action title>**: <What to change and why, 2-3 sentences. Expected impact.>
|
||
3. ...
|
||
```
|
||
|
||
Maximum ~30 lines for the summary.
|
||
|
||
**Prioritization criteria**:
|
||
|
||
- Systemic issues (affecting multiple entries/datasets) before isolated ones
|
||
- Issues with clear, validated evidence before speculative ones
|
||
- Application quality gaps before evaluator refinements before test case additions
|
||
- Quick fixes before large refactors
|
||
|
||
The action plan should have 3–5 items. Each must trace back to a validated hypothesis from Phase 2. Do not include items that are speculative or lack evidence.
|
||
|
||
---
|
||
|
||
## Process summary
|
||
|
||
1. **Phase 1** (per entry): Read data → grade pending evaluations → update `evaluations.jsonl`
|
||
2. **Phase 2** (per dataset): Aggregate → form 3 hypotheses → validate → write `dataset-{idx}/analysis.md` + `dataset-{idx}/analysis-summary.md`
|
||
3. **Phase 3** (per test run): Synthesize → prioritize → write `action-plan.md` + `action-plan-summary.md`
|
||
|
||
Process entries within a dataset concurrently (using subagents if available). Process phases sequentially — Phase 2 depends on Phase 1 outputs, Phase 3 depends on Phase 2 outputs.
|
||
|
||
---
|
||
|
||
## Final verification
|
||
|
||
Before you end your turn, run the Step 6 verifier script that ships beside `setup.sh` in this skill's `resources/` directory against the exact test run directory you analyzed.
|
||
|
||
Example shape:
|
||
|
||
```bash
|
||
python /path/to/eval-driven-dev/resources/verify_step6_completion.py pixie_qa/results/<test_id>
|
||
```
|
||
|
||
If the verifier reports any error, keep working. Step 6 is not complete until the verifier passes.
|