update eval-driven-dev skill (#1434)

* update eval-driven-dev skill * fix: update skill update command to use correct repository path * address comments. * update eval driven dev
2026-04-30 12:15:56 +00:00 · 2026-04-27 18:27:48 -07:00
parent 9933f65e6b
commit 2860790bc9
23 changed files with 1881 additions and 700 deletions
--- a/skills/eval-driven-dev/references/5-run-tests.md
+++ b/skills/eval-driven-dev/references/5-run-tests.md
@@ -1,32 +1,32 @@
-# Step 5: Run Evaluation-Based Tests
+# Step 5: Run `pixie test` and Fix Mechanical Issues

-**Why this step**: Run `pixie test` and fix any dataset quality issues — `WrapRegistryMissError`, `WrapTypeMismatchError`, bad `eval_input` data, or import failures — until real evaluator scores are produced for every entry.
+**Why this step**: Run `pixie test` and fix mechanical issues in your QA components — dataset format problems, runnable implementation bugs, and custom evaluator errors — until every entry produces real scores. This step is NOT about assessing result quality or fixing the application itself.

 ---

 ## 5a. Run tests

 ```bash
-pixie test
+uv run pixie test
 ```

 For verbose output with per-case scores and evaluator reasoning:

 ```bash
-pixie test -v
+uv run pixie test -v
 ```

 `pixie test` automatically loads the `.env` file before running tests.

-The test runner now:
+The evaluation harness:

 1. Resolves the `Runnable` class from the dataset's `runnable` field
 2. Calls `Runnable.create()` to construct an instance, then `setup()` once
 3. Runs all dataset entries **concurrently** (up to 4 in parallel):
-   a. Reads `entry_kwargs` and `eval_input` from the entry
+   a. Reads `input_data` and `eval_input` from the entry
   b. Populates the wrap input registry with `eval_input` data
   c. Initialises the capture registry
-   d. Validates `entry_kwargs` into the Pydantic model and calls `Runnable.run(args)`
+   d. Validates `input_data` into the Pydantic model and calls `Runnable.run(args)`
   e. `wrap(purpose="input")` calls in the app return registry values instead of calling external services
   f. `wrap(purpose="output"/"state")` calls capture data for evaluation
   g. Builds `Evaluable` from captured data
@@ -35,9 +35,11 @@ The test runner now:

 Because entries run concurrently, the Runnable's `run()` method must be concurrency-safe. If you see `sqlite3.OperationalError`, `"database is locked"`, or similar errors, add a `Semaphore(1)` to your Runnable (see the concurrency section in Step 2 reference).

-## 5b. Fix dataset/harness issues
+## 5b. Fix mechanical issues only

-**Data validation errors** (registry miss, type mismatch, deserialization failure) are reported per-entry with clear messages pointing to the specific `wrap` name and dataset field. This step is about fixing **what you did wrong in Step 4** — bad data, wrong format, missing fields — not about evaluating the app's quality.
+This step is strictly about fixing what you built in previous steps — the dataset, the runnable, and any custom evaluators. You are fixing mechanical problems that prevent the pipeline from running, NOT assessing or improving the application's output quality.
+
+**What counts as a mechanical issue** (fix these):

 | Error                                 | Cause                                                                                                                   | Fix                                                                                          |
 | ------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------- |
@@ -46,33 +48,37 @@ Because entries run concurrently, the Runnable's `run()` method must be concurre
 | Runnable resolution failure           | `runnable` path or class name is wrong, or the class doesn't implement the `Runnable` protocol                          | Fix `filepath:ClassName` in the dataset; ensure the class has `create()` and `run()` methods |
 | Import error                          | Module path or syntax error in runnable/evaluator                                                                       | Fix the referenced file                                                                      |
 | `ModuleNotFoundError: pixie_qa`       | `pixie_qa/` directory missing `__init__.py`                                                                             | Run `pixie init` to recreate it                                                              |
-| `TypeError: ... is not callable`      | Evaluator name points to a non-callable attribute                                                                       | Evaluators must be functions, classes, or callable instances                                 |
+| `TypeError: ... is not callable`      | Evaluator name points to a non-callable attribute                                                                       | Evaluators must be functions, classes, or callable instances                                  |
 | `sqlite3.OperationalError`            | Concurrent `run()` calls sharing a SQLite connection                                                                    | Add `asyncio.Semaphore(1)` to the Runnable (see Step 2 concurrency section)                  |
+| Custom evaluator crashes              | Bug in your custom evaluator implementation                                                                             | Fix the evaluator code                                                                       |

-Iterate — fix errors, re-run, fix the next error — until `pixie test` runs cleanly with real evaluator scores for all entries.
+**What is NOT a mechanical issue** (do NOT fix these here):

-### When to stop iterating on evaluator results
+- Application produces wrong/low-quality output → that's the application's behavior, analyzed in Step 6
+- Evaluator scores are low → that's a quality signal, analyzed in Step 6
+- LLM calls fail inside the application → report in Step 6, do not mock or work around
+- Evaluator scores fluctuate between runs → normal LLM non-determinism, not a bug

-Once the dataset runs without errors and produces real scores, assess the results:
-
- **Custom function evaluators** (deterministic checks): If they fail, the issue is in the dataset data or evaluator logic. Fix and re-run — these should converge quickly.
- **LLM-as-judge evaluators** (e.g., `Factuality`, `ClosedQA`, custom LLM evaluators): These have inherent variance across runs. If scores fluctuate between runs without code changes, the issue is evaluator prompt quality, not app behavior. **Do not spend more than one revision cycle on LLM evaluator prompts.** Run 2–3 times, assess variance, and accept the results if they are directionally correct.
- **General rule**: Stop iterating when all custom function evaluators pass consistently and LLM evaluators produce reasonable scores (most passing). Perfect LLM evaluator scores are not the goal — the goal is a working QA pipeline that catches real regressions.
-
-## 5c. Run analysis
-
-Once tests complete without setup errors and produce real scores, run analysis:
-
-```bash
-pixie analyze <test_id>
-```
-
-Where `<test_id>` is the test run identifier printed by `pixie test` (e.g., `20250615-120000`). This generates LLM-powered markdown analysis for each dataset, identifying patterns in successes and failures.
+Iterate — fix errors, re-run, fix the next error — until `pixie test` runs to completion with real evaluator scores for all entries.

 ## Output

- Test results at `{PIXIE_ROOT}/results/<test_id>/result.json`
- Analysis files at `{PIXIE_ROOT}/results/<test_id>/dataset-<index>.md` (after `pixie analyze`)
+After `pixie test` completes successfully, results are stored in the per-entry directory structure:
+
+```
+{PIXIE_ROOT}/results/<test_id>/
+  meta.json                           # test run metadata
+  dataset-{idx}/
+    metadata.json                     # dataset name, path, runnable
+    entry-{idx}/
+      config.json                     # evaluators, description, expectation
+      eval-input.jsonl                # input data fed to evaluators
+      eval-output.jsonl               # output data captured from app
+      evaluations.jsonl               # evaluation results (scored + pending)
+      trace.jsonl                     # LLM call traces (if captured)
+```
+
+The `<test_id>` is printed in console output. You will reference this directory in Step 6.

 ---