* update eval-driven-dev skill * fix: update skill update command to use correct repository path * address comments. * update eval driven dev
6.0 KiB
Step 5: Run pixie test and Fix Mechanical Issues
Why this step: Run pixie test and fix mechanical issues in your QA components — dataset format problems, runnable implementation bugs, and custom evaluator errors — until every entry produces real scores. This step is NOT about assessing result quality or fixing the application itself.
5a. Run tests
uv run pixie test
For verbose output with per-case scores and evaluator reasoning:
uv run pixie test -v
pixie test automatically loads the .env file before running tests.
The evaluation harness:
- Resolves the
Runnableclass from the dataset'srunnablefield - Calls
Runnable.create()to construct an instance, thensetup()once - Runs all dataset entries concurrently (up to 4 in parallel):
a. Reads
input_dataandeval_inputfrom the entry b. Populates the wrap input registry witheval_inputdata c. Initialises the capture registry d. Validatesinput_datainto the Pydantic model and callsRunnable.run(args)e.wrap(purpose="input")calls in the app return registry values instead of calling external services f.wrap(purpose="output"/"state")calls capture data for evaluation g. BuildsEvaluablefrom captured data h. Runs evaluators - Calls
Runnable.teardown()once
Because entries run concurrently, the Runnable's run() method must be concurrency-safe. If you see sqlite3.OperationalError, "database is locked", or similar errors, add a Semaphore(1) to your Runnable (see the concurrency section in Step 2 reference).
5b. Fix mechanical issues only
This step is strictly about fixing what you built in previous steps — the dataset, the runnable, and any custom evaluators. You are fixing mechanical problems that prevent the pipeline from running, NOT assessing or improving the application's output quality.
What counts as a mechanical issue (fix these):
| Error | Cause | Fix |
|---|---|---|
WrapRegistryMissError: name='<key>' |
Dataset entry missing an eval_input item with the name that the app's wrap(purpose="input", name="<key>") expects |
Add the missing {"name": "<key>", "value": ...} to eval_input in every affected entry |
WrapTypeMismatchError |
Deserialized type doesn't match what the app expects | Fix the value in the dataset |
| Runnable resolution failure | runnable path or class name is wrong, or the class doesn't implement the Runnable protocol |
Fix filepath:ClassName in the dataset; ensure the class has create() and run() methods |
| Import error | Module path or syntax error in runnable/evaluator | Fix the referenced file |
ModuleNotFoundError: pixie_qa |
pixie_qa/ directory missing __init__.py |
Run pixie init to recreate it |
TypeError: ... is not callable |
Evaluator name points to a non-callable attribute | Evaluators must be functions, classes, or callable instances |
sqlite3.OperationalError |
Concurrent run() calls sharing a SQLite connection |
Add asyncio.Semaphore(1) to the Runnable (see Step 2 concurrency section) |
| Custom evaluator crashes | Bug in your custom evaluator implementation | Fix the evaluator code |
What is NOT a mechanical issue (do NOT fix these here):
- Application produces wrong/low-quality output → that's the application's behavior, analyzed in Step 6
- Evaluator scores are low → that's a quality signal, analyzed in Step 6
- LLM calls fail inside the application → report in Step 6, do not mock or work around
- Evaluator scores fluctuate between runs → normal LLM non-determinism, not a bug
Iterate — fix errors, re-run, fix the next error — until pixie test runs to completion with real evaluator scores for all entries.
Output
After pixie test completes successfully, results are stored in the per-entry directory structure:
{PIXIE_ROOT}/results/<test_id>/
meta.json # test run metadata
dataset-{idx}/
metadata.json # dataset name, path, runnable
entry-{idx}/
config.json # evaluators, description, expectation
eval-input.jsonl # input data fed to evaluators
eval-output.jsonl # output data captured from app
evaluations.jsonl # evaluation results (scored + pending)
trace.jsonl # LLM call traces (if captured)
The <test_id> is printed in console output. You will reference this directory in Step 6.
If you hit an unexpected error when running tests (wrong parameter names, import failures, API mismatch), read
wrap-api.md,evaluators.md, ortesting-api.mdfor the authoritative API reference before guessing at a fix.