* update eval-driven-dev skill * fix: update skill update command to use correct repository path * address comments. * update eval driven dev
22 KiB
Step 4: Build the Dataset
Why this step: The dataset ties everything together — the runnable (Step 2), the evaluators (Step 3), and the use cases (Step 1c) — into concrete test scenarios. At test time, pixie test calls the runnable with input_data, the wrap registry is populated with eval_input, and evaluators score the resulting captured outputs.
Before building entries, review:
pixie_qa/00-project-analysis.md— the capability inventory and failure modes. Dataset entries should cover entries from the capability inventory and include entries targeting the listed failure modes.pixie_qa/02-eval-criteria.md— use cases and their capability coverage. Ensure every listed use case has representative entries.
Understanding input_data, eval_input, and expectation
Before building the dataset, understand what these terms mean:
-
input_data= the kwargs passed toRunnable.run()as a Pydantic model. These are the input data (user message, request body, CLI args). The keys must match the fields of the Pydantic model defined forrun(args: T). -
eval_input= a list of{"name": ..., "value": ...}objects corresponding towrap(purpose="input")calls in the app. At test time, these are injected automatically by the wrap registry;wrap(purpose="input")calls in the app return the registry value instead of calling the real external dependency.eval_inputmay be an empty list only when the app has nowrap(purpose="input")calls. If the app HAS input wraps, every dataset entry MUST provide correspondingeval_inputvalues with pre-captured content — otherwise the app makes live external calls during eval, which is slow, flaky, and non-reproducible. See section 4b′ for how to capture this content.Each item is a
NamedDataobject withname(str) andvalue(any JSON-serializable value). -
expectation(optional) = case-specific evaluation reference. What a correct output should look like for this scenario. Used by evaluators that compare output against a reference (e.g.,Factuality,ClosedQA). Not needed for output-quality evaluators that don't require a reference. -
eval output = what the app actually produces, captured at runtime by
wrap(purpose="output")andwrap(purpose="state")calls. Not stored in the dataset — it's produced whenpixie testruns the app.
The reference trace at pixie_qa/reference-trace.jsonl is your primary source for data shapes:
- Filter it to see the exact serialized format for
eval_inputvalues - Read the
kwargsrecord to understand theinput_datastructure - Read
purpose="output"/"state"events to understand what outputs the app produces, so you can write meaningfulexpectationvalues
4a. Derive evaluator assignments
The eval criteria artifact (pixie_qa/02-eval-criteria.md) maps each criterion to use cases. The evaluator mapping artifact (pixie_qa/03-evaluator-mapping.md) maps each criterion to a concrete evaluator name. Combine these:
- Dataset-level default evaluators: Criteria marked as applying to "All" use cases → their evaluator names go in the top-level
"evaluators"array. - Item-level evaluators: Criteria that apply to only a subset → their evaluator names go in
"evaluators"on the relevant rows only, using"..."to also include the defaults.
4b. Inspect data shapes with pixie format
Use pixie format on the reference trace to see the exact data shapes and the real app output in dataset-entry format:
uv run pixie format --input reference-trace.jsonl --output dataset-sample.json
The output looks like:
{
"input_data": {
"user_message": "What are your business hours?"
},
"eval_input": [
{
"name": "customer_profile",
"value": { "name": "Alice", "tier": "gold" }
},
{
"name": "conversation_history",
"value": [{ "role": "user", "content": "What are your hours?" }]
}
],
"expectation": null,
"eval_output": {
"response": "Our business hours are Monday to Friday, 9am to 5pm..."
}
}
Important: The eval_output in this template is the full real output produced by the running app. Do NOT copy eval_output into your dataset entries — it would make tests trivially pass by giving evaluators the real answer. Instead:
- Use
input_dataandeval_inputas exact templates for data keys and format - Look at
eval_outputto understand what the app produces — then write a conciseexpectationdescription that captures the key quality criteria for each scenario
Example: if eval_output.response is "Our business hours are Monday to Friday, 9 AM to 5 PM, and Saturday 10 AM to 2 PM.", write expectation as "Should mention weekday hours (Mon–Fri 9am–5pm) and Saturday hours" — a short description a human or LLM evaluator can compare against.
4b′. Capture external content for eval_input (mandatory)
CRITICAL: If the app has ANY wrap(purpose="input") calls, every dataset entry MUST provide corresponding eval_input values with pre-captured real content. An empty eval_input list means the app will make live external calls (HTTP requests, database queries, API calls) during every eval run — this makes evals slow, flaky, and non-reproducible.
Why this matters
During pixie test, each wrap(purpose="input", name="X") call in the app checks the wrap registry for a value named "X":
- If found: the registered value is returned directly (no external call)
- If not found: the real external call executes (non-deterministic, slow, may fail)
An eval_input: [] entry means NOTHING is in the registry, so every external dependency runs live. This defeats the purpose of instrumentation.
How to capture content
For each wrap(purpose="input", name="X") in the app, you must capture the real data once and embed it in the dataset. Choose one of these approaches:
Option A — Use the reference trace (preferred):
The reference trace from Step 2c already contains captured values for every purpose="input" wrap. Extract them:
# View the reference trace to find input wrap values
grep '"purpose": "input"' pixie_qa/reference-trace.jsonl
Or use pixie format to see the data in dataset-entry format — the eval_input array in the output already has the captured values with correct names and shapes.
Option B — Fetch content directly (for new entries with different inputs):
When creating dataset entries with different input sources (e.g., different URLs, different queries), capture the content by running the dependency code once:
# Example: for a web scraper, run the app's own fetch logic once
from myapp.fetcher import fetch_page
page_content = fetch_page(target_url) # use the app's real code path
Then include the captured content in the entry's eval_input:
{
"eval_input": [
{
"name": "fetch_result",
"value": "<captured page content here>"
}
]
}
Option C — Run pixie trace with each input (most thorough):
For each set of input_data, run pixie trace to execute the app with real dependencies and capture all values:
pixie trace --runnable pixie_qa/run_app.py:AppRunnable --input trace-input.json
Then extract the purpose="input" values from the resulting trace and use them as eval_input.
Content format
The eval_input value must match the exact type and format that the wrap() call returns. Check the reference trace to see what format the app produces:
- If the wrap captures a string (e.g., HTML content, markdown text), the value is a string
- If the wrap captures a dict (e.g., database record), the value is a JSON object
- If the wrap captures a list, the value is a JSON array
Do NOT skip this step. Every wrap(purpose="input") in the app must have a corresponding eval_input entry in every dataset row. If you proceed with empty eval_input when the app has input wraps, evals will be unreliable.
4c. Generate dataset items
Create diverse entries guided by the reference trace and use cases:
input_datakeys must match the fields of the Pydantic model used inRunnable.run(args: T)eval_inputmust be a list of{"name": ..., "value": ...}objects matching thenamevalues ofwrap(purpose="input")calls in the app- Cover each use case from
pixie_qa/02-eval-criteria.md— at least one entry per use case, with meaningfully diverse inputs across entries
If the user specified a dataset or data source in the prompt (e.g., a JSON file with research questions or conversation scenarios), read that file, adapt each entry to the input_data / eval_input shape, and incorporate them into the dataset. Do NOT ignore specified data.
Entry quality checklist
Before finalizing the dataset, verify each entry against these criteria:
Input realism:
- Does
eval_inputcontain world data that respects the synthesization boundary (see Step 2c)? User-authored parameters are fine; world data should be sourced, not fabricated from scratch. - Does the world data in
eval_inputmatch the scale and complexity described in00-project-analysis.md"Realistic input characteristics"? If the analysis says inputs are typically 5KB–500KB, a 200-char input is not realistic. - Is the answer to the prompt non-trivial to extract from the input? A test where the answer is in a clearly labeled HTML tag or the first sentence doesn't test extraction quality.
Scenario diversity:
- Do entries cover meaningfully different difficulty levels — not just different topics with the same difficulty?
- Does at least one entry target a failure mode from
00-project-analysis.mdthat you expect might actually cause degraded scores (not a guaranteed pass)? - Do entries use different structural patterns in the input data (not just different content poured into the same template)?
Difficulty calibration:
- Is there at least one entry you are genuinely uncertain whether the app will handle correctly? If you're confident every entry will pass trivially, the dataset is too easy.
- Consider including one intentionally challenging entry that probes a known limitation — a "stress test" entry. If it passes, great. If it fails, the eval has demonstrated it can catch real issues.
Anti-patterns for dataset entries
- Fabricating world data: Hand-authoring content the app would normally fetch from external sources (e.g., writing HTML for a web scraper, writing "retrieved documents" for a RAG system). This removes real-world complexity.
- Uniform difficulty: All entries have the same complexity level. Real workloads have a distribution — some easy, some hard, some edge cases.
- Obvious answers: Every entry has the target information cleanly labeled and unambiguous. Real data often has the answer scattered, partially present, duplicated with variations, or embedded in noise.
- Round-trip authorship: You wrote both the input and the expected output, so you know exactly what's there. A real evaluator tests whether the app can find information it hasn't seen before.
- Only happy paths: No entry tests error conditions, edge cases, or known failure modes.
- Building all entries from the same toy trace with minor rephrasing: If all entries have similar
input_dataand similareval_inputdata, the dataset tests nothing meaningful. Each entry should represent a meaningfully different scenario. - Reusing the project's own test fixtures as eval data: The project's
tests/,fixtures/,examples/, andmock_server/directories contain data designed for unit/integration tests — small, clean, deterministic, and trivially easy. Using them aseval_inputdata guarantees 100% pass rates and zero quality signal. Even if these fixtures look convenient, they bypass every real-world difficulty that makes the app's job hard. Run the production code to capture realistic data instead, or generate synthetic data that matches the scale/complexity from00-project-analysis.md. - Using a project's mock/fake implementations: If the project includes mock LLMs, fake HTTP servers, or stub services in its test infrastructure, do NOT use them in your eval pipeline. Your eval must exercise the app's real code paths with realistically complex data — not the project's own test shortcuts.
4c′. Verify coverage against project analysis
Before writing the final dataset JSON, open pixie_qa/00-project-analysis.md and check:
- Realistic input characteristics: For each characteristic listed (size, complexity, noise, variety), confirm at least one dataset entry reflects it. If the analysis says "messy inputs with navigation and ads," at least one entry's
eval_inputshould contain messy data with navigation and ads. - Failure modes: For each failure mode listed, confirm at least one dataset entry is designed to exercise it. The entry doesn't need to guarantee failure — but it should create conditions where that failure mode could manifest. If a failure mode cannot be exercised with the current instrumentation setup, add a note in
02-eval-criteria.mdexplaining why. - Capability coverage: Confirm the dataset covers the capabilities listed in the eval criteria (Step 1c). Each covered capability should have at least one entry.
If any gap is found, add entries to close it before proceeding to 4d.
4c″. STOP CHECK — Dataset realism audit (hard gate)
This is a hard gate. Do NOT proceed to 4d until every check passes. If any check fails, revise the dataset and re-audit.
Before writing the final dataset JSON, perform this self-audit:
-
Cross-reference
00-project-analysis.md: Open the "Realistic input characteristics" section. For each characteristic (size, complexity, noise, structure), verify at least one dataset entry'seval_inputreflects it. If the analysis says "5KB–500KB HTML pages with navigation chrome and ads" and your largesteval_inputis 1KB of clean HTML, the dataset is not realistic — add harder entries. -
Count distinct sources: How many unique
eval_inputdata sources are in the dataset? If more than 50% of entries share the sameeval_inputcontent (even with different prompts), the dataset lacks diversity. Prompt variations on the same input test the LLM's interpretation, not the app's data processing. -
Difficulty distribution (mandatory threshold): For each entry, label it as "routine" (confident it will pass), "moderate" (likely passes but non-trivial), or "challenging" (genuinely uncertain or targeting a known failure mode).
- Maximum 60% "routine" entries. If you have 5 entries, at most 3 can be routine.
- At least one "challenging" entry that targets a failure mode from
00-project-analysis.mdwhere you are genuinely uncertain about the outcome. If every entry is a guaranteed pass, the dataset cannot distinguish a good app from a broken one.
-
Capability coverage (mandatory threshold): Count how many capabilities from
00-project-analysis.mdare exercised by at least one dataset entry.- Must cover ≥50% of listed capabilities. If the analysis lists 6 capabilities, the dataset must exercise at least 3.
- If coverage is below threshold, add entries targeting the uncovered capabilities.
-
Project fixture contamination check: Scan every
eval_inputvalue. Did any data originate from the project'stests/,fixtures/,examples/, or mock server directories? If yes, replace it with real-world data. These fixtures are designed for development convenience, not evaluation realism. -
Tautology check: Will the test pipeline produce meaningful scores, or is it a closed loop? If you authored both the input data and the evaluator logic such that passing is guaranteed by construction (e.g., regex extractor + exact-match evaluator on hand-authored HTML), the pipeline is tautological and cannot catch real issues. The app's real LLM should produce the output, and evaluators should assess quality dimensions that can genuinely fail.
-
eval_inputcompleteness check: For everywrap(purpose="input", name="X")call in the instrumented app code, verify that EVERY dataset entry provides a correspondingeval_inputitem with"name": "X"and a non-empty"value". If any entry haseval_input: []while the app has input wraps, the dataset is incomplete — captured content is missing. Go back to step 4b′ and capture the content.
4d. Build the dataset JSON file
Create the dataset at pixie_qa/datasets/<name>.json:
{
"name": "qa-golden-set",
"runnable": "pixie_qa/run_app.py:AppRunnable",
"evaluators": ["Factuality", "pixie_qa/evaluators.py:ConciseVoiceStyle"],
"entries": [
{
"input_data": {
"user_message": "What are your business hours?"
},
"description": "Customer asks about business hours with gold tier account",
"eval_input": [
{
"name": "customer_profile",
"value": { "name": "Alice Johnson", "tier": "gold" }
}
],
"expectation": "Should mention Mon-Fri 9am-5pm and Sat 10am-2pm"
},
{
"input_data": {
"user_message": "I want to change something"
},
"description": "Ambiguous change request from basic tier customer",
"eval_input": [
{
"name": "customer_profile",
"value": { "name": "Bob Smith", "tier": "basic" }
}
],
"expectation": "Should ask for clarification",
"evaluators": ["...", "ClosedQA"]
},
{
"input_data": {
"user_message": "I want to end this call"
},
"description": "User requests call end after failed verification",
"eval_input": [
{
"name": "customer_profile",
"value": { "name": "Charlie Brown", "tier": "basic" }
}
],
"expectation": "Agent should call endCall tool and end the conversation",
"eval_metadata": {
"expected_tool": "endCall",
"expected_call_ended": true
},
"evaluators": ["...", "pixie_qa/evaluators.py:tool_call_check"]
}
]
}
Key fields
Entry structure — all fields are top-level on each entry (flat structure — no nesting):
entry:
├── input_data (required) — args for Runnable.run()
├── eval_input (optional) — list of {"name": ..., "value": ...} objects (default: [])
├── description (required) — human-readable label for the test case
├── expectation (optional) — reference for comparison-based evaluators
├── eval_metadata (optional) — extra per-entry data for custom evaluators
└── evaluators (optional) — evaluator names for THIS entry
Top-level fields:
runnable(required):filepath:ClassNamereference to theRunnableclass from Step 2 (e.g.,"pixie_qa/run_app.py:AppRunnable"). Path is relative to the project root.evaluators(dataset-level, optional): Default evaluator names applied to every entry — the evaluators for criteria that apply to ALL use cases.
Per-entry fields (all top-level on each entry):
input_data(required): Keys match the Pydantic model fields forRunnable.run(args: T). These are the app's input data.eval_input(optional, default[]): List of{"name": ..., "value": ...}objects. Names matchwrap(purpose="input")names in the app. The runner automatically prependsinput_datawhen building theEvaluable.description(required): Use case one-liner frompixie_qa/02-eval-criteria.md.expectation(optional): Case-specific expectation text for evaluators that need a reference.eval_metadata(optional): Extra per-entry data for custom evaluators — e.g., expected tool names, boolean flags, thresholds. Accessible in evaluators asevaluable.eval_metadata.evaluators(optional): Row-level evaluator override.
Evaluator assignment rules
- Evaluators that apply to ALL items go in the top-level
"evaluators"array. - Items that need additional evaluators use
"evaluators": ["...", "ExtraEval"]—"..."expands to defaults. - Items that need a completely different set use
"evaluators": ["OnlyThis"]without"...". - Items using only defaults: omit the
"evaluators"field.
Dataset Creation Reference
Using eval_input values
The eval_input values are {"name": ..., "value": ...} objects. Use the reference trace as templates — copy the "data" field from the relevant purpose="input" event and adapt the values:
Simple dict:
{ "name": "customer_profile", "value": { "name": "Alice", "tier": "gold" } }
List of dicts (e.g., conversation history):
{
"name": "conversation_history",
"value": [
{ "role": "user", "content": "Hello" },
{ "role": "assistant", "content": "Hi there!" }
]
}
Important: The exact format depends on what the wrap(purpose="input") call captures. Always copy from the reference trace rather than constructing from scratch.
Crafting diverse eval scenarios
Cover different aspects of each use case. Refer to pixie_qa/00-project-analysis.md for the capability inventory and failure modes:
- Cover each capability — at least one entry per capability from the capability inventory, not just the primary capability
- Target failure modes — include entries that exercise the hard problems / failure modes listed in the project analysis (e.g., malformed input, edge cases, complex scenarios)
- Different user phrasings of the same request
- Edge cases (ambiguous input, missing information, error conditions)
- Entries that stress-test specific eval criteria
- At least one entry per use case from Step 1c
Output
pixie_qa/datasets/<name>.json — the dataset file.