diff --git a/docs/README.skills.md b/docs/README.skills.md
index 1f24d0b3..d21fc6c0 100644
--- a/docs/README.skills.md
+++ b/docs/README.skills.md
@@ -134,7 +134,7 @@ See [CONTRIBUTING.md](../CONTRIBUTING.md#adding-skills) for guidelines on how to
| [ef-core](../skills/ef-core/SKILL.md) | Get best practices for Entity Framework Core | None |
| [email-drafter](../skills/email-drafter/SKILL.md) | Draft and review professional emails that match your personal writing style. Analyzes your sent emails for tone, greeting, structure, and sign-off patterns via WorkIQ, then generates context-aware drafts for any recipient. USE FOR: draft email, write email, compose email, reply email, follow-up email, analyze email tone, email style. | None |
| [entra-agent-user](../skills/entra-agent-user/SKILL.md) | Create Agent Users in Microsoft Entra ID from Agent Identities, enabling AI agents to act as digital workers with user identity capabilities in Microsoft 365 and Azure environments. | None |
-| [eval-driven-dev](../skills/eval-driven-dev/SKILL.md) | Set up eval-based QA for Python LLM applications: instrument the app, build golden datasets, write and run eval tests, and iterate on failures. ALWAYS USE THIS SKILL when the user asks to set up QA, add tests, add evals, evaluate, benchmark, fix wrong behaviors, improve quality, or do quality assurance for any Python project that calls an LLM model. | `references/1-a-entry-point.md`
`references/1-b-eval-criteria.md`
`references/2-wrap-and-trace.md`
`references/3-define-evaluators.md`
`references/4-build-dataset.md`
`references/5-run-tests.md`
`references/6-investigate.md`
`references/evaluators.md`
`references/testing-api.md`
`references/wrap-api.md`
`resources` |
+| [eval-driven-dev](../skills/eval-driven-dev/SKILL.md) | Improve AI application with evaluation-driven development. Define eval criteria, instrument the application, build golden datasets, observe and evaluate application runs, analyze results, and produce a concrete action plan for improvements. ALWAYS USE THIS SKILL when the user asks to set up QA, add tests, add evals, evaluate, benchmark, fix wrong behaviors, improve quality, or do quality assurance for any Python project that calls an LLM model. | `references/1-a-project-analysis.md`
`references/1-b-entry-point.md`
`references/1-c-eval-criteria.md`
`references/2a-instrumentation.md`
`references/2b-implement-runnable.md`
`references/2c-capture-and-verify-trace.md`
`references/3-define-evaluators.md`
`references/4-build-dataset.md`
`references/5-run-tests.md`
`references/6-analyze-outcomes.md`
`references/evaluators.md`
`references/runnable-examples`
`references/testing-api.md`
`references/wrap-api.md`
`resources` |
| [exam-ready](../skills/exam-ready/SKILL.md) | Activate this skill when a student provides study material (PDF or pasted notes) and a syllabus, and wants to prepare for an exam. Extracts key definitions, points, keywords, diagrams, exam-ready sentences, and practice questions strictly from the provided material. | None |
| [excalidraw-diagram-generator](../skills/excalidraw-diagram-generator/SKILL.md) | Generate Excalidraw diagrams from natural language descriptions. Use when asked to "create a diagram", "make a flowchart", "visualize a process", "draw a system architecture", "create a mind map", or "generate an Excalidraw file". Supports flowcharts, relationship diagrams, mind maps, and system architecture diagrams. Outputs .excalidraw JSON files that can be opened directly in Excalidraw. | `references/element-types.md`
`references/excalidraw-schema.md`
`scripts/.gitignore`
`scripts/README.md`
`scripts/add-arrow.py`
`scripts/add-icon-to-diagram.py`
`scripts/split-excalidraw-library.py`
`templates` |
| [fabric-lakehouse](../skills/fabric-lakehouse/SKILL.md) | Use this skill to get context about Fabric Lakehouse and its features for software systems and AI-powered functions. It offers descriptions of Lakehouse data components, organization with schemas and shortcuts, access control, and code examples. This skill supports users in designing, building, and optimizing Lakehouse solutions using best practices. | `references/getdata.md`
`references/pyspark.md` |
diff --git a/skills/eval-driven-dev/SKILL.md b/skills/eval-driven-dev/SKILL.md
index 71da823c..53b80589 100644
--- a/skills/eval-driven-dev/SKILL.md
+++ b/skills/eval-driven-dev/SKILL.md
@@ -1,26 +1,27 @@
---
name: eval-driven-dev
description: >
- Set up eval-based QA for Python LLM applications: instrument the app,
- build golden datasets, write and run eval tests, and iterate on failures.
+ Improve AI application with evaluation-driven development. Define eval criteria, instrument the application, build golden datasets, observe and evaluate application runs, analyze results, and produce a concrete action plan for improvements.
ALWAYS USE THIS SKILL when the user asks to set up QA, add tests, add evals,
evaluate, benchmark, fix wrong behaviors, improve quality, or do quality assurance for any Python project that calls an LLM model.
license: MIT
-compatibility: Python 3.11+
+compatibility: Python 3.10+
metadata:
- version: 0.6.1
- pixie-qa-version: ">=0.6.1,<0.7.0"
+ version: 0.8.4
+ pixie-qa-version: ">=0.8.4,<0.9.0"
pixie-qa-source: https://github.com/yiouli/pixie-qa/
---
# Eval-Driven Development for Python LLM Applications
-You're building an **automated QA pipeline** that tests a Python application end-to-end — running it the same way a real user would, with real inputs — then scoring the outputs using evaluators and producing pass/fail results via `pixie test`.
+You're building an **automated evaluation pipeline** that tests a Python-based AI application end-to-end — running it the same way a real user would, with real inputs — then scoring the outputs using evaluators and producing pass/fail results via `pixie test`.
**What you're testing is the app itself** — its request handling, context assembly (how it gathers data, builds prompts, manages conversation state), routing, and response formatting. The app uses an LLM, which makes outputs non-deterministic — that's why you use evaluators (LLM-as-judge, similarity scores) instead of `assertEqual` — but the thing under test is the app's code, not the LLM.
During evaluation, the app's own code runs for real — routing, prompt assembly, LLM calls, response formatting — nothing is mocked or stubbed. But the data the app reads from external sources (databases, caches, third-party APIs, voice streams) is replaced with test-specified values via instrumentations. This means each test case controls exactly what data the app sees, while still exercising the full application code path.
+**Rule: The app's LLM calls must go to a real LLM.** Do not replace, mock, stub, or intercept the LLM with a fake implementation. The LLM is the core value-generating component — replacing it makes the eval tautological (you control both inputs and outputs, so scores are meaningless). If the project's test suite contains LLM mocking patterns, those are for the project's own unit tests — do NOT adopt them for the eval Runnable.
+
**The deliverable is a working `pixie test` run with real scores** — not a plan, not just instrumentation, not just a dataset.
This skill is about doing the work, not describing it. Read code, edit files, run commands, produce a working pipeline.
@@ -29,8 +30,16 @@ This skill is about doing the work, not describing it. Read code, edit files, ru
## Before you start
-**First, activate the virtual environment**. Identify the correct virtual environment for the project and activate it. After the virtual environment is active, then run the setup.sh included in the skill's resources.
-The script updates the `eval-driven-dev` skill and `pixie-qa` python package to the latest version, initializes the pixie working directory if it's not already initialized, and starts a web server in the background to show user updates. If the skill or package update fails, continue — do not let these failures block the rest of the workflow.
+**First, activate the virtual environment**. Identify the correct virtual environment for the project and activate it. After the virtual environment is active, run the setup.sh included in the skill's resources.
+The script updates the `eval-driven-dev` skill and `pixie-qa` python package to latest version, initialize the pixie working directory if it's not already initialized, and start a web server in the background to show user updates.
+
+**Setup error handling — what you can skip vs. what must succeed:**
+
+- **Skill update fails** → OK to continue. The existing skill version is sufficient.
+- **pixie-qa upgrade fails but was already installed** → OK to continue with the existing version.
+- **pixie-qa is NOT installed and installation fails** → **STOP.** Ask the user for help. The workflow cannot proceed without the `pixie` package.
+- **`pixie init` fails** → **STOP.** Ask the user for help.
+- **`pixie start` (web server) fails** → **STOP.** Ask the user for help. Check `server.log` in the pixie root directory for diagnostics. Common causes: port conflict, missing dependency, slow environment. Do NOT proceed without the web server — the user needs it to see eval results.
---
@@ -45,6 +54,16 @@ Follow Steps 1–6 straight through without stopping. Do not ask the user for co
- **Create artifacts immediately.** After reading code for a sub-step, write the output file for that sub-step before moving on. Don't accumulate understanding across multiple sub-steps before writing anything.
- **Verify, then move on.** Each step has a checkpoint. Verify it, then proceed to the next step. Don't plan future steps while verifying the current one.
+**When to stop and ask for help:**
+
+Some blockers cannot and should not be worked around. When you encounter any of the following, **stop immediately and ask the user for help** — do not attempt workarounds:
+
+- **Application won't run due to missing environment variables or configuration**: The app requires environment variables or configuration that are not set and cannot be inferred. Do NOT work around this by mocking, faking, or replacing application components — the eval must exercise real production code. Ask the user to fix the environment setup.
+- **App import failures that indicate a broken project**: If the app's core modules cannot be imported due to missing system dependencies or incompatible Python versions (not just missing pip packages you can install), ask the user to fix the project setup.
+- **Ambiguous entry point**: If the app has multiple equally plausible entry points and the project analysis doesn't clarify which one matters most, ask the user which to target.
+
+Blockers you SHOULD resolve yourself (do not ask): missing Python packages (install them), missing `pixie` package (install it), port conflicts (pick a different port), file permission issues (fix them).
+
**Run Steps 1–6 in sequence.** If the user's prompt makes it clear that earlier steps are already done (e.g., "run the existing tests", "re-run evals"), skip to the appropriate step. When in doubt, start from Step 1.
---
@@ -59,33 +78,61 @@ Follow Steps 1–6 straight through without stopping. Do not ask the user for co
If the prompt specifies any of the above, they take priority. Read and incorporate them before proceeding.
-Step 1 has two sub-steps. Each reads its own reference file and produces its own output file. **Complete each sub-step fully before starting the next.**
+Step 1 has three sub-steps. Each reads its own reference file and produces its own output file. **Complete each sub-step fully before starting the next.**
-#### Sub-step 1a: Entry point & execution flow
+#### Sub-step 1a: Project analysis
-> **Reference**: Read `references/1-a-entry-point.md` now.
+> **Reference**: Read `references/1-a-project-analysis.md` now.
-Read the source code to understand how the app starts and how a real user invokes it. Write your findings to `pixie_qa/01-entry-point.md` before moving on.
+Before looking at code structure or entry points, understand what this software does in the real world — its purpose, its users, the complexity of real inputs, and where it fails. This understanding drives every downstream decision: which entry points matter most, what eval criteria to define, what trace inputs to use, and what dataset entries to create. Write the detailed context file before moving on. **Note**: the project may contain `tests/`, `fixtures/`, `examples/`, mock servers, and documentation — these are the project's own development infrastructure, NOT data sources for your eval pipeline. Ignore them when sourcing trace inputs and dataset content.
-> **Checkpoint**: `pixie_qa/01-entry-point.md` written with entry point, execution flow, user-facing interface, and env requirements.
+> **Checkpoint**: `pixie_qa/00-project-analysis.md` written — covering what the software does, target users, capability inventory (at least 3 capabilities if the project has them), realistic input characteristics, and hard problems / failure modes (at least 2).
-#### Sub-step 1b: Eval criteria
+#### Sub-step 1b: Entry point & execution flow
-> **Reference**: Read `references/1-b-eval-criteria.md` now.
+> **Reference**: Read `references/1-b-entry-point.md` now.
-Define the app's use cases and eval criteria. Use cases drive dataset creation (Step 4); eval criteria drive evaluator selection (Step 3). Write your findings to `pixie_qa/02-eval-criteria.md` before moving on.
+Read the source code to understand how the app starts and how a real user invokes it. Use the **capability inventory** from `pixie_qa/00-project-analysis.md` to prioritize entry points — focus on the entry point(s) that exercise the most valuable capabilities, not just the first one found. Write the detailed context file before moving on.
-> **Checkpoint**: `pixie_qa/02-eval-criteria.md` written with use cases, eval criteria, and their applicability scope. Do NOT read Step 2 instructions yet.
+> **Checkpoint**: `pixie_qa/01-entry-point.md` written — covering entry point, execution flow, user-facing interface, and env requirements.
+
+#### Sub-step 1c: Eval criteria
+
+> **Reference**: Read `references/1-c-eval-criteria.md` now.
+
+Define the app's use cases and eval criteria. Derive use cases from the **capability inventory** in `pixie_qa/00-project-analysis.md`. Derive eval criteria from the **hard problems / failure modes** — not generic quality dimensions. Use cases drive dataset creation (Step 4); eval criteria drive evaluator selection (Step 3). Write the detailed context file before moving on.
+
+> **Checkpoint**: `pixie_qa/02-eval-criteria.md` written — covering use cases, eval criteria, and their applicability scope. Do NOT read Step 2 instructions yet.
---
-### Step 2: Instrument with `wrap` and capture a reference trace
+### Step 2: Instrument, run application, and capture a reference trace
-> **Reference**: Read `references/2-wrap-and-trace.md` now for the detailed sub-steps.
+Step 2 has three sub-steps. Each reads its own reference file. **Complete each sub-step before starting the next.**
-**Goal**: Make the app testable by controlling its external data and capturing its outputs. `wrap()` calls at data boundaries let the test harness inject controlled inputs (replacing real DB/API calls) and capture outputs for scoring. The `Runnable` class provides the lifecycle interface that `pixie test` uses to set up, invoke, and tear down the app. A reference trace captured with `pixie trace` proves the instrumentation works and provides the exact data shapes needed for dataset creation in Step 4.
+#### Sub-step 2a: Instrument with `wrap`
-> **Checkpoint**: `pixie_qa/scripts/run_app.py` written and verified. `pixie_qa/reference-trace.jsonl` exists and all expected data points appear when formatted with `pixie format`. Do NOT read Step 3 instructions yet.
+> **Reference**: Read `references/2a-instrumentation.md` now.
+
+Add `wrap()` calls at the app's data boundaries so the eval harness can inject controlled inputs and capture outputs. This makes the app testable without changing its logic.
+
+> **Checkpoint**: `wrap()` calls added at all data boundaries. Every eval criterion from `pixie_qa/02-eval-criteria.md` has a corresponding data point.
+
+#### Sub-step 2b: Implement the Runnable
+
+> **Reference**: Read `references/2b-implement-runnable.md` now.
+
+Write a Runnable class that lets the eval harness invoke the app exactly as a real user would. The Runnable should be simple — it just wires up the app's real entry point to the harness interface. If it's getting complicated, something is wrong.
+
+> **Checkpoint**: `pixie_qa/run_app.py` written. The Runnable calls the app's real entry point with real LLM configuration — no mocking, no faking, no component replacement.
+
+#### Sub-step 2c: Capture and verify a reference trace
+
+> **Reference**: Read `references/2c-capture-and-verify-trace.md` now.
+
+Run the app through the Runnable and capture a trace. The trace proves instrumentation and the Runnable are working correctly, and provides the data shapes needed for dataset creation in Step 4.
+
+> **Checkpoint**: `pixie_qa/reference-trace.jsonl` exists. All expected `wrap` entries and `llm_span` entries appear. `pixie format` shows all data points needed for evaluation. Do NOT read Step 3 instructions yet.
---
@@ -93,9 +140,9 @@ Define the app's use cases and eval criteria. Use cases drive dataset creation (
> **Reference**: Read `references/3-define-evaluators.md` now for the detailed sub-steps.
-**Goal**: Turn the qualitative eval criteria from Step 1b into concrete, runnable scoring functions. Each criterion maps to either a built-in evaluator or a custom one you implement. The evaluator mapping artifact bridges between criteria and the dataset, ensuring every quality dimension has a scorer.
+**Goal**: Turn the qualitative eval criteria from Step 1c into concrete, runnable scoring functions. Each criterion maps to either a built-in evaluator, an **agent evaluator** (the default for any semantic or qualitative criterion), or a manual custom function (only for mechanical/deterministic checks like regex or field existence). The evaluator mapping artifact bridges between criteria and the dataset, ensuring every quality dimension has a scorer. Select evaluators that measure the **hard problems** identified in `pixie_qa/00-project-analysis.md` — not just generic quality dimensions.
-> **Checkpoint**: All evaluators implemented. `pixie_qa/03-evaluator-mapping.md` written with criterion-to-evaluator mapping. Do NOT read Step 4 instructions yet.
+> **Checkpoint**: All evaluators implemented. `pixie_qa/03-evaluator-mapping.md` written with criterion-to-evaluator mapping and decision rationale. Do NOT read Step 4 instructions yet.
---
@@ -103,32 +150,48 @@ Define the app's use cases and eval criteria. Use cases drive dataset creation (
> **Reference**: Read `references/4-build-dataset.md` now for the detailed sub-steps.
-**Goal**: Create the test scenarios that tie everything together — the runnable (Step 2), the evaluators (Step 3), and the use cases (Step 1b). Each dataset entry defines what to send to the app, what data the app should see from external services, and how to score the result. Use the reference trace from Step 2 as the source of truth for data shapes and field names.
+**Goal**: Create the test scenarios that tie everything together — the runnable (Step 2), the evaluators (Step 3), and the use cases (Step 1c). Each dataset entry defines what to send to the app, what data the app should see from external services, and how to score the result. Use the reference trace from Step 2 as the source of truth for data shapes and field names. Cover entries from the **capability inventory** in `pixie_qa/00-project-analysis.md` and include entries targeting the **failure modes** identified there. **Do NOT use the project's own test fixtures, mock servers, or example data as dataset `eval_input` content** — source real-world data instead. **Every `wrap(purpose="input")` in the app must have pre-captured content in each entry's `eval_input`** — do NOT leave `eval_input` empty when the app has input wraps.
-> **Checkpoint**: Dataset JSON created at `pixie_qa/datasets/.json` with diverse entries covering all use cases. Do NOT read Step 5 instructions yet.
+> **Checkpoint**: Dataset JSON created at `pixie_qa/datasets/.json` with diverse entries covering all use cases. **Dataset realism audit passed** — entries use real-world data at representative scale, no project test fixtures contamination, at least one entry targets a failure mode with uncertain outcome, and every `eval_input` has captured content for all input wraps. Do NOT read Step 5 instructions yet.
---
-### Step 5: Run evaluation-based tests
+### Step 5: Run `pixie test` and fix mechanical issues
> **Reference**: Read `references/5-run-tests.md` now for the detailed sub-steps.
-**Goal**: Execute the full pipeline end-to-end and verify it produces real scores. This step is about getting the machinery running — fixing any setup or data issues until every dataset entry runs and gets scored. Once tests produce results, run `pixie analyze` for pattern analysis.
+**Goal**: Execute the full pipeline end-to-end and get it running without mechanical errors. This step is strictly about fixing setup and data issues in the pixie QA components (dataset, runnable, custom evaluators) — NOT about fixing the application itself or evaluating result quality. Once `pixie test` completes without errors and produces real evaluator scores for every entry, this step is done.
-> **Checkpoint**: Tests run and produce real scores. Analysis generated.
+> **Checkpoint**: `pixie test` runs to completion. Every dataset entry has evaluator scores (real `EvaluationResult` or `PendingEvaluation`). No setup errors, no import failures, no data validation errors.
>
-> If the test errors out, that's a setup bug — fix and re-run. But if tests produce real pass/fail scores, that's the deliverable.
->
-> **STOP GATE — read this before doing anything else after tests produce scores:**
->
-> - If the user's original prompt asks only for setup ("set up QA", "add tests", "add evals", "set up evaluations"), **STOP HERE**. Report the test results to the user: "QA setup is complete. Tests show N/M passing. [brief summary]. Want me to investigate the failures and iterate?" Do NOT proceed to Step 6.
-> - If the user's original prompt explicitly asks for iteration ("fix", "improve", "debug", "iterate", "investigate failures", "make tests pass"), proceed to Step 6.
+> If the test errors out, that's a mechanical bug in your QA components — fix and re-run. But once tests produce scores, move on. Do NOT assess result quality here — that's Step 6.
+
+**Always proceed to Step 6 after tests produce scores.** Analysis is the essential final step — without it, pending evaluations are never completed and the user gets uninterpreted raw scores with no actionable insights. Do NOT stop here and ask the user whether to continue.
+
+**Cycle rule for iterative runs**: Every successful `pixie test` invocation creates a concrete `pixie_qa/results/` directory and starts a new analysis cycle. Before you edit application code, prompts, datasets, evaluators, or rerun `pixie test`, complete Step 6 for that exact results directory. Do not skip earlier cycles and analyze only the last run.
---
-### Step 6: Investigate and iterate
+### Step 6: Analyze outcomes
-> **Reference**: Read `references/6-investigate.md` now — it has the stop/continue decision, analysis review, root-cause patterns, and investigation procedures. **Follow its instructions before doing any investigation work.**
+> **Reference**: Read `references/6-analyze-outcomes.md` now — it has the complete three-phase analysis process, writing guidelines, and output format requirements.
+
+**Goal**: Analyze `pixie test` results in a structured, data-driven process to produce actionable insights on test case quality, evaluator quality, and application quality. This step completes pending evaluations, writes per-entry and per-dataset analysis, and produces a prioritized action plan. Every statement must be backed by concrete data from the evaluation run — no speculation, no hand-waving.
+
+**Persisted analysis artifacts**: In this trimmed workflow, persist analysis only at the dataset level and test-run level. Those artifacts still use a **detailed version** (for agent consumption: data points, evidence trails, reasoning chains) plus a **summary version** (for human review: concise TLDR readable in under 2 minutes). Do not create per-entry analysis files.
+
+**Hard completion gate**: Step 6 is **not complete** until all of the following are true:
+
+- Every `"status": "pending"` entry in every `pixie_qa/results//dataset-*/entry-*/evaluations.jsonl` has been replaced with a scored result containing `score` and `reasoning`.
+- Every dataset directory has `analysis.md` and `analysis-summary.md`.
+- The test run root has `action-plan.md` and `action-plan-summary.md`.
+- You have run the Step 6 verifier script from this skill's `resources/` directory against `pixie_qa/results/`, and it reports success.
+
+**Explicitly not sufficient**:
+
+- Writing a single top-level file such as `pixie_qa/06-analysis.md`
+- Saying pending evaluations are for the user to review in the web UI
+- Saying an entry "likely passes" without updating `evaluations.jsonl`
---
diff --git a/skills/eval-driven-dev/references/1-a-project-analysis.md b/skills/eval-driven-dev/references/1-a-project-analysis.md
new file mode 100644
index 00000000..92a7d68f
--- /dev/null
+++ b/skills/eval-driven-dev/references/1-a-project-analysis.md
@@ -0,0 +1,102 @@
+# Step 1a: Project Analysis
+
+Before looking at code structure, entry points, or writing any instrumentation, understand what this software does in the real world. This analysis is the foundation for every subsequent step — it determines which entry points to prioritize, what eval criteria to define, what trace inputs to use, and what dataset entries to build.
+
+---
+
+## What to investigate
+
+Read the project's README, documentation, and top-level source files. You're looking for answers to five questions:
+
+### 1. What does this software do?
+
+Write a one-paragraph plain-language summary. What problem does it solve? What does a successful run look like?
+
+### 2. Who uses it and why?
+
+Who are the target users? What's the primary use case? What problem does this solve that alternatives don't? This helps you understand what "quality" means for this app — a chatbot that chats with customers has different quality requirements than a research agent that synthesises multi-source reports.
+
+### 3. Capability inventory
+
+List the distinct capabilities, modes, or features the app offers. Be specific. for example:
+
+- For a scraping library: single-page scraping, multi-page scraping, search-based scraping, speech output, script generation
+- For a voice agent: greeting, FAQ handling, account lookup, transfer to human, call summarization
+- For a research agent: topic research, multi-source synthesis, citation generation, report formatting
+
+Each capability may need its own entry point, its own trace, and its own dataset entries. This list directly feeds Step 1c (use cases) and Step 4 (dataset diversity).
+
+### 4. What are realistic inputs?
+
+Characterize the real-world inputs the app processes — not toy examples:
+
+- For a web scraper: "messy HTML pages with navigation, ads, dynamic content, tables, nested structures — typically 5KB-500KB of HTML"
+- For a research agent: "open-ended research questions requiring multi-source synthesis, with 3-10 sub-questions"
+- For a voice agent: "multi-turn conversations with background noise, interruptions, and ambiguous requests"
+
+Be specific about **scale** (how large), **complexity** (how messy/diverse), and **variety** (what kinds). This directly feeds trace input selection (Step 2) — if you don't characterize realistic inputs here, you'll end up using toy inputs that bypass the app's real logic.
+
+**This section is an operational constraint, not just documentation.** Steps 2c (trace input) and 4c (dataset entries) will cross-reference these characteristics to verify that trace inputs and dataset entries match real-world scale and complexity. Be concrete and quantitative — write "5KB–500KB HTML pages," not "various HTML pages."
+
+### 5. What are the hard problems / failure modes?
+
+What makes this app's job difficult? Where does it fail in practice? These become the most valuable eval scenarios:
+
+- For a scraper: "malformed HTML, dynamic JS-rendered content, complex nested schemas, very large pages that exceed context windows"
+- For a research agent: "conflicting sources, questions requiring multi-step reasoning, hallucinating citations"
+- For a voice agent: "ambiguous caller intent, account lookup failures, simultaneous tool calls"
+
+Each failure mode should map to at least one eval criterion (Step 1c) and at least one dataset entry (Step 4).
+
+---
+
+## Output: `pixie_qa/00-project-analysis.md`
+
+Write your findings to this file. **Complete all five sections before moving to sub-step 1b.** This document is referenced by every subsequent step.
+
+### Template
+
+```markdown
+# Project Analysis
+
+## What this software does
+
+
+
+## Target users and value proposition
+
+
+
+## Capability inventory
+
+1. :
+2. :
+3. ...
+
+## Realistic input characteristics
+
+
+
+## Hard problems and failure modes
+
+1. :
+2. :
+3. ...
+```
+
+### Quality check
+
+Before moving on, verify:
+
+- The "What this software does" section describes the app's purpose in terms a non-technical user would understand — not just "it runs a graph" or "it calls OpenAI"
+- The capability inventory lists at least 3 capabilities (if the project has them) — if you only found 1, you may have only looked at one part of the codebase
+- The realistic input characteristics describe real-world scale and complexity, not the simplest possible input
+- The failure modes are specific to this app's domain, not generic ("bad input" is not a failure mode; "malformed HTML with unclosed tags that breaks the parser" is)
+
+### What to ignore in the project
+
+The project may contain directories and files that are part of its own development/test infrastructure — `tests/`, `fixtures/`, `examples/`, `mock_server/`, `docs/`, demo scripts, etc. These exist for the project's developers, not for your eval pipeline.
+
+**Critical**: Do NOT use the project's test fixtures, mock servers, example data, or unit test infrastructure as inputs for your eval traces or dataset entries. They are designed for development speed and isolation — small, clean, deterministic data that bypasses every real-world difficulty. Using them produces trivially easy evaluations that cannot catch real quality issues.
+
+When you encounter these directories during analysis, note their existence but treat them as implementation details of the project — not as data sources for your QA pipeline. Your QA pipeline must test the app against real-world conditions, not against the project's own test shortcuts.
diff --git a/skills/eval-driven-dev/references/1-a-entry-point.md b/skills/eval-driven-dev/references/1-b-entry-point.md
similarity index 85%
rename from skills/eval-driven-dev/references/1-a-entry-point.md
rename to skills/eval-driven-dev/references/1-b-entry-point.md
index c5576333..c70adc75 100644
--- a/skills/eval-driven-dev/references/1-a-entry-point.md
+++ b/skills/eval-driven-dev/references/1-b-entry-point.md
@@ -1,6 +1,6 @@
-# Step 1a: Entry Point & Execution Flow
+# Step 1b: Entry Point & Execution Flow
-Identify how the application starts and how a real user invokes it.
+Identify how the application starts and how a real user invokes it. Use the **capability inventory** from `pixie_qa/00-project-analysis.md` to prioritize — focus on the entry point(s) that exercise the most valuable and frequently-used capabilities, not just the first one you find.
---
@@ -27,7 +27,7 @@ How does a real user or client invoke the app? This is what the eval must exerci
### 3. Environment and configuration
-- What env vars does the app require? (API keys, database URLs, feature flags)
+- What env vars does the app require? (service endpoints, database URLs, feature flags)
- What config files does it read?
- What has sensible defaults vs. what must be explicitly set?
diff --git a/skills/eval-driven-dev/references/1-b-eval-criteria.md b/skills/eval-driven-dev/references/1-b-eval-criteria.md
deleted file mode 100644
index 0550c568..00000000
--- a/skills/eval-driven-dev/references/1-b-eval-criteria.md
+++ /dev/null
@@ -1,82 +0,0 @@
-# Step 1b: Eval Criteria
-
-Define what quality dimensions matter for this app — based on the entry point (`01-entry-point.md`) you've already documented.
-
-This document serves two purposes:
-
-1. **Dataset creation (Step 4)**: The use cases tell you what kinds of items to generate — each use case should have representative items in the dataset.
-2. **Evaluator selection (Step 3)**: The eval criteria tell you what evaluators to choose and how to map them.
-
-Keep this concise — it's a planning artifact, not a comprehensive spec.
-
----
-
-## What to define
-
-### 1. Use cases
-
-List the distinct scenarios the app handles. Each use case becomes a category of dataset items. **Each use case description must be a concise one-liner that conveys both (a) what the input is and (b) what the expected behavior or outcome is.** The description should be specific enough that someone unfamiliar with the app can understand the scenario and its success criteria.
-
-**Good use case descriptions:**
-
-- "Reroute to human agent on account lookup difficulties"
-- "Answer billing question using customer's plan details from CRM"
-- "Decline to answer questions outside the support domain"
-- "Summarize research findings including all queried sub-topics"
-
-**Bad use case descriptions (too vague):**
-
-- "Handle billing questions"
-- "Edge case"
-- "Error handling"
-
-### 2. Eval criteria
-
-Define **high-level, application-specific eval criteria** — quality dimensions that matter for THIS app. Each criterion will map to an evaluator in Step 3.
-
-**Good criteria are specific to the app's purpose.** Examples:
-
-- Voice customer support agent: "Does the agent verify the caller's identity before transferring?", "Are responses concise enough for phone conversation?"
-- Research report generator: "Does the report address all sub-questions?", "Are claims supported by retrieved sources?"
-- RAG chatbot: "Are answers grounded in the retrieved context?", "Does it say 'I don't know' when context is missing?"
-
-**Bad criteria are generic evaluator names dressed up as requirements.** Don't say "Factual accuracy" or "Response relevance" — say what factual accuracy or relevance means for THIS app.
-
-At this stage, don't pick evaluator classes or thresholds. That comes in Step 3.
-
-### 3. Check criteria applicability and observability
-
-For each criterion:
-
-1. **Determine applicability scope** — does this criterion apply to ALL use cases, or only a subset? If a criterion is only relevant for certain scenarios (e.g., "identity verification" only applies to account-related requests, not general FAQ), mark it clearly. This distinction is critical for Step 4 (dataset creation) because:
- - **Universal criteria** → become dataset-level default evaluators
- - **Case-specific criteria** → become item-level evaluators on relevant rows only
-
-2. **Verify observability** — for each criterion, identify what data point in the app needs to be captured as a `wrap()` call to evaluate it. This drives the wrap coverage in Step 2.
- - If the criterion is about the app's final response → captured by `wrap(purpose="output", name="response")`
- - If it's about a routing decision → captured by `wrap(purpose="state", name="routing_decision")`
- - If it's about data the app fetched and used → captured by `wrap(purpose="input", name="...")`
-
----
-
-## Output: `pixie_qa/02-eval-criteria.md`
-
-Write your findings to this file. **Keep it short** — the template below is the maximum length.
-
-### Template
-
-```markdown
-# Eval Criteria
-
-## Use cases
-
-1.