Update eval-driven-dev skill (#1092)

* Update eval-driven-dev skill. * Update skills/eval-driven-dev/SKILL.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update skills/eval-driven-dev/SKILL.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update skills/eval-driven-dev/resources/check_version.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update skills/eval-driven-dev/resources/check_version.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update skills/eval-driven-dev/resources/version.json Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Remove skill version check, only upgrade python package. * run npm start to update readme * Rename "eval-driven-dev" to "qa-eval" to better reflect the functionality of the skill. * run npm start * revert the skill name change, run npm start. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-05-14 02:46:51 +00:00 · 2026-03-22 16:38:54 -07:00
parent 919fdb3f8e
commit 371d0dd072
2 changed files with 269 additions and 24 deletions
@@ -115,7 +115,7 @@ See [CONTRIBUTING.md](../CONTRIBUTING.md#adding-skills) for guidelines on how to
 | [editorconfig](../skills/editorconfig/SKILL.md) | Generates a comprehensive and best-practice-oriented .editorconfig file based on project analysis and user preferences. | None |
 | [ef-core](../skills/ef-core/SKILL.md) | Get best practices for Entity Framework Core | None |
 | [entra-agent-user](../skills/entra-agent-user/SKILL.md) | Create Agent Users in Microsoft Entra ID from Agent Identities, enabling AI agents to act as digital workers with user identity capabilities in Microsoft 365 and Azure environments. | None |
-| [eval-driven-dev](../skills/eval-driven-dev/SKILL.md) | Instrument Python LLM apps, build golden datasets, write eval-based tests, run them, and root-cause failures — covering the full eval-driven development cycle. Make sure to use this skill whenever a user is developing, testing, QA-ing, evaluating, or benchmarking a Python project that calls an LLM, even if they don't say "evals" explicitly. Use for making sure an AI app works correctly, catching regressions after prompt changes, debugging why an agent started behaving differently, or validating output quality before shipping. | `references/pixie-api.md` |
+| [eval-driven-dev](../skills/eval-driven-dev/SKILL.md) | Add instrumentation, build golden datasets, write eval-based tests, run them, root-cause failures, and iterate — Ensure your Python LLM application works correctly. Make sure to use this skill whenever a user is developing, testing, QA-ing, evaluating, or benchmarking a Python project that calls an LLM. Use for making sure an LLM application works correctly, catching regressions after prompt changes, fixing unexpected behavior, or validating output quality before shipping. | `references/pixie-api.md` |
 | [excalidraw-diagram-generator](../skills/excalidraw-diagram-generator/SKILL.md) | Generate Excalidraw diagrams from natural language descriptions. Use when asked to "create a diagram", "make a flowchart", "visualize a process", "draw a system architecture", "create a mind map", or "generate an Excalidraw file". Supports flowcharts, relationship diagrams, mind maps, and system architecture diagrams. Outputs .excalidraw JSON files that can be opened directly in Excalidraw. | `references/element-types.md`<br />`references/excalidraw-schema.md`<br />`scripts/.gitignore`<br />`scripts/README.md`<br />`scripts/add-arrow.py`<br />`scripts/add-icon-to-diagram.py`<br />`scripts/split-excalidraw-library.py`<br />`templates` |
 | [fabric-lakehouse](../skills/fabric-lakehouse/SKILL.md) | Use this skill to get context about Fabric Lakehouse and its features for software systems and AI-powered functions. It offers descriptions of Lakehouse data components, organization with schemas and shortcuts, access control, and code examples. This skill supports users in designing, building, and optimizing Lakehouse solutions using best practices. | `references/getdata.md`<br />`references/pyspark.md` |
 | [fedora-linux-triage](../skills/fedora-linux-triage/SKILL.md) | Triage and resolve Fedora issues with dnf, systemd, and SELinux-aware guidance. | None |
@@ -1,12 +1,26 @@
 ---
 name: eval-driven-dev
-description: Instrument Python LLM apps, build golden datasets, write eval-based tests, run them, and root-cause failures — covering the full eval-driven development cycle. Make sure to use this skill whenever a user is developing, testing, QA-ing, evaluating, or benchmarking a Python project that calls an LLM, even if they don't say "evals" explicitly. Use for making sure an AI app works correctly, catching regressions after prompt changes, debugging why an agent started behaving differently, or validating output quality before shipping.
+description: Add instrumentation, build golden datasets, write eval-based tests, run them, root-cause failures, and iterate — Ensure your Python LLM application works correctly. Make sure to use this skill whenever a user is developing, testing, QA-ing, evaluating, or benchmarking a Python project that calls an LLM. Use for making sure an LLM application works correctly, catching regressions after prompt changes, fixing unexpected behavior, or validating output quality before shipping.
 license: MIT
 compatibility: Python 3.11+
 metadata:
  version: 0.1.11
 ---
-# Eval-Driven Development with pixie
+# Evaluation-Driven Development for Python LLM Applications
 This skill is about doing the work, not describing it. When a user asks you to set up evals for their app, you should be reading their code, editing their files, running commands, and producing a working test pipeline — not writing a plan for them to follow later.
 ## Startup checks (always first)
 Attempt to upgrade the `pixie-qa` package in the user's environment. Detect the package manager from the project (check for `uv.lock`, `poetry.lock`, `requirements.txt`, or a plain `pip` environment) and run the appropriate upgrade command:
 - **uv**: `uv add pixie-qa --upgrade` (or `uv sync --upgrade-package pixie-qa`)
 - **poetry**: `poetry add pixie-qa@latest`
 - **pip**: `pip install --upgrade pixie-qa`
 If the upgrade fails (e.g., no network, version conflict), log the error and continue — a failed upgrade must not block the rest of the skill.
 **All pixie-generated files live in a single `pixie_qa` directory** at the project root:
 ```
@@ -15,7 +29,7 @@ pixie_qa/
  observations.db        # SQLite trace DB (auto-created by enable_storage)
  datasets/              # golden datasets (JSON files)
  tests/                 # eval test files (test_*.py)
-  scripts/               # helper scripts (build_dataset.py, etc.)
+  scripts/               # helper scripts (run_harness.py, build_dataset.py, etc.)
 ```
 ---
@@ -26,21 +40,21 @@ pixie_qa/
 ### "Setup QA" / "set up evals" / "add tests" (setup intent)
-The user wants a **working eval pipeline**. Your job is Stages 0–6: install, understand, instrument, write tests, build dataset, run tests. **Stop after the first test run**, regardless of whether tests pass or fail. Report:
+The user wants a **working eval pipeline**. Your job is Stages 0–7: install, understand, instrument, build a run harness, capture real traces, write tests, build dataset, run tests. **Stop after the first test run**, regardless of whether tests pass or fail. Report:
-1. What you set up (instrumentation, test file, dataset)
+1. What you set up (instrumentation, run harness, test file, dataset)
 2. The test results (pass/fail, scores)
 3. If tests failed: a **brief summary** of what failed and likely causes — but do NOT fix anything
 Then ask: _"QA setup is complete. Tests show N/M passing. Want me to investigate the failures and start iterating?"_
-Only proceed to Stage 7 (investigation and fixes) if the user confirms.
+Only proceed to Stage 8 (investigation and fixes) if the user confirms.
 **Exception**: If the test run itself errors out (import failures, missing API keys, configuration bugs) — those are **setup problems**, not eval failures. Fix them and re-run until you get a clean test execution where pass/fail reflects actual app quality, not broken plumbing.
 ### "Fix" / "improve" / "debug" / "why is X failing" (iteration intent)
-The user wants you to investigate and fix. Proceed through all stages including Stage 7 — investigate failures, root-cause them, apply fixes, rebuild dataset, re-run tests, iterate.
+The user wants you to investigate and fix. Proceed through all stages including Stage 8 — investigate failures, root-cause them, apply fixes, rebuild dataset, re-run tests, iterate.
 ### Ambiguous requests
@@ -48,6 +62,34 @@ If the intent is unclear, default to **setup only** and ask before iterating. It
 ---
 ## Hard gates: when to STOP and get the user involved
 Some blockers cannot be worked around. When you hit one, **stop working and tell the user what you need** — do not guess, fabricate data, or skip ahead to later stages.
 ### Missing API keys or credentials
 If the app or evaluators need an API key (e.g. `OPENAI_API_KEY`) and it's not set in the environment or `.env`, tell the user exactly which key is missing and wait for them to provide it. Do not:
 - Proceed with running the app or evals (they will fail)
 - Hardcode a placeholder key
 - Skip to later stages hoping it won't matter
 ### Cannot run the app from a script
 If after reading the code (Stage 1) you cannot figure out how to invoke the app's core LLM-calling function from a standalone script — because it requires a running server, a webhook trigger, complex authentication, or external infrastructure you can't mock — **stop and ask the user**:
 > "I've identified `<function_name>` in `<file>` as the core function to evaluate, but it requires `<dependency>` which I can't easily mock. Can you either (a) show me how to call this function standalone, or (b) run the app yourself with a few representative inputs so I can capture traces?"
 ### App errors during run harness execution
 If the run harness script (Stage 4) errors out and you can't fix it after two attempts, stop and share the error with the user. Common blockers include database connections, missing configuration files, authentication/OAuth flows, and hardware-specific dependencies.
 ### Why stopping matters
 Every subsequent stage depends on having real traces from the actual app. If you can't run the app, you can't capture traces. If you can't capture traces, you can't build a real dataset. If you fabricate a dataset, the entire eval pipeline is testing a fiction, not the user's app. It's better to stop early and get the user's help than to produce an eval pipeline that tests the wrong thing.
 ---
 ## The eval boundary: what to evaluate
 **Eval-driven development focuses on LLM-dependent behaviour.** The purpose is to catch quality regressions in the parts of the system that are non-deterministic and hard to test with traditional unit tests — namely, LLM calls and the decisions they drive.
@@ -97,7 +139,7 @@ The application under test almost certainly needs an LLM provider API key (e.g.
 [ -n "$OPENAI_API_KEY" ] && echo "OPENAI_API_KEY set" || echo "OPENAI_API_KEY missing"
 ```
-If not set, ask the user. Do not proceed with running the app or evals without it — you'll get silent failures or import-time errors.
+If the key is not set: check whether the project uses a `.env` file. If it does, note that `python-dotenv` only loads `.env` when the app explicitly calls `load_dotenv()` — shell commands and the `pixie` CLI will not see variables from `.env` unless they're exported. Tell the user which key is missing and how to set it. **Do not proceed** with running the app or evals without a confirmed API key — you'll get failures that waste time and look like app bugs.
 ---
@@ -110,6 +152,7 @@ Before touching any code, spend time actually reading the source. The code will
 1. **How the software runs**: What is the entry point? How do you start it? Is it a CLI, a server, a library function? What are the required arguments, config files, or environment variables?
 2. **All inputs to the LLM**: This is not limited to the user's message. Trace every piece of data that gets incorporated into any LLM prompt:
   - User input (queries, messages, uploaded files)
   - System prompts (hardcoded or templated)
   - Retrieved context (RAG chunks, search results, database records)
@@ -118,6 +161,7 @@ Before touching any code, spend time actually reading the source. The code will
   - Configuration or feature flags that change prompt behavior
 3. **All intermediate steps and outputs**: Walk through the code path from input to final output and document each stage:
   - Retrieval / search results
   - Tool calls and their results
   - Agent routing / handoff decisions
@@ -128,6 +172,47 @@ Before touching any code, spend time actually reading the source. The code will
 5. **Use cases and expected behaviors**: What are the distinct things the app is supposed to handle? For each use case, what does a "good" response look like? What would constitute a failure?
 ### Identify the eval-boundary function
 This is the single most important decision you'll make, and getting it right determines whether the eval pipeline tests the real app or a fiction.
 The **eval-boundary function** is the function in the actual production code that:
 1. Takes structured input (text, dict, message list) — not raw HTTP requests, audio streams, or webhook payloads
 2. Calls the LLM (directly or through a chain of internal calls)
 3. Returns the LLM's response (or a processed version of it)
 Everything **upstream** of this function (webhook handlers, voice-to-text processing, request parsing, authentication, session management) will be mocked or bypassed when building the run harness. Everything **at and below** this function is the real code you're evaluating.
 **Example**: In a Twilio voice AI app:
 - Twilio sends a webhook with audio → **upstream, mock this**
 - Audio processing converts speech to text → **upstream, mock this**
 - Call state is loaded from Redis → **upstream, mock or simplify this**
 - `agent.respond(user_text, conversation_history)` calls the LLM → **eval-boundary function**
 - Response text is converted to speech → **downstream, not part of eval**
 **Example**: In a FastAPI RAG chatbot:
 - HTTP endpoint receives POST request → **upstream, bypass this**
 - Request validation and auth → **upstream, bypass this**
 - `chatbot.answer(question, context)` retrieves docs and calls LLM → **eval-boundary function**
 - Response is formatted as JSON → **downstream, not part of eval**
 **Example**: In a simple CLI Q&A tool:
 - `main()` reads user input from stdin → **upstream, bypass this**
 - `answer_question(question)` calls the LLM → **eval-boundary function**
 When identifying the eval-boundary function, record:
 - The exact function name and file location
 - Its signature (parameter names and types)
 - What upstream dependencies it needs (clients, config objects, state)
 - Which of those dependencies require real credentials vs. can be mocked
 If you cannot identify a clear eval-boundary function — if the LLM call is deeply entangled with infrastructure code that can't be separated — **stop and ask the user**. See "Hard gates" above.
 ### Write MEMORY.md
 Write your findings down in `pixie_qa/MEMORY.md`. This is the primary working document for the eval effort. It should be human-readable and detailed enough that someone unfamiliar with the project can understand the application and the eval strategy.
@@ -168,6 +253,14 @@ The understanding section should include:
 <List each distinct scenario the app handles, with examples of good/bad outputs>
 ### Eval-boundary function
 - **Function**: `<class.method or function_name>`
 - **Location**: `<file:line>`
 - **Signature**: `<parameters and return type>`
 - **Upstream dependencies to mock**: <list what needs mocking for standalone execution>
 - **Why this boundary**: <explain why this is the right function to evaluate>
 ## Evaluation plan
 ### What to evaluate and why
@@ -273,9 +366,23 @@ async def run_for_eval(user_messages: list[str]) -> str:
    ...
 ```
 ```python
 # ❌ WRONG — calling the LLM directly instead of calling the app's function
@observe(name="agent_answer_question")
 def answer_question(question: str) -> str:
    # This bypasses the entire app and calls OpenAI directly.
    # You're testing a script you just wrote, not the user's app.
    response = client.responses.create(
        model="gpt-4.1",
        input=[{"role": "user", "content": question}],
    )
    return response.output_text
 ```
 **Rules:**
 - **Never add new wrapper functions** to the application code for eval purposes.
 - **Never bypass the app by calling the LLM provider directly** — if you find yourself writing `client.responses.create(...)` or `openai.ChatCompletion.create(...)` in a test or run harness, you're not testing the app. Import and call the app's own function instead.
 - **Never change the function's interface** (arguments, return type, behavior).
 - **Never duplicate production logic** into a separate "testable" function.
 - The instrumentation is purely additive — if you removed all pixie imports and decorators, the app would work identically.
@@ -286,7 +393,129 @@ async def run_for_eval(user_messages: list[str]) -> str:
 ---
-## Stage 4: Write the Eval Test File
+## Stage 4: Create a Run Harness and Verify Traces
 **This stage is a hard gate.** You cannot proceed to writing tests or building datasets until you have successfully run the app's real code through the run harness and confirmed that traces appear in the database.
 The run harness is a short script that calls the eval-boundary function you identified in Stage 1, bypassing external infrastructure that isn't relevant to LLM evaluation.
 ### When the app is simple
 If the eval-boundary function is a straightforward call with no complex dependencies (e.g., `answer_question(question: str) -> str`), the harness can be minimal:
 ```python
 # pixie_qa/scripts/run_harness.py
 from pixie import enable_storage, flush
 from myapp import answer_question
 enable_storage()
 result = answer_question("What is the capital of France?")
 print(f"Result: {result}")
 flush()
 ```
 Run it, verify traces appear, and move on.
 ### When the app has complex dependencies
 Most real-world apps need more setup. The eval-boundary function often requires configuration objects, database connections, API clients, or state objects to run. Your job is to mock or stub the **minimum** necessary to call the real production function.
 ```python
 # pixie_qa/scripts/run_harness.py
 """Exercises the actual app code through the eval-boundary function.
 Mocks upstream infrastructure (webhooks, voice processing, call state, etc.)
 and calls the real production function with representative text inputs.
 """
 from pixie import enable_storage, flush
 # Load .env if the project uses one for API keys
 from dotenv import load_dotenv
 load_dotenv()
 # Import the ACTUAL production function — not a copy, not a re-implementation
 from myapp.agents.llm.openai import OpenAILLM
 def run_one_case(question: str) -> str:
    """Call the actual production function with minimal mocked dependencies."""
    enable_storage()
    # Construct the minimum context the function needs.
    # Use real API client (needs real key), mock everything else.
    llm = OpenAILLM(...)
    # Call the ACTUAL function — the same one production uses
    result = llm.run_normal_ai_response(
        prompt=question,
        messages=[{"role": "user", "content": question}],
    )
    flush()
    return result
 if __name__ == "__main__":
    test_inputs = [
        "What are your business hours?",
        "I need to update my account information.",
    ]
    for q in test_inputs:
        print(f"Q: {q}")
        print(f"A: {run_one_case(q)}")
        print("---")
 ```
 **Critical rules for the run harness:**
 - **Call the real function.** The same function production uses. If you find yourself writing `client.responses.create(...)` or `openai.ChatCompletion.create(...)` in the harness instead of calling the app's own function, you are bypassing the app and testing something else entirely.
 - **Mock only upstream infrastructure.** Database connections, webhook payloads, session state, audio processing — these can be mocked or stubbed. The LLM call itself must be real because that's what you're evaluating.
 - **The LLM API key must be real.** If it's missing, stop and ask the user. See "Hard gates."
 - **Keep it minimal.** This is not a full integration test. It's a way to exercise the real LLM-calling code path and capture traces.
 - **If you can't create a working harness after two attempts**, stop and ask the user for help.
 ### Verify traces are captured
 After running the harness, verify that traces were actually captured:
 ```bash
 python pixie_qa/scripts/run_harness.py
 ```
 Then check the database:
 ```python
 import asyncio
 from pixie import ObservationStore
 async def check():
    store = ObservationStore()
    traces = await store.list_traces(limit=5)
    print(f"Found {len(traces)} traces")
    for t in traces:
        print(t)
 asyncio.run(check())
 ```
 **What to check:**
 - At least one trace appears in the database
 - The trace contains a span for the eval-boundary function (the span name should match the `@observe(name=...)` you added in Stage 3)
 - The span has captured `eval_input` and `eval_output` with sensible values
 **If no traces appear:**
 - Is `enable_storage()` being called before the instrumented function runs?
 - Is `flush()` being called after the function returns?
 - Is the `@observe` decorator on the correct function?
 - Is the function actually being executed (not just defined/imported)?
 **Do not proceed to Stage 5 until you have seen real traces from the actual app in the database.** If traces don't appear, debug the issue now or ask the user for help. This is a setup problem and must be resolved before anything else.
 ---
 ## Stage 5: Write the Eval Test File
 Write the test file before building the dataset. This might seem backwards, but it forces you to decide what you're actually measuring before you start collecting data — otherwise the data collection has no direction.
@@ -320,7 +549,7 @@ async def test_factuality():
 Note that `enable_storage()` belongs inside the `runnable`, not at module level in the test file — it needs to fire on each invocation so the trace is captured for that specific run.
-The `runnable` calls **the same function that production uses** — it does not create a new code path. The only addition is `enable_storage()` to capture traces during eval.
+The `runnable` imports and calls **the same function that production uses** — the eval-boundary function you identified in Stage 1 and verified in Stage 4. If the `runnable` calls a different function than what the run harness calls, something is wrong.
 The test runner is `pixie test` (not `pytest`):
@@ -335,9 +564,11 @@ pixie test -v                        # verbose: shows per-case scores and reason
 ---
-## Stage 5: Build the Dataset
+## Stage 6: Build the Dataset
-Create the dataset first, then populate it by **actually running the app** with representative inputs. This is critical — dataset items should contain real app outputs and trace metadata, not fabricated data.
+**Prerequisite**: You must have successfully run the app and verified traces in Stage 4. If you skipped Stage 4 or it failed, go back — do not proceed.
 Create the dataset, then populate it by **actually running the app** with representative inputs. Dataset items must contain real app outputs captured from actual execution.
 ```bash
 pixie dataset create <dataset-name>
@@ -346,9 +577,10 @@ pixie dataset list   # verify it exists
 ### Run the app and capture traces to the dataset
-Write a simple script (`pixie_qa/scripts/build_dataset.py`) that calls the instrumented function for each input, flushes traces, then saves them to the dataset:
+The easiest approach is to extend the run harness from Stage 4 into a dataset builder. Since you already have a working script that calls the real app code and produces traces, adapt it to save results:
 ```python
 # pixie_qa/scripts/build_dataset.py
 import asyncio
 from pixie import enable_storage, flush, DatasetStore, Evaluable
@@ -380,6 +612,8 @@ async def build_dataset():
 asyncio.run(build_dataset())
 ```
 Note that `eval_output=result` is the **actual return value from running the app** — not a string you typed in.
 Alternatively, use the CLI for per-case capture:
 ```bash
@@ -399,16 +633,22 @@ pixie dataset save <dataset-name> --notes "basic geography question"
 echo '"Paris"' | pixie dataset save <dataset-name> --expected-output
 ```
-**Key rules for dataset building:**
+### The cardinal sin of dataset building
- **Always run the app** — never fabricate `eval_output` manually. The whole point is capturing what the app actually produces.
+**Never fabricate `eval_output` values by hand.** If you type `"eval_output": "4"` into a dataset JSON file without the app actually producing that output, the dataset is testing a fiction. A fabricated dataset is worse than no dataset because it gives false confidence — the user thinks their app is being tested, but it isn't.
 If you catch yourself writing or editing `eval_output` values directly in a JSON file, stop. Go back to Stage 4, run the app, and capture real outputs.
 ### Key rules for dataset building
 - **Every `eval_output` must come from a real execution** of the eval-boundary function. No exceptions.
 - **Include expected outputs** for comparison-based evaluators like `FactualityEval`. Expected outputs should reflect the **correct LLM response given what the tools/system actually return** — not an idealized answer predicated on fixing non-LLM bugs.
 - **Cover the range** of inputs you care about: normal cases, edge cases, things the app might plausibly get wrong.
 - When using `pixie dataset save`, the evaluable's `eval_metadata` will automatically include `trace_id` and `span_id` for later debugging.
 ---
-## Stage 6: Run the Tests
+## Stage 7: Run the Tests
 ```bash
 pixie test pixie_qa/tests/ -v
@@ -420,7 +660,7 @@ The `-v` flag shows per-case scores and reasoning, which makes it much easier to
 ---
-## Stage 7: Investigate Failures
+## Stage 8: Investigate Failures
 **Only proceed here if the user asked for iteration/fixing, or explicitly confirmed after setup.**
@@ -471,9 +711,6 @@ asyncio.run(inspect("the-trace-id-here"))
 Walk through the trace and identify exactly where the failure originates. Common patterns:
 | Symptom | Likely cause |
 | ------- | ------------ |
 **LLM-related failures (fix with prompt/model/eval changes):**
 | Symptom                                                | Likely cause                                                  |
@@ -551,7 +788,7 @@ pixie test pixie_qa/tests/test_<feature>.py -v
 ## Memory Template
-````markdown
+```markdown
 # Eval Notes: <Project Name>
 ## How the application works
@@ -581,6 +818,14 @@ pixie test pixie_qa/tests/test_<feature>.py -v
   - Good output: ...
   - Bad output: ...
 ### Eval-boundary function
 - **Function**: `<fully qualified name>`
 - **Location**: `<file:line>`
 - **Signature**: `<params and return type>`
 - **Upstream dependencies to mock**: <what needs mocking/stubbing>
 - **Why this boundary**: <rationale>
 ## Evaluation plan
 ### What to evaluate and why
@@ -607,8 +852,8 @@ pixie test pixie_qa/tests/test_<feature>.py -v
 ### <date> — <test_name> failure
-<Full structured investigation as described in Stage 7>
+<Full structured investigation as described in Stage 8>
-````
+```
 ---