update eval-driven-dev skill (#1434)

* update eval-driven-dev skill

* fix: update skill update command to use correct repository path

* address comments.

* update eval driven dev
This commit is contained in:
Yiou Li
2026-04-27 18:27:48 -07:00
committed by GitHub
parent 9933f65e6b
commit 2860790bc9
23 changed files with 1881 additions and 700 deletions

View File

@@ -0,0 +1,134 @@
# Step 2a: Instrument with `wrap`
> For the full `wrap()` API reference, see `wrap-api.md`.
**Goal**: Add `wrap()` calls at data boundaries so the eval harness can (1) inject controlled inputs in place of real external dependencies, and (2) capture outputs for scoring.
---
## Data-flow analysis
Starting from LLM call sites, trace backwards and forwards through the code to find:
- **Dependency input**: data from external systems (databases, APIs, caches, file systems, network fetches)
- **App output**: data going out to users or external systems
- **Intermediate state**: internal decisions relevant to evaluation (routing, tool calls)
You do **not** need to wrap LLM call arguments or responses — those are already captured by OpenInference auto-instrumentation.
## Adding `wrap()` calls
For each data point found, add a `wrap()` call in the application code:
```python
import pixie
# External dependency data — function form (prevents the real call in eval mode)
profile = pixie.wrap(db.get_profile, purpose="input", name="customer_profile",
description="Customer profile fetched from database")(user_id)
# External dependency data — function form (prevents the real call in eval mode)
history = pixie.wrap(redis.get_history, purpose="input", name="conversation_history",
description="Conversation history from Redis")(session_id)
# App output — what the user receives
response = pixie.wrap(response_text, purpose="output", name="response",
description="The assistant's response to the user")
# Intermediate state — internal decision relevant to evaluation
selected_agent = pixie.wrap(selected_agent, purpose="state", name="routing_decision",
description="Which agent was selected to handle this request")
```
### Value vs. function wrapping
```python
# Value form: wrap a data value (result already computed)
profile = pixie.wrap(db.get_profile(user_id), purpose="input", name="customer_profile")
# Function form: wrap the callable — in eval mode the original function is
# NOT called; the registry value is returned instead.
profile = pixie.wrap(db.get_profile, purpose="input", name="customer_profile")(user_id)
```
**CRITICAL: Always use function form for `purpose="input"` wraps on external calls** — HTTP requests, database queries, API calls, file reads, cache lookups. Function form prevents the real call from executing in eval mode, so the dataset value is returned directly without making a live network request or database query. Value form still executes the real call first and only replaces the result afterwards — this wastes time, creates flaky tests, and makes evals dependent on external service availability.
The only case where value form is acceptable for `purpose="input"` is when the wrapped value is a local computation (no I/O, no side effects) that is cheap to recompute.
### Placement rules
1. **Wrap at the data boundary** — where data enters or exits the application, not deep inside utility functions.
2. **Names must be unique** across the entire application (used as registry keys and dataset field names).
3. **Use `lower_snake_case`** for names.
4. **Don't change the function's interface**`wrap()` is purely additive, returns the same type.
### Placement by purpose
#### `purpose="input"` — where external data enters
Place input wraps at the **boundary where external data enters the app**, not at intermediate processing stages. In a pipeline architecture (fetch → process → extract → format):
- **Correct**: `wrap(fetch_page, purpose="input", name="fetched_page")(url)` using **function form** at the HTTP fetch boundary — in eval mode, the fetch is skipped entirely and the dataset value is returned; in trace mode, the real fetch runs and the result is captured.
- **Incorrect**: `wrap(html_content, purpose="input", name="fetched_page")` using value form — the HTTP fetch still runs in eval mode (wasting time and creating flaky tests), and only the result is replaced afterwards.
- **Incorrect**: `wrap(processed_chunks, purpose="input", name="chunks")` after parsing — eval mode bypasses parsing and chunking entirely.
**Principle**: `wrap(purpose="input")` replaces the _minimum external dependency_ while exercising the _maximum internal logic_. Push the boundary as far upstream as possible. **Always use function form** for input wraps on external calls — this prevents the real call from executing in eval mode.
#### `purpose="output"` — where processed data exits
Track **downstream** from the LLM response to find where data leaves the app — sent to the user, written to storage, rendered in UI, or passed to an external system. Wrap at that exit boundary.
- Don't wrap raw LLM responses — those are already captured by OpenInference auto-instrumentation as `llm_span` entries.
- Wrap the app's **final processed result** — after any post-processing, formatting, or transformation the app applies to the LLM output.
- If the app has multiple output channels (e.g., a response to the user AND a side-effect write to a database), wrap each one separately.
```python
# Final response after the app's formatting pipeline
response = pixie.wrap(formatted_response, purpose="output", name="response",
description="Final response sent to the user")
# Side-effect output — data written to external storage
pixie.wrap(saved_record, purpose="output", name="saved_summary",
description="Summary record saved to the database")
```
**Principle**: output wraps are observation-only — they capture what the app produced so evaluators can score it. They are never mocked or injected during eval runs.
#### `purpose="state"` — internal decisions relevant to evaluation
Some eval criteria need to judge the app's internal reasoning — not just what went in or came out, but _how_ the app made decisions. Wrap internal state when an eval criterion requires it and the data isn't visible in inputs or outputs.
Common examples:
- **Agent routing**: which sub-agent or tool was selected to handle a request
- **Plan/step decisions**: what steps the agent chose to execute
- **Memory updates**: what the agent added to or removed from its working memory
- **Retrieval results**: which documents/chunks were retrieved before being fed to the LLM
```python
# Agent routing decision
selected_agent = pixie.wrap(selected_agent, purpose="state", name="routing_decision",
description="Which agent was selected to handle this request")
# Retrieved context fed to LLM
pixie.wrap(retrieved_chunks, purpose="state", name="retrieved_context",
description="Document chunks retrieved by RAG before LLM call")
```
**Principle**: only wrap state that an eval criterion actually needs. Don't wrap every variable — state wraps are for internal data that evaluators must see but that doesn't appear in the app's inputs or outputs.
### Coverage check
After adding all `wrap()` calls, go through each eval criterion from `pixie_qa/02-eval-criteria.md` and verify:
1. Every criterion that judges **what went in** has a corresponding `input` or `entry` wrap.
2. Every criterion that judges **what came out** has a corresponding `output` wrap.
3. Every criterion that judges **how the app decided** has a corresponding `state` wrap.
If a criterion needs data that isn't captured, add the wrap now — don't defer.
---
## Output
Modified application source files with `wrap()` calls at data boundaries.