Files
awesome-copilot/skills/eval-driven-dev/references/2a-instrumentation.md
Yiou Li 2860790bc9 update eval-driven-dev skill (#1434)
* update eval-driven-dev skill

* fix: update skill update command to use correct repository path

* address comments.

* update eval driven dev
2026-04-28 11:27:48 +10:00

7.3 KiB

Step 2a: Instrument with wrap

For the full wrap() API reference, see wrap-api.md.

Goal: Add wrap() calls at data boundaries so the eval harness can (1) inject controlled inputs in place of real external dependencies, and (2) capture outputs for scoring.


Data-flow analysis

Starting from LLM call sites, trace backwards and forwards through the code to find:

  • Dependency input: data from external systems (databases, APIs, caches, file systems, network fetches)
  • App output: data going out to users or external systems
  • Intermediate state: internal decisions relevant to evaluation (routing, tool calls)

You do not need to wrap LLM call arguments or responses — those are already captured by OpenInference auto-instrumentation.

Adding wrap() calls

For each data point found, add a wrap() call in the application code:

import pixie

# External dependency data — function form (prevents the real call in eval mode)
profile = pixie.wrap(db.get_profile, purpose="input", name="customer_profile",
    description="Customer profile fetched from database")(user_id)

# External dependency data — function form (prevents the real call in eval mode)
history = pixie.wrap(redis.get_history, purpose="input", name="conversation_history",
    description="Conversation history from Redis")(session_id)

# App output — what the user receives
response = pixie.wrap(response_text, purpose="output", name="response",
    description="The assistant's response to the user")

# Intermediate state — internal decision relevant to evaluation
selected_agent = pixie.wrap(selected_agent, purpose="state", name="routing_decision",
    description="Which agent was selected to handle this request")

Value vs. function wrapping

# Value form: wrap a data value (result already computed)
profile = pixie.wrap(db.get_profile(user_id), purpose="input", name="customer_profile")

# Function form: wrap the callable — in eval mode the original function is
# NOT called; the registry value is returned instead.
profile = pixie.wrap(db.get_profile, purpose="input", name="customer_profile")(user_id)

CRITICAL: Always use function form for purpose="input" wraps on external calls — HTTP requests, database queries, API calls, file reads, cache lookups. Function form prevents the real call from executing in eval mode, so the dataset value is returned directly without making a live network request or database query. Value form still executes the real call first and only replaces the result afterwards — this wastes time, creates flaky tests, and makes evals dependent on external service availability.

The only case where value form is acceptable for purpose="input" is when the wrapped value is a local computation (no I/O, no side effects) that is cheap to recompute.

Placement rules

  1. Wrap at the data boundary — where data enters or exits the application, not deep inside utility functions.
  2. Names must be unique across the entire application (used as registry keys and dataset field names).
  3. Use lower_snake_case for names.
  4. Don't change the function's interfacewrap() is purely additive, returns the same type.

Placement by purpose

purpose="input" — where external data enters

Place input wraps at the boundary where external data enters the app, not at intermediate processing stages. In a pipeline architecture (fetch → process → extract → format):

  • Correct: wrap(fetch_page, purpose="input", name="fetched_page")(url) using function form at the HTTP fetch boundary — in eval mode, the fetch is skipped entirely and the dataset value is returned; in trace mode, the real fetch runs and the result is captured.
  • Incorrect: wrap(html_content, purpose="input", name="fetched_page") using value form — the HTTP fetch still runs in eval mode (wasting time and creating flaky tests), and only the result is replaced afterwards.
  • Incorrect: wrap(processed_chunks, purpose="input", name="chunks") after parsing — eval mode bypasses parsing and chunking entirely.

Principle: wrap(purpose="input") replaces the minimum external dependency while exercising the maximum internal logic. Push the boundary as far upstream as possible. Always use function form for input wraps on external calls — this prevents the real call from executing in eval mode.

purpose="output" — where processed data exits

Track downstream from the LLM response to find where data leaves the app — sent to the user, written to storage, rendered in UI, or passed to an external system. Wrap at that exit boundary.

  • Don't wrap raw LLM responses — those are already captured by OpenInference auto-instrumentation as llm_span entries.
  • Wrap the app's final processed result — after any post-processing, formatting, or transformation the app applies to the LLM output.
  • If the app has multiple output channels (e.g., a response to the user AND a side-effect write to a database), wrap each one separately.
# Final response after the app's formatting pipeline
response = pixie.wrap(formatted_response, purpose="output", name="response",
    description="Final response sent to the user")

# Side-effect output — data written to external storage
pixie.wrap(saved_record, purpose="output", name="saved_summary",
    description="Summary record saved to the database")

Principle: output wraps are observation-only — they capture what the app produced so evaluators can score it. They are never mocked or injected during eval runs.

purpose="state" — internal decisions relevant to evaluation

Some eval criteria need to judge the app's internal reasoning — not just what went in or came out, but how the app made decisions. Wrap internal state when an eval criterion requires it and the data isn't visible in inputs or outputs.

Common examples:

  • Agent routing: which sub-agent or tool was selected to handle a request
  • Plan/step decisions: what steps the agent chose to execute
  • Memory updates: what the agent added to or removed from its working memory
  • Retrieval results: which documents/chunks were retrieved before being fed to the LLM
# Agent routing decision
selected_agent = pixie.wrap(selected_agent, purpose="state", name="routing_decision",
    description="Which agent was selected to handle this request")

# Retrieved context fed to LLM
pixie.wrap(retrieved_chunks, purpose="state", name="retrieved_context",
    description="Document chunks retrieved by RAG before LLM call")

Principle: only wrap state that an eval criterion actually needs. Don't wrap every variable — state wraps are for internal data that evaluators must see but that doesn't appear in the app's inputs or outputs.

Coverage check

After adding all wrap() calls, go through each eval criterion from pixie_qa/02-eval-criteria.md and verify:

  1. Every criterion that judges what went in has a corresponding input or entry wrap.
  2. Every criterion that judges what came out has a corresponding output wrap.
  3. Every criterion that judges how the app decided has a corresponding state wrap.

If a criterion needs data that isn't captured, add the wrap now — don't defer.


Output

Modified application source files with wrap() calls at data boundaries.