Files
awesome-copilot/skills/eval-driven-dev/references/pixie-api.md
Yiou Li 47f544a09c Add eval-driven development skill (#1013)
* Add eval-driven development skill

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-03-16 10:48:08 +11:00

9.4 KiB
Raw Blame History

pixie API Reference

Configuration

All settings read from environment variables at call time. By default, every artefact lives inside a single pixie_qa project directory:

Variable Default Description
PIXIE_ROOT pixie_qa Root directory for all artefacts
PIXIE_DB_PATH pixie_qa/observations.db SQLite database file path
PIXIE_DB_ENGINE sqlite Database engine (currently sqlite)
PIXIE_DATASET_DIR pixie_qa/datasets Directory for dataset JSON files

Instrumentation API (pixie)

from pixie import enable_storage, observe, start_observation, flush, init, add_handler
Function / Decorator Signature Notes
enable_storage() () → StorageHandler Idempotent. Creates DB, registers handler. Call at app startup.
init() (*, capture_content=True, queue_size=1000) → None Called internally by enable_storage. Idempotent.
observe (name=None) → decorator Wraps a sync or async function. Captures all kwargs as eval_input, return value as eval_output.
start_observation (*, input, name=None) → ContextManager[ObservationContext] Manual span. Call obs.set_output(v) and obs.set_metadata(key, value) inside.
flush (timeout_seconds=5.0) → bool Drains the queue. Call after a run before using CLI commands.
add_handler (handler) → None Register a custom handler (must call init() first).

CLI Commands

# Dataset management
pixie dataset create <name>
pixie dataset list
pixie dataset save <name>                              # root span (default)
pixie dataset save <name> --select last_llm_call       # last LLM call
pixie dataset save <name> --select by_name --span-name <name>
pixie dataset save <name> --notes "some note"
echo '"expected value"' | pixie dataset save <name> --expected-output

# Run eval tests
pixie test [path] [-k filter_substring] [-v]

pixie dataset save selection modes:

  • root (default) — the outermost @observe or start_observation span
  • last_llm_call — the most recent LLM API call span in the trace
  • by_name — a span matching the --span-name argument (takes the last matching span)

Eval Harness (pixie)

from pixie import (
    assert_dataset_pass, assert_pass, run_and_evaluate, evaluate,
    EvalAssertionError, Evaluation, ScoreThreshold,
    capture_traces, MemoryTraceHandler,
    last_llm_call, root,
)

Key functions

assert_dataset_pass(runnable, dataset_name, evaluators, *, dataset_dir=None, passes=1, pass_criteria=None, from_trace=None)

  • Loads dataset by name, runs assert_pass with all items.
  • runnable: callable (eval_input) → None (sync or async). Must instrument itself.
  • evaluators: list of evaluator callables.
  • pass_criteria: defaults to ScoreThreshold() (all scores >= 0.5).
  • from_trace: last_llm_call or root — selects which span to evaluate.

assert_pass(runnable, eval_inputs, evaluators, *, evaluables=None, passes=1, pass_criteria=None, from_trace=None)

  • Same, but takes explicit inputs (and optionally Evaluable items for expected outputs).

run_and_evaluate(evaluator, runnable, eval_input, *, expected_output=..., from_trace=None)

  • Runs runnable(eval_input), captures traces, evaluates. Returns one Evaluation.

ScoreThreshold(threshold=0.5, pct=1.0)

  • threshold: min score per item (default 0.5).
  • pct: fraction of items that must meet threshold (default 1.0 = all).
  • Example: ScoreThreshold(0.7, pct=0.8) = 80% of cases must score ≥ 0.7.

Evaluation(score, reasoning, details={}) — frozen result. score is 0.01.0.

capture_traces() — context manager; use for in-memory trace capture without DB.

last_llm_call(trace) / root(trace)from_trace helpers.


Evaluators

Heuristic (no LLM needed)

Evaluator Use when
ExactMatchEval(expected=...) Output must exactly equal the expected string
LevenshteinMatch(expected=...) Partial string similarity (edit distance)
NumericDiffEval(expected=...) Normalised numeric difference
JSONDiffEval(expected=...) Structural JSON comparison
ValidJSONEval(schema=None) Output is valid JSON (optionally matching a schema)
ListContainsEval(expected=...) Output list contains expected items

LLM-as-judge (require OpenAI key or compatible client)

Evaluator Use when
FactualityEval(expected=..., model=..., client=...) Output is factually accurate vs reference
ClosedQAEval(expected=..., model=..., client=...) Closed-book QA comparison
SummaryEval(expected=..., model=..., client=...) Summarisation quality
TranslationEval(expected=..., language=..., ...) Translation quality
PossibleEval(model=..., client=...) Output is feasible / plausible
SecurityEval(model=..., client=...) No security vulnerabilities in output
ModerationEval(threshold=..., client=...) Content moderation
BattleEval(expected=..., model=..., client=...) Head-to-head comparison

RAG / retrieval

Evaluator Use when
ContextRelevancyEval(expected=..., client=...) Retrieved context is relevant to query
FaithfulnessEval(client=...) Answer is faithful to the provided context
AnswerRelevancyEval(client=...) Answer addresses the question
AnswerCorrectnessEval(expected=..., client=...) Answer is correct vs reference

Custom evaluator template

from pixie import Evaluation, Evaluable

async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
    # evaluable.eval_input  — what was passed to the observed function
    # evaluable.eval_output — what the function returned
    # evaluable.expected_output — reference answer (UNSET if not provided)
    score = 1.0 if "expected pattern" in str(evaluable.eval_output) else 0.0
    return Evaluation(score=score, reasoning="...")

Dataset Python API

from pixie import DatasetStore, Evaluable

store = DatasetStore()                               # reads PIXIE_DATASET_DIR
store.create("my-dataset")                          # create empty
store.create("my-dataset", items=[...])             # create with items
store.append("my-dataset", Evaluable(...))          # add one item
store.get("my-dataset")                             # returns Dataset
store.list()                                        # list names
store.remove("my-dataset", index=2)                 # remove by index
store.delete("my-dataset")                          # delete entirely

Evaluable fields:

  • eval_input: the input (what @observe captured as function kwargs)
  • eval_output: the output (return value of the observed function)
  • eval_metadata: dict of extra info (trace_id, span_id, provider, token counts, etc.) — always includes trace_id and span_id
  • expected_output: reference answer for comparison (UNSET if not provided)

ObservationStore Python API

from pixie import ObservationStore

store = ObservationStore()   # reads PIXIE_DB_PATH
await store.create_tables()

# Read traces
await store.list_traces(limit=10, offset=0)         # → list of trace summaries
await store.get_trace(trace_id)                     # → list[ObservationNode] (tree)
await store.get_root(trace_id)                      # → root ObserveSpan
await store.get_last_llm(trace_id)                  # → most recent LLMSpan
await store.get_by_name(name, trace_id=None)        # → list of spans

# ObservationNode
node.to_text()          # pretty-print span tree
node.find(name)         # find a child span by name
node.children           # list of child ObservationNode
node.span               # the underlying span (ObserveSpan or LLMSpan)