* Add eval-driven development skill * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
9.4 KiB
pixie API Reference
Configuration
All settings read from environment variables at call time. By default,
every artefact lives inside a single pixie_qa project directory:
| Variable | Default | Description |
|---|---|---|
PIXIE_ROOT |
pixie_qa |
Root directory for all artefacts |
PIXIE_DB_PATH |
pixie_qa/observations.db |
SQLite database file path |
PIXIE_DB_ENGINE |
sqlite |
Database engine (currently sqlite) |
PIXIE_DATASET_DIR |
pixie_qa/datasets |
Directory for dataset JSON files |
Instrumentation API (pixie)
from pixie import enable_storage, observe, start_observation, flush, init, add_handler
| Function / Decorator | Signature | Notes |
|---|---|---|
enable_storage() |
() → StorageHandler |
Idempotent. Creates DB, registers handler. Call at app startup. |
init() |
(*, capture_content=True, queue_size=1000) → None |
Called internally by enable_storage. Idempotent. |
observe |
(name=None) → decorator |
Wraps a sync or async function. Captures all kwargs as eval_input, return value as eval_output. |
start_observation |
(*, input, name=None) → ContextManager[ObservationContext] |
Manual span. Call obs.set_output(v) and obs.set_metadata(key, value) inside. |
flush |
(timeout_seconds=5.0) → bool |
Drains the queue. Call after a run before using CLI commands. |
add_handler |
(handler) → None |
Register a custom handler (must call init() first). |
CLI Commands
# Dataset management
pixie dataset create <name>
pixie dataset list
pixie dataset save <name> # root span (default)
pixie dataset save <name> --select last_llm_call # last LLM call
pixie dataset save <name> --select by_name --span-name <name>
pixie dataset save <name> --notes "some note"
echo '"expected value"' | pixie dataset save <name> --expected-output
# Run eval tests
pixie test [path] [-k filter_substring] [-v]
pixie dataset save selection modes:
root(default) — the outermost@observeorstart_observationspanlast_llm_call— the most recent LLM API call span in the traceby_name— a span matching the--span-nameargument (takes the last matching span)
Eval Harness (pixie)
from pixie import (
assert_dataset_pass, assert_pass, run_and_evaluate, evaluate,
EvalAssertionError, Evaluation, ScoreThreshold,
capture_traces, MemoryTraceHandler,
last_llm_call, root,
)
Key functions
assert_dataset_pass(runnable, dataset_name, evaluators, *, dataset_dir=None, passes=1, pass_criteria=None, from_trace=None)
- Loads dataset by name, runs
assert_passwith all items. runnable: callable(eval_input) → None(sync or async). Must instrument itself.evaluators: list of evaluator callables.pass_criteria: defaults toScoreThreshold()(all scores >= 0.5).from_trace:last_llm_callorroot— selects which span to evaluate.
assert_pass(runnable, eval_inputs, evaluators, *, evaluables=None, passes=1, pass_criteria=None, from_trace=None)
- Same, but takes explicit inputs (and optionally
Evaluableitems for expected outputs).
run_and_evaluate(evaluator, runnable, eval_input, *, expected_output=..., from_trace=None)
- Runs
runnable(eval_input), captures traces, evaluates. Returns oneEvaluation.
ScoreThreshold(threshold=0.5, pct=1.0)
threshold: min score per item (default 0.5).pct: fraction of items that must meet threshold (default 1.0 = all).- Example:
ScoreThreshold(0.7, pct=0.8)= 80% of cases must score ≥ 0.7.
Evaluation(score, reasoning, details={}) — frozen result. score is 0.0–1.0.
capture_traces() — context manager; use for in-memory trace capture without DB.
last_llm_call(trace) / root(trace) — from_trace helpers.
Evaluators
Heuristic (no LLM needed)
| Evaluator | Use when |
|---|---|
ExactMatchEval(expected=...) |
Output must exactly equal the expected string |
LevenshteinMatch(expected=...) |
Partial string similarity (edit distance) |
NumericDiffEval(expected=...) |
Normalised numeric difference |
JSONDiffEval(expected=...) |
Structural JSON comparison |
ValidJSONEval(schema=None) |
Output is valid JSON (optionally matching a schema) |
ListContainsEval(expected=...) |
Output list contains expected items |
LLM-as-judge (require OpenAI key or compatible client)
| Evaluator | Use when |
|---|---|
FactualityEval(expected=..., model=..., client=...) |
Output is factually accurate vs reference |
ClosedQAEval(expected=..., model=..., client=...) |
Closed-book QA comparison |
SummaryEval(expected=..., model=..., client=...) |
Summarisation quality |
TranslationEval(expected=..., language=..., ...) |
Translation quality |
PossibleEval(model=..., client=...) |
Output is feasible / plausible |
SecurityEval(model=..., client=...) |
No security vulnerabilities in output |
ModerationEval(threshold=..., client=...) |
Content moderation |
BattleEval(expected=..., model=..., client=...) |
Head-to-head comparison |
RAG / retrieval
| Evaluator | Use when |
|---|---|
ContextRelevancyEval(expected=..., client=...) |
Retrieved context is relevant to query |
FaithfulnessEval(client=...) |
Answer is faithful to the provided context |
AnswerRelevancyEval(client=...) |
Answer addresses the question |
AnswerCorrectnessEval(expected=..., client=...) |
Answer is correct vs reference |
Custom evaluator template
from pixie import Evaluation, Evaluable
async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
# evaluable.eval_input — what was passed to the observed function
# evaluable.eval_output — what the function returned
# evaluable.expected_output — reference answer (UNSET if not provided)
score = 1.0 if "expected pattern" in str(evaluable.eval_output) else 0.0
return Evaluation(score=score, reasoning="...")
Dataset Python API
from pixie import DatasetStore, Evaluable
store = DatasetStore() # reads PIXIE_DATASET_DIR
store.create("my-dataset") # create empty
store.create("my-dataset", items=[...]) # create with items
store.append("my-dataset", Evaluable(...)) # add one item
store.get("my-dataset") # returns Dataset
store.list() # list names
store.remove("my-dataset", index=2) # remove by index
store.delete("my-dataset") # delete entirely
Evaluable fields:
eval_input: the input (what@observecaptured as function kwargs)eval_output: the output (return value of the observed function)eval_metadata: dict of extra info (trace_id, span_id, provider, token counts, etc.) — always includestrace_idandspan_idexpected_output: reference answer for comparison (UNSETif not provided)
ObservationStore Python API
from pixie import ObservationStore
store = ObservationStore() # reads PIXIE_DB_PATH
await store.create_tables()
# Read traces
await store.list_traces(limit=10, offset=0) # → list of trace summaries
await store.get_trace(trace_id) # → list[ObservationNode] (tree)
await store.get_root(trace_id) # → root ObserveSpan
await store.get_last_llm(trace_id) # → most recent LLMSpan
await store.get_by_name(name, trace_id=None) # → list of spans
# ObservationNode
node.to_text() # pretty-print span tree
node.find(name) # find a child span by name
node.children # list of child ObservationNode
node.span # the underlying span (ObserveSpan or LLMSpan)