Files
awesome-copilot/skills/eval-driven-dev/references/pixie-api.md
Yiou Li 47f544a09c Add eval-driven development skill (#1013)
* Add eval-driven development skill

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-03-16 10:48:08 +11:00

196 lines
9.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# pixie API Reference
## Configuration
All settings read from environment variables at call time. By default,
every artefact lives inside a single `pixie_qa` project directory:
| Variable | Default | Description |
| ------------------- | -------------------------- | ---------------------------------- |
| `PIXIE_ROOT` | `pixie_qa` | Root directory for all artefacts |
| `PIXIE_DB_PATH` | `pixie_qa/observations.db` | SQLite database file path |
| `PIXIE_DB_ENGINE` | `sqlite` | Database engine (currently sqlite) |
| `PIXIE_DATASET_DIR` | `pixie_qa/datasets` | Directory for dataset JSON files |
---
## Instrumentation API (`pixie`)
```python
from pixie import enable_storage, observe, start_observation, flush, init, add_handler
```
| Function / Decorator | Signature | Notes |
| -------------------- | ------------------------------------------------------------ | --------------------------------------------------------------------------------------------------- |
| `enable_storage()` | `() → StorageHandler` | Idempotent. Creates DB, registers handler. Call at app startup. |
| `init()` | `(*, capture_content=True, queue_size=1000) → None` | Called internally by `enable_storage`. Idempotent. |
| `observe` | `(name=None) → decorator` | Wraps a sync or async function. Captures all kwargs as `eval_input`, return value as `eval_output`. |
| `start_observation` | `(*, input, name=None) → ContextManager[ObservationContext]` | Manual span. Call `obs.set_output(v)` and `obs.set_metadata(key, value)` inside. |
| `flush` | `(timeout_seconds=5.0) → bool` | Drains the queue. Call after a run before using CLI commands. |
| `add_handler` | `(handler) → None` | Register a custom handler (must call `init()` first). |
---
## CLI Commands
```bash
# Dataset management
pixie dataset create <name>
pixie dataset list
pixie dataset save <name> # root span (default)
pixie dataset save <name> --select last_llm_call # last LLM call
pixie dataset save <name> --select by_name --span-name <name>
pixie dataset save <name> --notes "some note"
echo '"expected value"' | pixie dataset save <name> --expected-output
# Run eval tests
pixie test [path] [-k filter_substring] [-v]
```
**`pixie dataset save` selection modes:**
- `root` (default) — the outermost `@observe` or `start_observation` span
- `last_llm_call` — the most recent LLM API call span in the trace
- `by_name` — a span matching the `--span-name` argument (takes the last matching span)
---
## Eval Harness (`pixie`)
```python
from pixie import (
assert_dataset_pass, assert_pass, run_and_evaluate, evaluate,
EvalAssertionError, Evaluation, ScoreThreshold,
capture_traces, MemoryTraceHandler,
last_llm_call, root,
)
```
### Key functions
**`assert_dataset_pass(runnable, dataset_name, evaluators, *, dataset_dir=None, passes=1, pass_criteria=None, from_trace=None)`**
- Loads dataset by name, runs `assert_pass` with all items.
- `runnable`: callable `(eval_input) → None` (sync or async). Must instrument itself.
- `evaluators`: list of evaluator callables.
- `pass_criteria`: defaults to `ScoreThreshold()` (all scores >= 0.5).
- `from_trace`: `last_llm_call` or `root` — selects which span to evaluate.
**`assert_pass(runnable, eval_inputs, evaluators, *, evaluables=None, passes=1, pass_criteria=None, from_trace=None)`**
- Same, but takes explicit inputs (and optionally `Evaluable` items for expected outputs).
**`run_and_evaluate(evaluator, runnable, eval_input, *, expected_output=..., from_trace=None)`**
- Runs `runnable(eval_input)`, captures traces, evaluates. Returns one `Evaluation`.
**`ScoreThreshold(threshold=0.5, pct=1.0)`**
- `threshold`: min score per item (default 0.5).
- `pct`: fraction of items that must meet threshold (default 1.0 = all).
- Example: `ScoreThreshold(0.7, pct=0.8)` = 80% of cases must score ≥ 0.7.
**`Evaluation(score, reasoning, details={})`** — frozen result. `score` is 0.01.0.
**`capture_traces()`** — context manager; use for in-memory trace capture without DB.
**`last_llm_call(trace)`** / **`root(trace)`** — `from_trace` helpers.
---
## Evaluators
### Heuristic (no LLM needed)
| Evaluator | Use when |
| -------------------------------- | --------------------------------------------------- |
| `ExactMatchEval(expected=...)` | Output must exactly equal the expected string |
| `LevenshteinMatch(expected=...)` | Partial string similarity (edit distance) |
| `NumericDiffEval(expected=...)` | Normalised numeric difference |
| `JSONDiffEval(expected=...)` | Structural JSON comparison |
| `ValidJSONEval(schema=None)` | Output is valid JSON (optionally matching a schema) |
| `ListContainsEval(expected=...)` | Output list contains expected items |
### LLM-as-judge (require OpenAI key or compatible client)
| Evaluator | Use when |
| ----------------------------------------------------- | ----------------------------------------- |
| `FactualityEval(expected=..., model=..., client=...)` | Output is factually accurate vs reference |
| `ClosedQAEval(expected=..., model=..., client=...)` | Closed-book QA comparison |
| `SummaryEval(expected=..., model=..., client=...)` | Summarisation quality |
| `TranslationEval(expected=..., language=..., ...)` | Translation quality |
| `PossibleEval(model=..., client=...)` | Output is feasible / plausible |
| `SecurityEval(model=..., client=...)` | No security vulnerabilities in output |
| `ModerationEval(threshold=..., client=...)` | Content moderation |
| `BattleEval(expected=..., model=..., client=...)` | Head-to-head comparison |
### RAG / retrieval
| Evaluator | Use when |
| ------------------------------------------------- | ------------------------------------------ |
| `ContextRelevancyEval(expected=..., client=...)` | Retrieved context is relevant to query |
| `FaithfulnessEval(client=...)` | Answer is faithful to the provided context |
| `AnswerRelevancyEval(client=...)` | Answer addresses the question |
| `AnswerCorrectnessEval(expected=..., client=...)` | Answer is correct vs reference |
### Custom evaluator template
```python
from pixie import Evaluation, Evaluable
async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
# evaluable.eval_input — what was passed to the observed function
# evaluable.eval_output — what the function returned
# evaluable.expected_output — reference answer (UNSET if not provided)
score = 1.0 if "expected pattern" in str(evaluable.eval_output) else 0.0
return Evaluation(score=score, reasoning="...")
```
---
## Dataset Python API
```python
from pixie import DatasetStore, Evaluable
store = DatasetStore() # reads PIXIE_DATASET_DIR
store.create("my-dataset") # create empty
store.create("my-dataset", items=[...]) # create with items
store.append("my-dataset", Evaluable(...)) # add one item
store.get("my-dataset") # returns Dataset
store.list() # list names
store.remove("my-dataset", index=2) # remove by index
store.delete("my-dataset") # delete entirely
```
**`Evaluable` fields:**
- `eval_input`: the input (what `@observe` captured as function kwargs)
- `eval_output`: the output (return value of the observed function)
- `eval_metadata`: dict of extra info (trace_id, span_id, provider, token counts, etc.) — always includes `trace_id` and `span_id`
- `expected_output`: reference answer for comparison (`UNSET` if not provided)
---
## ObservationStore Python API
```python
from pixie import ObservationStore
store = ObservationStore() # reads PIXIE_DB_PATH
await store.create_tables()
# Read traces
await store.list_traces(limit=10, offset=0) # → list of trace summaries
await store.get_trace(trace_id) # → list[ObservationNode] (tree)
await store.get_root(trace_id) # → root ObserveSpan
await store.get_last_llm(trace_id) # → most recent LLMSpan
await store.get_by_name(name, trace_id=None) # → list of spans
# ObservationNode
node.to_text() # pretty-print span tree
node.find(name) # find a child span by name
node.children # list of child ObservationNode
node.span # the underlying span (ObserveSpan or LLMSpan)
```