# pixie API Reference ## Configuration All settings read from environment variables at call time. By default, every artefact lives inside a single `pixie_qa` project directory: | Variable | Default | Description | | ------------------- | -------------------------- | ---------------------------------- | | `PIXIE_ROOT` | `pixie_qa` | Root directory for all artefacts | | `PIXIE_DB_PATH` | `pixie_qa/observations.db` | SQLite database file path | | `PIXIE_DB_ENGINE` | `sqlite` | Database engine (currently sqlite) | | `PIXIE_DATASET_DIR` | `pixie_qa/datasets` | Directory for dataset JSON files | --- ## Instrumentation API (`pixie`) ```python from pixie import enable_storage, observe, start_observation, flush, init, add_handler ``` | Function / Decorator | Signature | Notes | | -------------------- | ------------------------------------------------------------ | --------------------------------------------------------------------------------------------------- | | `enable_storage()` | `() → StorageHandler` | Idempotent. Creates DB, registers handler. Call at app startup. | | `init()` | `(*, capture_content=True, queue_size=1000) → None` | Called internally by `enable_storage`. Idempotent. | | `observe` | `(name=None) → decorator` | Wraps a sync or async function. Captures all kwargs as `eval_input`, return value as `eval_output`. | | `start_observation` | `(*, input, name=None) → ContextManager[ObservationContext]` | Manual span. Call `obs.set_output(v)` and `obs.set_metadata(key, value)` inside. | | `flush` | `(timeout_seconds=5.0) → bool` | Drains the queue. Call after a run before using CLI commands. | | `add_handler` | `(handler) → None` | Register a custom handler (must call `init()` first). | --- ## CLI Commands ```bash # Dataset management pixie dataset create pixie dataset list pixie dataset save # root span (default) pixie dataset save --select last_llm_call # last LLM call pixie dataset save --select by_name --span-name pixie dataset save --notes "some note" echo '"expected value"' | pixie dataset save --expected-output # Run eval tests pixie test [path] [-k filter_substring] [-v] ``` **`pixie dataset save` selection modes:** - `root` (default) — the outermost `@observe` or `start_observation` span - `last_llm_call` — the most recent LLM API call span in the trace - `by_name` — a span matching the `--span-name` argument (takes the last matching span) --- ## Eval Harness (`pixie`) ```python from pixie import ( assert_dataset_pass, assert_pass, run_and_evaluate, evaluate, EvalAssertionError, Evaluation, ScoreThreshold, capture_traces, MemoryTraceHandler, last_llm_call, root, ) ``` ### Key functions **`assert_dataset_pass(runnable, dataset_name, evaluators, *, dataset_dir=None, passes=1, pass_criteria=None, from_trace=None)`** - Loads dataset by name, runs `assert_pass` with all items. - `runnable`: callable `(eval_input) → None` (sync or async). Must instrument itself. - `evaluators`: list of evaluator callables. - `pass_criteria`: defaults to `ScoreThreshold()` (all scores >= 0.5). - `from_trace`: `last_llm_call` or `root` — selects which span to evaluate. **`assert_pass(runnable, eval_inputs, evaluators, *, evaluables=None, passes=1, pass_criteria=None, from_trace=None)`** - Same, but takes explicit inputs (and optionally `Evaluable` items for expected outputs). **`run_and_evaluate(evaluator, runnable, eval_input, *, expected_output=..., from_trace=None)`** - Runs `runnable(eval_input)`, captures traces, evaluates. Returns one `Evaluation`. **`ScoreThreshold(threshold=0.5, pct=1.0)`** - `threshold`: min score per item (default 0.5). - `pct`: fraction of items that must meet threshold (default 1.0 = all). - Example: `ScoreThreshold(0.7, pct=0.8)` = 80% of cases must score ≥ 0.7. **`Evaluation(score, reasoning, details={})`** — frozen result. `score` is 0.0–1.0. **`capture_traces()`** — context manager; use for in-memory trace capture without DB. **`last_llm_call(trace)`** / **`root(trace)`** — `from_trace` helpers. --- ## Evaluators ### Heuristic (no LLM needed) | Evaluator | Use when | | -------------------------------- | --------------------------------------------------- | | `ExactMatchEval(expected=...)` | Output must exactly equal the expected string | | `LevenshteinMatch(expected=...)` | Partial string similarity (edit distance) | | `NumericDiffEval(expected=...)` | Normalised numeric difference | | `JSONDiffEval(expected=...)` | Structural JSON comparison | | `ValidJSONEval(schema=None)` | Output is valid JSON (optionally matching a schema) | | `ListContainsEval(expected=...)` | Output list contains expected items | ### LLM-as-judge (require OpenAI key or compatible client) | Evaluator | Use when | | ----------------------------------------------------- | ----------------------------------------- | | `FactualityEval(expected=..., model=..., client=...)` | Output is factually accurate vs reference | | `ClosedQAEval(expected=..., model=..., client=...)` | Closed-book QA comparison | | `SummaryEval(expected=..., model=..., client=...)` | Summarisation quality | | `TranslationEval(expected=..., language=..., ...)` | Translation quality | | `PossibleEval(model=..., client=...)` | Output is feasible / plausible | | `SecurityEval(model=..., client=...)` | No security vulnerabilities in output | | `ModerationEval(threshold=..., client=...)` | Content moderation | | `BattleEval(expected=..., model=..., client=...)` | Head-to-head comparison | ### RAG / retrieval | Evaluator | Use when | | ------------------------------------------------- | ------------------------------------------ | | `ContextRelevancyEval(expected=..., client=...)` | Retrieved context is relevant to query | | `FaithfulnessEval(client=...)` | Answer is faithful to the provided context | | `AnswerRelevancyEval(client=...)` | Answer addresses the question | | `AnswerCorrectnessEval(expected=..., client=...)` | Answer is correct vs reference | ### Custom evaluator template ```python from pixie import Evaluation, Evaluable async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation: # evaluable.eval_input — what was passed to the observed function # evaluable.eval_output — what the function returned # evaluable.expected_output — reference answer (UNSET if not provided) score = 1.0 if "expected pattern" in str(evaluable.eval_output) else 0.0 return Evaluation(score=score, reasoning="...") ``` --- ## Dataset Python API ```python from pixie import DatasetStore, Evaluable store = DatasetStore() # reads PIXIE_DATASET_DIR store.create("my-dataset") # create empty store.create("my-dataset", items=[...]) # create with items store.append("my-dataset", Evaluable(...)) # add one item store.get("my-dataset") # returns Dataset store.list() # list names store.remove("my-dataset", index=2) # remove by index store.delete("my-dataset") # delete entirely ``` **`Evaluable` fields:** - `eval_input`: the input (what `@observe` captured as function kwargs) - `eval_output`: the output (return value of the observed function) - `eval_metadata`: dict of extra info (trace_id, span_id, provider, token counts, etc.) — always includes `trace_id` and `span_id` - `expected_output`: reference answer for comparison (`UNSET` if not provided) --- ## ObservationStore Python API ```python from pixie import ObservationStore store = ObservationStore() # reads PIXIE_DB_PATH await store.create_tables() # Read traces await store.list_traces(limit=10, offset=0) # → list of trace summaries await store.get_trace(trace_id) # → list[ObservationNode] (tree) await store.get_root(trace_id) # → root ObserveSpan await store.get_last_llm(trace_id) # → most recent LLMSpan await store.get_by_name(name, trace_id=None) # → list of spans # ObservationNode node.to_text() # pretty-print span tree node.find(name) # find a child span by name node.children # list of child ObservationNode node.span # the underlying span (ObserveSpan or LLMSpan) ```