Add eval-driven development skill (#1013)

* Add eval-driven development skill

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
This commit is contained in:
Yiou Li
2026-03-15 16:48:08 -07:00
committed by GitHub
parent 2d863d0af0
commit 47f544a09c
3 changed files with 813 additions and 0 deletions

View File

@@ -0,0 +1,195 @@
# pixie API Reference
## Configuration
All settings read from environment variables at call time. By default,
every artefact lives inside a single `pixie_qa` project directory:
| Variable | Default | Description |
| ------------------- | -------------------------- | ---------------------------------- |
| `PIXIE_ROOT` | `pixie_qa` | Root directory for all artefacts |
| `PIXIE_DB_PATH` | `pixie_qa/observations.db` | SQLite database file path |
| `PIXIE_DB_ENGINE` | `sqlite` | Database engine (currently sqlite) |
| `PIXIE_DATASET_DIR` | `pixie_qa/datasets` | Directory for dataset JSON files |
---
## Instrumentation API (`pixie`)
```python
from pixie import enable_storage, observe, start_observation, flush, init, add_handler
```
| Function / Decorator | Signature | Notes |
| -------------------- | ------------------------------------------------------------ | --------------------------------------------------------------------------------------------------- |
| `enable_storage()` | `() → StorageHandler` | Idempotent. Creates DB, registers handler. Call at app startup. |
| `init()` | `(*, capture_content=True, queue_size=1000) → None` | Called internally by `enable_storage`. Idempotent. |
| `observe` | `(name=None) → decorator` | Wraps a sync or async function. Captures all kwargs as `eval_input`, return value as `eval_output`. |
| `start_observation` | `(*, input, name=None) → ContextManager[ObservationContext]` | Manual span. Call `obs.set_output(v)` and `obs.set_metadata(key, value)` inside. |
| `flush` | `(timeout_seconds=5.0) → bool` | Drains the queue. Call after a run before using CLI commands. |
| `add_handler` | `(handler) → None` | Register a custom handler (must call `init()` first). |
---
## CLI Commands
```bash
# Dataset management
pixie dataset create <name>
pixie dataset list
pixie dataset save <name> # root span (default)
pixie dataset save <name> --select last_llm_call # last LLM call
pixie dataset save <name> --select by_name --span-name <name>
pixie dataset save <name> --notes "some note"
echo '"expected value"' | pixie dataset save <name> --expected-output
# Run eval tests
pixie test [path] [-k filter_substring] [-v]
```
**`pixie dataset save` selection modes:**
- `root` (default) — the outermost `@observe` or `start_observation` span
- `last_llm_call` — the most recent LLM API call span in the trace
- `by_name` — a span matching the `--span-name` argument (takes the last matching span)
---
## Eval Harness (`pixie`)
```python
from pixie import (
assert_dataset_pass, assert_pass, run_and_evaluate, evaluate,
EvalAssertionError, Evaluation, ScoreThreshold,
capture_traces, MemoryTraceHandler,
last_llm_call, root,
)
```
### Key functions
**`assert_dataset_pass(runnable, dataset_name, evaluators, *, dataset_dir=None, passes=1, pass_criteria=None, from_trace=None)`**
- Loads dataset by name, runs `assert_pass` with all items.
- `runnable`: callable `(eval_input) → None` (sync or async). Must instrument itself.
- `evaluators`: list of evaluator callables.
- `pass_criteria`: defaults to `ScoreThreshold()` (all scores >= 0.5).
- `from_trace`: `last_llm_call` or `root` — selects which span to evaluate.
**`assert_pass(runnable, eval_inputs, evaluators, *, evaluables=None, passes=1, pass_criteria=None, from_trace=None)`**
- Same, but takes explicit inputs (and optionally `Evaluable` items for expected outputs).
**`run_and_evaluate(evaluator, runnable, eval_input, *, expected_output=..., from_trace=None)`**
- Runs `runnable(eval_input)`, captures traces, evaluates. Returns one `Evaluation`.
**`ScoreThreshold(threshold=0.5, pct=1.0)`**
- `threshold`: min score per item (default 0.5).
- `pct`: fraction of items that must meet threshold (default 1.0 = all).
- Example: `ScoreThreshold(0.7, pct=0.8)` = 80% of cases must score ≥ 0.7.
**`Evaluation(score, reasoning, details={})`** — frozen result. `score` is 0.01.0.
**`capture_traces()`** — context manager; use for in-memory trace capture without DB.
**`last_llm_call(trace)`** / **`root(trace)`** — `from_trace` helpers.
---
## Evaluators
### Heuristic (no LLM needed)
| Evaluator | Use when |
| -------------------------------- | --------------------------------------------------- |
| `ExactMatchEval(expected=...)` | Output must exactly equal the expected string |
| `LevenshteinMatch(expected=...)` | Partial string similarity (edit distance) |
| `NumericDiffEval(expected=...)` | Normalised numeric difference |
| `JSONDiffEval(expected=...)` | Structural JSON comparison |
| `ValidJSONEval(schema=None)` | Output is valid JSON (optionally matching a schema) |
| `ListContainsEval(expected=...)` | Output list contains expected items |
### LLM-as-judge (require OpenAI key or compatible client)
| Evaluator | Use when |
| ----------------------------------------------------- | ----------------------------------------- |
| `FactualityEval(expected=..., model=..., client=...)` | Output is factually accurate vs reference |
| `ClosedQAEval(expected=..., model=..., client=...)` | Closed-book QA comparison |
| `SummaryEval(expected=..., model=..., client=...)` | Summarisation quality |
| `TranslationEval(expected=..., language=..., ...)` | Translation quality |
| `PossibleEval(model=..., client=...)` | Output is feasible / plausible |
| `SecurityEval(model=..., client=...)` | No security vulnerabilities in output |
| `ModerationEval(threshold=..., client=...)` | Content moderation |
| `BattleEval(expected=..., model=..., client=...)` | Head-to-head comparison |
### RAG / retrieval
| Evaluator | Use when |
| ------------------------------------------------- | ------------------------------------------ |
| `ContextRelevancyEval(expected=..., client=...)` | Retrieved context is relevant to query |
| `FaithfulnessEval(client=...)` | Answer is faithful to the provided context |
| `AnswerRelevancyEval(client=...)` | Answer addresses the question |
| `AnswerCorrectnessEval(expected=..., client=...)` | Answer is correct vs reference |
### Custom evaluator template
```python
from pixie import Evaluation, Evaluable
async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
# evaluable.eval_input — what was passed to the observed function
# evaluable.eval_output — what the function returned
# evaluable.expected_output — reference answer (UNSET if not provided)
score = 1.0 if "expected pattern" in str(evaluable.eval_output) else 0.0
return Evaluation(score=score, reasoning="...")
```
---
## Dataset Python API
```python
from pixie import DatasetStore, Evaluable
store = DatasetStore() # reads PIXIE_DATASET_DIR
store.create("my-dataset") # create empty
store.create("my-dataset", items=[...]) # create with items
store.append("my-dataset", Evaluable(...)) # add one item
store.get("my-dataset") # returns Dataset
store.list() # list names
store.remove("my-dataset", index=2) # remove by index
store.delete("my-dataset") # delete entirely
```
**`Evaluable` fields:**
- `eval_input`: the input (what `@observe` captured as function kwargs)
- `eval_output`: the output (return value of the observed function)
- `eval_metadata`: dict of extra info (trace_id, span_id, provider, token counts, etc.) — always includes `trace_id` and `span_id`
- `expected_output`: reference answer for comparison (`UNSET` if not provided)
---
## ObservationStore Python API
```python
from pixie import ObservationStore
store = ObservationStore() # reads PIXIE_DB_PATH
await store.create_tables()
# Read traces
await store.list_traces(limit=10, offset=0) # → list of trace summaries
await store.get_trace(trace_id) # → list[ObservationNode] (tree)
await store.get_root(trace_id) # → root ObserveSpan
await store.get_last_llm(trace_id) # → most recent LLMSpan
await store.get_by_name(name, trace_id=None) # → list of spans
# ObservationNode
node.to_text() # pretty-print span tree
node.find(name) # find a child span by name
node.children # list of child ObservationNode
node.span # the underlying span (ObserveSpan or LLMSpan)
```