Add eval-driven development skill (#1013)

* Add eval-driven development skill * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-05-01 20:55:55 +00:00 · 2026-03-15 16:48:08 -07:00
parent 2d863d0af0
commit 47f544a09c
3 changed files with 813 additions and 0 deletions
--- a/skills/eval-driven-dev/references/pixie-api.md
+++ b/skills/eval-driven-dev/references/pixie-api.md
@@ -0,0 +1,195 @@
+# pixie API Reference
+
+## Configuration
+
+All settings read from environment variables at call time. By default,
+every artefact lives inside a single `pixie_qa` project directory:
+
+| Variable            | Default                    | Description                        |
+| ------------------- | -------------------------- | ---------------------------------- |
+| `PIXIE_ROOT`        | `pixie_qa`                 | Root directory for all artefacts   |
+| `PIXIE_DB_PATH`     | `pixie_qa/observations.db` | SQLite database file path          |
+| `PIXIE_DB_ENGINE`   | `sqlite`                   | Database engine (currently sqlite) |
+| `PIXIE_DATASET_DIR` | `pixie_qa/datasets`        | Directory for dataset JSON files   |
+
+---
+
+## Instrumentation API (`pixie`)
+
+```python
+from pixie import enable_storage, observe, start_observation, flush, init, add_handler
+```
+
+| Function / Decorator | Signature                                                    | Notes                                                                                               |
+| -------------------- | ------------------------------------------------------------ | --------------------------------------------------------------------------------------------------- |
+| `enable_storage()`   | `() → StorageHandler`                                        | Idempotent. Creates DB, registers handler. Call at app startup.                                     |
+| `init()`             | `(*, capture_content=True, queue_size=1000) → None`          | Called internally by `enable_storage`. Idempotent.                                                  |
+| `observe`            | `(name=None) → decorator`                                    | Wraps a sync or async function. Captures all kwargs as `eval_input`, return value as `eval_output`. |
+| `start_observation`  | `(*, input, name=None) → ContextManager[ObservationContext]` | Manual span. Call `obs.set_output(v)` and `obs.set_metadata(key, value)` inside.                    |
+| `flush`              | `(timeout_seconds=5.0) → bool`                               | Drains the queue. Call after a run before using CLI commands.                                       |
+| `add_handler`        | `(handler) → None`                                           | Register a custom handler (must call `init()` first).                                               |
+
+---
+
+## CLI Commands
+
+```bash
+# Dataset management
+pixie dataset create <name>
+pixie dataset list
+pixie dataset save <name>                              # root span (default)
+pixie dataset save <name> --select last_llm_call       # last LLM call
+pixie dataset save <name> --select by_name --span-name <name>
+pixie dataset save <name> --notes "some note"
+echo '"expected value"' | pixie dataset save <name> --expected-output
+
+# Run eval tests
+pixie test [path] [-k filter_substring] [-v]
+```
+
+**`pixie dataset save` selection modes:**
+
+- `root` (default) — the outermost `@observe` or `start_observation` span
+- `last_llm_call` — the most recent LLM API call span in the trace
+- `by_name` — a span matching the `--span-name` argument (takes the last matching span)
+
+---
+
+## Eval Harness (`pixie`)
+
+```python
+from pixie import (
+    assert_dataset_pass, assert_pass, run_and_evaluate, evaluate,
+    EvalAssertionError, Evaluation, ScoreThreshold,
+    capture_traces, MemoryTraceHandler,
+    last_llm_call, root,
+)
+```
+
+### Key functions
+
+**`assert_dataset_pass(runnable, dataset_name, evaluators, *, dataset_dir=None, passes=1, pass_criteria=None, from_trace=None)`**
+
+- Loads dataset by name, runs `assert_pass` with all items.
+- `runnable`: callable `(eval_input) → None` (sync or async). Must instrument itself.
+- `evaluators`: list of evaluator callables.
+- `pass_criteria`: defaults to `ScoreThreshold()` (all scores >= 0.5).
+- `from_trace`: `last_llm_call` or `root` — selects which span to evaluate.
+
+**`assert_pass(runnable, eval_inputs, evaluators, *, evaluables=None, passes=1, pass_criteria=None, from_trace=None)`**
+
+- Same, but takes explicit inputs (and optionally `Evaluable` items for expected outputs).
+
+**`run_and_evaluate(evaluator, runnable, eval_input, *, expected_output=..., from_trace=None)`**
+
+- Runs `runnable(eval_input)`, captures traces, evaluates. Returns one `Evaluation`.
+
+**`ScoreThreshold(threshold=0.5, pct=1.0)`**
+
+- `threshold`: min score per item (default 0.5).
+- `pct`: fraction of items that must meet threshold (default 1.0 = all).
+- Example: `ScoreThreshold(0.7, pct=0.8)` = 80% of cases must score ≥ 0.7.
+
+**`Evaluation(score, reasoning, details={})`** — frozen result. `score` is 0.0–1.0.
+
+**`capture_traces()`** — context manager; use for in-memory trace capture without DB.
+
+**`last_llm_call(trace)`** / **`root(trace)`** — `from_trace` helpers.
+
+---
+
+## Evaluators
+
+### Heuristic (no LLM needed)
+
+| Evaluator                        | Use when                                            |
+| -------------------------------- | --------------------------------------------------- |
+| `ExactMatchEval(expected=...)`   | Output must exactly equal the expected string       |
+| `LevenshteinMatch(expected=...)` | Partial string similarity (edit distance)           |
+| `NumericDiffEval(expected=...)`  | Normalised numeric difference                       |
+| `JSONDiffEval(expected=...)`     | Structural JSON comparison                          |
+| `ValidJSONEval(schema=None)`     | Output is valid JSON (optionally matching a schema) |
+| `ListContainsEval(expected=...)` | Output list contains expected items                 |
+
+### LLM-as-judge (require OpenAI key or compatible client)
+
+| Evaluator                                             | Use when                                  |
+| ----------------------------------------------------- | ----------------------------------------- |
+| `FactualityEval(expected=..., model=..., client=...)` | Output is factually accurate vs reference |
+| `ClosedQAEval(expected=..., model=..., client=...)`   | Closed-book QA comparison                 |
+| `SummaryEval(expected=..., model=..., client=...)`    | Summarisation quality                     |
+| `TranslationEval(expected=..., language=..., ...)`    | Translation quality                       |
+| `PossibleEval(model=..., client=...)`                 | Output is feasible / plausible            |
+| `SecurityEval(model=..., client=...)`                 | No security vulnerabilities in output     |
+| `ModerationEval(threshold=..., client=...)`           | Content moderation                        |
+| `BattleEval(expected=..., model=..., client=...)`     | Head-to-head comparison                   |
+
+### RAG / retrieval
+
+| Evaluator                                         | Use when                                   |
+| ------------------------------------------------- | ------------------------------------------ |
+| `ContextRelevancyEval(expected=..., client=...)`  | Retrieved context is relevant to query     |
+| `FaithfulnessEval(client=...)`                    | Answer is faithful to the provided context |
+| `AnswerRelevancyEval(client=...)`                 | Answer addresses the question              |
+| `AnswerCorrectnessEval(expected=..., client=...)` | Answer is correct vs reference             |
+
+### Custom evaluator template
+
+```python
+from pixie import Evaluation, Evaluable
+
+async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
+    # evaluable.eval_input  — what was passed to the observed function
+    # evaluable.eval_output — what the function returned
+    # evaluable.expected_output — reference answer (UNSET if not provided)
+    score = 1.0 if "expected pattern" in str(evaluable.eval_output) else 0.0
+    return Evaluation(score=score, reasoning="...")
+```
+
+---
+
+## Dataset Python API
+
+```python
+from pixie import DatasetStore, Evaluable
+
+store = DatasetStore()                               # reads PIXIE_DATASET_DIR
+store.create("my-dataset")                          # create empty
+store.create("my-dataset", items=[...])             # create with items
+store.append("my-dataset", Evaluable(...))          # add one item
+store.get("my-dataset")                             # returns Dataset
+store.list()                                        # list names
+store.remove("my-dataset", index=2)                 # remove by index
+store.delete("my-dataset")                          # delete entirely
+```
+
+**`Evaluable` fields:**
+
+- `eval_input`: the input (what `@observe` captured as function kwargs)
+- `eval_output`: the output (return value of the observed function)
+- `eval_metadata`: dict of extra info (trace_id, span_id, provider, token counts, etc.) — always includes `trace_id` and `span_id`
+- `expected_output`: reference answer for comparison (`UNSET` if not provided)
+
+---
+
+## ObservationStore Python API
+
+```python
+from pixie import ObservationStore
+
+store = ObservationStore()   # reads PIXIE_DB_PATH
+await store.create_tables()
+
+# Read traces
+await store.list_traces(limit=10, offset=0)         # → list of trace summaries
+await store.get_trace(trace_id)                     # → list[ObservationNode] (tree)
+await store.get_root(trace_id)                      # → root ObserveSpan
+await store.get_last_llm(trace_id)                  # → most recent LLMSpan
+await store.get_by_name(name, trace_id=None)        # → list of spans
+
+# ObservationNode
+node.to_text()          # pretty-print span tree
+node.find(name)         # find a child span by name
+node.children           # list of child ObservationNode
+node.span               # the underlying span (ObserveSpan or LLMSpan)
+```