mirror of https://github.com/github/awesome-copilot.git synced 2026-03-16 14:15:11 +00:00

Files

Yiou Li 47f544a09c Add eval-driven development skill (#1013 )

* Add eval-driven development skill

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

2026-03-16 10:48:08 +11:00

9.4 KiB

Raw Blame History

pixie API Reference

Configuration

All settings read from environment variables at call time. By default, every artefact lives inside a single pixie_qa project directory:

Variable	Default	Description
`PIXIE_ROOT`	`pixie_qa`	Root directory for all artefacts
`PIXIE_DB_PATH`	`pixie_qa/observations.db`	SQLite database file path
`PIXIE_DB_ENGINE`	`sqlite`	Database engine (currently sqlite)
`PIXIE_DATASET_DIR`	`pixie_qa/datasets`	Directory for dataset JSON files

Instrumentation API (`pixie`)

from pixie import enable_storage, observe, start_observation, flush, init, add_handler

Function / Decorator	Signature	Notes
`enable_storage()`	`() → StorageHandler`	Idempotent. Creates DB, registers handler. Call at app startup.
`init()`	`(*, capture_content=True, queue_size=1000) → None`	Called internally by `enable_storage`. Idempotent.
`observe`	`(name=None) → decorator`	Wraps a sync or async function. Captures all kwargs as `eval_input`, return value as `eval_output`.
`start_observation`	`(*, input, name=None) → ContextManager[ObservationContext]`	Manual span. Call `obs.set_output(v)` and `obs.set_metadata(key, value)` inside.
`flush`	`(timeout_seconds=5.0) → bool`	Drains the queue. Call after a run before using CLI commands.
`add_handler`	`(handler) → None`	Register a custom handler (must call `init()` first).

CLI Commands

# Dataset management
pixie dataset create <name>
pixie dataset list
pixie dataset save <name>                              # root span (default)
pixie dataset save <name> --select last_llm_call       # last LLM call
pixie dataset save <name> --select by_name --span-name <name>
pixie dataset save <name> --notes "some note"
echo '"expected value"' | pixie dataset save <name> --expected-output

# Run eval tests
pixie test [path] [-k filter_substring] [-v]

pixie dataset save selection modes:

root (default) — the outermost @observe or start_observation span
last_llm_call — the most recent LLM API call span in the trace
by_name — a span matching the --span-name argument (takes the last matching span)

Eval Harness (`pixie`)

from pixie import (
    assert_dataset_pass, assert_pass, run_and_evaluate, evaluate,
    EvalAssertionError, Evaluation, ScoreThreshold,
    capture_traces, MemoryTraceHandler,
    last_llm_call, root,
)

Key functions

assert_dataset_pass(runnable, dataset_name, evaluators, *, dataset_dir=None, passes=1, pass_criteria=None, from_trace=None)

Loads dataset by name, runs assert_pass with all items.
runnable: callable (eval_input) → None (sync or async). Must instrument itself.
evaluators: list of evaluator callables.
pass_criteria: defaults to ScoreThreshold() (all scores >= 0.5).
from_trace: last_llm_call or root — selects which span to evaluate.

assert_pass(runnable, eval_inputs, evaluators, *, evaluables=None, passes=1, pass_criteria=None, from_trace=None)

Same, but takes explicit inputs (and optionally Evaluable items for expected outputs).

run_and_evaluate(evaluator, runnable, eval_input, *, expected_output=..., from_trace=None)

Runs runnable(eval_input), captures traces, evaluates. Returns one Evaluation.

ScoreThreshold(threshold=0.5, pct=1.0)

threshold: min score per item (default 0.5).
pct: fraction of items that must meet threshold (default 1.0 = all).
Example: ScoreThreshold(0.7, pct=0.8) = 80% of cases must score ≥ 0.7.

Evaluation(score, reasoning, details={}) — frozen result. score is 0.0–1.0.

capture_traces() — context manager; use for in-memory trace capture without DB.

last_llm_call(trace) / root(trace) — from_trace helpers.

Evaluators

Heuristic (no LLM needed)

Evaluator	Use when
`ExactMatchEval(expected=...)`	Output must exactly equal the expected string
`LevenshteinMatch(expected=...)`	Partial string similarity (edit distance)
`NumericDiffEval(expected=...)`	Normalised numeric difference
`JSONDiffEval(expected=...)`	Structural JSON comparison
`ValidJSONEval(schema=None)`	Output is valid JSON (optionally matching a schema)
`ListContainsEval(expected=...)`	Output list contains expected items

LLM-as-judge (require OpenAI key or compatible client)

Evaluator	Use when
`FactualityEval(expected=..., model=..., client=...)`	Output is factually accurate vs reference
`ClosedQAEval(expected=..., model=..., client=...)`	Closed-book QA comparison
`SummaryEval(expected=..., model=..., client=...)`	Summarisation quality
`TranslationEval(expected=..., language=..., ...)`	Translation quality
`PossibleEval(model=..., client=...)`	Output is feasible / plausible
`SecurityEval(model=..., client=...)`	No security vulnerabilities in output
`ModerationEval(threshold=..., client=...)`	Content moderation
`BattleEval(expected=..., model=..., client=...)`	Head-to-head comparison

RAG / retrieval

Evaluator	Use when
`ContextRelevancyEval(expected=..., client=...)`	Retrieved context is relevant to query
`FaithfulnessEval(client=...)`	Answer is faithful to the provided context
`AnswerRelevancyEval(client=...)`	Answer addresses the question
`AnswerCorrectnessEval(expected=..., client=...)`	Answer is correct vs reference

Custom evaluator template

from pixie import Evaluation, Evaluable

async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
    # evaluable.eval_input  — what was passed to the observed function
    # evaluable.eval_output — what the function returned
    # evaluable.expected_output — reference answer (UNSET if not provided)
    score = 1.0 if "expected pattern" in str(evaluable.eval_output) else 0.0
    return Evaluation(score=score, reasoning="...")

Dataset Python API

from pixie import DatasetStore, Evaluable

store = DatasetStore()                               # reads PIXIE_DATASET_DIR
store.create("my-dataset")                          # create empty
store.create("my-dataset", items=[...])             # create with items
store.append("my-dataset", Evaluable(...))          # add one item
store.get("my-dataset")                             # returns Dataset
store.list()                                        # list names
store.remove("my-dataset", index=2)                 # remove by index
store.delete("my-dataset")                          # delete entirely

Evaluable fields:

eval_input: the input (what @observe captured as function kwargs)
eval_output: the output (return value of the observed function)
eval_metadata: dict of extra info (trace_id, span_id, provider, token counts, etc.) — always includes trace_id and span_id
expected_output: reference answer for comparison (UNSET if not provided)

ObservationStore Python API

from pixie import ObservationStore

store = ObservationStore()   # reads PIXIE_DB_PATH
await store.create_tables()

# Read traces
await store.list_traces(limit=10, offset=0)         # → list of trace summaries
await store.get_trace(trace_id)                     # → list[ObservationNode] (tree)
await store.get_root(trace_id)                      # → root ObserveSpan
await store.get_last_llm(trace_id)                  # → most recent LLMSpan
await store.get_by_name(name, trace_id=None)        # → list of spans

# ObservationNode
node.to_text()          # pretty-print span tree
node.find(name)         # find a child span by name
node.children           # list of child ObservationNode
node.span               # the underlying span (ObserveSpan or LLMSpan)

9.4 KiB Raw Blame History Unescape Escape