update eval-driven-dev skill (#1434)

* update eval-driven-dev skill * fix: update skill update command to use correct repository path * address comments. * update eval driven dev
2026-05-01 04:35:55 +00:00 · 2026-04-27 18:27:48 -07:00
parent 9933f65e6b
commit 2860790bc9
23 changed files with 1881 additions and 700 deletions
--- a/skills/eval-driven-dev/references/evaluators.md
+++ b/skills/eval-driven-dev/references/evaluators.md
@@ -1,7 +1,7 @@
 # Built-in Evaluators

 > Auto-generated from pixie source code docstrings.
-> Do not edit by hand — regenerate from the upstream [pixie-qa](https://github.com/yiouli/pixie-qa) source repository.
+> Do not edit by hand — run `uv run python scripts/generate_skill_docs.py`.

 Autoevals adapters — pre-made evaluators wrapping `autoevals` scorers.

@@ -30,13 +30,14 @@ Public API (all are also re-exported from `pixie.evals`):

 Choose evaluators based on the **output type** and eval criteria:

-| Output type                                  | Evaluator category                                          | Examples                              |
-| -------------------------------------------- | ----------------------------------------------------------- | ------------------------------------- |
-| Deterministic (labels, yes/no, fixed-format) | Heuristic: `ExactMatch`, `JSONDiff`, `ValidJSON`            | Label classification, JSON extraction |
-| Open-ended text with a reference answer      | LLM-as-judge: `Factuality`, `ClosedQA`, `AnswerCorrectness` | Chatbot responses, QA, summaries      |
-| Text with expected context/grounding         | RAG: `Faithfulness`, `ContextRelevancy`                     | RAG pipelines                         |
-| Text with style/format requirements          | Custom via `create_llm_evaluator`                           | Voice-friendly responses, tone checks |
-| Multi-aspect quality                         | Multiple evaluators combined                                | Factuality + relevance + tone         |
+| Output type                                  | Evaluator category                                          | Examples                               |
+| -------------------------------------------- | ----------------------------------------------------------- | -------------------------------------- |
+| Deterministic (labels, yes/no, fixed-format) | Heuristic: `ExactMatch`, `JSONDiff`, `ValidJSON`            | Label classification, JSON extraction  |
+| Open-ended text with a reference answer      | LLM-as-judge: `Factuality`, `ClosedQA`, `AnswerCorrectness` | Chatbot responses, QA, summaries       |
+| Text with expected context/grounding         | RAG: `Faithfulness`, `ContextRelevancy`                     | RAG pipelines                          |
+| Text with style/format requirements          | Custom via `create_llm_evaluator`                           | Voice-friendly responses, tone checks  |
+| Multi-aspect quality                         | Multiple evaluators combined                                | Factuality + relevance + tone          |
+| Trace-dependent quality (tool use, routing)  | Agent evaluator via `create_agent_evaluator`                | Tool correctness, multi-step reasoning |

 Critical rules:

@@ -529,3 +530,56 @@ An evaluator callable satisfying the `Evaluator` protocol.
 Raises:
 ValueError: If the template uses nested field access like
 `{eval_input[key]}` (only top-level placeholders are supported).
+
+### `create_agent_evaluator`
+
+```python
+create_agent_evaluator(name: 'str', criteria: 'str') -> '_AgentEvaluator'
+```
+
+Create an evaluator whose grading is deferred to a coding agent.
+
+During `pixie test`, agent evaluators are not scored automatically.
+Instead, they raise `AgentEvaluationPending` and record a
+`PendingEvaluation` with the evaluation criteria. The coding agent
+(guided by Step 6) reviews each entry's trace and output, then
+grades the pending evaluations.
+
+**When to use**: Quality dimensions that require holistic review of
+the LLM trace — tool call correctness, multi-step reasoning quality,
+routing decisions — where an automated LLM-as-judge prompt can't
+capture the nuance.
+
+**When NOT to use**: Simple text quality checks (use
+`create_llm_evaluator` instead), deterministic checks (use heuristic
+evaluators), or any criterion that can be scored from input + output
+alone without trace context.
+
+Args:
+name: Display name for the evaluator (shown in scorecard as ⏳ pending).
+criteria: What to evaluate — the grading instructions the agent
+will follow when reviewing results. Be specific and actionable.
+
+Returns:
+An evaluator callable satisfying the `Evaluator` protocol. Its
+`__call__` raises `AgentEvaluationPending` instead of returning an
+`Evaluation`.
+
+Example:
+
+```python
+from pixie import create_agent_evaluator
+
+ResponseQuality = create_agent_evaluator(
+    name="ResponseQuality",
+    criteria="The response directly addresses the user's question with "
+             "accurate, well-structured information. No hallucinations "
+             "or off-topic content.",
+)
+
+ToolUsageCorrectness = create_agent_evaluator(
+    name="ToolUsageCorrectness",
+    criteria="The app called the correct tools in the right order based "
+             "on the user's intent. No unnecessary or missed tool calls.",
+)
+```