update eval-driven-dev skill (#1434)

* update eval-driven-dev skill * fix: update skill update command to use correct repository path * address comments. * update eval driven dev
2026-04-30 20:25:55 +00:00 · 2026-04-27 18:27:48 -07:00
parent 9933f65e6b
commit 2860790bc9
23 changed files with 1881 additions and 700 deletions
--- a/skills/eval-driven-dev/references/3-define-evaluators.md
+++ b/skills/eval-driven-dev/references/3-define-evaluators.md
@@ -6,81 +6,70 @@

 ## 3a. Map criteria to evaluators

-**Every eval criterion from Step 1b — including any dimensions specified by the user in the prompt — must have a corresponding evaluator.** If the user asked for "factuality, completeness, and bias," you need three evaluators (or a multi-criteria evaluator that covers all three). Do not silently drop any requested dimension.
+**Every eval criterion from Step 1c — including any dimensions specified by the user in the prompt — must have a corresponding evaluator.** If the user asked for "factuality, completeness, and bias," you need three evaluators (or a multi-criteria evaluator that covers all three). Do not silently drop any requested dimension. Prioritize evaluators that measure the **hard problems / failure modes** identified in `pixie_qa/00-project-analysis.md` — these are more valuable than generic quality evaluators.

-For each eval criterion, decide how to evaluate it:
+For each eval criterion, choose an evaluator using this decision order:

- **Can it be checked with a built-in evaluator?** (factual correctness → `Factuality`, exact match → `ExactMatch`, RAG faithfulness → `Faithfulness`)
- **Does it need a custom evaluator?** Most app-specific criteria do — use `create_llm_evaluator` with a prompt that operationalizes the criterion.
- **Is it universal or case-specific?** Universal criteria apply to all dataset items. Case-specific criteria apply only to certain rows.
+1. **Built-in evaluator** — if a standard evaluator fits the criterion (factual correctness → `Factuality`, exact match → `ExactMatch`, RAG faithfulness → `Faithfulness`). See `evaluators.md` for the full catalog.
+2. **Agent evaluator** (`create_agent_evaluator`) — **the default for all semantic, qualitative, and app-specific criteria**. Agent evaluators are graded by you (the coding agent) in Step 6, where you review each entry's trace and output holistically. This is far more effective than automated scoring for criteria like "Did the extraction accurately capture the source content?", "Are there hallucinated values?", or "Did the app handle noisy input gracefully?"
+3. **Manual custom evaluator** — ONLY for **mechanical, deterministic checks** where a programmatic function is definitively correct: field existence, regex pattern matching, JSON schema validation, numeric thresholds, type checking. **Never use manual custom evaluators for semantic quality** — if the check requires _judgment_ about whether content is correct, relevant, or complete, use an agent evaluator instead.
+
+**Distinguish structural from semantic criteria**: For each criterion, ask: "Can this be checked with a simple programmatic rule that always gives the right answer?" If yes → manual custom evaluator. If no → agent evaluator. Most app-specific quality criteria are semantic, not structural.

 For open-ended LLM text, **never** use `ExactMatch` — LLM outputs are non-deterministic.

-`AnswerRelevancy` is **RAG-only** — it requires a `context` value in the trace. Returns 0.0 without it. For general relevance without RAG, use `create_llm_evaluator` with a custom prompt.
+`AnswerRelevancy` is **RAG-only** — it requires a `context` value in the trace. Returns 0.0 without it. For general relevance, use an agent evaluator with clear criteria.

 ## 3b. Implement custom evaluators

 If any criterion requires a custom evaluator, implement it now. Place custom evaluators in `pixie_qa/evaluators.py` (or a sub-module if there are many).

-### `create_llm_evaluator` factory
+### Agent evaluators (`create_agent_evaluator`) — the default

-Use when the quality dimension is domain-specific and no built-in evaluator fits.
-
-The return value is a **ready-to-use evaluator instance**. Assign it to a module-level variable — `pixie test` will import and use it directly (no class wrapper needed):
+Use agent evaluators for **all semantic, qualitative, and judgment-based criteria**. These are graded by you (the coding agent) in Step 5d, where you review each entry's trace and output with full context — far more effective than any automated approach for quality dimensions like accuracy, completeness, hallucination detection, or error handling.

 ```python
-from pixie import create_llm_evaluator
+from pixie import create_agent_evaluator

-concise_voice_style = create_llm_evaluator(
-    name="ConciseVoiceStyle",
-    prompt_template="""
-    You are evaluating whether this response is concise and phone-friendly.
+extraction_accuracy = create_agent_evaluator(
+    name="ExtractionAccuracy",
+    criteria="The extracted data accurately reflects the source content. All fields "
+             "contain correct values from the source — no hallucinated, fabricated, or "
+             "placeholder values. Compare the final_answer against the fetched_content "
+             "and parsed_content to verify every claimed fact.",
+)

-    Input: {eval_input}
-    Response: {eval_output}
+noise_handling = create_agent_evaluator(
+    name="NoiseHandling",
+    criteria="The app correctly ignored navigation chrome, boilerplate, ads, and other "
+             "non-content elements from the source. The extracted data contains only "
+             "information relevant to the user's prompt, not noise from the page structure.",
+)

-    Score 1.0 if the response is concise (under 3 sentences), directly addresses
-    the question, and uses conversational language suitable for a phone call.
-    Score 0.0 if it's verbose, off-topic, or uses written-style formatting.
-    """,
+schema_compliance = create_agent_evaluator(
+    name="SchemaCompliance",
+    criteria="The output contains all fields requested in the prompt with appropriate "
+             "types and non-trivial values. Missing fields, null values for required data, "
+             "or fields with generic placeholder text indicate failure.",
 )
 ```

-Reference the evaluator in your dataset JSON by its `filepath:callable_name` reference (e.g., `"pixie_qa/evaluators.py:concise_voice_style"`).
+Reference agent evaluators in the dataset via `filepath:callable_name` (e.g., `"pixie_qa/evaluators.py:extraction_accuracy"`).

-**How template variables work**: `{eval_input}`, `{eval_output}`, `{expectation}` are the only placeholders. Each is replaced with a string representation of the corresponding `Evaluable` field:
+During `pixie test`, agent evaluators show as `⏳` in the console. They are graded in Step 5d.

- **Single-item** `eval_input` / `eval_output` → the item's value (string, JSON-serialized dict/list)
- **Multi-item** `eval_input` / `eval_output` → a JSON dict mapping `name → value` for every item
+**Writing effective criteria**: The `criteria` string is the grading rubric you'll follow in Step 5d. Make it specific and actionable:

-The LLM judge sees the full serialized value.
+- **Bad**: "Check if the output is good" — too vague to grade consistently
+- **Bad**: "The response should be accurate" — doesn't say what to compare against
+- **Good**: "Compare the extracted fields against the source HTML/document. Each field must have a corresponding passage in the source. Flag any field whose value cannot be traced back to the source content."
+- **Good**: "The app should preserve the structural hierarchy of the source document. If the source has sections/subsections, the extraction should reflect that nesting, not flatten everything into a single level."

-**Rules**:
+### Manual custom evaluator — for mechanical checks only

- **Only `{eval_input}`, `{eval_output}`, `{expectation}`** — no nested access like `{eval_input[key]}` (this will crash with a `ValueError`)
- **Keep templates short and direct** — the system prompt already tells the LLM to return `Score: X.X`. Your template just needs to present the data and define the scoring criteria.
- **Don't instruct the LLM to "parse" or "extract" data** — just present the values and state the criteria. The LLM can read JSON naturally.
+Use manual custom evaluators **only** for deterministic, programmatic checks where a simple function definitively gives the right answer. Examples: field existence, regex matching, JSON schema validation, numeric range checks, type verification.

-**Non-RAG response relevance** (instead of `AnswerRelevancy`):
-
-```python
-response_relevance = create_llm_evaluator(
-    name="ResponseRelevance",
-    prompt_template="""
-    You are evaluating whether a customer support response is relevant and helpful.
-
-    Input: {eval_input}
-    Response: {eval_output}
-    Expected: {expectation}
-
-    Score 1.0 if the response directly addresses the question and meets expectations.
-    Score 0.5 if partially relevant but misses important aspects.
-    Score 0.0 if off-topic, ignores the question, or contradicts expectations.
-    """,
-)
-```
-
-### Manual custom evaluator
+**Do NOT use manual custom evaluators for semantic quality.** If the check requires _judgment_ about whether content is correct, relevant, complete, or well-written, use an agent evaluator instead. The litmus test: "Could a regex, string match, or comparison operator implement this check perfectly?" If not, it's semantic — use an agent evaluator.

 Custom evaluators can be **sync or async functions**. Assign them to module-level variables in `pixie_qa/evaluators.py`:

@@ -119,9 +108,13 @@ def call_ended_check(evaluable: Evaluable, *, trace=None) -> Evaluation:
    )
 ```

+### ValidJSON and string expectations conflict
+
+`ValidJSON` treats the dataset entry's `expectation` field as a JSON Schema when present. If your entries use **string** expectations (e.g., for `Factuality`), adding `ValidJSON` as a dataset-level default evaluator will cause failures — it cannot validate a plain string as a JSON Schema. Either apply `ValidJSON` only to entries with object/boolean expectations, or omit it when the dataset relies on string expectations.
+
 ## 3c. Produce the evaluator mapping artifact

-Write the criterion-to-evaluator mapping to `pixie_qa/03-evaluator-mapping.md`. This artifact bridges between the eval criteria (Step 1b) and the dataset (Step 4).
+Write the criterion-to-evaluator mapping to `pixie_qa/03-evaluator-mapping.md`. This artifact bridges between the eval criteria (Step 1c) and the dataset (Step 4).

 **CRITICAL**: Use the exact evaluator names as they appear in the `evaluators.md` reference — built-in evaluators use their short name (e.g., `Factuality`, `ClosedQA`), and custom evaluators use `filepath:callable_name` format (e.g., `pixie_qa/evaluators.py:ConciseVoiceStyle`).

@@ -137,15 +130,22 @@ Write the criterion-to-evaluator mapping to `pixie_qa/03-evaluator-mapping.md`.
 | Factuality     | Factual accuracy    | All items                  |
 | ClosedQA       | Answer correctness  | Items with expected_output |

-## Custom evaluators
+## Agent evaluators

-| Evaluator name                           | Criterion it covers | Applies to | Source file            |
-| ---------------------------------------- | ------------------- | ---------- | ---------------------- |
-| pixie_qa/evaluators.py:ConciseVoiceStyle | Phone-friendly tone | All items  | pixie_qa/evaluators.py |
+| Evaluator name                             | Criterion it covers          | Applies to | Source file            |
+| ------------------------------------------ | ---------------------------- | ---------- | ---------------------- |
+| pixie_qa/evaluators.py:extraction_accuracy | Content accuracy vs source   | All items  | pixie_qa/evaluators.py |
+| pixie_qa/evaluators.py:noise_handling      | Navigation/boilerplate noise | All items  | pixie_qa/evaluators.py |
+
+## Manual custom evaluators (mechanical checks only)
+
+| Evaluator name                                 | Criterion it covers  | Applies to | Source file            |
+| ---------------------------------------------- | -------------------- | ---------- | ---------------------- |
+| pixie_qa/evaluators.py:required_fields_present | Required field check | All items  | pixie_qa/evaluators.py |

 ## Applicability summary

- **Dataset-level defaults** (apply to all items): Factuality, pixie_qa/evaluators.py:ConciseVoiceStyle
+- **Dataset-level defaults** (apply to all items): Factuality, pixie_qa/evaluators.py:extraction_accuracy
 - **Item-specific** (apply to subset): ClosedQA (only items with expected_output)
 ```

@@ -156,6 +156,6 @@ Write the criterion-to-evaluator mapping to `pixie_qa/03-evaluator-mapping.md`.

 ---

-> **Evaluator selection guide**: See `evaluators.md` for the full evaluator catalog, selection guide (which evaluator for which output type), and `create_llm_evaluator` reference.
+> **Evaluator selection guide**: See `evaluators.md` for the full built-in evaluator catalog and `create_agent_evaluator` reference.
 >
 > **If you hit an unexpected error** when implementing evaluators (import failures, API mismatch), read `evaluators.md` for the authoritative evaluator reference and `wrap-api.md` for API details before guessing at a fix.