chore: publish from staged

2026-05-04 22:25:57 +00:00 · 2026-05-04 04:22:49 +00:00
parent 252f342650
commit c135d1c5aa
536 changed files with 116819 additions and 294 deletions
--- a/plugins/phoenix/skills/phoenix-evals/references/fundamentals-model-selection.md
+++ b/plugins/phoenix/skills/phoenix-evals/references/fundamentals-model-selection.md
@@ -0,0 +1,66 @@
+# Model Selection
+
+Error analysis first, model changes last.
+
+## Decision Tree
+
+```
+Performance Issue?
+       │
+       ▼
+Error analysis suggests model problem?
+    NO  → Fix prompts, retrieval, tools
+    YES → Is it a capability gap?
+          YES → Consider model change
+          NO  → Fix the actual problem
+```
+
+## Judge Model Selection
+
+| Principle | Action |
+| --------- | ------ |
+| Start capable | Use gpt-4o first |
+| Optimize later | Test cheaper after criteria stable |
+| Same model OK | Judge does different task |
+
+```python
+# Start with capable model
+judge = ClassificationEvaluator(
+    llm=LLM(provider="openai", model="gpt-4o"),
+    ...
+)
+
+# After validation, test cheaper
+judge_cheap = ClassificationEvaluator(
+    llm=LLM(provider="openai", model="gpt-4o-mini"),
+    ...
+)
+# Compare TPR/TNR on same test set
+```
+
+## Don't Model Shop
+
+```python
+from phoenix.client import Client
+
+client = Client()
+
+# BAD
+for model in ["gpt-4o", "claude-3", "gemini-pro"]:
+    results = client.experiments.run_experiment(
+        dataset=dataset,
+        task=lambda input, _model=model: task(input, model=_model),
+        evaluators=evaluators,
+    )
+
+# GOOD
+failures = analyze_errors(results)
+# "Ignores context" → Fix prompt
+# "Can't do math" → Maybe try better model
+```
+
+## When Model Change Is Warranted
+
+- Failures persist after prompt optimization
+- Capability gaps (reasoning, math, code)
+- Error analysis confirms model limitation