Files
awesome-copilot/plugins/phoenix/skills/phoenix-evals/references/fundamentals-model-selection.md
2026-05-04 04:22:49 +00:00

1.4 KiB

Model Selection

Error analysis first, model changes last.

Decision Tree

Performance Issue?
       │
       ▼
Error analysis suggests model problem?
    NO  → Fix prompts, retrieval, tools
    YES → Is it a capability gap?
          YES → Consider model change
          NO  → Fix the actual problem

Judge Model Selection

Principle Action
Start capable Use gpt-4o first
Optimize later Test cheaper after criteria stable
Same model OK Judge does different task
# Start with capable model
judge = ClassificationEvaluator(
    llm=LLM(provider="openai", model="gpt-4o"),
    ...
)

# After validation, test cheaper
judge_cheap = ClassificationEvaluator(
    llm=LLM(provider="openai", model="gpt-4o-mini"),
    ...
)
# Compare TPR/TNR on same test set

Don't Model Shop

from phoenix.client import Client

client = Client()

# BAD
for model in ["gpt-4o", "claude-3", "gemini-pro"]:
    results = client.experiments.run_experiment(
        dataset=dataset,
        task=lambda input, _model=model: task(input, model=_model),
        evaluators=evaluators,
    )

# GOOD
failures = analyze_errors(results)
# "Ignores context" → Fix prompt
# "Can't do math" → Maybe try better model

When Model Change Is Warranted

  • Failures persist after prompt optimization
  • Capability gaps (reasoning, math, code)
  • Error analysis confirms model limitation