Model Selection

Error analysis first, model changes last.

Decision Tree

Performance Issue?
       │
       ▼
Error analysis suggests model problem?
    NO  → Fix prompts, retrieval, tools
    YES → Is it a capability gap?
          YES → Consider model change
          NO  → Fix the actual problem

Judge Model Selection

Principle	Action
Start capable	Use gpt-4o first
Optimize later	Test cheaper after criteria stable
Same model OK	Judge does different task

# Start with capable model
judge = ClassificationEvaluator(
    llm=LLM(provider="openai", model="gpt-4o"),
    ...
)

# After validation, test cheaper
judge_cheap = ClassificationEvaluator(
    llm=LLM(provider="openai", model="gpt-4o-mini"),
    ...
)
# Compare TPR/TNR on same test set

Don't Model Shop

from phoenix.client import Client

client = Client()

# BAD
for model in ["gpt-4o", "claude-3", "gemini-pro"]:
    results = client.experiments.run_experiment(
        dataset=dataset,
        task=lambda input, _model=model: task(input, model=_model),
        evaluators=evaluators,
    )

# GOOD
failures = analyze_errors(results)
# "Ignores context" → Fix prompt
# "Can't do math" → Maybe try better model

When Model Change Is Warranted

Failures persist after prompt optimization
Capability gaps (reasoning, math, code)
Error analysis confirms model limitation

1.4 KiB Raw Blame History

Model Selection

Decision Tree

Judge Model Selection

Don't Model Shop

When Model Change Is Warranted

1.4 KiB

Raw Blame History