mirror of
https://github.com/github/awesome-copilot.git
synced 2026-05-04 22:25:57 +00:00
1.6 KiB
1.6 KiB
Anti-Patterns
Common mistakes and fixes.
| Anti-Pattern | Problem | Fix |
|---|---|---|
| Generic metrics | Pre-built scores don't match your failures | Build from error analysis |
| Vibe-based | No quantification | Measure with experiments |
| Ignoring humans | Uncalibrated LLM judges | Validate >80% TPR/TNR |
| Premature automation | Evaluators for imagined problems | Let observed failures drive |
| Saturation blindness | 100% pass = no signal | Keep capability evals at 50-80% |
| Similarity metrics | BERTScore/ROUGE for generation | Use for retrieval only |
| Model switching | Hoping a model works better | Error analysis first |
| Single-run scoring | LLM judges and non-deterministic tasks add per-run noise that can drown the signal from a prompt change on a small dataset | Set repetitions on runExperiment (or grow the dataset) when the task or judge is an LLM call |
Quantify Changes
from phoenix.client import Client
client = Client()
baseline = client.experiments.run_experiment(dataset=dataset, task=old_prompt, evaluators=evaluators)
improved = client.experiments.run_experiment(dataset=dataset, task=new_prompt, evaluators=evaluators)
print(f"Improvement: {improved.pass_rate - baseline.pass_rate:+.1%}")
Don't Use Similarity for Generation
# BAD
score = bertscore(output, reference)
# GOOD
correct_facts = check_facts_against_source(output, context)
Error Analysis Before Model Change
# BAD
for model in models:
results = test(model)
# GOOD
failures = analyze_errors(results)
# Then decide if model change is warranted