mirror of
https://github.com/github/awesome-copilot.git
synced 2026-05-04 14:15:55 +00:00
1.7 KiB
1.7 KiB
Experiments: Overview
Systematic testing of AI systems with datasets, tasks, and evaluators.
Structure
DATASET → Examples: {input, expected_output, metadata}
TASK → function(input) → output
EVALUATORS → (input, output, expected) → score
EXPERIMENT → Run task on all examples, score results
Basic Usage
from phoenix.client import Client
client = Client()
experiment = client.experiments.run_experiment(
dataset=my_dataset,
task=my_task,
evaluators=[accuracy, faithfulness],
experiment_name="improved-retrieval-v2",
)
print(experiment.aggregate_scores)
# {'accuracy': 0.85, 'faithfulness': 0.92}
Workflow
- Create dataset - From traces, synthetic data, or manual curation
- Define task - The function to test (your LLM pipeline)
- Select evaluators - Code and/or LLM-based
- Run experiment - Execute and score
- Analyze & iterate - Review, modify task, re-run
Dry Runs
Test setup before full execution:
experiment = client.experiments.run_experiment(
dataset=dataset,
task=task,
evaluators=evaluators,
dry_run=3,
) # Just 3 examples
Async Usage
Use AsyncClient when your task or evaluators make network calls and you want higher throughput:
from phoenix.client import AsyncClient
client = AsyncClient()
experiment = await client.experiments.run_experiment(
dataset=my_dataset,
task=my_async_task,
evaluators=[accuracy, faithfulness],
experiment_name="improved-retrieval-v2",
)
Best Practices
- Name meaningfully:
"improved-retrieval-v2-2024-01-15"not"test" - Version datasets: Don't modify existing
- Multiple evaluators: Combine perspectives