mirror of
https://github.com/github/awesome-copilot.git
synced 2026-05-04 14:15:55 +00:00
73 lines
1.7 KiB
Markdown
73 lines
1.7 KiB
Markdown
# Experiments: Overview
|
|
|
|
Systematic testing of AI systems with datasets, tasks, and evaluators.
|
|
|
|
## Structure
|
|
|
|
```
|
|
DATASET → Examples: {input, expected_output, metadata}
|
|
TASK → function(input) → output
|
|
EVALUATORS → (input, output, expected) → score
|
|
EXPERIMENT → Run task on all examples, score results
|
|
```
|
|
|
|
## Basic Usage
|
|
|
|
```python
|
|
from phoenix.client import Client
|
|
|
|
client = Client()
|
|
experiment = client.experiments.run_experiment(
|
|
dataset=my_dataset,
|
|
task=my_task,
|
|
evaluators=[accuracy, faithfulness],
|
|
experiment_name="improved-retrieval-v2",
|
|
)
|
|
|
|
print(experiment.aggregate_scores)
|
|
# {'accuracy': 0.85, 'faithfulness': 0.92}
|
|
```
|
|
|
|
## Workflow
|
|
|
|
1. **Create dataset** - From traces, synthetic data, or manual curation
|
|
2. **Define task** - The function to test (your LLM pipeline)
|
|
3. **Select evaluators** - Code and/or LLM-based
|
|
4. **Run experiment** - Execute and score
|
|
5. **Analyze & iterate** - Review, modify task, re-run
|
|
|
|
## Dry Runs
|
|
|
|
Test setup before full execution:
|
|
|
|
```python
|
|
experiment = client.experiments.run_experiment(
|
|
dataset=dataset,
|
|
task=task,
|
|
evaluators=evaluators,
|
|
dry_run=3,
|
|
) # Just 3 examples
|
|
```
|
|
|
|
## Async Usage
|
|
|
|
Use `AsyncClient` when your task or evaluators make network calls and you want higher throughput:
|
|
|
|
```python
|
|
from phoenix.client import AsyncClient
|
|
|
|
client = AsyncClient()
|
|
experiment = await client.experiments.run_experiment(
|
|
dataset=my_dataset,
|
|
task=my_async_task,
|
|
evaluators=[accuracy, faithfulness],
|
|
experiment_name="improved-retrieval-v2",
|
|
)
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
- **Name meaningfully**: `"improved-retrieval-v2-2024-01-15"` not `"test"`
|
|
- **Version datasets**: Don't modify existing
|
|
- **Multiple evaluators**: Combine perspectives
|