chore: publish from staged

This commit is contained in:
github-actions[bot]
2026-05-04 04:22:49 +00:00
parent 252f342650
commit c135d1c5aa
536 changed files with 116819 additions and 294 deletions

View File

@@ -0,0 +1,72 @@
# Experiments: Overview
Systematic testing of AI systems with datasets, tasks, and evaluators.
## Structure
```
DATASET → Examples: {input, expected_output, metadata}
TASK → function(input) → output
EVALUATORS → (input, output, expected) → score
EXPERIMENT → Run task on all examples, score results
```
## Basic Usage
```python
from phoenix.client import Client
client = Client()
experiment = client.experiments.run_experiment(
dataset=my_dataset,
task=my_task,
evaluators=[accuracy, faithfulness],
experiment_name="improved-retrieval-v2",
)
print(experiment.aggregate_scores)
# {'accuracy': 0.85, 'faithfulness': 0.92}
```
## Workflow
1. **Create dataset** - From traces, synthetic data, or manual curation
2. **Define task** - The function to test (your LLM pipeline)
3. **Select evaluators** - Code and/or LLM-based
4. **Run experiment** - Execute and score
5. **Analyze & iterate** - Review, modify task, re-run
## Dry Runs
Test setup before full execution:
```python
experiment = client.experiments.run_experiment(
dataset=dataset,
task=task,
evaluators=evaluators,
dry_run=3,
) # Just 3 examples
```
## Async Usage
Use `AsyncClient` when your task or evaluators make network calls and you want higher throughput:
```python
from phoenix.client import AsyncClient
client = AsyncClient()
experiment = await client.experiments.run_experiment(
dataset=my_dataset,
task=my_async_task,
evaluators=[accuracy, faithfulness],
experiment_name="improved-retrieval-v2",
)
```
## Best Practices
- **Name meaningfully**: `"improved-retrieval-v2-2024-01-15"` not `"test"`
- **Version datasets**: Don't modify existing
- **Multiple evaluators**: Combine perspectives