chore: publish from staged

2026-05-04 22:25:57 +00:00 · 2026-05-04 04:22:49 +00:00
parent 252f342650
commit c135d1c5aa
536 changed files with 116819 additions and 294 deletions
--- a/plugins/phoenix/skills/phoenix-evals/references/experiments-overview.md
+++ b/plugins/phoenix/skills/phoenix-evals/references/experiments-overview.md
@@ -0,0 +1,72 @@
+# Experiments: Overview
+
+Systematic testing of AI systems with datasets, tasks, and evaluators.
+
+## Structure
+
+```
+DATASET     → Examples: {input, expected_output, metadata}
+TASK        → function(input) → output
+EVALUATORS  → (input, output, expected) → score
+EXPERIMENT  → Run task on all examples, score results
+```
+
+## Basic Usage
+
+```python
+from phoenix.client import Client
+
+client = Client()
+experiment = client.experiments.run_experiment(
+    dataset=my_dataset,
+    task=my_task,
+    evaluators=[accuracy, faithfulness],
+    experiment_name="improved-retrieval-v2",
+)
+
+print(experiment.aggregate_scores)
+# {'accuracy': 0.85, 'faithfulness': 0.92}
+```
+
+## Workflow
+
+1. **Create dataset** - From traces, synthetic data, or manual curation
+2. **Define task** - The function to test (your LLM pipeline)
+3. **Select evaluators** - Code and/or LLM-based
+4. **Run experiment** - Execute and score
+5. **Analyze & iterate** - Review, modify task, re-run
+
+## Dry Runs
+
+Test setup before full execution:
+
+```python
+experiment = client.experiments.run_experiment(
+    dataset=dataset,
+    task=task,
+    evaluators=evaluators,
+    dry_run=3,
+)  # Just 3 examples
+```
+
+## Async Usage
+
+Use `AsyncClient` when your task or evaluators make network calls and you want higher throughput:
+
+```python
+from phoenix.client import AsyncClient
+
+client = AsyncClient()
+experiment = await client.experiments.run_experiment(
+    dataset=my_dataset,
+    task=my_async_task,
+    evaluators=[accuracy, faithfulness],
+    experiment_name="improved-retrieval-v2",
+)
+```
+
+## Best Practices
+
+- **Name meaningfully**: `"improved-retrieval-v2-2024-01-15"` not `"test"`
+- **Version datasets**: Don't modify existing
+- **Multiple evaluators**: Combine perspectives