chore: publish from staged

2026-05-04 22:25:57 +00:00 · 2026-05-04 04:22:49 +00:00
parent 252f342650
commit c135d1c5aa
536 changed files with 116819 additions and 294 deletions
--- a/plugins/phoenix/skills/phoenix-evals/references/experiments-running-python.md
+++ b/plugins/phoenix/skills/phoenix-evals/references/experiments-running-python.md
@@ -0,0 +1,105 @@
+# Experiments: Running Experiments in Python
+
+Execute experiments with `run_experiment`.
+
+## Basic Usage
+
+```python
+from phoenix.client import Client
+from phoenix.client.experiments import run_experiment
+
+client = Client()
+dataset = client.datasets.get_dataset(name="qa-test-v1")
+
+def my_task(example):
+    return call_llm(example.input["question"])
+
+def exact_match(output, expected):
+    return 1.0 if output.strip().lower() == expected["answer"].strip().lower() else 0.0
+
+experiment = run_experiment(
+    dataset=dataset,
+    task=my_task,
+    evaluators=[exact_match],
+    experiment_name="qa-experiment-v1",
+)
+```
+
+## Task Functions
+
+```python
+# Basic task
+def task(example):
+    return call_llm(example.input["question"])
+
+# With context (RAG)
+def rag_task(example):
+    return call_llm(f"Context: {example.input['context']}\nQ: {example.input['question']}")
+```
+
+## Evaluator Parameters
+
+| Parameter | Access |
+| --------- | ------ |
+| `output` | Task output |
+| `expected` | Example expected output |
+| `input` | Example input |
+| `metadata` | Example metadata |
+
+## Options
+
+```python
+experiment = run_experiment(
+    dataset=dataset,
+    task=my_task,
+    evaluators=evaluators,
+    experiment_name="my-experiment",
+    dry_run=3,       # Test with 3 examples
+    repetitions=3,   # Run each example 3 times
+)
+```
+
+## Results
+
+```python
+print(experiment.aggregate_scores)
+# {'accuracy': 0.85, 'faithfulness': 0.92}
+
+for run in experiment.runs:
+    print(run.output, run.scores)
+```
+
+## Stability
+
+Single-run scores are noisy when either the task or the evaluator is non-deterministic — an LLM call, tool use, streaming output, an LLM-as-judge. On a small dataset, that per-run noise can swamp the signal from a prompt change.
+
+Averaging over repetitions lets the score you report reflect the prompt rather than the sampling noise:
+
+```python
+run_experiment(
+    # ...
+    repetitions=3,
+)
+```
+
+Things to consider:
+
+- Reach for repetitions when the task or the evaluator is an LLM call and the dataset is small.
+- Prefer repetitions when per-example cost is low and you mostly want to settle the score; prefer growing the dataset when you also need to cover more behaviors.
+- Skip repetitions when both the task and the evaluator are deterministic (e.g. string comparison against a ground truth) — a single run is the answer.
+
+Consider adding stability when:
+
+- Repeat runs of the same experiment drift in ways that feel larger than the differences you're trying to measure.
+- A prompt change flips example labels in ways that don't track with how the outputs actually changed.
+- The judge's reasoning on the same output reads differently from one run to the next.
+
+Repetitions are also what `repetitions=1` (default) silently relies on — don't trust a tuning decision based on a single 10-example run.
+
+## Add Evaluations Later
+
+```python
+from phoenix.client.experiments import evaluate_experiment
+
+evaluate_experiment(experiment=experiment, evaluators=[new_evaluator])
+```