Experiments: Running Experiments in Python

Execute experiments with run_experiment.

Basic Usage

from phoenix.client import Client
from phoenix.client.experiments import run_experiment

client = Client()
dataset = client.datasets.get_dataset(name="qa-test-v1")

def my_task(example):
    return call_llm(example.input["question"])

def exact_match(output, expected):
    return 1.0 if output.strip().lower() == expected["answer"].strip().lower() else 0.0

experiment = run_experiment(
    dataset=dataset,
    task=my_task,
    evaluators=[exact_match],
    experiment_name="qa-experiment-v1",
)

Task Functions

# Basic task
def task(example):
    return call_llm(example.input["question"])

# With context (RAG)
def rag_task(example):
    return call_llm(f"Context: {example.input['context']}\nQ: {example.input['question']}")

Evaluator Parameters

Parameter	Access
`output`	Task output
`expected`	Example expected output
`input`	Example input
`metadata`	Example metadata

Options

experiment = run_experiment(
    dataset=dataset,
    task=my_task,
    evaluators=evaluators,
    experiment_name="my-experiment",
    dry_run=3,       # Test with 3 examples
    repetitions=3,   # Run each example 3 times
)

Results

print(experiment.aggregate_scores)
# {'accuracy': 0.85, 'faithfulness': 0.92}

for run in experiment.runs:
    print(run.output, run.scores)

Stability

Single-run scores are noisy when either the task or the evaluator is non-deterministic — an LLM call, tool use, streaming output, an LLM-as-judge. On a small dataset, that per-run noise can swamp the signal from a prompt change.

Averaging over repetitions lets the score you report reflect the prompt rather than the sampling noise:

run_experiment(
    # ...
    repetitions=3,
)

Things to consider:

Reach for repetitions when the task or the evaluator is an LLM call and the dataset is small.
Prefer repetitions when per-example cost is low and you mostly want to settle the score; prefer growing the dataset when you also need to cover more behaviors.
Skip repetitions when both the task and the evaluator are deterministic (e.g. string comparison against a ground truth) — a single run is the answer.

Consider adding stability when:

Repeat runs of the same experiment drift in ways that feel larger than the differences you're trying to measure.
A prompt change flips example labels in ways that don't track with how the outputs actually changed.
The judge's reasoning on the same output reads differently from one run to the next.

Repetitions are also what repetitions=1 (default) silently relies on — don't trust a tuning decision based on a single 10-example run.

Add Evaluations Later

from phoenix.client.experiments import evaluate_experiment

evaluate_experiment(experiment=experiment, evaluators=[new_evaluator])

2.9 KiB Raw Blame History