chore: sync Arize skills from arize-skills@597d609bfe5f07fd7d24acfdb408a082911b18fc and phoenix@746247cbb07b0dc7803b87c69dd8c77811c33f59 (#1583)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2026-05-04 14:15:55 +00:00 · 2026-05-03 18:05:44 -07:00
parent 82b58047e0
commit c7b2aecb94
40 changed files with 1316 additions and 423 deletions
--- a/skills/phoenix-evals/references/experiments-running-python.md
+++ b/skills/phoenix-evals/references/experiments-running-python.md
@@ -69,6 +69,33 @@ for run in experiment.runs:
    print(run.output, run.scores)
 ```

+## Stability
+
+Single-run scores are noisy when either the task or the evaluator is non-deterministic — an LLM call, tool use, streaming output, an LLM-as-judge. On a small dataset, that per-run noise can swamp the signal from a prompt change.
+
+Averaging over repetitions lets the score you report reflect the prompt rather than the sampling noise:
+
+```python
+run_experiment(
+    # ...
+    repetitions=3,
+)
+```
+
+Things to consider:
+
+- Reach for repetitions when the task or the evaluator is an LLM call and the dataset is small.
+- Prefer repetitions when per-example cost is low and you mostly want to settle the score; prefer growing the dataset when you also need to cover more behaviors.
+- Skip repetitions when both the task and the evaluator are deterministic (e.g. string comparison against a ground truth) — a single run is the answer.
+
+Consider adding stability when:
+
+- Repeat runs of the same experiment drift in ways that feel larger than the differences you're trying to measure.
+- A prompt change flips example labels in ways that don't track with how the outputs actually changed.
+- The judge's reasoning on the same output reads differently from one run to the next.
+
+Repetitions are also what `repetitions=1` (default) silently relies on — don't trust a tuning decision based on a single 10-example run.
+
 ## Add Evaluations Later

 ```python