chore: sync Arize skills from arize-skills@6a622b6c962907f54ca3578cb2cabff161d8aae6 and phoenix@30ccbe6b38cc83719038bf30041335f29bae45e9 (#1690)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
This commit is contained in:
Jim Bennett
2026-05-12 19:08:14 -07:00
committed by GitHub
parent 10b72a6aa9
commit a4d0afc7a4
15 changed files with 513 additions and 146 deletions
@@ -4,6 +4,8 @@ Creating and managing evaluation datasets.
## Creating Datasets
`create_dataset()` upserts: if a dataset with the same name already exists it is updated in-place; re-running with identical inputs is a no-op.
```python
from phoenix.client import Client
@@ -21,6 +23,19 @@ dataset = client.datasets.create_dataset(
],
)
# With stable example IDs for targeted updates across uploads
dataset = client.datasets.create_dataset(
name="qa-test-v1",
examples=[
{
"id": "q-001", # stable ID — server updates this row, not inserts
"input": {"question": "What is 2+2?"},
"output": {"answer": "4"},
"metadata": {"category": "math"},
},
],
)
# From DataFrame
dataset = client.datasets.create_dataset(
dataframe=df,
@@ -28,6 +43,8 @@ dataset = client.datasets.create_dataset(
input_keys=["question"],
output_keys=["answer"],
metadata_keys=["category"],
split_key="split", # single split column (use this instead of deprecated split_keys)
example_id_key="id", # column containing stable example IDs
)
```
@@ -58,6 +75,9 @@ df = dataset.to_dataframe()
| `input_keys` | Columns for task input |
| `output_keys` | Columns for expected output |
| `metadata_keys` | Additional context |
| `example_id_key` | Column with stable example IDs; server updates the matching row instead of inserting |
| `split_key` | Single column for split assignment (replaces deprecated `split_keys`) |
| `split_keys` | **Deprecated** — use `split_key` (singular) instead |
## Using Evaluators in Experiments
@@ -128,6 +148,8 @@ See `tutorials/evals/evals-2/evals_2.0_rag_demo.ipynb` for a full worked example
## Best Practices
- **Versioning**: Create new datasets (e.g., `qa-test-v2`), don't modify
- **Upsert by default**: Re-upload to the same name to update in-place; use `example_id_key` so the server targets specific rows instead of treating every upload as new data
- **Versioning**: Version with tags or new names (e.g., `qa-test-v2`) when you want a clean snapshot, not just incremental edits
- **Metadata**: Track source, category, difficulty
- **Balance**: Ensure diverse coverage across categories
- **Avoid `split_keys`**: Pass `split_key` (singular) — `split_keys` is deprecated and emits a `DeprecationWarning`
@@ -4,6 +4,8 @@ Creating and managing evaluation datasets.
## Creating Datasets
`createDataset()` upserts: if a dataset with the same name already exists it is updated to match the provided examples. Re-running with identical inputs is a no-op.
```typescript
import { createClient } from "@arizeai/phoenix-client";
import { createDataset } from "@arizeai/phoenix-client/datasets";
@@ -21,15 +23,32 @@ const { datasetId } = await createDataset({
},
],
});
// With stable example IDs for targeted updates across uploads
const { datasetId } = await createDataset({
client,
name: "qa-test-v1",
examples: [
{
id: "q-001", // stable ID — server updates this row, not inserts
input: { question: "What is 2+2?" },
output: { answer: "4" },
metadata: { category: "math" },
},
],
});
```
## Example Structure
```typescript
interface DatasetExample {
interface Example {
input: Record<string, unknown>; // Task input
output?: Record<string, unknown>; // Expected output
metadata?: Record<string, unknown>; // Additional context
output?: Record<string, unknown> | null; // Expected output
metadata?: Record<string, unknown> | null; // Additional context
splits?: string | string[] | null; // Split assignment ("train", ["train", "easy"], etc.)
spanId?: string | null; // OTEL span ID to link back to source trace
id?: string | null; // Stable user-provided ID; server updates matching row
}
```
@@ -64,6 +83,7 @@ const all = await listDatasets({ client });
## Best Practices
- **Versioning**: Create new datasets, don't modify existing
- **Upsert by default**: Re-upload to the same name to update in-place; use `id` on examples so the server targets specific rows instead of treating every upload as new data
- **Versioning**: Version with new names (e.g., `qa-test-v2`) when you want a clean snapshot, not just incremental edits
- **Metadata**: Track source, category, provenance
- **Type safety**: Use TypeScript interfaces for structure
- **Type safety**: Use the `Example` type from `@arizeai/phoenix-client/datasets`