chore: sync Arize skills from arize-skills@6a622b6c962907f54ca3578cb2cabff161d8aae6 and phoenix@30ccbe6b38cc83719038bf30041335f29bae45e9 (#1690)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2026-05-15 19:21:45 +00:00 · 2026-05-12 19:08:14 -07:00
parent 10b72a6aa9
commit a4d0afc7a4
15 changed files with 513 additions and 146 deletions
@@ -4,6 +4,8 @@ Creating and managing evaluation datasets.

 ## Creating Datasets

+`create_dataset()` upserts: if a dataset with the same name already exists it is updated in-place; re-running with identical inputs is a no-op.
+
 ```python
 from phoenix.client import Client

@@ -21,6 +23,19 @@ dataset = client.datasets.create_dataset(
    ],
 )

+# With stable example IDs for targeted updates across uploads
+dataset = client.datasets.create_dataset(
+    name="qa-test-v1",
+    examples=[
+        {
+            "id": "q-001",                      # stable ID — server updates this row, not inserts
+            "input": {"question": "What is 2+2?"},
+            "output": {"answer": "4"},
+            "metadata": {"category": "math"},
+        },
+    ],
+)
+
 # From DataFrame
 dataset = client.datasets.create_dataset(
    dataframe=df,
@@ -28,6 +43,8 @@ dataset = client.datasets.create_dataset(
    input_keys=["question"],
    output_keys=["answer"],
    metadata_keys=["category"],
+    split_key="split",        # single split column (use this instead of deprecated split_keys)
+    example_id_key="id",      # column containing stable example IDs
 )
 ```

@@ -58,6 +75,9 @@ df = dataset.to_dataframe()
 | `input_keys` | Columns for task input |
 | `output_keys` | Columns for expected output |
 | `metadata_keys` | Additional context |
+| `example_id_key` | Column with stable example IDs; server updates the matching row instead of inserting |
+| `split_key` | Single column for split assignment (replaces deprecated `split_keys`) |
+| `split_keys` | **Deprecated** — use `split_key` (singular) instead |

 ## Using Evaluators in Experiments

@@ -128,6 +148,8 @@ See `tutorials/evals/evals-2/evals_2.0_rag_demo.ipynb` for a full worked example

 ## Best Practices

- **Versioning**: Create new datasets (e.g., `qa-test-v2`), don't modify
+- **Upsert by default**: Re-upload to the same name to update in-place; use `example_id_key` so the server targets specific rows instead of treating every upload as new data
+- **Versioning**: Version with tags or new names (e.g., `qa-test-v2`) when you want a clean snapshot, not just incremental edits
 - **Metadata**: Track source, category, difficulty
 - **Balance**: Ensure diverse coverage across categories
+- **Avoid `split_keys`**: Pass `split_key` (singular) — `split_keys` is deprecated and emits a `DeprecationWarning`