mirror of
https://github.com/github/awesome-copilot.git
synced 2026-05-15 19:21:45 +00:00
chore: sync Arize skills from arize-skills@6a622b6c962907f54ca3578cb2cabff161d8aae6 and phoenix@30ccbe6b38cc83719038bf30041335f29bae45e9 (#1690)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
This commit is contained in:
@@ -4,17 +4,40 @@ Group open-ended observations into structured failure taxonomies. Axial coding t
|
||||
|
||||
**Reach for this whenever** the user has observations and needs structure — e.g., "what categories of failures do we have", "what should I build evals for", "how do I prioritize fixes", "group these notes", "MECE breakdown", or any framing that asks for categories or counts grounded in real traces rather than invented top-down.
|
||||
|
||||
## Coding annotation identifier (reuse the open-coding value)
|
||||
|
||||
Reuse the **coding annotation identifier** chosen in open coding — every `annotate` call below passes `--identifier "$CODING_ANNOTATION_IDENTIFIER"` explicitly. In a fresh shell or fresh agent invocation, set `CODING_ANNOTATION_IDENTIFIER` to the same value (recoverable from the wrap-up UI URL or by listing `.px/coding/*.jsonl`); don't mint a new id. See [open-coding.md#coding-annotation-identifier-pick-this-first](open-coding.md#coding-annotation-identifier-pick-this-first) for the rationale and the sanitization rule.
|
||||
|
||||
> **Workflow term vs. server annotation name.** The skill calls this value the **coding annotation identifier**; the server annotation NAME used for the UI filter stays `coding_session_id` for data compatibility. Don't try to rename the server-side key.
|
||||
|
||||
```bash
|
||||
CODING_ANNOTATION_IDENTIFIER="coding-run:chatbot-context-loss-2026-05-06"
|
||||
SLUG=$(echo -n "$CODING_ANNOTATION_IDENTIFIER" | sed 's/[^a-zA-Z0-9_-]/-/g')
|
||||
NOTES_SIDECAR=".px/coding/${SLUG}.jsonl"
|
||||
AXIAL_SIDECAR=".px/coding/${SLUG}-axial.jsonl"
|
||||
```
|
||||
|
||||
## Choosing the unit
|
||||
|
||||
Open-coding notes are usually **trace-level** (see [open-coding.md#choosing-the-unit](open-coding.md#choosing-the-unit)) — examples below lead with `px trace` and fall back to `px span` for span-level notes. **An axial label can live at a different level than the note that informed it** — that's a feature: a trace-level note "answered shipping when asked returns" can produce a span-level annotation on the retrieval span once a pattern reveals retrieval as the consistent culprit. Re-attribution at axial coding time is what axial coding *is*. Session-level rollups go through REST `/v1/projects/{id}/session_annotations` (no CLI write path).
|
||||
Open coding's diagnostic in [open-coding.md#choosing-the-unit-of-analysis](open-coding.md#choosing-the-unit-of-analysis) commits to a unit (trace, span, or session). Axial coding inherits that unit by default — if open coding ran at the session level, axial labels will too; same for trace and span.
|
||||
|
||||
**An axial label can live at a different level than the note that informed it** — that's a feature, and it works in every direction:
|
||||
|
||||
- *Trace → span*: a trace-level note "answered shipping when asked about returns" can produce a span-level annotation on the retrieval span once a pattern reveals retrieval as the consistent culprit.
|
||||
- *Trace → session*: a batch of trace-level notes describing single-turn confusion can produce a session-level annotation once you see the pattern is "the agent doesn't track the user's stated context across turns."
|
||||
- *Session → trace*: a session-level note about cross-turn drift may, on closer reading, attribute to one specific turn where the agent dropped the thread; a trace-level annotation can name that turn.
|
||||
|
||||
Whichever level you write the axial label on, write the matching `coding_session_id` UI-filter annotation on the same entity (see [UI-filter annotation](#ui-filter-annotation) below) so the UI link picks it up.
|
||||
|
||||
## Process
|
||||
|
||||
1. **Gather** — Collect open-coding notes from the entities you reviewed (trace-level by default)
|
||||
2. **Pattern** — Group notes with common themes
|
||||
3. **Name** — Create actionable category names
|
||||
4. **Attribute** — Decide what level each category lives at; an axial label can move from the note's level to the component the pattern implicates
|
||||
5. **Quantify** — Count failures per category
|
||||
1. **Set the coding annotation identifier** — set `CODING_ANNOTATION_IDENTIFIER` to the value used in open coding and re-derive `SLUG`, `NOTES_SIDECAR`, `AXIAL_SIDECAR` (see [Coding annotation identifier](#coding-annotation-identifier-reuse-the-open-coding-value))
|
||||
2. **Gather** — read open-coding notes from `$NOTES_SIDECAR` (at the unit committed in open coding); no server round-trip
|
||||
3. **Pattern** — group notes with common themes
|
||||
4. **Name** — create actionable category names
|
||||
5. **Attribute** — decide what level each category lives at; an axial label can move up (trace → session) or down (trace → span) from the source note's level to the level the pattern actually implicates
|
||||
6. **Record** — `px {trace,span,session} annotate ... --name axial_coding_category --label <cat> --identifier "$CODING_ANNOTATION_IDENTIFIER"`, add/update one JSONL sidecar row for the label, then write the matching `coding_session_id` UI-filter annotation
|
||||
7. **Quantify** — count failures per category from `$AXIAL_SIDECAR`
|
||||
|
||||
## Example Taxonomy
|
||||
|
||||
@@ -39,97 +62,110 @@ failure_taxonomy:
|
||||
|
||||
## Reading
|
||||
|
||||
### 1. Gather — extract open-coding notes
|
||||
### 1. Gather — read this run's open-coding notes from the sidecar
|
||||
|
||||
Open-coding notes are stored as annotations with `name="note"` and are only returned when `--include-notes` is passed. Use `--include-annotations` instead and you will get structured annotations but **not** notes — the server excludes notes from the annotations array.
|
||||
Open-coding wrote one JSONL line per note to `$NOTES_SIDECAR` (`.px/coding/${SLUG}.jsonl`). Read it directly — no server round-trip is needed. Each line has `entity_kind`, `entity_id`, `note`, `identifier`, and `ts`. If the same `(entity_kind, entity_id)` appears more than once, use the newest `ts` as the current note.
|
||||
|
||||
```bash
|
||||
# Trace-level notes (default for open coding)
|
||||
px trace list --include-notes --format raw --no-progress | jq '
|
||||
[ .[] | select((.notes // []) | length > 0) ]
|
||||
| map({ trace_id: .traceId, notes: [ .notes[].result.explanation ] })
|
||||
'
|
||||
**Missing-file behavior.** An absent `$NOTES_SIDECAR` means open coding hasn't run for this coding annotation identifier in this CWD — stop and run open coding first, do not silently treat it as zero notes.
|
||||
|
||||
# Span-level notes (when open coding dropped to span for mechanical failures)
|
||||
px span list --include-notes --format raw --no-progress | jq '
|
||||
[ .[] | select((.notes // []) | length > 0) ]
|
||||
| map({ span_id: .context.span_id, notes: [ .notes[].result.explanation ] })
|
||||
'
|
||||
```
|
||||
**Malformed lines.** Each line is independently parseable JSON. If `jq` reports a parse error, fix or drop that line manually; do not edit other lines.
|
||||
|
||||
**Notes outside this run.** The sidecar only carries notes this CWD wrote. To pull notes another reviewer or earlier run wrote, fetch them via `px {trace,span,session} list --include-notes` (embeds notes into row output) — the workflow's sidecar is intentionally per-CWD-per-coding-identifier.
|
||||
|
||||
### 2. Group — synthesize categories
|
||||
|
||||
Review the note text collected above. Manually identify recurring themes and draft candidate category names. Aim for MECE coverage: each note should fit exactly one category.
|
||||
|
||||
### 3. Record — write axial-coding annotations
|
||||
### 3. Record — write axial-coding labels
|
||||
|
||||
Write one annotation per entity using `px trace annotate` or `px span annotate`. The level can differ from where the source note lives — see the **Recording** section below.
|
||||
Write one annotation per entity using `px {trace,span,session} annotate`, passing `--identifier "$CODING_ANNOTATION_IDENTIFIER"` explicitly on every call, and record one JSONL row in `$AXIAL_SIDECAR` so [Quantify](#4-quantify--count-per-category-from-the-axial-sidecar) below can count without a server round-trip. The level can differ from where the source note lives — see [Recording](#recording) below.
|
||||
|
||||
### 4. Quantify — count per category
|
||||
### 4. Quantify — count per category from the axial sidecar
|
||||
|
||||
After recording, use `--include-annotations` to count how many entities carry each label. Examples below show span-level counts; for trace-level annotations, swap `px span list` for `px trace list` (the `.annotations[]` shape is the same).
|
||||
Counts come from `$AXIAL_SIDECAR` (populated by [Record](#3-record--write-axial-coding-labels)). No server query, no project-wide history mixed in — the sidecar holds exactly the labels this run wrote. Count the current rows by `axial_label`; if an entity appears more than once, use the newest `ts`.
|
||||
|
||||
```bash
|
||||
px span list --include-annotations --format raw --no-progress | jq '
|
||||
[ .[] | .annotations[]? | select(.name == "failure_category" and .result.label != null) ]
|
||||
| group_by(.result.label)
|
||||
| map({ label: .[0].result.label, count: length })
|
||||
| sort_by(-.count)
|
||||
'
|
||||
```
|
||||
|
||||
Filter to a specific annotation name to check coverage:
|
||||
|
||||
```bash
|
||||
px span list --include-annotations --format raw --no-progress | jq '
|
||||
[ .[] | select((.annotations // []) | any(.name == "failure_category")) ]
|
||||
| length
|
||||
'
|
||||
```
|
||||
Same missing-file and malformed-line rules as `$NOTES_SIDECAR`: a missing axial sidecar means no labels have been written yet (run [Record](#3-record--write-axial-coding-labels)); malformed lines are line-local — fix or drop, don't edit neighbors.
|
||||
|
||||
## Recording
|
||||
|
||||
Use the matching annotate command for the level the **label** belongs at — which may differ from where the source note lives (see [Choosing the unit](#choosing-the-unit)):
|
||||
Use the matching annotate command for the level the **label** belongs at — which may differ from where the source note lives (see [Choosing the unit](#choosing-the-unit)). Every call carries `--identifier "$CODING_ANNOTATION_IDENTIFIER"` and `--format raw --no-progress`, and is paired with a JSONL row in `$AXIAL_SIDECAR`.
|
||||
|
||||
**Axial sidecar JSONL line shape (one per `annotate`):**
|
||||
|
||||
```json
|
||||
{"entity_kind":"trace","entity_id":"<trace-id>","annotation_name":"axial_coding_category","axial_label":"<label>","explanation":"<optional explanation>","identifier":"<original identifier value, unsanitized>","ts":"<ISO-8601 UTC>"}
|
||||
```
|
||||
|
||||
Fields:
|
||||
- `entity_kind` — `"trace"`, `"span"`, or `"session"` (matches the `annotate` subcommand)
|
||||
- `entity_id` — the entity argument passed to `annotate`
|
||||
- `annotation_name` — always `"axial_coding_category"` for axial labels (the workflow's reserved annotation name)
|
||||
- `axial_label` — the `--label` value, verbatim; this is what [Quantify](#4-quantify--count-per-category-from-the-axial-sidecar) groups on
|
||||
- `explanation` — optional, but include it when the `annotate` call used `--explanation`
|
||||
- `identifier` — the **original** `$CODING_ANNOTATION_IDENTIFIER` value, unsanitized; the sanitized form lives only in the filename
|
||||
- `ts` — ISO-8601 UTC timestamp of the local append
|
||||
|
||||
If you revise a label for the same entity under the same coding annotation identifier, either replace that row or append a newer row. When duplicate `(entity_kind, entity_id, annotation_name)` rows exist, the newest `ts` is the current label. This matches the server upsert behavior of `annotate --identifier`.
|
||||
|
||||
Minimal trace example:
|
||||
|
||||
```bash
|
||||
# Trace-level label (most common — the trace as a whole exhibits the failure)
|
||||
px trace annotate <trace-id> \
|
||||
--name failure_category \
|
||||
--name axial_coding_category \
|
||||
--label answered_off_topic \
|
||||
--explanation "asked about returns; answer covered shipping" \
|
||||
--annotator-kind HUMAN
|
||||
|
||||
# Span-level label (when the pattern implicates a specific component)
|
||||
px span annotate <span-id> \
|
||||
--name failure_category \
|
||||
--label retrieval_off_topic \
|
||||
--explanation "retrieved shipping docs for a returns query" \
|
||||
--annotator-kind HUMAN
|
||||
--annotator-kind HUMAN \
|
||||
--identifier "$CODING_ANNOTATION_IDENTIFIER" \
|
||||
--format raw --no-progress
|
||||
```
|
||||
|
||||
Accepted flags: `--name`, `--label`, `--score`, `--explanation`, `--annotator-kind` (`HUMAN`, `LLM`, `CODE`). There are no `--identifier` or `--sync` flags on these commands.
|
||||
Then add a matching JSONL row to `$AXIAL_SIDECAR` using the line shape above. For span or session labels, change `entity_kind`, `entity_id`, and the `px` subcommand accordingly.
|
||||
|
||||
### Bulk recording
|
||||
Accepted flags: `--name`, `--label`, `--score`, `--explanation`, `--annotator-kind` (`HUMAN`, `LLM`, `CODE`), `--identifier`. There is no `--sync` flag — the CLI passes `sync=true` itself.
|
||||
|
||||
Axial coding categorizes the entities you took notes on during open coding. Do **not** filter by `--status-code ERROR` — that captures only spans where Python raised, which excludes most failure modes (hallucination, wrong tone, retrieval miss). See [open-coding.md](open-coding.md#inspection) for the full reasoning.
|
||||
### UI-filter annotation
|
||||
|
||||
Write a `coding_session_id` annotation at the same level as the axial label — see [open-coding.md#ui-filter-annotation](open-coding.md#ui-filter-annotation) for why the Phoenix UI filter requires a name-based annotation rather than the bare `--identifier`. If open coding already wrote `coding_session_id` on the same entity, this call upserts (idempotent). The annotation NAME `coding_session_id` is unchanged; only the workflow's spoken term is "coding annotation identifier".
|
||||
|
||||
```bash
|
||||
# Bulk-annotate traces that already have open-coding notes
|
||||
px trace list --include-notes --format raw --no-progress \
|
||||
| jq -r '.[] | select((.notes // []) | length > 0) | .traceId' \
|
||||
| while read tid; do
|
||||
px trace annotate "$tid" \
|
||||
--name failure_category \
|
||||
--label answered_off_topic \
|
||||
--annotator-kind HUMAN
|
||||
done
|
||||
# Same level as the axial label above
|
||||
px trace annotate <trace-id> \
|
||||
--name coding_session_id \
|
||||
--label "$CODING_ANNOTATION_IDENTIFIER" \
|
||||
--identifier "$CODING_ANNOTATION_IDENTIFIER"
|
||||
# or px span annotate / px session annotate at matching levels
|
||||
```
|
||||
|
||||
The same pattern works for span-level notes — swap `px trace` for `px span` and `.traceId` for `.context.span_id`.
|
||||
### Recording discipline
|
||||
|
||||
Aside: for Node-based bulk scripts, `@arizeai/phoenix-client` exposes `addSpanAnnotation`, `addSpanNote`, and `addTraceNote`. (No `addTraceAnnotation` is exported today; use the REST endpoint or `px trace annotate` for trace-level annotations.)
|
||||
Axial coding categorizes the entities you took notes on during open coding. Use `$NOTES_SIDECAR` as the source of candidate entities and write labels only after reading the note text and surrounding trace/span/session context. Do **not** filter by `--status-code ERROR` — that captures only spans where Python raised, which excludes most failure modes (hallucination, wrong tone, retrieval miss). See [open-coding.md](open-coding.md#inspection) for the full reasoning.
|
||||
|
||||
Aside: `px api graphql` rejects mutations — it cannot write annotations.
|
||||
**Fallback paths:** REST `POST /v1/{trace,span,session}_annotations` and `@arizeai/phoenix-client`'s `addSpanAnnotation` / `addSessionAnnotation` (no `addTraceAnnotation` is exported today — use REST or `px trace annotate`). The GraphQL endpoint rejects mutations.
|
||||
|
||||
## Wrapping up
|
||||
|
||||
After axial coding finishes, share the Phoenix UI link with the user. The link points to the project's traces table filtered by the `coding_session_id` annotation — `annotations['coding_session_id'].label == '<coding-annotation-id>'`. The UI route `/projects/:projectId` expects an encoded GraphQL node ID, not a project name — resolve it via `px project get`:
|
||||
|
||||
```bash
|
||||
project_id=$(px project get "$PHOENIX_PROJECT" --format raw --no-progress | jq -r '.id')
|
||||
encoded=$(python3 -c 'import urllib.parse, sys; print(urllib.parse.quote(sys.argv[1]))' \
|
||||
"annotations['coding_session_id'].label == '$CODING_ANNOTATION_IDENTIFIER'")
|
||||
echo "Phoenix UI: $PHOENIX_HOST/projects/$project_id/traces?filterCondition=$encoded"
|
||||
```
|
||||
|
||||
If the user wants to discard everything this run produced (open-coding notes, axial-coding labels, and `coding_session_id` annotations on the server, plus the local sidecars), three identifier-bound deletes handle the server side and one `rm` handles the local sidecars. **Confirm before running** — destructive. Each `px <entity>-annotations delete` call requires `--all` to authorize the unbounded sweep; `--identifier` only narrows. Set `PHOENIX_CLI_DANGEROUSLY_ENABLE_DELETES=true` first if not already exported:
|
||||
|
||||
```bash
|
||||
for kind in trace span session; do
|
||||
px "$kind-annotations" delete \
|
||||
--identifier "$CODING_ANNOTATION_IDENTIFIER" \
|
||||
--all -y \
|
||||
--format raw --no-progress
|
||||
done
|
||||
rm -f "$NOTES_SIDECAR" "$AXIAL_SIDECAR"
|
||||
```
|
||||
|
||||
Each `px <entity>-annotations delete` call removes notes, axial-coding labels, and `coding_session_id` annotations together because they share the underlying annotation table; the `rm` clears the local sidecars.
|
||||
|
||||
## Agent Failure Taxonomy
|
||||
|
||||
@@ -173,6 +209,10 @@ A useful category is:
|
||||
|
||||
## Principles
|
||||
|
||||
- **MECE** - Each failure fits ONE category
|
||||
- **Actionable** - Categories suggest fixes
|
||||
- **Bottom-up** - Let categories emerge from data
|
||||
- **One coding annotation identifier per run** — every `annotate` call and every sidecar line carries `$CODING_ANNOTATION_IDENTIFIER`, the same value open coding used; never mint a new id mid-run.
|
||||
- **Pass `--identifier` explicitly** — every `px` call gets `--identifier "$CODING_ANNOTATION_IDENTIFIER"`; do not rely on inherited env vars.
|
||||
- **Sidecar reads, server writes** — Gather and Quantify read `$NOTES_SIDECAR` and `$AXIAL_SIDECAR` locally; Record writes to the server and updates the sidecar. If an entity appears more than once, the newest `ts` wins.
|
||||
- **MECE** — Each failure fits ONE category.
|
||||
- **Actionable** — Categories suggest fixes.
|
||||
- **Bottom-up** — Let categories emerge from data.
|
||||
- **UI-filter annotation always paired** — never write `axial_coding_category` without writing the matching `coding_session_id` annotation; the UI link depends on it.
|
||||
|
||||
Reference in New Issue
Block a user