mirror of
https://github.com/github/awesome-copilot.git
synced 2026-05-04 14:15:55 +00:00
chore: sync Arize skills from arize-skills@597d609bfe5f07fd7d24acfdb408a082911b18fc and phoenix@746247cbb07b0dc7803b87c69dd8c77811c33f59 (#1583)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
This commit is contained in:
@@ -1,11 +1,11 @@
|
||||
---
|
||||
name: phoenix-cli
|
||||
description: Debug LLM applications using the Phoenix CLI. Fetch traces, analyze errors, review experiments, inspect datasets, and query the GraphQL API. Use when debugging AI/LLM applications, analyzing trace data, working with Phoenix observability, or investigating LLM performance issues.
|
||||
description: Debug LLM applications using the Phoenix CLI. Fetch traces, analyze errors, structure trace review with open coding and axial coding, inspect datasets, review experiments, query annotation configs, and use the GraphQL API. Use whenever the user is analyzing traces or spans, investigating LLM/agent failures, deciding what to do after instrumenting an app, building failure taxonomies, choosing what evals to write, or asking "what's going wrong", "what kinds of mistakes", or "where do I focus" — even without naming a technique.
|
||||
license: Apache-2.0
|
||||
compatibility: Requires Node.js (for npx) or global install of @arizeai/phoenix-cli. Optionally requires jq for JSON processing.
|
||||
metadata:
|
||||
author: arize-ai
|
||||
version: "2.0.0"
|
||||
version: "3.3.0"
|
||||
---
|
||||
|
||||
# Phoenix CLI
|
||||
@@ -22,9 +22,20 @@ The CLI uses singular resource commands with subcommands like `list` and `get`:
|
||||
```bash
|
||||
px trace list
|
||||
px trace get <trace-id>
|
||||
px trace annotate <trace-id>
|
||||
px trace add-note <trace-id>
|
||||
px span list
|
||||
px span annotate <span-id>
|
||||
px span add-note <span-id>
|
||||
px session list
|
||||
px session get <session-id>
|
||||
px session annotate <session-id>
|
||||
px session add-note <session-id>
|
||||
px dataset list
|
||||
px dataset get <name>
|
||||
px project list
|
||||
px annotation-config list
|
||||
px auth status
|
||||
```
|
||||
|
||||
## Setup
|
||||
@@ -37,41 +48,53 @@ export PHOENIX_API_KEY=your-api-key # if auth is enabled
|
||||
|
||||
Always use `--format raw --no-progress` when piping to `jq`.
|
||||
|
||||
## Quick Reference
|
||||
|
||||
| Task | Files |
|
||||
| ---- | ----- |
|
||||
| Look at sampled traces and write specific notes about what went wrong (no taxonomy yet) | [references/open-coding](references/open-coding.md) |
|
||||
| Group those notes into a structured failure taxonomy and quantify what matters | [references/axial-coding](references/axial-coding.md) |
|
||||
|
||||
## Workflows
|
||||
|
||||
**"What do I do after instrumenting?" / "Where do I focus?" / "What's going wrong?"**
|
||||
[open-coding](references/open-coding.md) → [axial-coding](references/axial-coding.md) → build evals for the top categories.
|
||||
|
||||
## Reference Categories
|
||||
|
||||
| Prefix | Description |
|
||||
| ------ | ----------- |
|
||||
| `references/open-coding` | Free-form notes against sampled traces — reach for it whenever the user wants to make sense of traces but has no failure categories yet |
|
||||
| `references/axial-coding` | Inductive grouping of notes into a MECE taxonomy with counts — reach for it whenever the user has observations and needs categories or eval targets |
|
||||
|
||||
## Auth
|
||||
|
||||
```bash
|
||||
px auth status # check connection and authentication
|
||||
px auth status --endpoint http://other:6006 # check a specific endpoint
|
||||
```
|
||||
|
||||
## Projects
|
||||
|
||||
```bash
|
||||
px project list # list all projects (table view)
|
||||
px project list --format raw --no-progress | jq '.[].name' # project names as JSON
|
||||
```
|
||||
|
||||
## Traces
|
||||
|
||||
```bash
|
||||
px trace list --limit 20 --format raw --no-progress | jq .
|
||||
px trace list --last-n-minutes 60 --limit 20 --format raw --no-progress | jq '.[] | select(.status == "ERROR")'
|
||||
px trace list --since 2025-01-15T00:00:00Z --limit 50 --format raw --no-progress | jq .
|
||||
px trace list --format raw --no-progress | jq 'sort_by(-.duration) | .[0:5]'
|
||||
px trace list --include-notes --format raw --no-progress | jq '.[].notes'
|
||||
px trace get <trace-id> --format raw | jq .
|
||||
px trace get <trace-id> --format raw | jq '.spans[] | select(.status_code != "OK")'
|
||||
```
|
||||
|
||||
## Spans
|
||||
|
||||
```bash
|
||||
px span list --limit 20 # recent spans (table view)
|
||||
px span list --last-n-minutes 60 --limit 50 # spans from last hour
|
||||
px span list --span-kind LLM --limit 10 # only LLM spans
|
||||
px span list --status-code ERROR --limit 20 # only errored spans
|
||||
px span list --name chat_completion --limit 10 # filter by span name
|
||||
px span list --trace-id <id> --format raw --no-progress | jq . # all spans for a trace
|
||||
px span list --include-annotations --limit 10 # include annotation scores
|
||||
px span list output.json --limit 100 # save to JSON file
|
||||
px span list --format raw --no-progress | jq '.[] | select(.status_code == "ERROR")'
|
||||
```
|
||||
|
||||
### Span JSON shape
|
||||
|
||||
```
|
||||
Span
|
||||
name, span_kind ("LLM"|"CHAIN"|"TOOL"|"RETRIEVER"|"EMBEDDING"|"AGENT"|"RERANKER"|"GUARDRAIL"|"EVALUATOR"|"UNKNOWN")
|
||||
status_code ("OK"|"ERROR"|"UNSET"), status_message
|
||||
context.span_id, context.trace_id, parent_id
|
||||
start_time, end_time
|
||||
attributes (same as trace span attributes above)
|
||||
annotations[] (with --include-annotations)
|
||||
name, result { score, label, explanation }
|
||||
px trace get <trace-id> --include-notes --format raw | jq '.notes'
|
||||
px trace annotate <trace-id> --name reviewer --label pass
|
||||
px trace annotate <trace-id> --name reviewer --score 0.9 --format raw --no-progress
|
||||
px trace add-note <trace-id> --text "needs follow-up"
|
||||
```
|
||||
|
||||
### Trace JSON shape
|
||||
@@ -79,10 +102,16 @@ Span
|
||||
```
|
||||
Trace
|
||||
traceId, status ("OK"|"ERROR"), duration (ms), startTime, endTime
|
||||
annotations[] (with --include-annotations, excludes note)
|
||||
name, result { score, label, explanation }
|
||||
notes[] (with --include-notes)
|
||||
name="note", result { explanation }
|
||||
rootSpan — top-level span (parent_id: null)
|
||||
spans[]
|
||||
name, span_kind ("LLM"|"CHAIN"|"TOOL"|"RETRIEVER"|"EMBEDDING"|"AGENT")
|
||||
status_code ("OK"|"ERROR"), parent_id, context.span_id
|
||||
name, span_kind ("LLM"|"CHAIN"|"TOOL"|"RETRIEVER"|"EMBEDDING"|"AGENT"|"RERANKER"|"GUARDRAIL"|"EVALUATOR"|"UNKNOWN")
|
||||
status_code ("OK"|"ERROR"|"UNSET"), parent_id, context.span_id
|
||||
notes[] (with --include-notes)
|
||||
name="note", result { explanation }
|
||||
attributes
|
||||
input.value, output.value — raw input/output
|
||||
llm.model_name, llm.provider
|
||||
@@ -95,13 +124,66 @@ Trace
|
||||
exception.message — set if span errored
|
||||
```
|
||||
|
||||
## Spans
|
||||
|
||||
```bash
|
||||
px span list --limit 20 # recent spans (table view)
|
||||
px span list --last-n-minutes 60 --limit 50 # spans from last hour
|
||||
px span list --since 2025-01-15T00:00:00Z --limit 50 # spans since a timestamp
|
||||
px span list --span-kind LLM --limit 10 # only LLM spans
|
||||
px span list --status-code ERROR --limit 20 # only errored spans
|
||||
px span list --name chat_completion --limit 10 # filter by span name
|
||||
px span list --trace-id <id> --format raw --no-progress | jq . # all spans for a trace
|
||||
px span list --parent-id null --limit 10 # only root spans
|
||||
px span list --parent-id <span-id> --limit 10 # only children of a span
|
||||
px span list --include-annotations --limit 10 # include annotation scores
|
||||
px span list --include-notes --limit 10 # include span notes
|
||||
px span list --attribute llm.model_name:gpt-4 --limit 10 # filter by string attribute
|
||||
px span list --attribute llm.token_count.total:500 --limit 10 # filter by numeric attribute
|
||||
px span list --attribute 'user.id:"12345"' --limit 10 # force string match for numeric-looking value
|
||||
px span list --attribute session.id:sess:abc:123 --limit 20 # colon in value OK (split on first colon only)
|
||||
px span list --attribute llm.model_name:gpt-4 --attribute session.id:abc --limit 10 # AND multiple filters
|
||||
px span list output.json --limit 100 # save to JSON file
|
||||
px span list --format raw --no-progress | jq '.[] | select(.status_code == "ERROR")'
|
||||
px span annotate <span-id> --name reviewer --label pass
|
||||
px span annotate <span-id> --name checker --score 1 --annotator-kind CODE
|
||||
px span add-note <span-id> --text "verified by agent"
|
||||
```
|
||||
|
||||
### Span JSON shape
|
||||
|
||||
```
|
||||
Span
|
||||
name, span_kind ("LLM"|"CHAIN"|"TOOL"|"RETRIEVER"|"EMBEDDING"|"AGENT"|"RERANKER"|"GUARDRAIL"|"EVALUATOR"|"UNKNOWN")
|
||||
status_code ("OK"|"ERROR"|"UNSET"), status_message
|
||||
context.span_id, context.trace_id, parent_id
|
||||
start_time, end_time
|
||||
attributes
|
||||
input.value, output.value — raw input/output
|
||||
llm.model_name, llm.provider
|
||||
llm.token_count.prompt/completion/total
|
||||
llm.input_messages.{N}.message.role/content
|
||||
llm.output_messages.{N}.message.role/content
|
||||
llm.invocation_parameters — JSON string (temperature, etc.)
|
||||
exception.message — set if span errored
|
||||
annotations[] (with --include-annotations, excludes note)
|
||||
name, result { score, label, explanation }
|
||||
notes[] (with --include-notes)
|
||||
name="note", result { explanation }
|
||||
```
|
||||
|
||||
## Sessions
|
||||
|
||||
```bash
|
||||
px session list --limit 10 --format raw --no-progress | jq .
|
||||
px session list --order asc --format raw --no-progress | jq '.[].session_id'
|
||||
px session list --include-annotations --include-notes --format raw --no-progress | jq '.[].notes'
|
||||
px session get <session-id> --format raw | jq .
|
||||
px session get <session-id> --include-annotations --format raw | jq '.annotations'
|
||||
px session get <session-id> --include-annotations --format raw | jq '.session.annotations'
|
||||
px session get <session-id> --include-notes --format raw | jq '.session.notes'
|
||||
px session annotate <session-id> --name reviewer --label pass
|
||||
px session annotate <session-id> --name reviewer --score 0.9 --format raw --no-progress
|
||||
px session add-note <session-id> --text "verified by agent"
|
||||
```
|
||||
|
||||
### Session JSON shape
|
||||
@@ -110,13 +192,12 @@ px session get <session-id> --include-annotations --format raw | jq '.annotation
|
||||
SessionData
|
||||
id, session_id, project_id
|
||||
start_time, end_time
|
||||
annotations[] (with --include-annotations, excludes note)
|
||||
name, result { score, label, explanation }
|
||||
notes[] (with --include-notes)
|
||||
name="note", result { explanation }
|
||||
traces[]
|
||||
id, trace_id, start_time, end_time
|
||||
|
||||
SessionAnnotation (with --include-annotations)
|
||||
id, name, annotator_kind ("LLM"|"CODE"|"HUMAN"), session_id
|
||||
result { label, score, explanation }
|
||||
metadata, identifier, source, created_at, updated_at
|
||||
```
|
||||
|
||||
## Datasets / Experiments / Prompts
|
||||
@@ -124,12 +205,21 @@ SessionAnnotation (with --include-annotations)
|
||||
```bash
|
||||
px dataset list --format raw --no-progress | jq '.[].name'
|
||||
px dataset get <name> --format raw | jq '.examples[] | {input, output: .expected_output}'
|
||||
px dataset get <name> --split train --format raw | jq . # filter by split
|
||||
px dataset get <name> --version <version-id> --format raw | jq .
|
||||
px experiment list --dataset <name> --format raw --no-progress | jq '.[] | {id, name, failed_run_count}'
|
||||
px experiment get <id> --format raw --no-progress | jq '.[] | select(.error != null) | {input, error}'
|
||||
px prompt list --format raw --no-progress | jq '.[].name'
|
||||
px prompt get <name> --format text --no-progress # plain text, ideal for piping to AI
|
||||
```
|
||||
|
||||
## Annotation Configs
|
||||
|
||||
```bash
|
||||
px annotation-config list # list all configs (table view)
|
||||
px annotation-config list --format raw --no-progress | jq '.[].name' # config names as JSON
|
||||
```
|
||||
|
||||
## GraphQL
|
||||
|
||||
For ad-hoc queries not covered by the commands above. Output is `{"data": {...}}`.
|
||||
|
||||
178
skills/phoenix-cli/references/axial-coding.md
Normal file
178
skills/phoenix-cli/references/axial-coding.md
Normal file
@@ -0,0 +1,178 @@
|
||||
# Axial Coding
|
||||
|
||||
Group open-ended observations into structured failure taxonomies. Axial coding turns notes, trace observations, or open-coding output into named categories with counts, supporting downstream work like eval design and fix prioritization. It works well after [open coding](open-coding.md), but can start from any set of open-ended observations.
|
||||
|
||||
**Reach for this whenever** the user has observations and needs structure — e.g., "what categories of failures do we have", "what should I build evals for", "how do I prioritize fixes", "group these notes", "MECE breakdown", or any framing that asks for categories or counts grounded in real traces rather than invented top-down.
|
||||
|
||||
## Choosing the unit
|
||||
|
||||
Open-coding notes are usually **trace-level** (see [open-coding.md#choosing-the-unit](open-coding.md#choosing-the-unit)) — examples below lead with `px trace` and fall back to `px span` for span-level notes. **An axial label can live at a different level than the note that informed it** — that's a feature: a trace-level note "answered shipping when asked returns" can produce a span-level annotation on the retrieval span once a pattern reveals retrieval as the consistent culprit. Re-attribution at axial coding time is what axial coding *is*. Session-level rollups go through REST `/v1/projects/{id}/session_annotations` (no CLI write path).
|
||||
|
||||
## Process
|
||||
|
||||
1. **Gather** — Collect open-coding notes from the entities you reviewed (trace-level by default)
|
||||
2. **Pattern** — Group notes with common themes
|
||||
3. **Name** — Create actionable category names
|
||||
4. **Attribute** — Decide what level each category lives at; an axial label can move from the note's level to the component the pattern implicates
|
||||
5. **Quantify** — Count failures per category
|
||||
|
||||
## Example Taxonomy
|
||||
|
||||
```yaml
|
||||
failure_taxonomy:
|
||||
content_quality:
|
||||
hallucination: [invented_facts, fictional_citations]
|
||||
incompleteness: [partial_answer, missing_key_info]
|
||||
inaccuracy: [wrong_numbers, wrong_dates]
|
||||
|
||||
communication:
|
||||
tone_mismatch: [too_casual, too_formal]
|
||||
clarity: [ambiguous, jargon_heavy]
|
||||
|
||||
context:
|
||||
user_context: [ignored_preferences, misunderstood_intent]
|
||||
retrieved_context: [ignored_documents, wrong_context]
|
||||
|
||||
safety:
|
||||
missing_disclaimers: [legal, medical, financial]
|
||||
```
|
||||
|
||||
## Reading
|
||||
|
||||
### 1. Gather — extract open-coding notes
|
||||
|
||||
Open-coding notes are stored as annotations with `name="note"` and are only returned when `--include-notes` is passed. Use `--include-annotations` instead and you will get structured annotations but **not** notes — the server excludes notes from the annotations array.
|
||||
|
||||
```bash
|
||||
# Trace-level notes (default for open coding)
|
||||
px trace list --include-notes --format raw --no-progress | jq '
|
||||
[ .[] | select((.notes // []) | length > 0) ]
|
||||
| map({ trace_id: .traceId, notes: [ .notes[].result.explanation ] })
|
||||
'
|
||||
|
||||
# Span-level notes (when open coding dropped to span for mechanical failures)
|
||||
px span list --include-notes --format raw --no-progress | jq '
|
||||
[ .[] | select((.notes // []) | length > 0) ]
|
||||
| map({ span_id: .context.span_id, notes: [ .notes[].result.explanation ] })
|
||||
'
|
||||
```
|
||||
|
||||
### 2. Group — synthesize categories
|
||||
|
||||
Review the note text collected above. Manually identify recurring themes and draft candidate category names. Aim for MECE coverage: each note should fit exactly one category.
|
||||
|
||||
### 3. Record — write axial-coding annotations
|
||||
|
||||
Write one annotation per entity using `px trace annotate` or `px span annotate`. The level can differ from where the source note lives — see the **Recording** section below.
|
||||
|
||||
### 4. Quantify — count per category
|
||||
|
||||
After recording, use `--include-annotations` to count how many entities carry each label. Examples below show span-level counts; for trace-level annotations, swap `px span list` for `px trace list` (the `.annotations[]` shape is the same).
|
||||
|
||||
```bash
|
||||
px span list --include-annotations --format raw --no-progress | jq '
|
||||
[ .[] | .annotations[]? | select(.name == "failure_category" and .result.label != null) ]
|
||||
| group_by(.result.label)
|
||||
| map({ label: .[0].result.label, count: length })
|
||||
| sort_by(-.count)
|
||||
'
|
||||
```
|
||||
|
||||
Filter to a specific annotation name to check coverage:
|
||||
|
||||
```bash
|
||||
px span list --include-annotations --format raw --no-progress | jq '
|
||||
[ .[] | select((.annotations // []) | any(.name == "failure_category")) ]
|
||||
| length
|
||||
'
|
||||
```
|
||||
|
||||
## Recording
|
||||
|
||||
Use the matching annotate command for the level the **label** belongs at — which may differ from where the source note lives (see [Choosing the unit](#choosing-the-unit)):
|
||||
|
||||
```bash
|
||||
# Trace-level label (most common — the trace as a whole exhibits the failure)
|
||||
px trace annotate <trace-id> \
|
||||
--name failure_category \
|
||||
--label answered_off_topic \
|
||||
--explanation "asked about returns; answer covered shipping" \
|
||||
--annotator-kind HUMAN
|
||||
|
||||
# Span-level label (when the pattern implicates a specific component)
|
||||
px span annotate <span-id> \
|
||||
--name failure_category \
|
||||
--label retrieval_off_topic \
|
||||
--explanation "retrieved shipping docs for a returns query" \
|
||||
--annotator-kind HUMAN
|
||||
```
|
||||
|
||||
Accepted flags: `--name`, `--label`, `--score`, `--explanation`, `--annotator-kind` (`HUMAN`, `LLM`, `CODE`). There are no `--identifier` or `--sync` flags on these commands.
|
||||
|
||||
### Bulk recording
|
||||
|
||||
Axial coding categorizes the entities you took notes on during open coding. Do **not** filter by `--status-code ERROR` — that captures only spans where Python raised, which excludes most failure modes (hallucination, wrong tone, retrieval miss). See [open-coding.md](open-coding.md#inspection) for the full reasoning.
|
||||
|
||||
```bash
|
||||
# Bulk-annotate traces that already have open-coding notes
|
||||
px trace list --include-notes --format raw --no-progress \
|
||||
| jq -r '.[] | select((.notes // []) | length > 0) | .traceId' \
|
||||
| while read tid; do
|
||||
px trace annotate "$tid" \
|
||||
--name failure_category \
|
||||
--label answered_off_topic \
|
||||
--annotator-kind HUMAN
|
||||
done
|
||||
```
|
||||
|
||||
The same pattern works for span-level notes — swap `px trace` for `px span` and `.traceId` for `.context.span_id`.
|
||||
|
||||
Aside: for Node-based bulk scripts, `@arizeai/phoenix-client` exposes `addSpanAnnotation`, `addSpanNote`, and `addTraceNote`. (No `addTraceAnnotation` is exported today; use the REST endpoint or `px trace annotate` for trace-level annotations.)
|
||||
|
||||
Aside: `px api graphql` rejects mutations — it cannot write annotations.
|
||||
|
||||
## Agent Failure Taxonomy
|
||||
|
||||
```yaml
|
||||
agent_failures:
|
||||
planning: [wrong_plan, incomplete_plan]
|
||||
tool_selection: [wrong_tool, missed_tool, unnecessary_call]
|
||||
tool_execution: [wrong_parameters, type_error]
|
||||
state_management: [lost_context, stuck_in_loop]
|
||||
error_recovery: [no_fallback, wrong_fallback]
|
||||
```
|
||||
|
||||
### Transition Matrix — jq sketch
|
||||
|
||||
To find where failures occur between agent states, identify the last non-error span before each first-error span within a trace. Note: OTel leaves most spans at `status_code == "UNSET"` and only sets `"OK"` when code explicitly does so — match `!= "ERROR"` rather than `== "OK"` so the matrix works on typical OTel data.
|
||||
|
||||
```bash
|
||||
px span list --format raw --no-progress | jq '
|
||||
group_by(.context.trace_id)
|
||||
| map(
|
||||
sort_by(.start_time)
|
||||
| { trace_id: .[0].context.trace_id,
|
||||
last_non_error: map(select(.status_code != "ERROR")) | last | .name,
|
||||
first_err: map(select(.status_code == "ERROR")) | first | .name }
|
||||
)
|
||||
| [ .[] | select(.first_err != null) ]
|
||||
| group_by([.last_non_error, .first_err])
|
||||
| map({ transition: "\(.[0].last_non_error) → \(.[0].first_err)", count: length })
|
||||
| sort_by(-.count)
|
||||
'
|
||||
```
|
||||
|
||||
Use the output to tally which state-to-state transitions are most failure-prone and add them to your taxonomy.
|
||||
|
||||
## What Makes a Good Category
|
||||
|
||||
A useful category is:
|
||||
- **Named for the cause**, not the symptom ("wrong_tool_selected", not "bad_output")
|
||||
- **Tied to a fix** — if you can't name a remediation, the category is too vague
|
||||
- **Grounded in data** — emerged from actual note text, not assumed upfront
|
||||
|
||||
## Principles
|
||||
|
||||
- **MECE** - Each failure fits ONE category
|
||||
- **Actionable** - Categories suggest fixes
|
||||
- **Bottom-up** - Let categories emerge from data
|
||||
127
skills/phoenix-cli/references/open-coding.md
Normal file
127
skills/phoenix-cli/references/open-coding.md
Normal file
@@ -0,0 +1,127 @@
|
||||
# Open Coding
|
||||
|
||||
Free-form note-writing against sampled traces, before any taxonomy exists. After you pick a sample of traces, read each one and write a short, specific observation of what went wrong. These raw notes feed [axial coding](axial-coding.md), where they get grouped into named failure categories — and ultimately into eval targets or fix priorities.
|
||||
|
||||
**Reach for this whenever** the user wants to look at traces or spans without a fixed taxonomy yet — e.g., "what's going wrong with this agent", "I just instrumented my app, where do I start", "review these traces", "what kinds of mistakes is the model making", "help me make sense of these outputs", or any framing that needs grounded observations before categories.
|
||||
|
||||
## Choosing the unit
|
||||
|
||||
Open coding has two scopes that don't have to match:
|
||||
|
||||
- **Review scope** — the **trace**. Read input → tool calls → retrieved context → output as one story.
|
||||
- **Recording scope** — **default to the trace**. The honest observation is usually trace-shaped ("asked X, got Y; the answer didn't address the question"), and forcing localization to a span at this stage commits to causal attribution you don't yet have data to support — that's axial coding's job.
|
||||
|
||||
Drop to a **span** only when one of:
|
||||
- The span, read in isolation, is still wrong: an exception fired, a tool returned an error response, the output is malformed.
|
||||
- You already know the domain well enough to attribute the failure on sight without inferring across spans.
|
||||
|
||||
Session-level findings are axial-coding rollup targets, not open-coding notes — Phoenix has REST `/v1/projects/{id}/session_annotations` but no session `add-note` path.
|
||||
|
||||
## Process
|
||||
|
||||
1. **Inspect** — fetch a trace from your sample
|
||||
2. **Read** — look at input, output, exceptions, tool calls, retrieved context
|
||||
3. **Note** — write one specific sentence describing what went wrong (or skip if correct)
|
||||
4. **Record** — attach the note to the trace with `px trace add-note` (default), or to a span with `px span add-note` for in-isolation/mechanical failures
|
||||
5. **Iterate** — move to the next trace; repeat until the sample is exhausted or saturation hits
|
||||
|
||||
## Inspection
|
||||
|
||||
Use `px` to read trace and span context before writing a note. Open coding reviews by **trace** — read input → tool calls → retrieved context → output as a unit. Record on the trace by default; drill to a specific span only when the failure is mechanical (exception, error response, malformed output) or you can attribute on sight (see [Choosing the unit](#choosing-the-unit)).
|
||||
|
||||
> **Don't filter the sample by `--status-code ERROR`.** OTel's `status_code` only flips to `ERROR` when an instrumentor catches a raised Python exception (network failure, 5xx, parse error). Hallucinations, wrong tone, retrieval misses, and bad tool selection all complete cleanly and arrive as `OK` or `UNSET`. Sampling for open coding by `--status-code ERROR` excludes the population this workflow exists to surface.
|
||||
|
||||
```bash
|
||||
# Sample recent traces — the unit of inspection in open coding
|
||||
px trace list --limit 100 --format raw --no-progress | jq '
|
||||
.[] | {trace_id: .traceId, root: .rootSpan.name, status,
|
||||
input: .rootSpan.attributes["input.value"],
|
||||
output: .rootSpan.attributes["output.value"]}
|
||||
'
|
||||
|
||||
# Trace-level context — all spans in one trace, ordered by start_time
|
||||
px trace get <trace-id> --format raw | jq '
|
||||
.spans | sort_by(.start_time) | map({span_id: .context.span_id, name, status_code,
|
||||
input: .attributes["input.value"],
|
||||
output: .attributes["output.value"]})
|
||||
'
|
||||
|
||||
# Drill to one span (px span get does not exist; filter via span list)
|
||||
px span list --trace-id <trace-id> --format raw --no-progress \
|
||||
| jq '.[] | select(.context.span_id == "<span-id>")'
|
||||
|
||||
# Check existing notes on traces (default) or spans you are about to review
|
||||
# Notes are stored as annotations with name="note"; use --include-notes (not --include-annotations)
|
||||
px trace list --include-notes --limit 10 --format raw --no-progress | jq '
|
||||
.[] | select((.notes // []) | length > 0)
|
||||
| {trace_id: .traceId, notes: [.notes[] | .result.explanation]}
|
||||
'
|
||||
# Same shape on spans — swap px trace for px span and use .context.span_id
|
||||
```
|
||||
|
||||
Always pipe through `jq` with `--format raw --no-progress` when scripting.
|
||||
|
||||
## Recording Notes
|
||||
|
||||
Default write path is `px trace add-note <trace-id> --text "..."` — most observations are trace-shaped and shouldn't pre-commit to localization. Drop to `px span add-note <span-id>` when the failure is in-isolation wrong (exception, error response, malformed output) or you already know the failure structure on sight.
|
||||
|
||||
```bash
|
||||
# Trace-level note (default)
|
||||
px trace add-note <trace-id> --text "Asked about returns; final answer covered shipping policy instead"
|
||||
|
||||
# Span-level note (mechanical or attributable-on-sight failures)
|
||||
px span add-note <span-id> --text "Tool call returned 500 — vendor API unreachable"
|
||||
|
||||
# Interactive loop — walk traces, write a trace-level note per failing trace
|
||||
px trace list --last-n-minutes 60 --limit 50 --format raw --no-progress \
|
||||
| jq -r '.[].traceId' \
|
||||
| while read tid; do
|
||||
echo "── trace $tid ──"
|
||||
px trace get "$tid" --format raw | jq '
|
||||
{input: .rootSpan.attributes["input.value"],
|
||||
output: .rootSpan.attributes["output.value"],
|
||||
spans: (.spans | sort_by(.start_time) | map({name, status_code}))}
|
||||
'
|
||||
read -p "Note for $tid (blank to skip): " note
|
||||
[ -z "$note" ] && continue
|
||||
px trace add-note "$tid" --text "$note"
|
||||
done
|
||||
```
|
||||
|
||||
Bulk auto-tagging by status code (e.g. `px span list --status-code ERROR | xargs ... add-note "error"`) is **not open coding** — open coding is manual, observation-grounded, and ranges over all failure modes, not just spans where Python raised. Skip the bulk-by-status-code shortcut; it produces fewer, less informative notes than walking traces.
|
||||
|
||||
**Fallback write paths (one-line asides):**
|
||||
|
||||
- `POST /v1/trace_notes` and `POST /v1/span_notes` — accept one `{data: {trace_id|span_id, note}}` per request; use for scripted writes outside the CLI.
|
||||
- `@arizeai/phoenix-client` `addTraceNote` and `addSpanNote` wrap the same endpoints.
|
||||
- `px api graphql` rejects mutations with `"Only queries are permitted."` — use `px trace/span add-note` or the REST endpoints instead.
|
||||
|
||||
## What Makes a Good Note
|
||||
|
||||
| Weak note | Why it's weak | Good note | Why it's strong |
|
||||
| -------------------- | ------------------------- | -------------------------------------------------------------------------- | ------------------------------------------- |
|
||||
| "Wrong answer" | No observable detail | "Said the store closes at 6pm but policy is 9pm" | Quotes observed vs. correct value |
|
||||
| "Bad tone" | Vague judgment | "Used first-name greeting for an enterprise support ticket" | Specifies the context mismatch |
|
||||
| "Hallucination" | Labels before observing | "Cited a product feature ('auto-renew') that does not exist in the schema" | Describes what was fabricated |
|
||||
| "Retrieval issue" | Category, not observation | "Retrieved docs about shipping when the question was about returns" | States what was retrieved vs. needed |
|
||||
| "Model confused" | Opaque | "Answered in Spanish when the user wrote in English" | Observable and reproducible |
|
||||
|
||||
Write what you saw, not the category you think it belongs to — categorization happens in [axial coding](axial-coding.md). Short prefixes like `TONE:` or `FACTUAL:` are a personal shorthand, not a repo convention.
|
||||
|
||||
## Saturation
|
||||
|
||||
Stop writing notes when observations stop being new. Signals:
|
||||
|
||||
- **Repeats** — the last 10–15 traces produced notes that describe failures you've already seen.
|
||||
- **Paraphrase convergence** — you catch yourself writing minor variations of earlier notes.
|
||||
- **Skips outnumber notes** — most recent traces are correct and need no note.
|
||||
|
||||
At saturation, move on to [axial coding](axial-coding.md) to group what you have. Continuing past saturation adds traces but not insight. You do not need to annotate every trace — annotating correct ones dilutes signal.
|
||||
|
||||
## Principles
|
||||
|
||||
- **Free-form over structured** — do not pre-commit to a taxonomy during open coding; categories emerge in axial coding.
|
||||
- **Specific over general** — quote or paraphrase the observed failure; vague labels ("bad response") carry no signal.
|
||||
- **Context before labeling** — inspect input, output, and retrieved context before writing any note.
|
||||
- **Iterate before categorizing** — work through the full sample first; resist grouping while still collecting.
|
||||
- **Skip is valid** — a correct span needs no note; annotating everything dilutes signal.
|
||||
Reference in New Issue
Block a user