# Open Coding Free-form note-writing against sampled traces, before any taxonomy exists. After you pick a sample of traces, read each one and write a short, specific observation of what went wrong. These raw notes feed [axial coding](axial-coding.md), where they get grouped into named failure categories — and ultimately into eval targets or fix priorities. **Reach for this whenever** the user wants to look at traces or spans without a fixed taxonomy yet — e.g., "what's going wrong with this agent", "I just instrumented my app, where do I start", "review these traces", "what kinds of mistakes is the model making", "help me make sense of these outputs", or any framing that needs grounded observations before categories. ## Choosing the unit Open coding has two scopes that don't have to match: - **Review scope** — the **trace**. Read input → tool calls → retrieved context → output as one story. - **Recording scope** — **default to the trace**. The honest observation is usually trace-shaped ("asked X, got Y; the answer didn't address the question"), and forcing localization to a span at this stage commits to causal attribution you don't yet have data to support — that's axial coding's job. Drop to a **span** only when one of: - The span, read in isolation, is still wrong: an exception fired, a tool returned an error response, the output is malformed. - You already know the domain well enough to attribute the failure on sight without inferring across spans. Session-level findings are axial-coding rollup targets, not open-coding notes — Phoenix has REST `/v1/projects/{id}/session_annotations` but no session `add-note` path. ## Process 1. **Inspect** — fetch a trace from your sample 2. **Read** — look at input, output, exceptions, tool calls, retrieved context 3. **Note** — write one specific sentence describing what went wrong (or skip if correct) 4. **Record** — attach the note to the trace with `px trace add-note` (default), or to a span with `px span add-note` for in-isolation/mechanical failures 5. **Iterate** — move to the next trace; repeat until the sample is exhausted or saturation hits ## Inspection Use `px` to read trace and span context before writing a note. Open coding reviews by **trace** — read input → tool calls → retrieved context → output as a unit. Record on the trace by default; drill to a specific span only when the failure is mechanical (exception, error response, malformed output) or you can attribute on sight (see [Choosing the unit](#choosing-the-unit)). > **Don't filter the sample by `--status-code ERROR`.** OTel's `status_code` only flips to `ERROR` when an instrumentor catches a raised Python exception (network failure, 5xx, parse error). Hallucinations, wrong tone, retrieval misses, and bad tool selection all complete cleanly and arrive as `OK` or `UNSET`. Sampling for open coding by `--status-code ERROR` excludes the population this workflow exists to surface. ```bash # Sample recent traces — the unit of inspection in open coding px trace list --limit 100 --format raw --no-progress | jq ' .[] | {trace_id: .traceId, root: .rootSpan.name, status, input: .rootSpan.attributes["input.value"], output: .rootSpan.attributes["output.value"]} ' # Trace-level context — all spans in one trace, ordered by start_time px trace get --format raw | jq ' .spans | sort_by(.start_time) | map({span_id: .context.span_id, name, status_code, input: .attributes["input.value"], output: .attributes["output.value"]}) ' # Drill to one span (px span get does not exist; filter via span list) px span list --trace-id --format raw --no-progress \ | jq '.[] | select(.context.span_id == "")' # Check existing notes on traces (default) or spans you are about to review # Notes are stored as annotations with name="note"; use --include-notes (not --include-annotations) px trace list --include-notes --limit 10 --format raw --no-progress | jq ' .[] | select((.notes // []) | length > 0) | {trace_id: .traceId, notes: [.notes[] | .result.explanation]} ' # Same shape on spans — swap px trace for px span and use .context.span_id ``` Always pipe through `jq` with `--format raw --no-progress` when scripting. ## Recording Notes Default write path is `px trace add-note --text "..."` — most observations are trace-shaped and shouldn't pre-commit to localization. Drop to `px span add-note ` when the failure is in-isolation wrong (exception, error response, malformed output) or you already know the failure structure on sight. ```bash # Trace-level note (default) px trace add-note --text "Asked about returns; final answer covered shipping policy instead" # Span-level note (mechanical or attributable-on-sight failures) px span add-note --text "Tool call returned 500 — vendor API unreachable" # Interactive loop — walk traces, write a trace-level note per failing trace px trace list --last-n-minutes 60 --limit 50 --format raw --no-progress \ | jq -r '.[].traceId' \ | while read tid; do echo "── trace $tid ──" px trace get "$tid" --format raw | jq ' {input: .rootSpan.attributes["input.value"], output: .rootSpan.attributes["output.value"], spans: (.spans | sort_by(.start_time) | map({name, status_code}))} ' read -p "Note for $tid (blank to skip): " note [ -z "$note" ] && continue px trace add-note "$tid" --text "$note" done ``` Bulk auto-tagging by status code (e.g. `px span list --status-code ERROR | xargs ... add-note "error"`) is **not open coding** — open coding is manual, observation-grounded, and ranges over all failure modes, not just spans where Python raised. Skip the bulk-by-status-code shortcut; it produces fewer, less informative notes than walking traces. **Fallback write paths (one-line asides):** - `POST /v1/trace_notes` and `POST /v1/span_notes` — accept one `{data: {trace_id|span_id, note}}` per request; use for scripted writes outside the CLI. - `@arizeai/phoenix-client` `addTraceNote` and `addSpanNote` wrap the same endpoints. - `px api graphql` rejects mutations with `"Only queries are permitted."` — use `px trace/span add-note` or the REST endpoints instead. ## What Makes a Good Note | Weak note | Why it's weak | Good note | Why it's strong | | -------------------- | ------------------------- | -------------------------------------------------------------------------- | ------------------------------------------- | | "Wrong answer" | No observable detail | "Said the store closes at 6pm but policy is 9pm" | Quotes observed vs. correct value | | "Bad tone" | Vague judgment | "Used first-name greeting for an enterprise support ticket" | Specifies the context mismatch | | "Hallucination" | Labels before observing | "Cited a product feature ('auto-renew') that does not exist in the schema" | Describes what was fabricated | | "Retrieval issue" | Category, not observation | "Retrieved docs about shipping when the question was about returns" | States what was retrieved vs. needed | | "Model confused" | Opaque | "Answered in Spanish when the user wrote in English" | Observable and reproducible | Write what you saw, not the category you think it belongs to — categorization happens in [axial coding](axial-coding.md). Short prefixes like `TONE:` or `FACTUAL:` are a personal shorthand, not a repo convention. ## Saturation Stop writing notes when observations stop being new. Signals: - **Repeats** — the last 10–15 traces produced notes that describe failures you've already seen. - **Paraphrase convergence** — you catch yourself writing minor variations of earlier notes. - **Skips outnumber notes** — most recent traces are correct and need no note. At saturation, move on to [axial coding](axial-coding.md) to group what you have. Continuing past saturation adds traces but not insight. You do not need to annotate every trace — annotating correct ones dilutes signal. ## Principles - **Free-form over structured** — do not pre-commit to a taxonomy during open coding; categories emerge in axial coding. - **Specific over general** — quote or paraphrase the observed failure; vague labels ("bad response") carry no signal. - **Context before labeling** — inspect input, output, and retrieved context before writing any note. - **Iterate before categorizing** — work through the full sample first; resist grouping while still collecting. - **Skip is valid** — a correct span needs no note; annotating everything dilutes signal.