Trace-driven development
OpenCouch uses two complementary surfaces for trace-driven development:
- Opik for primary external trace-level observability and run search.
- CLI inspection commands + live status for local visibility into execution, state, and memory.
The CLI intentionally keeps the normal chat loop lightweight now:
the reply renders as soon as the response is ready, a live spinner
shows node progress while post-response work finishes, and deeper
inspection is available on demand through commands like /status,
/context, /memory status, and /debug state.
Voice has a different observability shape because OpenAI Realtime owns the live speech loop. The backend still records app-owned tools, transcript persistence, inferred turn metadata, and finalization state, but there is no per-turn text-runtime graph trace for a spoken exchange.
How diagnostics flow
Each node writes its own keys into state["diagnostics"] via the
_merge_dicts reducer. Nodes return only their own keys — the
reducer handles merging automatically. No manual dict spreading.
crisis_gate load_memory
· crisis_gate_ms · load_memory_ms
· crisis_level · semantic_hits / episodic_hits
· crisis_classifier_path · retrieval_path
│ │
└──────────┬───────────────┘
▼
finalize_turn
│
response ready
│
AgentOutput.diagnostics
+ turn_total_ms (stamped by runtime)
Runtime stages and side-effect services use the same structured diagnostics channel, so
turn-level timings and retrieval counters land in one AgentOutput.
Observability
For text runs, Opik captures the runtime execution trace, including the top-level run plus child spans for runtime stages and SDK calls. In OpenCouch, Opik is the primary external surface for:
- inspecting runtime execution paths
- filtering runs by thread and runtime metadata
- reviewing failures from tests and manual trace runs
- comparing behavior across prompt, model, or routing changes
OpenCouch also attaches runtime metadata such as thread_id, channel, memory_mode, streaming, and user_scope to text runs to make traces easier to search.
Opik complements the local CLI inspection surfaces below; it does not replace backend tests or local debugging commands.
Enable Opik by setting:
OPIK_API_KEY=...
OPIK_WORKSPACE=...
OPIK_PROJECT_NAME=opencouch-dev
OPIK_PROJECT_NAME is optional; runs are grouped under the default
Opik project when it is unset.
Voice observability
Realtime voice debugging combines browser events and backend state:
| Surface | What it shows |
|---|---|
/voice | Product-level connection, transcript, tool activity, error, and finalization status. |
/voice/realtime-dev | Raw Realtime server events, parsed transcript updates, tool calls, and end-session response. |
/api/voice/realtime/tools | Backend execution result for one Realtime function call. |
/api/voice/realtime/turn | Whether the finalized voice turn was recorded and the resulting message count. Inferred route/style metadata is written into runtime state. |
/api/voice/realtime/end | Whether persistent session finalization produced a summary. |
Recorded voice turns stamp diagnostics.voice_runtime=openai_realtime
and diagnostics.voice_tool_calls=[...] in runtime state. Grounded
lookup tool output also merges into state.grounded_lookup when present.
Use Opik for text-runtime traces. Use the Realtime dogfood route and voice API responses when debugging audio, tool-call, or finalization issues.
Safety audit ledger
Distinct from diagnostics and tracing, the safety audit ledger
(agent/audit/) is a durable, operator-facing record of crisis-response
behavior. It is deliberately not therapeutic memory, prompt context, or a
general observability bucket — audit rows are never loaded into
working_memory or used by normal response generation. The separation is the
point: safety records can be reviewed after the fact without leaking back into
the assistant's replies.
The crisis path writes in one direction only. After the crisis-response branch
completes, write_crisis_log builds a single CrisisLogRecord and the
configured CrisisLogBackend appends it. Records answer operator questions —
did the classifier fire, at what level, through which classifier path; did
resource lookup run, find resources, or fall back; did the runtime use the SDK,
the SDK tool-fallback, or a response-LLM override — without storing raw user
text. Only classification labels, classifier provenance, and structural
metadata are kept.
| File | Purpose |
|---|---|
agent/audit/models.py | CrisisLogRecord, classifier-path enums, and aggregate/summary models |
agent/audit/crisis_log.py | CrisisLogBackend protocol + in-memory / null backends; write_crisis_log helper |
agent/audit/postgres_crisis_log.py | Primary durable Postgres backend |
agent/audit/sqlite_crisis_log.py | SQLite fallback backend |
agent/audit/summary.py | Daily safety-summary aggregation over stored records |
Retention is operator-driven (see
Memory privacy) — backends expose a
purge-before-cutoff path, and the TUI adds a manual /memory purge-crisis [days]
command. No automatic expiry ships.
CLI surfaces
1. Assistant Reply
╭──────────── Support Reply ─────────────╮