Skip to main content

Trace-driven development

OpenCouch uses two complementary surfaces for trace-driven development:

  1. Opik for primary external trace-level observability and run search.
  2. CLI inspection commands + live status for local visibility into execution, state, and memory.

The CLI intentionally keeps the normal chat loop lightweight now: the reply renders as soon as the response is ready, a live spinner shows node progress while post-response work finishes, and deeper inspection is available on demand through commands like /status, /context, /memory status, and /debug state.

Voice has a different observability shape because OpenAI Realtime owns the live speech loop. The backend still records app-owned tools, transcript persistence, inferred turn metadata, and finalization state, but there is no per-turn text-runtime graph trace for a spoken exchange.


How diagnostics flow

Each node writes its own keys into state["diagnostics"] via the _merge_dicts reducer. Nodes return only their own keys — the reducer handles merging automatically. No manual dict spreading.

crisis_gate                load_memory
· crisis_gate_ms · load_memory_ms
· crisis_level · semantic_hits / episodic_hits
· crisis_classifier_path · retrieval_path
│ │
└──────────┬───────────────┘

finalize_turn

response ready

AgentOutput.diagnostics
+ turn_total_ms (stamped by runtime)
Runtime diagnostics

Runtime stages and side-effect services use the same structured diagnostics channel, so turn-level timings and retrieval counters land in one AgentOutput.


Observability

For text runs, Opik captures the runtime execution trace, including the top-level run plus child spans for runtime stages and SDK calls. In OpenCouch, Opik is the primary external surface for:

  • inspecting runtime execution paths
  • filtering runs by thread and runtime metadata
  • reviewing failures from tests and manual trace runs
  • comparing behavior across prompt, model, or routing changes

OpenCouch also attaches runtime metadata such as thread_id, channel, memory_mode, streaming, and user_scope to text runs to make traces easier to search.

Opik complements the local CLI inspection surfaces below; it does not replace backend tests or local debugging commands.

Enable Opik by setting:

OPIK_API_KEY=...
OPIK_WORKSPACE=...
OPIK_PROJECT_NAME=opencouch-dev

OPIK_PROJECT_NAME is optional; runs are grouped under the default Opik project when it is unset.

Voice observability

Realtime voice debugging combines browser events and backend state:

SurfaceWhat it shows
/voiceProduct-level connection, transcript, tool activity, error, and finalization status.
/voice/realtime-devRaw Realtime server events, parsed transcript updates, tool calls, and end-session response.
/api/voice/realtime/toolsBackend execution result for one Realtime function call.
/api/voice/realtime/turnWhether the finalized voice turn was recorded and the resulting message count. Inferred route/style metadata is written into runtime state.
/api/voice/realtime/endWhether persistent session finalization produced a summary.

Recorded voice turns stamp diagnostics.voice_runtime=openai_realtime and diagnostics.voice_tool_calls=[...] in runtime state. Grounded lookup tool output also merges into state.grounded_lookup when present.

Use Opik for text-runtime traces. Use the Realtime dogfood route and voice API responses when debugging audio, tool-call, or finalization issues.


Safety audit ledger

Distinct from diagnostics and tracing, the safety audit ledger (agent/audit/) is a durable, operator-facing record of crisis-response behavior. It is deliberately not therapeutic memory, prompt context, or a general observability bucket — audit rows are never loaded into working_memory or used by normal response generation. The separation is the point: safety records can be reviewed after the fact without leaking back into the assistant's replies.

The crisis path writes in one direction only. After the crisis-response branch completes, write_crisis_log builds a single CrisisLogRecord and the configured CrisisLogBackend appends it. Records answer operator questions — did the classifier fire, at what level, through which classifier path; did resource lookup run, find resources, or fall back; did the runtime use the SDK, the SDK tool-fallback, or a response-LLM override — without storing raw user text. Only classification labels, classifier provenance, and structural metadata are kept.

FilePurpose
agent/audit/models.pyCrisisLogRecord, classifier-path enums, and aggregate/summary models
agent/audit/crisis_log.pyCrisisLogBackend protocol + in-memory / null backends; write_crisis_log helper
agent/audit/postgres_crisis_log.pyPrimary durable Postgres backend
agent/audit/sqlite_crisis_log.pySQLite fallback backend
agent/audit/summary.pyDaily safety-summary aggregation over stored records

Retention is operator-driven (see Memory privacy) — backends expose a purge-before-cutoff path, and the TUI adds a manual /memory purge-crisis [days] command. No automatic expiry ships.


CLI surfaces

1. Assistant Reply

╭──────────── Support Reply ─────────────╮
│ It sounds like something's on your │
│ mind. What's most present right now? │
╰────────────────────────────────────────╯

Green border for therapeutic, red for crisis.

2. Live execution status

run_turn_stream emits one StatusEvent per pipeline stage as the turn executes. The CLI renders a progress spinner while the runtime is still working:

  ⠋ crisis_gate → load_memory → therapeutic → finalize

The stream now also has a non-terminal response_ready event. That means the CLI can render the finished reply as soon as turn finalization seals it.

3. On-demand inspection commands

CommandWhat it shows
/statusThread id, mode, turn count, response tier, and active response LLM
/history [n]Recent transcript with mode column per assistant turn
/contextStructured session context snapshot, including working memory and procedural rules
/memory statusOwner-scoped semantic / episodic / procedural counts, recall toggle, and store totals
/debug stateRaw runtime state as pretty-printed JSON

The old auto-rendered Turn Diagnostics, Stage Timings, and Session Context panels are no longer part of the default chat loop. Their underlying diagnostics still exist in runtime state and traces; the CLI just no longer prints those panels automatically after every turn.


Live streaming

Stage labels are defined in agent/models.py as STAGE_LABELS and shared between the CLI and WebSocket API so all clients display consistent text:

Internal stageFriendly label
crisis_gatesafety check
turn_dispatchrouting turn
memory_controlupdating memory
grounded_lookuplooking up factual answer
crisis_resource_lookuplooking up crisis resources
crisis_responsegenerating crisis reply
crisis_logwriting crisis log
load_memoryloading memory
memory_profile_loadloading profile memory
memory_graph_loadquerying graph memory
memory_profile_savesaving profile memory
memory_graph_savewriting graph memory
therapeuticgenerating therapeutic reply
runtime_extractionextracting facts and style rules after response finalization
finalizefinalizing turn
session_stagereading context
response_generationgenerating

Unknown stages fall through to their raw name so future additions render without a mapping update.


Diagnostics keys reference

KeyNodeValue
crisis_gate_mscrisis_gateAssessment wall-clock time
crisis_classifier_pathcrisis_gatellm_primary
crisis_levelcrisis_gateNormalized level (0–3)
resource_lookup_statuscrisis_resource_lookupfound / no_location / location_refused / no_verified_results / not_attempted
memory_control.actionturn_dispatchDetected command kind (or empty when none)
grounded_lookup_msgrounded_lookupGrounded lookup wall-clock time
grounded_lookup.statusgrounded_lookupanswered / no_verified_answer / not_attempted
load_memory_msload_memoryRetrieval wall-clock time
semantic_hitsload_memorySemantic entries retrieved
semantic_store_sizeload_memoryTotal semantic records in store
episodic_hitsload_memoryEpisodic entries retrieved
episodic_store_sizeload_memoryTotal episodic records in store
procedural_countload_memoryRules loaded from profile
proactive_recallload_memoryRecall toggle state
retrieval_pathload_memoryhybrid_rrf / token_recall / token_recall_after_embed_error
turn_total_msruntimeTotal turn wall-clock (stamped outside the graph)

Adding diagnostics to a new node

import time

start = time.monotonic()

# ... node logic ...

return {
"diagnostics": {
"my_node_ms": round((time.monotonic() - start) * 1000, 2),
"my_writes": write_count,
}
}
No spreading needed

The diagnostics field uses a _merge_dicts reducer — return only your own keys and the reducer handles merging with other nodes' diagnostics automatically. Never **state.get("diagnostics", {}).