Skip to main content

Eval-driven development

OpenCouch uses two complementary surfaces for eval-driven development:

  1. LangSmith for external trace-level observability, run search, and evaluation review.
  2. CLI inspection commands + live status for local visibility into execution, state, and memory.

The CLI intentionally keeps the normal chat loop lightweight now: the reply renders as soon as the response is ready, a live spinner shows node progress while post-response work finishes, and deeper inspection is available on demand through commands like /status, /context, /memory status, and /debug state.


How diagnostics flow

Each node writes its own keys into state["diagnostics"] via the _merge_dicts reducer. Nodes return only their own keys — the reducer handles merging automatically. No manual dict spreading.

crisis_gate                load_memory
· crisis_gate_ms · load_memory_ms
· crisis_level · semantic_hits / episodic_hits
· classifier_path · retrieval_path
│ │
└──────────┬───────────────┘

finalize_turn

┌─────────┴─────────┐
▼ ▼
extract_facts extract_procedural ← parallel fan-out
· extract_facts_ms · extract_procedural_ms
· semantic_writes · procedural_writes
· extract_facts_ · extract_procedural_
reason reason
└─────────┬─────────┘

AgentOutput.diagnostics
+ turn_total_ms (stamped by runtime)
Parallel extractors, merged diagnostics

Both extractors write simultaneously after finalize. Because diagnostics uses a _merge_dicts reducer, their keys merge without racing — no node needs to know what other nodes wrote.


Observability & evaluation

For text runs, LangSmith captures the LangGraph execution trace, including the top-level run plus child spans for graph nodes and subgraphs. In OpenCouch, LangSmith is the primary external surface for:

  • inspecting graph execution paths
  • filtering runs by thread and runtime metadata
  • reviewing failures from local eval harnesses
  • comparing behavior across prompt, model, or routing changes

OpenCouch also attaches runtime metadata such as thread_id, channel, memory_mode, streaming, and user_scope to text runs to make traces easier to search.

LangSmith complements the local CLI inspection surfaces below; it does not replace the project's deterministic eval runners or local debugging commands.


CLI surfaces

1. Assistant Reply

╭──────────── Support Reply ─────────────╮
│ It sounds like something's on your │
│ mind. What's most present right now? │
╰────────────────────────────────────────╯

Green border for therapeutic, red for crisis.

2. Live execution status

run_turn_stream emits one StatusEvent per node via LangGraph's multi-mode streaming. The CLI renders a progress spinner while the graph is still running:

  ⠋ crisis_gate → load_memory → therapeutic → finalize
→ extract_facts + extract_procedural ← parallel, order varies

The stream now also has a non-terminal response_ready event. That means the CLI can render the finished reply as soon as finalize_turn_node seals it, while the post-response memory tail continues in the background. The next user turn still waits for that tail before it is processed, so turn ordering and memory consistency stay intact.

3. On-demand inspection commands

CommandWhat it shows
/statusThread id, mode, turn count, response tier, and active response LLM
/history [n]Recent transcript with mode column per assistant turn
/contextStructured session context snapshot, including working memory and procedural rules
/memory statusOwner-scoped semantic / episodic / procedural counts, recall toggle, and store totals
/debug stateRaw graph state as pretty-printed JSON

The old auto-rendered Turn Diagnostics, Stage Timings, and Session Context panels are no longer part of the default chat loop. Their underlying diagnostics still exist in graph state and traces; the CLI just no longer prints those panels automatically after every turn.


Live streaming

Stage labels are mapped from internal node names:

Node nameCLI label
crisis_gate_nodecrisis_gate
load_memory_nodeload_memory
therapeutic_subgraphtherapeutic
finalize_turn_nodefinalize
extract_semantic_facts_nodeextract_facts
extract_procedural_rules_nodeextract_procedural

Unknown nodes fall through to their raw name so future additions render without a mapping update.


Diagnostics keys reference

KeyNodeValue
crisis_gate_mscrisis_gateAssessment wall-clock time
crisis_classifier_pathcrisis_gateWhich branch decided the result (override, deterministic path, or LLM review/fallback)
crisis_levelcrisis_gateNormalized level (0–3)
load_memory_msload_memoryRetrieval wall-clock time
semantic_hitsload_memorySemantic entries retrieved
semantic_store_sizeload_memoryTotal semantic records in store
episodic_hitsload_memoryEpisodic entries retrieved
episodic_store_sizeload_memoryTotal episodic records in store
procedural_countload_memoryRules loaded from profile
proactive_recallload_memoryRecall toggle state
retrieval_pathload_memoryhybrid_rrf / token_recall / token_recall_after_embed_error
extract_facts_msextract_factsExtraction wall-clock time
semantic_writesextract_factsImmediate semantic writes that actually committed on this turn
semantic_session_end_holdsextract_factsSemantic candidates held for session-end review
semantic_repeat_requiredextract_factsSemantic candidates blocked pending stronger repetition evidence
semantic_policy_dropsextract_factsSemantic candidates dropped by deterministic write policy
semantic_bumpsextract_factsExisting facts bumped (dedup match)
extract_facts_reasonextract_factsSkip reason or extraction outcome
extract_procedural_msextract_proceduralExtraction wall-clock time
procedural_writesextract_proceduralImmediate procedural rules written
procedural_session_end_holdsextract_proceduralProcedural candidates buffered for session-end promotion
procedural_policy_dropsextract_proceduralProcedural candidates dropped by deterministic write policy
extract_procedural_reasonextract_proceduralSkip reason or extraction outcome
turn_total_msruntimeTotal turn wall-clock (stamped outside the graph)

Adding diagnostics to a new node

import time

start = time.monotonic()

# ... node logic ...

return {
"diagnostics": {
"my_node_ms": round((time.monotonic() - start) * 1000, 2),
"my_writes": write_count,
}
}
No spreading needed

The diagnostics field uses a _merge_dicts reducer — return only your own keys and the reducer handles merging with other nodes' diagnostics automatically. Never **state.get("diagnostics", {}).