Skip to main content

Voice (Experimental)

Experimental speech preview

The current voice path is an experimental OpenAI Realtime integration. It is intentionally narrower than the main text agent.

  • It is speech-only and non-agentic.
  • It does not expose tool calling, web search, or autonomous actions.
  • It does not currently run the full LangGraph therapeutic stack.
  • It now writes semantic, procedural, and episodic memory on disconnect via the shared session-end seam, but it still does not run the full text graph live on every spoken turn.

OpenCouch currently ships two voice clients that share the same FastAPI websocket bridge:

  • a standalone test harness at /api/voice/test
  • the Next.js /voice page in the web app
🌐Browsermic + speaker
FastAPIvoice bridge
🎙OpenAI RealtimeSTT + LLM + TTS
Prompt Preloadbounded memory context on connect
VAD + Truncateserver interruption with client sync
Local Duckingbrowser-side preemptive playback mute

Current architecture

LayerCurrent implementation
BrowserStreams PCM16 microphone audio over WebSocket and plays assistant PCM audio locally
Backend bridgeFastAPI websocket at /api/voice/session forwards audio and interruption events
Realtime modelgpt-realtime for speech-to-speech output
Input transcriptiongpt-4o-mini-transcribe as an asynchronous input transcript stream
Default voicecedar
Audio format24 kHz mono PCM16 for both input and output
Turn detectionserver_vad with threshold: 0.3, prefix_padding_ms: 300, silence_duration_ms: 300, interrupt_response: true, create_response: true
Noise reductionnear_field

Prompt assembly

The voice session builds one bounded system prompt at connection time. It does not rebuild the prompt every turn.

SourceWhat it provides
Base spoken policyShort spoken replies, plain language, no markdown, no clinician framing, explicit crisis redirection
Procedural rulesUser preferences and style rules
Semantic factsPreviously noted user facts
Episodic arcsShort summaries from prior sessions

The current voice prompt is deliberately small:

  • whitespace is normalized
  • each memory item is trimmed to 220 characters
  • up to 6 procedural rules are included
  • up to 6 semantic facts are included
  • up to 3 episodic arcs are included
  • the final prompt is capped at 12,000 characters

This prompt shape is implemented in apps/backend/voice/realtime.py via build_voice_system_prompt().


What the browser does

Both the standalone test page and the Next.js /voice page follow the same transport pattern:

  • microphone audio is sent in 512-sample chunks
  • assistant playback is tracked by Realtime item_id and content_index
  • on interruption, the client stops playback and reports a conversation.item.truncate position back to the backend
  • local ducking lowers assistant playback immediately when the browser detects the user has started speaking, before the server-side interruption arrives

The Next.js voice page intentionally keeps transcript rendering disabled for now while the audio path is being stabilized. The standalone test harness still shows transcript events.


WebSocket contract

Client → backend

MessagePurpose
startOpen a voice session for a user_id, thread_id, and optional voice
audioSend base64-encoded PCM16 microphone bytes
truncateTell the backend how much of the assistant audio was actually played

Backend → client

MessagePurpose
readySession is configured and ready for audio
audioAssistant PCM audio chunk
transcriptOptional user or assistant transcript event
interruptedServer detected user speech while assistant audio was active
truncatedServer acknowledged truncation
errorSurfaced Realtime or bridge error

Current limitations

LimitationCurrent state
Agentic capabilityNo tool calling, no web search, no autonomous actions
LangGraph integrationVoice does not currently route through the main text agent graph
Crisis handlingThe prompt contains spoken crisis guidance, but the experimental voice path does not currently run the full graph-level crisis gate used by text mode
Memory writesVoice reads memory at connect time, then on disconnect replays the transcript through semantic/procedural extraction and the shared session-end summarizer
Prompt refreshThe system prompt is not refreshed mid-session
Transcript UXTranscript display is disabled in the Next.js voice page while the audio transport is being stabilized
Interruption heuristicsVAD and local ducking are intentionally aggressive and can false-trigger in noisy environments or with speaker bleed

Running locally

bash — voice mode
# Start voice server and open browser
uv run python -m opencouch_cli --voice --port 8000

# Custom port
uv run python -m opencouch_cli --voice --port 9000

Or start the server directly:

bash — direct server
uv run uvicorn main:app --port 8000
# Open http://localhost:8000/api/voice/test

The standalone harness is the fastest way to debug transport, playback, interruption, and truncation behavior. The Next.js /voice page uses the same backend API but adds application UI around it.


Text vs voice

ConcernText modeVoice mode
Response pathLangGraph therapeutic graphDirect Realtime websocket adapter
PromptingFull layered prompt assembly per turnOne bounded spoken prompt at connect time
ToolsProvider-backed tools supportedNone exposed currently
SafetyFull graph-level crisis gateLightweight spoken crisis guidance only
MemoryRead + write during normal chat lifecyclePrompt preload at connect time, then transcript replay on disconnect for semantic/procedural extraction + episodic summary
InterruptionN/AServer VAD + client truncation + local ducking
UI maturityPrimary, stable interfaceExperimental speech preview

Environment variables

.env.local
OPENAI_API_KEY=...