Voice (Experimental)
The current voice path is an experimental OpenAI Realtime integration. It is intentionally narrower than the main text agent.
- It is speech-only and non-agentic.
- It does not expose tool calling, web search, or autonomous actions.
- It does not currently run the full LangGraph therapeutic stack.
- It now writes semantic, procedural, and episodic memory on disconnect via the shared session-end seam, but it still does not run the full text graph live on every spoken turn.
OpenCouch currently ships two voice clients that share the same FastAPI websocket bridge:
- a standalone test harness at
/api/voice/test - the Next.js
/voicepage in the web app
Current architecture
| Layer | Current implementation |
|---|---|
| Browser | Streams PCM16 microphone audio over WebSocket and plays assistant PCM audio locally |
| Backend bridge | FastAPI websocket at /api/voice/session forwards audio and interruption events |
| Realtime model | gpt-realtime for speech-to-speech output |
| Input transcription | gpt-4o-mini-transcribe as an asynchronous input transcript stream |
| Default voice | cedar |
| Audio format | 24 kHz mono PCM16 for both input and output |
| Turn detection | server_vad with threshold: 0.3, prefix_padding_ms: 300, silence_duration_ms: 300, interrupt_response: true, create_response: true |
| Noise reduction | near_field |
Prompt assembly
The voice session builds one bounded system prompt at connection time. It does not rebuild the prompt every turn.
| Source | What it provides |
|---|---|
| Base spoken policy | Short spoken replies, plain language, no markdown, no clinician framing, explicit crisis redirection |
| Procedural rules | User preferences and style rules |
| Semantic facts | Previously noted user facts |
| Episodic arcs | Short summaries from prior sessions |
The current voice prompt is deliberately small:
- whitespace is normalized
- each memory item is trimmed to 220 characters
- up to 6 procedural rules are included
- up to 6 semantic facts are included
- up to 3 episodic arcs are included
- the final prompt is capped at 12,000 characters
This prompt shape is implemented in apps/backend/voice/realtime.py
via build_voice_system_prompt().
What the browser does
Both the standalone test page and the Next.js /voice page follow the
same transport pattern:
- microphone audio is sent in 512-sample chunks
- assistant playback is tracked by Realtime
item_idandcontent_index - on interruption, the client stops playback and reports a
conversation.item.truncateposition back to the backend - local ducking lowers assistant playback immediately when the browser detects the user has started speaking, before the server-side interruption arrives
The Next.js voice page intentionally keeps transcript rendering disabled for now while the audio path is being stabilized. The standalone test harness still shows transcript events.
WebSocket contract
Client → backend
| Message | Purpose |
|---|---|
start | Open a voice session for a user_id, thread_id, and optional voice |
audio | Send base64-encoded PCM16 microphone bytes |
truncate | Tell the backend how much of the assistant audio was actually played |
Backend → client
| Message | Purpose |
|---|---|
ready | Session is configured and ready for audio |
audio | Assistant PCM audio chunk |
transcript | Optional user or assistant transcript event |
interrupted | Server detected user speech while assistant audio was active |
truncated | Server acknowledged truncation |
error | Surfaced Realtime or bridge error |
Current limitations
| Limitation | Current state |
|---|---|
| Agentic capability | No tool calling, no web search, no autonomous actions |
| LangGraph integration | Voice does not currently route through the main text agent graph |
| Crisis handling | The prompt contains spoken crisis guidance, but the experimental voice path does not currently run the full graph-level crisis gate used by text mode |
| Memory writes | Voice reads memory at connect time, then on disconnect replays the transcript through semantic/procedural extraction and the shared session-end summarizer |
| Prompt refresh | The system prompt is not refreshed mid-session |
| Transcript UX | Transcript display is disabled in the Next.js voice page while the audio transport is being stabilized |
| Interruption heuristics | VAD and local ducking are intentionally aggressive and can false-trigger in noisy environments or with speaker bleed |
Running locally
# Start voice server and open browser
uv run python -m opencouch_cli --voice --port 8000
# Custom port
uv run python -m opencouch_cli --voice --port 9000Or start the server directly:
uv run uvicorn main:app --port 8000
# Open http://localhost:8000/api/voice/testThe standalone harness is the fastest way to debug transport,
playback, interruption, and truncation behavior. The Next.js /voice
page uses the same backend API but adds application UI around it.
Text vs voice
| Concern | Text mode | Voice mode |
|---|---|---|
| Response path | LangGraph therapeutic graph | Direct Realtime websocket adapter |
| Prompting | Full layered prompt assembly per turn | One bounded spoken prompt at connect time |
| Tools | Provider-backed tools supported | None exposed currently |
| Safety | Full graph-level crisis gate | Lightweight spoken crisis guidance only |
| Memory | Read + write during normal chat lifecycle | Prompt preload at connect time, then transcript replay on disconnect for semantic/procedural extraction + episodic summary |
| Interruption | N/A | Server VAD + client truncation + local ducking |
| UI maturity | Primary, stable interface | Experimental speech preview |
Environment variables
OPENAI_API_KEY=...