Realtime Lifecycle
One voice session spans three systems: the browser, OpenAI Realtime, and the OpenCouch backend. The browser talks directly to OpenAI over WebRTC, while backend calls handle OpenCouch-owned state and tools.
Connection setup
- The user opens
/voiceand clicks connect. - The browser calls
POST /api/voice/realtime/sessionwiththread_id, optionaluser_id,memory_mode, and optional assistant voice. - The backend loads compact voice memory context when the session is persistent.
agent.voice.realtime.build_realtime_session_config(...)builds the Realtime session configuration: model, voice, server VAD, input transcription, instructions, and tool schemas.- The backend creates an ephemeral OpenAI Realtime client secret and returns it to the browser.
- The browser opens the microphone, creates a WebRTC offer, and exchanges SDP with OpenAI Realtime.
- Once the Realtime data channel opens, the page is ready for speech.
During the call
Realtime emits transcript and function-call events over the data
channel. The browser handles them in apps/web/src/lib/realtime-voice-session.ts:
| Event family | Browser behavior |
|---|---|
| Final user transcript | Stores the pending user transcript. Realtime server VAD creates the next response automatically. |
| Final assistant transcript | Pairs it with the pending user transcript and records the turn. |
| Function call | Calls POST /api/voice/realtime/tools and sends the function-call output back to Realtime. Most tools then ask Realtime to continue; wait_for_user intentionally does not. |
| Agent speaking / ready state | Updates the UI and session store so the page can show listening, warming, and speaking states. |
The live speech loop is Realtime-native. OpenCouch does not run the text agent or a per-turn policy endpoint before voice responses. App-owned behavior is enforced through the session instructions, Realtime function tools, and persisted turn metadata.
Turn recording
When both sides of a spoken exchange are finalized, the browser calls
POST /api/voice/realtime/turn. The backend calls
runtime.voice.record_voice_turn(...) (the VoiceRuntimeFacade), which appends the
finalized user/assistant transcript entries, inferred route metadata,
inferred response-style metadata, grounded lookup result metadata, and
voice-specific diagnostics to the persisted thread state.
Incognito voice skips durable turn recording and returns
recorded=false.
Disconnect and finalization
Disconnect closes the data channel, peer connection, local microphone
tracks, and audio element. When finalize=true, the browser then calls
POST /api/voice/realtime/end.
Persistent sessions run the same end_session(...) finalizer used by
text sessions. That path may write one episodic session arc and promote
held semantic/procedural memory candidates. Incognito sessions end
without durable finalization.
Key files
| File | Purpose |
|---|---|
apps/web/src/lib/realtime-voice-session.ts | WebRTC setup, Realtime data-channel handling, function-call bridge, turn recording, and disconnect finalization. |
apps/web/src/components/realtime-voice-session-provider.tsx | App-shell provider that stores voice state, transcripts, activity, errors, and finalization status. |
apps/web/src/app/voice/page.tsx | Production voice UI. |
apps/web/src/app/voice/realtime-dev/page.tsx | Raw dogfood/debug UI for Realtime events. |
apps/backend/api/routes/voice.py | FastAPI endpoints for Realtime session creation, tool execution, turn recording, and end-session finalization. |
apps/backend/agent/voice/realtime.py | Realtime session configuration and client-secret creation. |
apps/backend/agent/voice/runtime_facade.py | VoiceRuntimeFacade (runtime.voice.*) — voice memory bootstrap, tool context, and turn recording. |
apps/backend/agent/runtime/runtime.py | Shared end_session(...) session-end path used by both voice and text. |