Voice
Voice is OpenAI Realtime-native, not the text agent wrapped in audio. The browser owns WebRTC audio and data-channel events, OpenAI Realtime owns the live speech-to-speech loop, and the FastAPI backend owns OpenCouch policy, tools, memory, transcript recording, and session finalization.
Browser WebRTC -> OpenAI Realtime -> FastAPI tools
Realtime model with app-owned policy
Realtime session
Owns the live spoken response loop. Receives compact instructions, private memory context, Realtime tool schemas, and browser-returned tool outputs.
- server VAD with interrupt support
- input transcription for turn recording
- automatic response creation after user speech
Voice endpoints
Create the session, execute tools, infer turn metadata, record finalized turns, and close persistent sessions.
/sessionbuilds config and client secret/toolsexecutes app-owned function calls/turnrecords finalized transcripts/endruns shared session finalization
Memory · lookup · exercises
Realtime schemas call the same service functions used by text SDK specialists.
- memory control and recall status
- grounded factual lookup and crisis resources
- therapeutic response skills and guided exercises
Transcript recording plus shared session finalization
Text turns run through the OpenAI Agents SDK text runtime. Voice turns run through OpenAI Realtime and record finalized transcripts back into OpenCouch state. Both surfaces reuse the same app-owned memory, grounded lookup, crisis-resource, guided-exercise, and session-end services.
Product surfaces
| Surface | Purpose |
|---|---|
/voice | Production web voice page. Uses the app shell, session setup, assistant voice selector, transcript display, tool activity, and end-session options. |
/voice/realtime-dev | Lower-level dogfood route for inspecting raw Realtime events, parsed transcripts, tool calls, and finalization responses. |
/api/voice/realtime/* | Backend contract for session creation, tool execution, turn recording, and session end. |
Ownership boundary
| Owner | Responsibilities |
|---|---|
| Browser | Microphone permission, WebRTC peer connection, audio playback, Realtime data-channel parsing, transcript UI, and disconnect/finalization UX. |
| OpenAI Realtime | Speech input/output, server VAD, interruption handling, live model response generation, and function-call events. |
| FastAPI backend | Ephemeral client secret creation, Realtime session config, private memory context, function tool execution, durable transcript recording, inferred turn metadata, and persistent session finalization. |
| Shared runtime services | Memory store, crisis log, grounded lookup, guided-exercise catalog/state, session feedback store, and PersistentAgentRuntime.end_session(...). |
What voice intentionally does not do
- It does not call
run_turn(...)orrun_turn_stream(...)for each spoken user turn. - It does not expose LiveKit; the product voice path is OpenAI Realtime.
- It does not save durable memory in incognito mode.
- It does not let the model invent crisis resources; specific crisis
resources must come from
lookup_crisis_resources.
Related pages
| Topic | Page |
|---|---|
| Realtime connection flow | Realtime Lifecycle |
| Function tools and Realtime policy | Tools & Policy |
| Transcript recording and memory finalization | Voice Persistence |
| Manual verification checklist | Voice Dogfood |