Voice

Voice is OpenAI Realtime-native, not the text agent wrapped in audio. The browser owns WebRTC audio and data-channel events, OpenAI Realtime owns the live speech-to-speech loop, and the FastAPI backend owns OpenCouch policy, tools, memory, transcript recording, and session finalization.

Transport

Browser WebRTC -> OpenAI Realtime -> FastAPI tools

Browsermic · speaker · data channel

ephemeral secret + SDP

OpenAI Realtimespeech loop · VAD · tools

function calls

FastAPI/api/voice/realtime/*

Session core

Realtime model with app-owned policy

live model

Realtime session

Owns the live spoken response loop. Receives compact instructions, private memory context, Realtime tool schemas, and browser-returned tool outputs.

server VAD with interrupt support
input transcription for turn recording
automatic response creation after user speech

final transcripttool output

backend policy

Voice endpoints

Create the session, execute tools, infer turn metadata, record finalized turns, and close persistent sessions.

/session builds config and client secret
/tools executes app-owned function calls
/turn records finalized transcripts
/end runs shared session finalization

shared tool services

reused by text and voice

Memory · lookup · exercises

Realtime schemas call the same service functions used by text SDK specialists.

memory control and recall status
grounded factual lookup and crisis resources
therapeutic response skills and guided exercises

Persistence

Transcript recording plus shared session finalization

per-session identity

Web setup state

Voice reuses the active web thread, memory mode, optional user id, and selected assistant voice.

thread_iduser_idmemory_modeassistant_voicetranscripttool_activityfinalization_status

shared runtime

PersistentAgentRuntime

Voice does not run a text turn, but it writes state through the same runtime stores and ends persistent sessions through the same finalizer.

voice_session_memory_contextbuild_voice_tool_contextprepare_voice_turn_policyrecord_voice_turnend_session

Shared services, different transport

Text turns run through the OpenAI Agents SDK text runtime. Voice turns run through OpenAI Realtime and record finalized transcripts back into OpenCouch state. Both surfaces reuse the same app-owned memory, grounded lookup, crisis-resource, guided-exercise, and session-end services.

Product surfaces

Surface	Purpose
`/voice`	Production web voice page. Uses the app shell, session setup, assistant voice selector, transcript display, tool activity, and end-session options.
`/voice/realtime-dev`	Lower-level dogfood route for inspecting raw Realtime events, parsed transcripts, tool calls, and finalization responses.
`/api/voice/realtime/*`	Backend contract for session creation, tool execution, turn recording, and session end.

Ownership boundary

Owner	Responsibilities
Browser	Microphone permission, WebRTC peer connection, audio playback, Realtime data-channel parsing, transcript UI, and disconnect/finalization UX.
OpenAI Realtime	Speech input/output, server VAD, interruption handling, live model response generation, and function-call events.
FastAPI backend	Ephemeral client secret creation, Realtime session config, private memory context, function tool execution, durable transcript recording, inferred turn metadata, and persistent session finalization.
Shared runtime services	Memory store, crisis log, grounded lookup, guided-exercise catalog/state, session feedback store, and `PersistentAgentRuntime.end_session(...)`.

What voice intentionally does not do

It does not call run_turn(...) or run_turn_stream(...) for each spoken user turn.
It does not expose LiveKit; the product voice path is OpenAI Realtime.
It does not save durable memory in incognito mode.
It does not let the model invent crisis resources; specific crisis resources must come from lookup_crisis_resources.

Topic	Page
Realtime connection flow	Realtime Lifecycle
Function tools and Realtime policy	Tools & Policy
Transcript recording and memory finalization	Voice Persistence
Manual verification checklist	Voice Dogfood

Product surfaces​

Ownership boundary​

What voice intentionally does not do​

Related pages​

Product surfaces

Ownership boundary

What voice intentionally does not do

Related pages