Skip to main content

Voice

Voice is OpenAI Realtime-native, not the text agent wrapped in audio. The browser owns WebRTC audio and data-channel events, OpenAI Realtime owns the live speech-to-speech loop, and the FastAPI backend owns OpenCouch policy, tools, memory, transcript recording, and session finalization.

Transport

Browser WebRTC -> OpenAI Realtime -> FastAPI tools

Browsermic · speaker · data channel
OpenAI Realtimespeech loop · VAD · tools
FastAPI/api/voice/realtime/*
Session core

Realtime model with app-owned policy

live model

Realtime session

Owns the live spoken response loop. Receives compact instructions, private memory context, Realtime tool schemas, and browser-returned tool outputs.

  • server VAD with interrupt support
  • input transcription for turn recording
  • automatic response creation after user speech
final transcripttool output
backend policy

Voice endpoints

Create the session, execute tools, infer turn metadata, record finalized turns, and close persistent sessions.

  • /session builds config and client secret
  • /tools executes app-owned function calls
  • /turn records finalized transcripts
  • /end runs shared session finalization
shared tool services
reused by text and voice

Memory · lookup · exercises

Realtime schemas call the same service functions used by text SDK specialists.

  • memory control and recall status
  • grounded factual lookup and crisis resources
  • therapeutic response skills and guided exercises
Persistence

Transcript recording plus shared session finalization

per-session identity

Web setup state

Voice reuses the active web thread, memory mode, optional user id, and selected assistant voice.

thread_iduser_idmemory_modeassistant_voicetranscripttool_activityfinalization_status
shared runtime

PersistentAgentRuntime

Voice does not run a text turn, but it writes state through the same runtime stores and ends persistent sessions through the same finalizer.

voice_session_memory_contextbuild_voice_tool_contextprepare_voice_turn_policyrecord_voice_turnend_session
Shared services, different transport

Text turns run through the OpenAI Agents SDK text runtime. Voice turns run through OpenAI Realtime and record finalized transcripts back into OpenCouch state. Both surfaces reuse the same app-owned memory, grounded lookup, crisis-resource, guided-exercise, and session-end services.

Product surfaces

SurfacePurpose
/voiceProduction web voice page. Uses the app shell, session setup, assistant voice selector, transcript display, tool activity, and end-session options.
/voice/realtime-devLower-level dogfood route for inspecting raw Realtime events, parsed transcripts, tool calls, and finalization responses.
/api/voice/realtime/*Backend contract for session creation, tool execution, turn recording, and session end.

Ownership boundary

OwnerResponsibilities
BrowserMicrophone permission, WebRTC peer connection, audio playback, Realtime data-channel parsing, transcript UI, and disconnect/finalization UX.
OpenAI RealtimeSpeech input/output, server VAD, interruption handling, live model response generation, and function-call events.
FastAPI backendEphemeral client secret creation, Realtime session config, private memory context, function tool execution, durable transcript recording, inferred turn metadata, and persistent session finalization.
Shared runtime servicesMemory store, crisis log, grounded lookup, guided-exercise catalog/state, session feedback store, and PersistentAgentRuntime.end_session(...).

What voice intentionally does not do

  • It does not call run_turn(...) or run_turn_stream(...) for each spoken user turn.
  • It does not expose LiveKit; the product voice path is OpenAI Realtime.
  • It does not save durable memory in incognito mode.
  • It does not let the model invent crisis resources; specific crisis resources must come from lookup_crisis_resources.
TopicPage
Realtime connection flowRealtime Lifecycle
Function tools and Realtime policyTools & Policy
Transcript recording and memory finalizationVoice Persistence
Manual verification checklistVoice Dogfood