SimpleVoiceGateway bridges Twilio Media Streams to the OpenAI Realtime API. Twilio sends μ-law audio over a WebSocket; the gateway pipes it to OpenAI, captures synthesized audio coming back, and sends it down the same socket so the caller hears the assistant in near real time.
How it connects
The gateway prepends an upgrade listener on the NestJS HTTP server and grabs requests whose path starts with /simple-voice/stream. The path carries identity:
wss://api.appmint.io/simple-voice/stream/{orgId}/{assistantId}
This is the URL Twilio's <Stream> verb opens. AppEngine generates the TwiML that points at this URL when the IVR routes a call to the AI assistant.
The older RealtimeVoiceGateway at /voice/stream/... is disabled in voice.module.ts because it conflicts with SimpleVoiceGateway's upgrade handler. SimpleVoiceGateway is the only voice WebSocket served today.
What the gateway does on connect
- 1
Resolve OpenAI credentials
Pulls the org's
OpenAIProviderconfig fromUpstreamService. If no key is configured the socket is closed immediately. - 2
Load the assistant
If
assistantIdis in the path, fetch theai_assistantrecord and compose the system prompt fromtitle,description,personality,capabilities[],tools[],behaviorRules,interactionRules,knowledgeSources, andsafetyfields. Pulls the assistant'svoice(defaults toalloy). - 3
Wait for
startfrom TwilioTwilio sends
{ event: 'start', start: { streamSid, callSid, customParameters } }. The gateway then opens a second WebSocket towss://api.openai.com/v1/realtime?model=gpt-realtimewith the org's API key. - 4
Configure the OpenAI session
Sends a
session.updatewith formataudio/pcmufor both input and output (μ-law, 8 kHz — same as Twilio),server_vadturn detection, the composed system prompt, the chosen voice, and any tool definitions. Then triggers an initialresponse.createso the assistant greets first. - 5
Pipe audio in both directions
- Twilio
media→input_audio_buffer.appendto OpenAI - OpenAI
response.output_audio.delta→mediato Twilio - OpenAI
input_audio_buffer.speech_started→clearto Twilio (so the assistant stops talking when the caller does)
- Twilio
Tool calls during a call
The OpenAI session is configured with the assistant's tools plus a built-in transfer_call (when the IVR context allows transfers). When OpenAI emits response.function_call_arguments.done:
case 'response.function_call_arguments.done':
const args = JSON.parse(event.arguments);
if (event.name === 'transfer_call' && callSid) {
await this.handleAITransfer(orgId, callSid, args);
// Confirm to the model and let it say goodbye
openAiWs.send(/* function_call_output */);
return;
}
const tool = this.crmToolRegistry.getTool(event.name);
const result = await tool.execute(args, { orgId, userId: 'voice-call', assistantId });
openAiWs.send(/* function_call_output with result */);
openAiWs.send(/* response.create to continue */);
Tool results come back from the existing CRM tool registry, so the same tools your text assistants use also work mid-call.
IVR context injection
When the call reaches the AI through an IVR flow, the orchestrator stashes context in cache before redirecting Twilio:
ivr-context:{orgId}:{assistantId}:{caller}
The gateway reads it on start and appends to the system prompt:
- The configured greeting ("Start the conversation by saying: ...")
- Business-hours state (open / closed / current period)
- The full routing JSON — menus, forwarding groups, hours
- A short cheat sheet for
transfer_callarg shape per menu action
This is how the assistant knows about departments, on-call groups, and what to do after-hours without each assistant being separately configured.
Recording and journey capture
The gateway buffers caller and AI audio chunks in memory for the duration of the call and captures both transcripts:
const callerAudioChunks: Buffer[] = [];
const aiAudioChunks: Buffer[] = [];
const aiTranscript: { role: 'ai' | 'caller'; text: string; timestamp: string }[];
const toolsUsed: { tool, params, result, timestamp }[];
On disconnect the gateway encodes the buffers, persists the recording, and writes the journey (transcript + tools) into the call record. See Recording and transcription for storage details.
Sample raw frames
Each Twilio frame:
{ "event": "media", "streamSid": "MZ...", "media": { "payload": "<base64 μ-law>" } }
Each OpenAI frame the gateway sends back:
{ "type": "input_audio_buffer.append", "audio": "<base64 μ-law>" }
Each frame back to Twilio:
{ "event": "media", "streamSid": "MZ...", "media": { "payload": "<base64 μ-law>" } }
Operational notes
- One WebSocket per call. There is no shared OpenAI session — each call creates its own.
- The OpenAI key never leaves the server. Twilio talks to AppEngine, AppEngine talks to OpenAI.
- If OpenAI disconnects, the Twilio leg is closed too — there's no fallback model.
UpstreamService.getAIIntegrationConfig(orgId, 'OpenAIProvider')is the only place the key is read; provision orgs through Upstream / vendor connect.