Documentation

Voice streaming

Stream Twilio Media Streams audio through SimpleVoiceGateway to the OpenAI Realtime API.

SimpleVoiceGateway bridges Twilio Media Streams to the OpenAI Realtime API. Twilio sends μ-law audio over a WebSocket; the gateway pipes it to OpenAI, captures synthesized audio coming back, and sends it down the same socket so the caller hears the assistant in near real time.

How it connects

The gateway prepends an upgrade listener on the NestJS HTTP server and grabs requests whose path starts with /simple-voice/stream. The path carries identity:

wss://api.appmint.io/simple-voice/stream/{orgId}/{assistantId}

This is the URL Twilio's <Stream> verb opens. AppEngine generates the TwiML that points at this URL when the IVR routes a call to the AI assistant.

The older RealtimeVoiceGateway at /voice/stream/... is disabled in voice.module.ts because it conflicts with SimpleVoiceGateway's upgrade handler. SimpleVoiceGateway is the only voice WebSocket served today.

What the gateway does on connect

  1. 1

    Resolve OpenAI credentials

    Pulls the org's OpenAIProvider config from UpstreamService. If no key is configured the socket is closed immediately.

  2. 2

    Load the assistant

    If assistantId is in the path, fetch the ai_assistant record and compose the system prompt from title, description, personality, capabilities[], tools[], behaviorRules, interactionRules, knowledgeSources, and safety fields. Pulls the assistant's voice (defaults to alloy).

  3. 3

    Wait for start from Twilio

    Twilio sends { event: 'start', start: { streamSid, callSid, customParameters } }. The gateway then opens a second WebSocket to wss://api.openai.com/v1/realtime?model=gpt-realtime with the org's API key.

  4. 4

    Configure the OpenAI session

    Sends a session.update with format audio/pcmu for both input and output (μ-law, 8 kHz — same as Twilio), server_vad turn detection, the composed system prompt, the chosen voice, and any tool definitions. Then triggers an initial response.create so the assistant greets first.

  5. 5

    Pipe audio in both directions

    • Twilio mediainput_audio_buffer.append to OpenAI
    • OpenAI response.output_audio.deltamedia to Twilio
    • OpenAI input_audio_buffer.speech_startedclear to Twilio (so the assistant stops talking when the caller does)

Tool calls during a call

The OpenAI session is configured with the assistant's tools plus a built-in transfer_call (when the IVR context allows transfers). When OpenAI emits response.function_call_arguments.done:

case 'response.function_call_arguments.done':
  const args = JSON.parse(event.arguments);

  if (event.name === 'transfer_call' && callSid) {
    await this.handleAITransfer(orgId, callSid, args);
    // Confirm to the model and let it say goodbye
    openAiWs.send(/* function_call_output */);
    return;
  }

  const tool = this.crmToolRegistry.getTool(event.name);
  const result = await tool.execute(args, { orgId, userId: 'voice-call', assistantId });
  openAiWs.send(/* function_call_output with result */);
  openAiWs.send(/* response.create to continue */);

Tool results come back from the existing CRM tool registry, so the same tools your text assistants use also work mid-call.

IVR context injection

When the call reaches the AI through an IVR flow, the orchestrator stashes context in cache before redirecting Twilio:

ivr-context:{orgId}:{assistantId}:{caller}

The gateway reads it on start and appends to the system prompt:

  • The configured greeting ("Start the conversation by saying: ...")
  • Business-hours state (open / closed / current period)
  • The full routing JSON — menus, forwarding groups, hours
  • A short cheat sheet for transfer_call arg shape per menu action

This is how the assistant knows about departments, on-call groups, and what to do after-hours without each assistant being separately configured.

Recording and journey capture

The gateway buffers caller and AI audio chunks in memory for the duration of the call and captures both transcripts:

const callerAudioChunks: Buffer[] = [];
const aiAudioChunks: Buffer[] = [];
const aiTranscript: { role: 'ai' | 'caller'; text: string; timestamp: string }[];
const toolsUsed: { tool, params, result, timestamp }[];

On disconnect the gateway encodes the buffers, persists the recording, and writes the journey (transcript + tools) into the call record. See Recording and transcription for storage details.

Sample raw frames

Each Twilio frame:

{ "event": "media", "streamSid": "MZ...", "media": { "payload": "<base64 μ-law>" } }

Each OpenAI frame the gateway sends back:

{ "type": "input_audio_buffer.append", "audio": "<base64 μ-law>" }

Each frame back to Twilio:

{ "event": "media", "streamSid": "MZ...", "media": { "payload": "<base64 μ-law>" } }

Operational notes

  • One WebSocket per call. There is no shared OpenAI session — each call creates its own.
  • The OpenAI key never leaves the server. Twilio talks to AppEngine, AppEngine talks to OpenAI.
  • If OpenAI disconnects, the Twilio leg is closed too — there's no fallback model.
  • UpstreamService.getAIIntegrationConfig(orgId, 'OpenAIProvider') is the only place the key is read; provision orgs through Upstream / vendor connect.