I Built a Voice Agent on OpenAI’s Realtime API. The Voices Sounded Robotic. Here’s the Hybrid Stack That Fixed It.

Last Updated on May 29, 2026 by Editorial Team

Author(s): Rajesh Vishnani

Originally published on Towards AI.

I Built a Voice Agent on OpenAI’s Realtime API. The Voices Sounded Robotic. Here’s the Hybrid Stack That Fixed It.

OpenAI for reasoning, ElevenLabs for voice, Twilio for transport — and a single config flag (`output_modalities: ["text"]`) that ties the whole thing together.

On the first live call, a borrower asked our agent about her interest rate.

The agent answered correctly. The numbers were right. The reasoning was clean. The latency was under a second.

She hung up at 11 seconds.

I went back and listened to the recording. The answer was perfect. The voice sounded like a GPS. You can ship an agent that’s technically correct and emotionally unlistenable, and most of the AI voice stack docs won’t even warn you about it.

This is the story of the architecture I ended up with after that call — a hybrid stack that uses OpenAI’s Realtime API for everything it’s brilliant at (reasoning, turn detection, tool calling) and replaces the one thing it’s not yet ready for in a high-stakes B2C sales context: the voice itself.

If you’re building real-time voice agents for anything where the human on the other end has to want to keep listening — sales, support, healthcare, anything with churn — this is the stack I’d recommend you start with.

The constraint

We were building an outbound voice agent for mortgage refinance calls. Specific constraints

Sub-second response latency. Anything above ~1.2s “perceived latency” on a sales call and the borrower starts talking over the agent.
Barge-in support. The borrower has to be able to interrupt and the agent has to gracefully shut up mid-sentence.
PSTN-quality audio. μ-law 8 kHz. No fancy 24 kHz neural codecs — telephone carrier, not WebRTC.
Function calling during the call. The agent had to be able to text the borrower a savings summary, book a consult, or schedule a callback — mid-conversation, with no awkward pauses.
A voice that didn’t get hung up on.

OpenAI’s Realtime API was an obvious first choice for the first four. We hit a wall on the fifth.

Why we moved off OpenAI’s built-in voices

OpenAI’s Realtime voices have improved dramatically in 2026 — they’re genuinely good for low-stakes assistants. But for outbound sales, where the listener didn’t ask to be on the phone and is making a continuous, subconscious “is this worth my time?” judgment, the bar is brutal. Three problems showed up in user testing

Prosody plateau. The voices have a recognizable “synthetic emphasis curve” that listeners pattern-match to “robot.” Once they clock it, trust drops, and the call is effectively over.
Limited voice diversity. We needed a voice that matched the demographic and regional expectations of the audience. The native voice options didn’t get us there.
No control over the warmth dial. We couldn’t tune stability vs. expressiveness for a financial-services context where “calm and trustworthy” is the entire job.

ElevenLabs solved all three. The question became: how do you keep OpenAI’s reasoning loop and just swap the voice?

The key insight: `output_modalities: ["text"]`

This is the single most important config line in the whole architecture, and it’s documented in maybe two paragraphs in OpenAI’s Realtime docs.

By default, OpenAI Realtime emits both audio and text in its response stream. The audio is what makes it “realtime” — the model is generating speech tokens directly, which is why the latency is so good.

But if you set

output_modalities: ["text"]

…OpenAI’s Realtime model stops emitting audio entirely. It still does everything else — semantic VAD, turn detection, function calling, interrupt handling, conversation state — but the output is plain text tokens, streamed.

You take that text stream and pipe it to a TTS provider of your choice.

In our case: ElevenLabs Flash v2.5, configured to emit ulaw_8000 directly so we don't pay a transcoding tax on the outbound audio path.

That’s it. That’s the trick. You’re not replacing OpenAI Realtime. You’re using 90% of it and bypassing the 10% you don’t want.

The full architecture

The backend is the bridge. Three responsibilities

Shuffle μ-law frames between Twilio’s media stream and OpenAI Realtime’s input buffer.
Stream synthesized speech from ElevenLabs back into Twilio’s media stream as μ-law frames.
Handle tool calls that OpenAI emits, run them against MongoDB / Twilio SMS / scheduling, and feed results back into the conversation.

Roughly 1,800 lines of Node.js, of which about 400 are reliability/edge-case code (more on that later).

The latency budget

This is the conversation that matters. If you’re building anything in this space, paste this into your engineering doc

The two architectural decisions that bought us most of this budget

ElevenLabs Flash v2.5 over Multilingual v2 or v3. Flash v2.5 hits ~100–150 ms TTFB in NA/EU regions. Multilingual v2 is ~300–400 ms. v3 is ~500 ms. For voice agents, you feel every one of those milliseconds — the model quality difference doesn’t matter if the user has already mentally checked out.
ElevenLabs emitting ulaw_8000 directly. Most TTS providers default to 24 kHz PCM, which you then have to downsample and μ-law encode before handing to Twilio. That's CPU time you don't have. Flash v2.5 will emit μ-law 8 kHz natively. Use it.

If you’re shaving for latency, here’s where the budget goes next

OpenAI Realtime’s eagerness: medium setting in semantic VAD. low (8s silence timeout) feels laggy. high triggers on natural pauses and creates jumpy, anxious agents. medium is the only setting that holds up on actual phone calls.
noise_reduction: near_field before VAD. Borrowers are calling from cars, kitchens, TVs in the background. Skipping this turns your VAD into a coin flip.
Region-aware deployment. ElevenLabs end-to-end TTFB swings by 50–100ms depending on where your backend lives relative to their endpoints. Worth measuring before you pick a region.

The five edge cases that ate two-thirds of my engineering time

The happy path of this architecture took about a week. The remaining three weeks were edge cases. These are the ones that don’t show up in any tutorial.

1. Barge-in mid-sentence

When the borrower interrupts, you have to stop the ElevenLabs audio playback immediately. Twilio supports this via { event: "clear" } — but you also have to cancel the in-flight OpenAI response with response.cancel and discard whatever ElevenLabs has queued. If you skip step three, the borrower hears 200 ms of orphaned audio after they start talking, and it feels like the agent didn't hear them.

2. The silence nudge ladder

If the borrower goes quiet — distracted, processing, doing something else — you can’t just sit there in dead air. You also can’t barrel on like a chatbot. We landed on a 3-strike escalation

The nudge fires on response.done after the agent stops speaking. Delay = 10s + TTS playback budget (we estimate playback as word_count / 3.5 wps).

3. The “no-goodbye” guard

OpenAI’s tool calling will happily emit end_call with no spoken farewell — just hangs up mid-conversation. Borrowers experience this as the agent rage-quitting on them.

Defensive guard: if end_call fires without any preceding spoken text in the current turn, reject the tool call and force the model to emit a farewell turn first. Roughly 12 lines of code. Saved us from a bunch of angry voicemails.

4. The grace window

Even after end_call is approved, give it an 8-second grace window where a speech_started event from the borrower cancels the hangup. Borrowers often say "wait, one more question—" right as the agent is wrapping up. Honoring that is the difference between feeling like a sales bot and feeling like a person.

5. Audio frame alignment

This one’s mundane and important. Twilio expects μ-law frames at exactly 20 ms / 160 bytes. ElevenLabs streams in ~8 KB chunks. You can’t just forward the chunks — you have to re-slice them into 160-byte frames and time them out at 20 ms intervals, or Twilio’s playback will glitch.

The way you find this bug is that everything works in dev, then one of your QA calls has a faint clicking sound, and you spend a day discovering it’s frame alignment.

What didn’t work

A few things I tried and abandoned, in case it saves you the same dead ends

Using OpenAI for transcription + Claude for reasoning + ElevenLabs for voice. Cleaner separation of concerns, but you lose OpenAI Realtime’s tight VAD ↔ reasoning loop. Adds ~300 ms of latency. Not worth it.
Volume-based VAD on the audio stream. Worked in the office. Fell apart on actual calls with HVAC noise, car audio, and TVs. Semantic VAD is the only thing that holds up in production.
Caching common ElevenLabs phrases. Sounded great when the cache hit. Sounded weirdly too consistent when it hit twice in the same call, like the agent had a verbal tic. Pulled it.

What I’d do differently

A few notes for v2

Build the kill switch first, not last. A keyword (/halt) that hard-stops a call regardless of what the agent is doing. We didn't have this for the first two weeks of testing and I regret it.
Log everything with the timestamps aligned. OpenAI events, ElevenLabs audio chunks, Twilio frames, MongoDB writes — same clock, correlated by call ID. Without this, debugging voice glitches is impossible. With it, it’s just tedious.
Use semantic VAD from day one. I burned a week on volume-based VAD before switching. There’s no production scenario where volume-based wins.
Don’t ship without the no-goodbye guard. Easy to forget, devastating in production.

The takeaway

If there’s one architectural pattern to take from this, it’s this

Realtime APIs from frontier labs are pipelines, not products. The fact that they bundle reasoning, VAD, and voice synthesis into one stream is a convenience, not a constraint. You can — and for many use cases, should — replace any individual stage with a specialist provider.

For voice agents in 2026, the right answer for most production deployments is

OpenAI Realtime for reasoning, turn detection, function calling, conversation state.
ElevenLabs Flash v2.5 for voice synthesis.
Twilio for transport.
And output_modalities: ["text"] as the seam that holds the whole thing together.

Don’t take it from me. Build it. The first call your agent makes that someone actually finishes — instead of hanging up at second 11 — will tell you everything.

If you’re building something similar and hit a wall on any of the edge cases above, drop a comment with the symptom and I’ll share what worked for us.

If this helped, send it to the engineer on your team who’s been told to “just use the OpenAI Realtime quickstart.” They are going to need the architectural escape hatches.

I publish one engineering deep-dive every Sunday on what it actually takes to ship AI in production — the architectures, the edge cases, the things that don’t make it into the official docs. Follow if you want the next one to find you.

Architecture diagrams and config values referenced in this article are from a production deployment serving real outbound sales calls. Some implementation details have been generalized.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

I Built a Voice Agent on OpenAI’s Realtime API. The Voices Sounded Robotic. Here’s the Hybrid Stack That Fixed It.

Author(s): Rajesh Vishnani

I Built a Voice Agent on OpenAI’s Realtime API. The Voices Sounded Robotic. Here’s the Hybrid Stack That Fixed It.

OpenAI for reasoning, ElevenLabs for voice, Twilio for transport — and a single config flag (`output_modalities: ["text"]`) that ties the whole thing together.

The constraint

Why we moved off OpenAI’s built-in voices

The key insight: `output_modalities: ["text"]`

The full architecture

The latency budget

The five edge cases that ate two-thirds of my engineering time

1. Barge-in mid-sentence

2. The silence nudge ladder

3. The “no-goodbye” guard

4. The grace window

5. Audio frame alignment

What didn’t work

What I’d do differently

The takeaway

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

Full-Stack Data Scientists for the Agentic Coding World

Building Production-Grade AI Skills with Snowflake Cortex AI Function Studio

I Tried 10 AI Agent Frameworks in 2026 — Here’s the Honest Guide I Wish I Had Earlier

How One Spring Boot Optimization Saved Our Startup $30,000 a Year

Inside Palantir AIP: How the World’s Most Controversial AI Platform Actually Works

What Is a Reverse Proxy? (And Why Every Backend Developer Should Care)

What Claude Opus 4.8 Actually Changes If You’re Building Agents

QWEN 3.7 Max Worked For 35 Hrs Straight And The Results Were Mind-blowing

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

I Built a Voice Agent on OpenAI’s Realtime API. The Voices Sounded Robotic. Here’s the Hybrid Stack That Fixed It.

Author(s): Rajesh Vishnani

I Built a Voice Agent on OpenAI’s Realtime API. The Voices Sounded Robotic. Here’s the Hybrid Stack That Fixed It.

OpenAI for reasoning, ElevenLabs for voice, Twilio for transport — and a single config flag (output_modalities: ["text"]) that ties the whole thing together.

The constraint

Why we moved off OpenAI’s built-in voices

The key insight: output_modalities: ["text"]

The full architecture

The latency budget

The five edge cases that ate two-thirds of my engineering time

1. Barge-in mid-sentence

2. The silence nudge ladder

3. The “no-goodbye” guard

4. The grace window

5. Audio frame alignment

What didn’t work

What I’d do differently

The takeaway

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement

OpenAI for reasoning, ElevenLabs for voice, Twilio for transport — and a single config flag (`output_modalities: ["text"]`) that ties the whole thing together.

The key insight: `output_modalities: ["text"]`