Table of Content

Table of Content

The $0.05/Min Promise: What Your Voice AI Stack Actually Costs at Scale

The $0.05/Min Promise: What Your Voice AI Stack Actually Costs at Scale

The $0.05/Min Promise: What Your Voice AI Stack Actually Costs at Scale

• 14 min read

• 14 min read

Ayush Parchure

Content Writing Intern, Flexprice

Imagine buying a car because the sticker said "$299/month," only to find out that number doesn't include insurance, fuel, maintenance, registration, or the financing rate that quietly doubled your total. That's roughly what happens when a voice AI pricing page says "$0.05/min." The number is real. It's also about 20% of the actual story.

Three months into scaling, the real invoice lands, and it tells a very different story. The bill comes in 3x to 5x higher than what anyone budgeted. The per-minute number on the landing page was one ingredient in a recipe with four others, and those four compound against each other in ways that nobody's pricing page will walk you through. The quality tier chosen at signup didn't just set the TTS cost. It rippled across every layer underneath.

This post breaks down what a voice AI call actually costs, where the spread between cheap and premium really lives, and the one billing dynamic that almost nobody models until it shows up on an invoice they weren't expecting.

Your voice AI call has five cost layers

When you look at a voice AI provider's pricing page, you're seeing one number that represents one layer. That's like pricing a restaurant by looking at the cost of flour. There are four more layers underneath, each with its own vendors, its own pricing tiers, and its own way of scaling that doesn't match the others. You probably only modeled the layer you could see on the landing page. The other four are where the surprises live.

Let's walk through all five.

Telephony: the only predictable line item

This one is boring, and that's a compliment. Telephony runs about $0.01/min through Twilio, Vonage, or similar carriers. It's the smallest cost in the stack and the most stable. It shows up on the bill, it matches what you expected, and nobody loses sleep over it.

I'm mentioning it here for one reason: so you understand that everything after this section is where the problems start.

STT: small numbers that add up at volume

STT pricing looks like a rounding error when you're running demos. Here's what the main providers charge per minute of audio:

  • Deepgram Nova-3: $0.0077/min

  • AssemblyAI Universal: $0.0025/min

  • OpenAI Whisper: $0.006/min

  • Google Enhanced: $0.036/min

At 10,000 minutes a month, you're looking at the difference between $25 and $360. Nobody's losing their job over that spread. But run those same numbers at 500,000 minutes a month, and the picture changes fast. AssemblyAI comes in around $1,250/month. Google Enhanced lands at $18,000/month. That's a $16,750/month gap on a component you probably never even negotiated.

TTS: the emotional decision that sets your cost floor

This is the line item where you spend the most time choosing and the least time modeling. And honestly, it makes sense. The voice is your product's personality. It's the first thing a user reacts to. So you listen to a 15-second demo clip, hear something that sounds warm and natural, and pick it. The pricing conversation happens later, if it happens at all.

Here's what that decision actually costs, measured in price per million characters:

  • Neets.ai: ~$1/million characters

  • Amazon Polly Standard: $4.00/million

  • Google Cloud TTS Standard: $4/million

  • Deepgram Aura-2: $27 to $30/million

  • Azure Neural HD: $30/million

  • ElevenLabs Flash v2.5: $60/million

  • ElevenLabs Multilingual v2: $120/million

Read that range again. From bottom to top, that is a 200x spread on the same component doing the same job: turning text into spoken words.

LLM inference: the wildcard that grows with every turn

This is the most variable and most misunderstood layer in the stack. The per-minute cost ranges from about $0.02 to north of $0.20, depending on which model you're running. GPT-4o-mini versus GPT-4o alone represents roughly a 10x cost difference on this single layer.

But the model choice isn't actually the scariest part. The scariest part is how costs behave inside a conversation as it gets longer. Every turn adds to the context window, and the context window is what you're paying for. I'm going to come back to this in its own section because it deserves a full explanation. For now, just know that LLM costs don't scale linearly with conversation length. They accelerate. And your spreadsheet almost certainly doesn't account for that.

If you're not sure how to track this at the token level, here's a practical guide on metering LLM token usage for billing

Platform and orchestration: the biggest line item nobody negotiates

This is the layer that handles routing, state management, failover, and the glue between your STT, LLM, and TTS. If you're using Vapi, Retell, Bland, or something similar, this is what you're paying them for. If you built it yourself, this is your engineering time plus infrastructure.

The per-minute range:

  • Retell AI: ~$0.07/min

  • Vapi: ~$0.05/min listed, but the real cost depends on which providers they're wrapping underneath

  • Custom build: Engineering time plus infra, typically $0.03 to $0.10/min at scale, depending on how lean you run it

Here's what's wild about this layer: it's often the single largest line item in your entire stack, and it's probably the one you evaluated last. You might have spent weeks A/B testing TTS voices and then picked your orchestration platform based on whoever had the best docs or the fastest integration. That's a $25,000 to $75,000/month decision at 500K minutes being made on convenience.

We compared the leading options in our breakdown of consumption-based billing platforms for voice AI

The honest total

Add it all up and the range runs from roughly $0.09/min on a budget stack to over $0.44/min on a premium stack. That is approximately a 5x spread for the same call, the same use case, with different tier choices across the layers.

The "$0.05/min" number that shows up in marketing materials? It's accurate for a very specific, very stripped-down configuration that almost nobody actually ships to production. It's the starting price on the menu, not the meal.

What you're actually buying when you pick a voice quality tier

Choosing a voice quality tier feels like an audio preference. It's not. It's a decision about latency, emotional range, user trust, and the cost floor you're locking yourself into. Let's talk about what actually separates the tiers, because it's not what most people think.

Prosody is the real differentiator, not phonemes

Budget TTS gets the sounds right. It pronounces words correctly, hits the consonants, and nails the vowels. If you're reading a transcript, you'd never know the difference.

But premium TTS gets the rhythm right. Prosody is the technical term for pitch, stress, pace, and intonation. It's the music underneath the words. And it is the hardest thing to synthesize well, and the very first thing a listener detects when it's off.

When someone describes a voice as robotic, they're usually not hearing mispronunciation. They're hearing flat prosody. The words are correct, but the sentence has no shape to it. There's no rise before a question, no emphasis on the word that carries the meaning, no subtle slowdown before an important point. The content is right. The delivery is dead.

This is why a seemingly small difference in MOS (Mean Opinion Score) between tiers, say 3.7 versus 4.2, translates to a 15 to 30% difference in user satisfaction scores. Research on TTS evaluation consistently shows this pattern. A half-point gap on a 5-point scale looks like a rounding error on paper. In a live conversation, it's the difference between a user who stays engaged and one who starts looking for the "talk to a human" button. This quality gap is why designing tiered pricing models isn't just a billing exercise. It's a product experience decision.

Latency: the conversion killer you cannot hear in a demo

Demos are dangerous because they hide latency. You click play, the voice responds instantly from a pre-rendered buffer, and everything feels snappy. In production, with real-time synthesis, the numbers tell a different story.

The threshold that matters is roughly 300 milliseconds. Below that, conversation feels natural. The user talks, the agent responds, and the rhythm of the exchange flows the way a phone call should.

Above 300ms, something shifts. The person on the other end interprets the pause as confusion. They think the agent didn't understand them. They start talking over the system, repeating themselves, getting frustrated. And this isn't a one-time reaction. It compounds across a 5-minute conversation. By minute three, the user has mentally checked out, not because the agent gave bad answers, but because the timing felt wrong in a way they can't quite articulate.

Here's the part that catches you off guard: some of the most expensive voices are also the slowest. Premium does not always mean better for real-time conversation. The voice that won your internal demo, the one everyone on your team agreed sounded the most human, might be the one that kills your completion rates in production because it can't keep up with the pace of actual dialogue.

You're probably on the wrong tier

I'll say this directly: if you're running voice AI at any meaningful scale, there's a good chance you have a tier mismatch somewhere in your stack. Your money is concentrated in a component that isn't the bottleneck, and the actual bottleneck is starved of budget.

Your agent sounds like a premium product and thinks like a budget one. Your users notice. They just can't tell you exactly why the experience feels hollow, because the voice is so good it raises expectations the LLM can't meet.

The under-investment trap runs the other way: budget TTS at $1 to $ 4 per million characters behind GPT-4o. Your agent reasons beautifully. It handles edge cases, catches nuance, and asks clarifying questions at exactly the right moment. But it does all of this in a voice that sounds like a GPS from 2014. Your user satisfaction sits at mediocre, not because the intelligence is lacking, but because the first impression kills trust before the intelligence gets a chance to show up.

Get started with your billing today.

Get started with your billing today.

The billing surprise nobody warns you about: context compounding

This is the section you won't find on any provider's pricing page. It's the billing dynamic that turns a reasonable cost model into a budget overrun, and it typically doesn't show up until month three or four when your volume is real, and your architecture is already locked in.

How llm costs grow inside a single conversation

LLM inference is priced by tokens. You pay for input tokens (what you send to the model) and output tokens (what the model sends back). Simple enough. But here's what most cost models miss entirely: the input token count grows with every single conversational turn, because you're sending the full conversation history each time.

Let me walk you through this with real numbers.

  • Turn 1: You send the system prompt plus one user utterance. Maybe 500 input tokens. The model responds. You pay for 500 in, maybe 150 out.

  • Turn 5: You send the system prompt, plus four prior exchanges (both user and agent messages for each), plus the new user utterance. Now you're at maybe 2,500 input tokens. You still pay for the output, but the input cost has grown 5x since turn 1.

  • Turn 10: System prompt, nine prior exchanges, new utterance. You're looking at 5,000+ input tokens. For a single inference call.

  • Turn 15: North of 7,500 input tokens. Every turn from here gets more expensive than the last, because you're paying to re-read the entire conversation every time the user says something new.

A 10-minute call does not cost 10x what a 1-minute call costs. It costs materially more because the input context at turn 15 might be 6x the input context at turn 2, and you've been paying that escalating cost at every turn along the way.

The practical impact: if you budgeted your LLM costs by multiplying average call length by a fixed per-minute rate, you're underestimating your actual LLM spend by 30 to 60% at scale. This isn't a theoretical risk. It's the single most common billing surprise in voice AI, and it typically surfaces right around the time your volume hits the point where the numbers actually matter.

This is closely tied to a broader problem, like how to handle unpredictable usage spikes in AI billing before they destroy your margins.

Why most platforms don't fix this for you

The standard mitigation is called context windowing: summarizing or truncating earlier turns to keep the input token count manageable. In theory, it's straightforward. In practice, it's where things get messy.

Most off-the-shelf voice platforms handle this in one of three ways. Some don't do it at all. The full conversation history gets passed to the LLM on every turn, and costs grow exactly the way I just described. Some do it automatically but poorly, aggressively summarizing or dropping early turns in ways that lose critical context. Your customer mentioned their account number in turn 2? Gone by turn 8. They explained their problem in detail at the start of the call. Compressed into a sentence fragment that strips the nuance.

And some platforms do offer controls, but they're buried in advanced settings you never touch because you don't know the problem exists until the bill arrives.

If you're building on top of a platform like Vapi or Retell, here's a question worth asking before you scale: how does the platform handle context growth on long calls? If the answer is that the full conversation history gets passed every turn, you now know where your month-4 billing surprise is coming from. And if the answer is "we handle it automatically," the follow-up question is: how, and what context gets lost?

What this actually looks like at 500k minutes/month

Let's make this concrete with a single worked example. Say you're running 500,000 minutes a month on a cost-conscious stack. You've chosen Deepgram for both STT and TTS, GPT-4o-mini for the LLM, and Retell for orchestration. Your monthly bill lands around $60K. That breaks down roughly as:

  • STT: ~$2,100

  • TTS: ~$18,000

  • LLM: ~$10,000

  • Platform: ~$25,000

  • Telephony: ~$5,000

  • Total: ~$60,000/month

Now swap in the premium configuration. ElevenLabs for TTS, GPT-4o for the LLM, enterprise-grade orchestration with all the bells and whistles. Same 500,000 minutes. Same use case. Same conversations.

  • STT: ~$12,000

  • TTS: ~$154,000

  • LLM: ~$100,000

  • Platform: ~$75,000

  • Telephony: ~$5,000

  • Total: ~$346,000/month

The calls sound better. The agent is smarter. The orchestration is more resilient. Whether that improvement is worth an extra $286,000 every month is the question you need to answer by design, not by accident. If you picked tiers in isolation, layer by layer, you'll discover the total only when finance asks why the bill is six figures higher than the forecast.

How to make this decision before it makes itself

You don't need a six-week analysis to get this right. You need honest answers to three questions, and you can work through all of them this week.

What is your cost per handled call, and what is that call worth?

If a handled call generates $4 of value, whether that's a booking, a resolution that prevents churn, or a qualified lead, and it costs you $0.40 to handle, you've got a 10x return. There's real room to invest in quality. Push the voice up a tier. Use a stronger LLM. The economics support it.

If a handled call generates $0.60 of value and costs $0.40, your cost structure needs to be a hard constraint, not an afterthought. Every tier decision should start with, can we afford this at target volume, rather than does this sound good in the demo.

You probably know your per-minute cost. But do you know your per-call value? You need both numbers on the same spreadsheet before any tier decision makes sense. If the gap between those two numbers is shrinking and you're not sure why, you may have a revenue leakage problem in your usage-based pricing

Is your quality ceiling set by the right component?

Pull up your stack. Look at each layer. Now ask: which component would a customer notice first if it were bad?

If you're running premium TTS with a budget LLM, you're paying for a beautiful voice that gives mediocre answers. The quality ceiling is set by the LLM, but the money is parked in the TTS.

If you're running a powerful LLM behind a flat, robotic voice, you have an agent that's brilliant and unpleasant to talk to. The quality ceiling is set by the TTS, but the investment is in the LLM.

In both cases, you're paying for quality that can't express itself. The fix isn't to spend more overall. It's to rebalance spend toward whichever component is currently the bottleneck. Move money from where it's wasted to where it's needed.

Have you modeled context growth, or are you assuming linear costs?

This one takes five minutes, and it might save you six figures.

Pull the LLM line item from last month's invoice. Divide by total call minutes to get your actual per-minute LLM cost. Now do the same for the month before. And the month before that.

If the per-minute LLM cost is growing faster than your volume, you have a context compounding problem. Your average conversation length is increasing, or your prompts are getting longer, or both, and the cost curve is bending upward in a way that your linear forecast doesn't capture. This will get worse before it gets better, and it's much cheaper to address now than after you've scaled another 3x.If you don't have visibility into this yet, start with a system that can track API usage for billing in real time

The "$0.05/min" number is not a lie. It's an incomplete sentence. The rest of it reads: at a specific quality level, for a specific stack configuration, before context growth, platform overhead, and the LLM tier you actually need.

Voice quality tiers are real. The spread is real, up to 5x on total stack costs depending on your choices. What you might find out too late is that the tier decision isn't really about the voice. It's about the cost structure you're building into your product for the next 18 months.

That is worth modeling before you scale, not after. Start with these 7 pricing metrics that actually capture AI product value

If you're scaling and want to pressure-test your stack economics, here at Flexprice, we're working through this with a handful of teams right now. 

Ayush Parchure

Ayush Parchure

Ayush is part of the content team at Flexprice, with a strong interest in AI, SaaS, and pricing. He loves breaking down complex systems and spends his free time gaming and experimenting with new cooking lessons.

Ayush is part of the content team at Flexprice, with a strong interest in AI, SaaS, and pricing. He loves breaking down complex systems and spends his free time gaming and experimenting with new cooking lessons.

Share it on:

Ship Usage-Based Billing with Flexprice

Summarize this blog on:

Ship Usage-Based Billing with Flexprice

Ship Usage-Based Billing with Flexprice

More insights on billing

More insights on billing

Get Instant Feedback on Your Pricing | Join the Flexprice Community with 300+ Builders on Slack

Join the Flexprice Community on Slack