breaking · operator analysis

OpenAI Shipped Your Voice Stack at $0.25/Min. Vapi Went Enterprise. The Infra Layer Abandoned Agencies in Eleven Days.

By Alfredo Romero, CEO, HermesMay 18, 20268 min read

On cohort #1, OpenAI made gpt-realtime-2 generally available in the Realtime API at roughly $0.25 to $0.35 per minute of conversation, with GPT-5-class reasoning, 128K context, native translation, and 70+ languages. Coverage: TechCrunch, OpenAI's announcement, and a clean teardown at DataCamp. Five days later, on May 12, Vapi closed a $50M Series B at a $500M valuation and publicly positioned itself as the enterprise infrastructure layer, with Amazon Ring running 100% of inbound through them. Two days after that, Synthflow's enterprise page started leading with two deal proof points: a $230M multinational BPO operator running 40+ branded agents at 600K calls per month, and a top U.S.-based CRM platform white-labeling Synthflow at 500K calls per month, both inside 60 days. The full receipts are on Synthflow's enterprise comparison page.

Eleven days. Three events. One outcome. The middle of the AI voice stack collapsed, and an agency owner with five clients is the only customer left in the room nobody is engineering for.

Why this matters for AI voice agencies

For the last 18 months, the canonical AI voice agency stack was a wrapper or infra platform (Retell, Vapi, Voicerr, Synthflow) plus GoHighLevel plus Zapier plus Stripe plus Twilio plus a custom dashboard. The wrapper layer existed because the orchestration was hard. Stitching STT, an LLM, TTS, telephony, turn-taking, barge-in, latency tuning, and call quality together was a four-vendor problem with 200ms of jitter to manage. You paid the wrapper for the integration, not the model.

On cohort #1 the orchestration disappeared. OpenAI now handles speech-to-speech reasoning inside one API call, in one model, with caching that drops the dominant cost line by 80x. The wrapper's reason to exist was the stitching. The stitching just got eaten by the model layer. eWeek summarized the strategic shift cleanly:

"OpenAI's bet is that doing the audio reasoning inside one model is more defensible than stitching three vendors together. Whether ElevenLabs, Deepgram and the rest can hold their wedge depends on how quickly they push their own integrated stacks."

At the same time, the wrappers and infra companies that could not be commoditized from below decided to escape upward. Vapi's TechCrunch story is the public version of this. The internal version is the customer list at the launch of gpt-realtime-2 on the same week: Zillow, Glean, Genspark, Bluejay, Intercom, Priceline, Foundation Health, Deutsche Telekom. Read it aloud. Not one agency. Synthflow's customer list is the same shape: a $230M BPO, a national CRM platform, 600K calls per month, 500K calls per month, dedicated success engineers, procurement-driven roadmaps.

The implication for the agency layer is concrete, not vibes. Your platform's product roadmap is now being written for a customer who pays a six-figure annual contract and signs a master service agreement. Your support ticket about a missing white-label string in a webhook payload sits behind their compliance review. Your feature request for a campaign retry rule sits behind their procurement cycle. That is what "enterprise focus" actually means in the day-to-day. It is not a marketing line. It is the prioritization queue.

Meanwhile the wrapper era's pricing power evaporated in the other direction. Voicerr already moved from $28 per month to $199 to $299 per month, a 7x to 10x hike documented at Trillet, because the upstream cost moved underneath them and they had no other lever. They lost both directions in the same quarter. Pricing power from above, model commoditization from below. A wrapper sitting on top of an infra layer that just became one OpenAI call has nowhere to go.

Three paths are now visible for the agency operator with three to fifteen clients:

DIY on Pipecat plus OpenAI plus Twilio plus your own glue. Pipecat hit v1.0.0 on April 14, 2026, so the orchestration scaffolding is production-grade open source. Honest cost: $40K to $60K in dev time before you close your sixth retainer, plus a permanent on-call rotation. Real path for a builder with a co-founder engineer and four months. Not the path for an agency that wants to ship a new agent this week.
Stay on a wrapper that is losing its reason to exist. Pay 7x to 10x the old sticker price. Inherit every upstream pricing shock. Wait for the next "subscription update" email.
Move to an application-layer platform built for agencies. Get the CRM, the campaign engine, the white-label portal, the workspace structure, the billing surface, the A2P submission flow, the recording and transcript pipeline, and the prompt versioning in one place. Let the platform handle the cache strategy, the model routing, and the upstream commercial relationship. Keep the spread.

What we're doing at Hermes about it

Hermes is the operating platform for AI voice agencies. It is the third path above. We are not a wrapper on Vapi. We are not a thin UI on Retell. We are the application layer the infra companies just walked away from.

The pricing stays where it has been since we wrote it down. Starter is $149 per month with 300 included minutes and 3 workspaces. Business is $399 per month with 1,000 included minutes and 7 workspaces. Agency is $699 per month with 1,650 included minutes and 10 workspaces. Overage is $0.21 per minute against a $0.18 landed cost on the new gpt-realtime-2 economics, and we run a multi-model routing strategy on top so the cache hit rate is actually load-bearing instead of theoretical. The 25% spread is locked because we run the upstream relationship. Wrappers cannot promise this. They do not control the cost structure and they have already proved it by raising prices 7 to 10 times.

The application layer is the work product. White-label is native, not bolted on. Every workspace has its own subdomain, custom branding, end-client portal, and per-workspace billing. The CRM is built into the same database as the campaign engine, so a call outcome writes a lead status without a Zapier hop. The campaign engine knows how to retry, schedule callbacks, hand off to a human, and reconcile A2P 10DLC submissions ($30 pass-through, $0 margin). The billing surface knows the difference between an included minute and an overage minute. None of this is a feature roadmap. It is the live platform, and it is the layer OpenAI's API does not ship.

The side-by-sides if you want to do the comparison yourself: Hermes vs the Vapi plus GHL stack, Hermes vs Synthflow, Hermes vs Voicerr, and Hermes vs the DIY stack.

Action steps for agencies affected (this week, not next quarter)

Recompute your real per-minute cost on the new gpt-realtime-2 economics. Pull your last 30 days of upstream invoices (Vapi, Retell, Voicerr, whoever). Divide by minutes used. If your landed cost per minute is north of $0.20 and you are charging clients $0.30 to $0.45 per minute, your spread is closing while the model layer's spread is widening. That is the leak.
Ask your current platform, in writing, where the agency tier sits in the product roadmap. Phrase it specifically: "What new features shipped for sub-100-call-per-day operators in Q1 2026?" If the answer is enterprise SSO, audit logs, dedicated success engineers, or a partner program for BPOs, you have your answer.
Stress-test your client-facing surface, not the model. Ten simultaneous calls. Thirty. Watch the latency curve. Watch which dashboard your client logs into. Watch whether the email confirmations carry your domain or a "Powered by" footer. The model is not your problem anymore. The application layer is.
Tell your clients before they hear it from a YouTube tutorial. Liam Ottley, Caleb Casas, Brendan Jowett, and Daniel Walter have not yet published "what gpt-realtime-2 means for your AI voice agency stack" videos. They will this week. Send the operator-level note to your clients first.
Move the parts of the stack you do not control. You cannot control OpenAI's pricing or Vapi's customer focus or Synthflow's enterprise pivot. You can choose whether you build the application layer yourself, rent it from a wrapper that just lost its wedge, or run on a platform built for agency operators who need new agents live in 72 hours.

Frequently asked questions

If gpt-realtime-2 is $0.25 to $0.35 per minute and Hermes overage is $0.21 per minute, how does the math even work?

Hermes runs a multi-model routing strategy with cache configuration, system prompt reuse, and per-call telemetry tuned at the platform level. Cached audio input on gpt-realtime-2 drops from $32 per 1M tokens to $0.40 per 1M tokens, which is an 80x reduction on the dominant cost line in a real outbound conversation. Combine that with selective model routing for non-reasoning segments (greeting, clarifying questions, scheduled callbacks, voicemail) and the landed cost lands around $0.18 per minute. We charge $0.21. That is a 25% spread, locked, on top of three included-minute tiers ($149 / 300 min, $399 / 1,000 min, $699 / 2,000 min). An agency owner does not need to engineer cache hits or pick which model handles which conversation segment. The platform does it.

If OpenAI now ships everything in one API call, why does an agency need a platform at all?

Because the platform is no longer the voice. The platform is the CRM, the campaign engine, the workspace structure, the white-label portal, the billing reconciliation, the A2P 10DLC submission, the lead status writebacks, the call attribution, the recording storage, the transcript pipeline, the knowledge base management, the prompt versioning, the retry logic, and the client-facing dashboard. None of that ships in the gpt-realtime-2 endpoint. The voice infrastructure collapsed into one API call. The application layer did not. That is the wedge an agency-built platform fills.

Should I just build this myself on Pipecat plus OpenAI now that the infra is cheap?

If you have a dedicated engineer and four months to burn, yes, the DIY path got cleaner with Pipecat hitting v1.0.0 on April 14. If you have five clients and need new agents live this week, no. The honest math is that a self-built stack on Pipecat plus OpenAI plus Twilio plus a CRM plus a billing surface plus a white-label portal costs $40K to $60K in developer time before you close your sixth retainer, and the maintenance does not stop. The reason platforms still exist is that the application layer is wide, not deep. We built it once. You charge clients $1,500 to $3,000 a month on top.

Where this leaves you

The infrastructure layer did not pivot to enterprise out of malice. It pivoted because the model layer commoditized it from below and the only revenue ramp left is the six-figure logo. Vapi did the right thing for its cap table. Synthflow did the right thing for its cap table. OpenAI did the right thing for its cap table. None of them owe the AI voice agency owner a different roadmap.

The agency layer does the right thing for itself by noticing this eleven-day window early. The wedge that opens when an infra layer graduates is the wedge an application-layer platform built specifically for agencies should occupy. By builders, for builders. One platform. Your brand. Your margins. From $149 per month. First agent live in 72 hours.

next step

The infra layer is moving. Move with it.

Founders' Beta: 60 days free for the first 100 operators. Every beta member gets 50% off their chosen plan for 2 months via coupon. The upstream pricing risk is our problem, not yours.

Apply for the Founders' Beta Hermes vs the Vapi + GHL stack

Alfredo Romero is CEO of Hermes, the operating platform for AI voice agencies. By builders, for builders. Connect on LinkedIn.

written by

Alfredo Romero

CEO and Co-Founder, Hermes

Alfredo runs sales, operations, and strategy at Hermes. Before founding Hermes he ran agencies for nine years and spent the last three building the AI voice operations side. He writes the operator playbook from real builds, not theory.

LinkedIn ↗X (@buildwithhermes) ↗About the founding team →