reliability · infrastructure
Retell Outages: 31 in 11 Months — What Fallback Architecture Should Look Like
According to StatusGator's monitoring data, more than 31 outages have affected Retell AI users in the past 12 months. In the five months since January 2026 alone, IsDown has tracked 50 separate incidents. The most recently acknowledged outage on Retell's official status page was April 13, 2026. These numbers do not mean Retell is a bad product. They mean it is an infrastructure dependency, and every infrastructure dependency fails. The question is whether your agency has a plan for when it does.
Most agencies running voice AI on Retell do not have a fallback plan. They have a single platform, a set of client agents pointing at it, and no answer for what happens to a live client call when the platform goes down mid-day. That is the operational gap this post addresses. By builders, for builders: here is what a real fallback architecture looks like, why the 99.99% uptime claim does not tell the full story, and what agencies need to demand from any platform they build their revenue on.
How many Retell outages have there actually been?
The raw numbers from third-party monitoring services are harder to ignore than marketing copy. StatusGator has tracked Retell AI since May 2025 and reports more than 31 outages affecting Retell AI users over the trailing 12 months. IsDown, tracking since January 2026, has logged 50 incidents in approximately five months. That is a pace of roughly 10 incidents per month against a platform that serves as the voice infrastructure backbone for hundreds of agencies.
The incidents break into three categories based on the documented history. First, TTS (text-to-speech) provider dependency failures, most visibly the March 14, 2025 incident titled "TTS provider openai is down," where Retell's dependency on OpenAI's TTS service propagated directly into Retell agent failures. Second, cloud infrastructure incidents, including the October 20, 2025 event where an AWS outage caused Retell login issues and analytics failures for four hours and 49 minutes. Third, platform component degradations: dashboard, web call, and end-to-end calling each show separate incident histories in monitoring services, meaning a given incident may not take the whole platform offline but may take out the component your clients are actively using.
"Over the past 12 months, more than 31 outages have affected Retell AI users." [StatusGator, May 2026]
Why does a "99.99% uptime" platform have 31 incidents in a year?
The math does not reconcile on first read. Retell claims a 99.99% uptime guarantee in its materials. Mathematically, 99.99% uptime (four nines) allows for approximately 52 minutes of downtime per year. If you have 31 incidents in 12 months, how is that consistent with a 99.99% claim?
The answer is definitional. Platform vendors measure uptime against full service unavailability: the entire platform is down, no calls go through, nothing works. Third-party monitoring services like StatusGator and IsDown flag any detectable degradation, including partial component failures, elevated error rates, increased latency, and third-party dependency failures that cause degraded but not zero call completion. An event where 15% of Retell calls fail due to an upstream TTS provider issue is a monitoring incident but may not trigger Retell's uptime SLA calculation, because a significant portion of calls are still completing.
For an individual agency, that distinction is largely irrelevant. If 15% of your client's calls fail on a Thursday afternoon during an outbound campaign run, you have a problem regardless of whether Retell posts a status page incident. The client does not see Retell's uptime SLA. They see calls not connecting and ask you what happened. That is the operational reality the 99.99% number does not capture.
"If applications depend on a single AI provider, they face one API failure from downtime — making multi-model fallback strategies important for any production SLA policy." [Universal.cloud, 2026]
What breaks for agencies when Retell goes down?
The failure cascade depends on how the agency has built its stack. For a typical Retell-based agency, here are the four things that break during an incident, in order of client-impact severity.
Outbound campaigns stop mid-run. If an agency is running an outbound call campaign and Retell's end-to-end calling component degrades, calls that are mid-queue do not go out. Depending on when the retry logic fires and whether the campaign engine retries at all, some contacts may receive no call during their designated window. For time-sensitive campaigns (appointment reminders, same-day follow-ups), a two-hour outage in the middle of a campaign window can mean that a batch of contacts simply never gets called that day. The agency either calls them manually or explains to the client why the morning run missed 40% of the list.
Inbound agents stop answering. If a client is running a receptionist agent on Retell and the platform degrades, inbound calls may ring through to an unanswered line, go to voicemail, or connect to a dead line with no audio depending on how the SIP routing is configured. For clients who have replaced their front desk with a Retell agent, this is not a degraded experience. This is a completely failed business function. A dental office with no one answering phones because the AI receptionist platform had an incident is a churn-level event for the agency.
Dashboard access fails independently. StatusGator tracks Retell's dashboard as a separate component with its own incident history. During the October 2025 AWS event, login and analytics were affected even when calling may have continued. This means agencies may not be able to see call logs, check campaign status, or even log in during an incident, making it harder to diagnose and respond to what is happening.
TTS dependency failures change the agent voice mid-campaign. Retell's built-in fallback system reroutes TTS and LLM functions to alternate providers when the primary is down, per their own documentation. This is a real engineering control. The side effect is that if an agency's agent is configured with a specific ElevenLabs or OpenAI voice and the fallback activates a different TTS provider, the agent's voice changes during a live call or between calls in the same campaign batch. For agencies that have trained clients to recognize their branded agent's voice, that inconsistency creates a support ticket.
What does a proper fallback architecture look like?
A real fallback architecture for a voice agency has three distinct layers. Most agencies have none of them. Some have one. The ones that survive incidents without losing clients have all three.
Layer 1: Own your phone numbers independently of the platform. This is the most important and most overlooked layer. If your client phone numbers live inside Retell as Retell-provisioned numbers, those numbers are not truly yours during an incident. Porting them out takes one to two weeks. If Retell is down, you cannot reroute them to a backup in real time.
The correct configuration is to provision all client numbers directly through a primary telephony carrier, either Twilio or Telnyx, and point them at Retell via SIP. Per Twilio's own failover documentation, you can configure one or more fallback SIP URIs at the Twilio layer so that if the primary SIP endpoint (your AI platform) fails to answer, calls automatically reroute to a secondary destination. That secondary destination can be a backup AI platform, a simple voicemail, or a forwarding number to a human. You set this once per number. It costs nothing extra. And it means that if Retell has an incident, your clients' phones do not go silent.
Layer 2: Multi-carrier redundancy at the SIP layer. Using two SIP trunk providers is a documented best practice for production AI calling deployments, per the 2026 SIP trunk buyer checklist. A Twilio-primary, Telnyx-secondary configuration gives you carrier-level redundancy that is independent of what happens at the AI platform layer. Modern SIP failover can detect a failed call leg and reroute in under two seconds, with health checks running every five seconds across the infrastructure. This is not exotic engineering. It is the same pattern used by every mid-market PBX deployment in the country.
Layer 3: Platform-level fallback for AI calls. At the AI platform layer, the fallback options are more limited if you are using Retell as an API-only service. You can configure a second AI platform (Bland, a hosted LLM endpoint, or a simpler IVR flow) as the SIP destination on your fallback leg. For agencies that run on a full-stack agency platform rather than the Retell API directly, this layer is handled inside the platform itself: redundant TTS/LLM routing, multi-region infrastructure, and a platform that is not itself dependent on a single cloud provider for its calling functionality.
How do you configure a practical fallback without rebuilding everything?
The minimum viable fallback configuration for an existing Retell-based agency stack takes roughly two to four hours to implement per client. Here is the sequence in order of implementation priority.
- Audit which phone numbers are Retell-provisioned vs. carrier-direct. Log into your Retell dashboard and pull the full phone number list. Any number showing as Retell-provisioned is exposed during a platform outage. Carrier-direct numbers (your own Twilio or Telnyx numbers pointed at Retell via SIP) are not. Flag all Retell-provisioned numbers for migration to your own carrier account. Prioritize your highest-volume inbound clients first. Port over a period of two to four weeks to avoid disruption.
- Set a fallback SIP URI on every Twilio or Telnyx number. For each number you own at the carrier layer, configure a secondary SIP destination or a forwarding number as a failover target. The simplest fallback is a mobile or landline number that goes to a voicemail with a branded message: "You have reached [Client Name]. We are experiencing a brief technical issue. Please leave a message and we will call you within [X minutes]." Clients hear a professional branded message. You hear the voicemail and know an incident is active. This is not ideal. It is a floor, not a ceiling. But it is infinitely better than a dead line.
- Configure offline mode or emergency forwarding for every agent. Many voice platforms including Retell allow you to set an offline or fallback routing rule per agent. If the agent fails to initiate a response within a configured timeout, the call is forwarded to a fallback destination. Use this in combination with your Twilio-layer fallback URI. The two controls operate independently and catch different failure modes: the Twilio fallback catches cases where Retell's SIP endpoint does not answer at all; the agent-level offline mode catches cases where Retell answers the SIP leg but the AI agent fails to respond.
- Set up a status monitor on Retell's status page. Subscribe to status.retellai.com for email or SMS alerts when Retell posts an incident. Set up a parallel alert on IsDown or StatusGator for Retell's end-to-end calling and web call components. These third-party monitors often detect degradation before the official status page is updated. Early detection gives you time to notify clients proactively before they notice and contact you.
- Create a client communication template for incidents. Write a two-paragraph email template you can send within five minutes of detecting an incident. Something like: "We are aware of a brief service disruption affecting voice calls. Our team is monitoring the situation and your calls are being forwarded to [fallback]. We will update you when full service is restored, estimated [X]." A client who receives this email within five minutes of an incident is not a churn risk. A client who notices on their own that calls went unanswered and gets no communication is.
How does the agency stack compare on reliability across platforms?
Reliability is not a single number. It is a combination of uptime history, dependency architecture, published SLAs, and what actually breaks when an incident hits. Here is how the major agency voice stacks compare on the dimensions that matter for incident response.
| Stack / Platform | Tracked incidents (12 mo) | TTS dependency | Number ownership | Built-in LLM fallback |
|---|---|---|---|---|
| Retell AI (API layer) | 31+ (StatusGator, 12 mo) | OpenAI/ElevenLabs dependency | Platform-provisioned by default | Yes (TTS/LLM only) |
| Vapi (API layer) | Untracked public data | Multi-provider, but upstream exposed | Platform-provisioned by default | Partial |
| Synthflow | Separate incident history | Own telephony stack (SBCs) | Platform-provisioned by default | Partial |
| Hermes (agency platform) | Pre-launch, incident history pending | Multi-provider routing | Carrier-direct recommended, supported | Yes, platform-managed |
The "number ownership" column is the most operationally significant for incident response. If a platform provisions your numbers, you cannot reroute them during an incident without the platform's cooperation. If your numbers live at the carrier layer, you control the routing at all times. That single architectural decision is the difference between a five-minute incident response and a two-hour client crisis.
What should agencies demand from any platform's SLA?
The question to ask any voice AI platform is not "what is your uptime?" It is "what is my agency's recoverable state if you go down for two hours at 2 pm on a Tuesday?" A vendor who answers the first question with a percentage and cannot answer the second has not built for agency operations.
Here are the five things that should be in writing, not in a marketing one-pager, before you build client revenue on any voice platform.
- Number portability on demand. Any platform that cannot release your phone numbers within 48 hours of a written request is not vendor-neutral infrastructure. It is lock-in with AI features on top. Get this in writing.
- Published incident response SLA. Not "we prioritize incidents." An actual time-to-acknowledgment commitment: for example, "we will post a status update within 15 minutes of a P1 incident affecting end-to-end calling." Without that commitment, you have no basis for setting client expectations during an incident.
- Explicit TTS and LLM dependency disclosure. Ask: which TTS and LLM providers does your infrastructure depend on, and what is the fallback if one of them goes down? Retell has been transparent that it uses OpenAI TTS as a primary provider. That is knowable information. Get the same disclosure from any platform you evaluate.
- Agent-level offline mode or fallback routing. The platform should support per-agent configuration of what happens when the AI fails to respond. Forwarding to voicemail, forwarding to a human number, playing a custom message, or retrying after a delay are all acceptable. No-answer is not.
- Historical incident data access. If a platform does not have a public status page with incident history going back at least 90 days, you cannot make an informed reliability decision. Status pages that show only "operational" without historical incident data are not transparency, they are the absence of it. Use StatusGator or IsDown to independently verify before committing.
"Organizations should negotiate SLAs early and ensure uptime, latency targets, and support response times are written down, not just promised. If applications depend on a single AI provider, they face one API failure from downtime." [Universal.cloud, AI Uptime SLA: The Forgotten Risk, 2026]
Is Retell the right tool for an agency, with or without fallback?
Retell is a strong API infrastructure layer. The per-minute rates are competitive, the agent builder is capable, and the conversation flow engine is one of the more sophisticated in the market. If you need raw API access to build a custom voice product, Retell is a reasonable choice.
The structural problem for agencies is that Retell is not designed for multi-client agency operations out of the box. There is no native white-label, no per-client workspace isolation, no campaign orchestration, no integrated billing, and no CRM. Agencies using Retell are adding GoHighLevel, Zapier, Stripe, and a custom dashboard on top to get to a client-deliverable product. That is the five-invoice problem: not just in cost, but in incident response. When Retell has an outage, you are managing an incident across Retell, GHL, Zapier, Twilio, and potentially ElevenLabs simultaneously. Each tool has a different status page and a different support queue.
A full-stack agency platform consolidates that. When there is an incident, you have one status page, one support contact, and one incident response protocol to manage. That consolidation is not just a convenience feature. In an incident at 2 pm on a Tuesday with five clients' phones going unanswered, it is the difference between a 15-minute resolution and a two-hour crisis.
If you are evaluating whether the multi-tool Retell stack is the right long-term infrastructure for your agency, the Vapi vs. Retell comparison for agencies covers that decision in detail, including the table-stakes features both platforms leave to you to build. And if you are already running the duct-tape stack and want to understand the actual margin impact of doing so, the 50-client margin math post puts a dollar figure on it.
Action steps: the fallback checklist
If you are running voice AI agents on Retell today, here is the minimum fallback configuration to put in place before the next incident. Estimated implementation time is two to four hours across a typical five-client agency setup.
- Pull your Retell phone number list. Identify every Retell-provisioned number. Start porting the highest-volume inbound lines to your own Twilio or Telnyx account.
- For every number you now own at the carrier layer, configure a fallback SIP URI or forwarding number in your Twilio or Telnyx dashboard. Route to voicemail with a branded message as the minimum viable fallback.
- Enable offline mode or fallback routing on every Retell agent that handles inbound calls for a client. Set a call-answer timeout and a forwarding destination.
- Subscribe to status.retellai.com for email alerts. Add Retell's end-to-end calling component on IsDown for a second alert layer.
- Write and save a client-incident email template. Time from detection to client communication should be under five minutes.
- Review your platform choice against the five SLA criteria above. If you cannot get written answers on number portability timelines, incident response commitments, and TTS dependency disclosure, factor that uncertainty into your agency risk model.
Frequently asked questions
How many Retell AI outages have there been in 2025-2026?
According to StatusGator, which has been monitoring Retell AI since May 2025, more than 31 outages have affected Retell AI users in the past 12 months. IsDown has tracked 50 incidents since January 2026 alone, in approximately five months of monitoring. The most recently acknowledged outage was April 13, 2026. A separate major incident in October 2025 caused login issues and analytics failures for nearly five hours due to an AWS dependency.
Does Retell AI have a 99.99% uptime guarantee?
Retell AI claims a 99.99% uptime guarantee in its marketing materials. Mathematically, 99.99% uptime allows for approximately 52 minutes of downtime per year. The 31+ incidents tracked by third-party monitoring services over 12 months suggest that the platform experiences more frequent degradation events than that figure implies. The discrepancy is largely because platform vendors measure uptime against full outages, while monitoring services flag any degradation including partial service disruptions, elevated error rates, and third-party dependency failures like TTS or LLM provider outages.
What types of incidents affect Retell AI most often?
Third-party monitoring data shows three recurring incident categories for Retell AI: TTS (text-to-speech) provider dependency failures, most notably the March 2025 OpenAI TTS outage; infrastructure-layer incidents tied to cloud providers, including the October 2025 AWS-related login and analytics disruption that lasted nearly five hours; and web call and dashboard degradation events. The TTS dependency category is significant because it means Retell's uptime is partially a function of OpenAI's uptime, which agencies cannot independently control.
What is the right fallback architecture for a voice AI agency?
A proper fallback architecture for a voice AI agency has three layers. First, at the telephony layer: own your numbers through a primary carrier (Twilio or Telnyx) and have a secondary SIP trunk with automatic failover configured. This protects you regardless of what happens at the AI platform layer. Second, at the platform layer: use a platform that has its own built-in TTS and LLM failover rather than being a single-provider dependent stack. Third, at the client layer: configure an offline mode or emergency forwarding rule for every client agent so that if the AI layer fails, calls route to a human fallback rather than going unanswered.
Should I stop using Retell AI because of the outage history?
Not necessarily. The 31 outages figure includes partial degradations, not all full platform failures. Retell is a capable API infrastructure layer with low published per-minute rates. The right question is whether your agency has a fallback plan in place for when it goes down, not whether it ever goes down. Every platform goes down. Retell's incidents are documented because the monitoring services are doing their job. The agencies that are most exposed are the ones that built their client delivery on a single platform with no telephony fallback and no human escalation path.
How quickly can voice AI failover happen if configured correctly?
Modern SIP failover configurations can detect a failed leg and reroute in under two seconds. Health checks at the infrastructure layer run every five seconds in well-configured deployments. For the AI model layer, transparent provider failover from one LLM or TTS provider to another can happen in the same call window if the platform has built-in fallback routing. The key variable is whether the fallback is pre-configured or whether it requires manual intervention. Manual failover for a multi-client agency during an incident is not a real failback plan.
What is the difference between Retell AI and a full agency platform like Hermes?
Retell AI is API infrastructure: a powerful call engine and agent builder, but without native white-label, CRM, campaign orchestration, or per-client billing isolation. Agencies using Retell typically add GoHighLevel, Stripe, Zapier, and a custom dashboard on top. A full agency platform like Hermes bundles the call infrastructure with native white-label workspaces, CRM, campaigns, and billing in a single stack. The fallback architecture question matters for both, but an agency platform has more leverage to build infrastructure-level redundancy into the product rather than leaving it to each agency to engineer.
The bottom line
Thirty-one outages in 12 months, 50 incidents in five months: Retell is not uniquely unreliable. Every voice infrastructure platform has incidents. What varies is how prepared agencies are when they happen. Own your phone numbers at the carrier layer. Configure fallback routing before you need it. Subscribe to incident alerts before a client notices before you do. And evaluate whether a single-API stack or a full-stack agency platform better matches how you want to operate at client five, at client ten, and at client twenty.
By builders, for builders. The architecture decisions that protect your clients are not complicated. They are just the ones most agencies have not made yet.
next step
Build your agency on infrastructure that was designed for agencies
Hermes is the operating platform for AI voice agencies. One platform, your brand, your margins. From $149/month. First agent live in 72 hours.
Alfredo Romero is CEO of Hermes, the voice infrastructure platform for AI agencies. Connect on LinkedIn.
written by
Alfredo Romero
CEO and Co-Founder, Hermes
Alfredo runs sales, operations, and strategy at Hermes. Before founding Hermes he ran agencies for nine years and spent the last three building the AI voice operations side. He writes the operator playbook from real builds, not theory.
