There's a particular flavour of anxiety that comes from knowing your system just sent (or maybe didn't send) a critical message to 200,000 people. Did the appointment reminders go out? Did the password reset emails land? Is the SMS provider actually delivering, or is it cheerfully accepting requests and dropping them into a void?
If you've built communication infrastructure at any meaningful scale, you've felt this. And if you haven't felt it yet, well, give it time.
This article is about what we've learned processing millions of messages: the architectural decisions, the operational patterns, and the hard-won lessons that don't show up in vendor documentation.
What "reliable" actually means (it's not just uptime)
When most teams talk about reliability, they reach for uptime percentages. "We're at 99.9% availability." Great. That still means nearly nine hours of downtime per year, and if those nine hours happen to coincide with your busiest sending window, you've got a real problem.
But uptime is only one dimension of reliability. For communication systems specifically, reliability is a composite of several properties:
- Delivery assurance: did the message actually reach the recipient, not just leave your system?
- Ordering guarantees: if you send "Your order is confirmed" followed by "Your order has shipped," they'd better arrive in that sequence.
- Exactly-once semantics: sending a message zero times is bad. Sending it five times might be worse.
- Timeliness: a two-factor auth code that arrives 45 minutes late is functionally identical to one that never arrives.
- Observability: you need to know, in near-real-time, when any of the above properties are violated.
Twilio's 2023 State of Customer Engagement Report found that the vast majority of businesses considered messaging reliability a top-three priority, yet fewer than a third had monitoring in place to detect delivery failures within an hour of occurrence.[1] That gap between intention and implementation is where most of the interesting engineering lives.
Sent ≠ Delivered ≠ Read
Here's a truth that takes some teams embarrassingly long to internalise: your system reporting "message sent" means almost nothing. All it confirms is that you successfully handed a payload to a downstream provider. Congratulations, you've completed step one of a multi-step process and declared victory.
The journey of a message looks something like this:
- Accepted: your system generated the message and placed it in an outbound queue.
- Dispatched: the message was sent to the channel provider (Twilio, SendGrid, an RCS gateway, etc.).
- Delivered: the provider confirmed the message reached the recipient's device or inbox.
- Read/Engaged: the recipient actually opened or interacted with the message.
Most systems track step 2 and call it a day. The better ones track step 3. Very few do step 4 well, partly because it's technically harder and partly because not all channels support read receipts (looking at you, SMS).
Industry delivery analytics data suggests that roughly 15-20% of SMS messages "accepted" by carrier networks never actually reach the end device. The reasons range from carrier filtering to handset incompatibilities to the recipient simply being in a tunnel. If you're only tracking "sent," you're blind to a significant failure rate.
The practical implication: you need delivery receipt webhooks, and you need to reconcile them against your send records. Ideally in near-real-time. At minimum, in a nightly batch process that flags anomalies.
Idempotency, or: why your customer got 47 shipping notifications
Picture this: a customer receives their first "Your package is arriving today!" notification. Delightful. Then they get a second one. Okay, maybe a glitch. By the seventh, they're mildly annoyed. By the forty-seventh, they're writing a tweet about your company that's about to go mildly viral for all the wrong reasons.
This isn't a hypothetical. Duplicate message sends are one of the most common failure modes in communication systems, and they're almost always caused by the same thing: a retry mechanism that doesn't respect idempotency.
Here's the pattern that causes it. Your service sends a message to the provider. The provider processes it and begins sending the response. The network blips. Your service never receives the acknowledgement. So it retries. The provider, having no concept of "I already handled this," processes it again. Repeat until someone notices or a rate limit kicks in, whichever comes first.
The fix is straightforward (in theory)
Every outbound message gets a unique idempotency key, typically a UUID generated at creation time, before the message enters the queue. This key travels with the message through every retry attempt. The downstream provider (if it supports idempotency keys, and most good ones do) will deduplicate on its end. And your own system should maintain a short-lived deduplication cache to catch retries before they even leave your infrastructure.
In practice, the tricky part is choosing the right scope for your idempotency key. Too narrow (per-attempt) and you get no deduplication. Too broad (per-recipient-per-day) and you accidentally suppress legitimate messages. The sweet spot is usually per-logical-event: one appointment reminder, one delivery notification, one password reset, regardless of how many times your retry logic fires.
A useful rule of thumb: if a customer receiving the message twice would cause confusion, frustration, or (worst case) a duplicate financial transaction, you need idempotency at that boundary. No exceptions.
Circuit breakers: for when your provider goes dark at 2am on a Friday
It's always a Friday. It's always 2am. And it's always the night before a major send campaign.
Your SMS provider's API starts returning 500 errors. Your system, being diligent, retries. And retries. And retries. Within minutes, you've exhausted your connection pool, backed up your message queue, and the cascading failure is now affecting your email sends too, even though the email provider is perfectly healthy.
This is the problem the circuit breaker pattern solves, and it's borrowed directly from electrical engineering. When a downstream dependency starts failing, the circuit breaker "trips," stopping requests to the failing service, returning a fast failure to the caller, and periodically tests whether the service has recovered.
Michael Nygard's Release It! describes the pattern in detail,[2] and it remains one of the most important resilience patterns in distributed systems. For communication infrastructure specifically, there are a few nuances worth noting:
- Per-provider circuit breakers. If your SMS provider goes down, your email and RCS circuits should remain closed (healthy). This sounds obvious but requires intentional separation in your HTTP client configuration.
- Graceful degradation, not silent failure. When a circuit opens, messages should be redirected to a fallback channel or held in a retry queue, not dropped. The customer doesn't care about your provider's outage; they care about getting their message.
- Half-open state with real traffic. When testing if a provider has recovered, use actual messages from the queue (low-priority ones, ideally) rather than synthetic health checks. Providers can pass health checks while still failing on real traffic. We've seen it happen more than once.
Observability: the log that nobody reads
If a message fails in a forest of microservices and nobody sees the log line, did it really fail? Philosophically, maybe not. Practically, your customer is still waiting for their appointment reminder, so yes. Yes it did.
Observability in communication systems isn't just about logging. It's about building the right abstractions so that failures surface themselves before customers report them. Google's Site Reliability Engineering handbook (O'Reilly, 2016)[3] calls this the difference between "detecting" and "debugging." You want your monitoring to tell you something is wrong before you need to figure out what is wrong.
The metrics that actually matter
- Delivery rate by channel, tracked as a time series. A sudden drop from 97% to 82% on SMS is an immediate red flag, even if the absolute volume looks fine.
- End-to-end latency (p50, p95, p99). The median doesn't tell you much; it's the tail latencies that reveal queue backlogs and provider slowdowns.
- Queue depth over time. A monotonically increasing queue depth means you're consuming slower than you're producing. This is fine during a burst; it's a crisis if it persists.
- Dead letter queue volume. Messages that exhaust all retries and land here are your system's SOS signal. If this number is anything other than tiny, investigate immediately.
- Error rate by error type. A spike in "invalid phone number" errors suggests a data quality issue upstream. A spike in "provider timeout" errors suggests an infrastructure issue downstream. The response is completely different, so don't aggregate them.
We've found that a simple dashboard with these five metrics, combined with alerts on rate-of-change (not just thresholds), catches about 90% of production issues before they become customer-visible. The remaining 10% are the weird ones, the kind that make for good post-mortem stories but terrible on-call shifts.
Queue design: the unglamorous backbone of everything
Nobody writes blog posts about their message queue. It's not exciting. It doesn't demo well. You can't put "implemented a robust dead letter queue strategy" on a landing page and expect anyone to care.
But get it wrong, and nothing else matters.
Retry strategies: exponential backoff with jitter
If your retry strategy is "wait 5 seconds and try again," you're going to have a bad time. When a provider recovers from an outage, the last thing it needs is every one of your failed messages hitting it simultaneously. This is the thundering herd problem, and it's solved by exponential backoff with jitter.
AWS's 2015 Architecture Blog post, "Exponential Backoff and Jitter,"[4] remains the definitive reference on this topic. The short version: each retry waits exponentially longer (1s, 2s, 4s, 8s...), with a random jitter component that spreads retries across time. "Full jitter" (where the delay is random between 0 and the exponential cap) tends to produce the best results for high-concurrency systems.
But there's a subtlety specific to messaging: not all messages should be retried with the same aggression. A two-factor auth code needs fast retries (it's useless after a few minutes). A marketing newsletter can wait hours. Your retry policy should be configurable per message priority, not a one-size-fits-all constant.
Dead letter queues: your safety net's safety net
After a message has exhausted its retry budget, it needs to go somewhere. That somewhere is a dead letter queue (DLQ). Messages in the DLQ are not abandoned; they're parked. They represent known failures that need human or automated investigation.
The key design decisions for DLQs:
- Preserve the full context. The DLQ message should include the original payload, all retry attempts with timestamps, the final error, and enough metadata to replay the message if needed.
- Set retention policies. DLQ messages shouldn't live forever. A password reset from last month is irrelevant. Set a TTL that matches the message type's useful lifetime.
- Build a replay mechanism. When the root cause is fixed, you need a way to reprocess DLQ messages selectively, not all-or-nothing. "Replay all DLQ messages from the last 2 hours tagged as SMS provider timeout" is the level of granularity you want.
When things go wrong: real-world war stories
In October 2021, Facebook (now Meta) experienced a six-hour global outage caused by a BGP configuration change.[5] WhatsApp, used by over 2 billion people for personal and business messaging, went completely dark. Businesses relying solely on WhatsApp for customer communication (and there were a surprising number, especially in South America and Southeast Asia) had no fallback. Six hours of silence.
In 2023, a major SMS aggregator experienced a partial outage that was harder to detect: messages were being accepted but silently dropped for certain carrier routes. Their API returned 200 OK for every request. Their status page showed all green. The only signal was a gradual decline in delivery receipt callbacks, subtle enough that it took some customers over four hours to notice.
These incidents reinforce the same lesson: single points of failure in communication infrastructure are unacceptable. Not because failures are rare, but because when they happen, your customers' customers are the ones affected.
Putting it together: a reliability checklist
If you're building or evaluating communication infrastructure, here's the short version of everything above, distilled into questions worth asking:
- Do you track delivery status, or just send status?
- Are your message sends idempotent? What happens if a retry fires?
- Do you have circuit breakers per provider, with fallback routing?
- Can you detect a 10% drop in delivery rate within 15 minutes?
- Are your retry strategies appropriate for each message priority?
- Do you have a dead letter queue with replay capability?
- If your primary SMS provider went down right now, what would happen?
- When was the last time you tested your fallback paths?
If you answered "I'm not sure" to more than two of these, you have some engineering work ahead of you. The good news is that none of this is theoretically hard. It's operationally hard. It requires discipline, good defaults, and a culture that treats message delivery as a first-class concern rather than a fire-and-forget API call.
The best communication infrastructure is invisible. Your customers never think about it because it just works. Getting to that point takes a surprising amount of deliberate engineering, the kind that doesn't make for flashy demos but makes for systems you can sleep through the night trusting.
We built Conductor with these principles at its core, not because they're novel, but because we kept seeing systems that ignored them. If you're wrestling with any of these challenges, we'd genuinely love to compare notes.
Sources
- Twilio, "State of Customer Engagement Report" (2023). https://www.twilio.com/en-us/state-of-customer-engagement
- Michael T. Nygard, Release It! Design and Deploy Production-Ready Software, 2nd Edition (Pragmatic Bookshelf). https://pragprog.com/titles/mnee2/release-it-second-edition/
- Betsy Beyer et al., Site Reliability Engineering: How Google Runs Production Systems (O'Reilly, 2016). https://sre.google/sre-book/table-of-contents/
- Marc Brooker, "Exponential Backoff and Jitter," AWS Architecture Blog (2015). https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
- Santosh Janardhan, "More Details About the October 4 Outage," Meta Engineering Blog (2021). https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/