Fault Tolerance & Redundancy

Everything fails. Design for it.

Embrace Failure

In distributed systems, failure is not exceptional — it’s the norm. Networks partition, servers crash, disks fail, and processes get killed. The question isn’t if things will fail, but when.

Types of Failures

Understanding failure modes helps you design appropriate responses:

Failure Type	What Happens	How Common	Solution
Crash	Component stops completely	Very common	Redundancy, auto-restart
Omission	Messages lost or not delivered	Common	Retries, acknowledgments
Timing	Response slower than acceptable	Common	Timeouts, SLOs
Byzantine	Component behaves incorrectly	Rare	Consensus protocols

Pattern 1: Redundancy

Eliminate every single point of failure by having backups for everything:

Redundancy Levels

Level	Description	Example
Active-Passive	Backup sits idle until needed	Secondary database that takes over on primary failure
Active-Active	All copies handle traffic	Multiple servers behind load balancer
N+1	N nodes needed, have N+1	Need 3 servers? Run 4
N+2	Extra buffer for maintenance + failure	Need 3 servers? Run 5

Pattern 2: Retry with Exponential Backoff

When operations fail, retry — but do it smartly. Immediate retries can overwhelm a struggling service.

Why Exponential Backoff?

Strategy	Delay Pattern	Problem
No Retry	—	One failure = complete failure
Immediate Retry	0s, 0s, 0s…	Hammers failing service, makes it worse
Constant Delay	1s, 1s, 1s…	May still overwhelm recovering service
Exponential Backoff	1s, 2s, 4s, 8s…	Gives service time to recover
Exponential + Jitter	1.2s, 2.7s, 4.1s…	Prevents thundering herd

The Thundering Herd Problem

When a service goes down and comes back up, thousands of clients retry at the exact same moment — overwhelming the service again. Jitter adds randomness to retry times, spreading out the load.

Pattern 3: The Bulkhead Pattern

Inspired by ship compartments that prevent a leak from sinking the entire ship. Isolate resources so one failing component can’t consume all resources.

Bulkhead in Practice

The Problem: You have a shared connection pool of 100 connections. Your payment service gets slow (3rd party issue). All 100 connections get stuck waiting for payment responses. Now your entire application is frozen — including checkout, browsing, and search.

The Solution: Separate pools per dependency:

Payment: 20 connections max
Inventory: 30 connections max
Search: 50 connections max

If payment slows down, only 20 connections are affected. Everything else keeps working.

Pattern 4: Timeout Everything

Never wait indefinitely. Every external call should have a timeout. Hanging requests consume resources and cause cascading failures.

Timeout Guidelines

Operation Type	Recommended Timeout	Why
In-memory cache	10-50ms	Should be instant
Database query	1-5s	Longer = wrong query or lock
Internal service	1-3s	Should be fast
External API	5-30s	Less control, may be slow
File upload	30-120s	Depends on size

Pattern 5: Fail Fast

When you know something won’t work, fail immediately instead of wasting time and resources.

Fail Fast Checklist

Before doing expensive work, check:

Input Validation — Reject bad requests immediately
Capacity Check — Are we at limit? Reject with 503
Dependency Health — Is the service we need alive?
Feature Flags — Is this feature enabled?

Pattern Summary

Pattern	What It Does	When to Use
Redundancy	Multiple copies of components	Always for critical paths
Retry + Backoff	Retry failed ops with increasing delays	Transient failures
Bulkhead	Isolate resources per component	Multi-dependency systems
Timeout	Set max wait time for operations	Every external call
Fail Fast	Reject early if we know it will fail	Expensive operations

Real-World Examples

Example 1: Netflix Chaos Engineering

Company: Netflix

Scenario: Netflix runs Chaos Monkey in production, randomly killing servers to ensure systems handle failures gracefully. This proactive approach to fault tolerance prevents cascading failures.

Implementation: Uses chaos engineering and redundancy:

Why This Works:

Proactive Testing: Find weaknesses before real failures
Redundancy: N+2 ensures capacity even after failures
Auto-Recovery: Automatic replacement of failed instances
Result: 99.99% availability despite constant failures

Real-World Impact:

Scale: Handles 200+ million subscribers
Uptime: 99.99% availability
Chaos Tests: Thousands of failures injected daily
Recovery: < 30 seconds automatic recovery

Example 2: AWS Multi-AZ Deployment

Company: Amazon Web Services

Scenario: AWS services deploy across multiple Availability Zones (AZs) to survive data center failures. If one AZ fails, traffic automatically routes to healthy AZs.

Implementation: Uses geographic redundancy and automatic failover:

Why Multi-AZ?

Disaster Recovery: Survives data center failures
High Availability: 99.99% SLA with multi-AZ
Automatic Failover: Traffic routes automatically
Result: Survives entire data center outages

Real-World Impact:

Uptime: 99.99% availability SLA
Failover: < 60 seconds automatic failover
Scale: Millions of customers rely on multi-AZ

LLD ↔ HLD Connection

Fault Tolerance Concept	LLD Implementation
Redundancy	Multiple service instances, object pools
Retry Logic	Strategy pattern for retry policies, decorator pattern
Bulkhead	Thread pools with limits, semaphores, connection pools
Timeout	Every external call wrapped with timeout config
Fail Fast	Guard clauses, precondition checks at method start
Circuit Breaker	State machine pattern (covered in Resiliency section)

Key Takeaways

What’s Next?

We’ve covered how to handle failures. Now let’s learn how to monitor system health proactively:

Next up: Health Checks & Heartbeats — Learn to detect failures before they impact users.

Request a feature or report an issue