Skip to content
Low Level Design Mastery Logo
LowLevelDesign Mastery

Fault Tolerance & Redundancy

Everything fails. Design for it.

In distributed systems, failure is not exceptional — it’s the norm. Networks partition, servers crash, disks fail, and processes get killed. The question isn’t if things will fail, but when.

Diagram

Understanding failure modes helps you design appropriate responses:

Diagram
Failure TypeWhat HappensHow CommonSolution
CrashComponent stops completelyVery commonRedundancy, auto-restart
OmissionMessages lost or not deliveredCommonRetries, acknowledgments
TimingResponse slower than acceptableCommonTimeouts, SLOs
ByzantineComponent behaves incorrectlyRareConsensus protocols

Eliminate every single point of failure by having backups for everything:

Diagram
LevelDescriptionExample
Active-PassiveBackup sits idle until neededSecondary database that takes over on primary failure
Active-ActiveAll copies handle trafficMultiple servers behind load balancer
N+1N nodes needed, have N+1Need 3 servers? Run 4
N+2Extra buffer for maintenance + failureNeed 3 servers? Run 5

When operations fail, retry — but do it smartly. Immediate retries can overwhelm a struggling service.

Diagram
StrategyDelay PatternProblem
No RetryOne failure = complete failure
Immediate Retry0s, 0s, 0s…Hammers failing service, makes it worse
Constant Delay1s, 1s, 1s…May still overwhelm recovering service
Exponential Backoff1s, 2s, 4s, 8s…✅ Gives service time to recover
Exponential + Jitter1.2s, 2.7s, 4.1s…✅ Prevents thundering herd

When a service goes down and comes back up, thousands of clients retry at the exact same moment — overwhelming the service again. Jitter adds randomness to retry times, spreading out the load.


Inspired by ship compartments that prevent a leak from sinking the entire ship. Isolate resources so one failing component can’t consume all resources.

Diagram

The Problem: You have a shared connection pool of 100 connections. Your payment service gets slow (3rd party issue). All 100 connections get stuck waiting for payment responses. Now your entire application is frozen — including checkout, browsing, and search.

The Solution: Separate pools per dependency:

  • Payment: 20 connections max
  • Inventory: 30 connections max
  • Search: 50 connections max

If payment slows down, only 20 connections are affected. Everything else keeps working.


Never wait indefinitely. Every external call should have a timeout. Hanging requests consume resources and cause cascading failures.

Diagram
Operation TypeRecommended TimeoutWhy
In-memory cache10-50msShould be instant
Database query1-5sLonger = wrong query or lock
Internal service1-3sShould be fast
External API5-30sLess control, may be slow
File upload30-120sDepends on size

When you know something won’t work, fail immediately instead of wasting time and resources.

Diagram

Before doing expensive work, check:

  1. Input Validation — Reject bad requests immediately
  2. Capacity Check — Are we at limit? Reject with 503
  3. Dependency Health — Is the service we need alive?
  4. Feature Flags — Is this feature enabled?

PatternWhat It DoesWhen to Use
RedundancyMultiple copies of componentsAlways for critical paths
Retry + BackoffRetry failed ops with increasing delaysTransient failures
BulkheadIsolate resources per componentMulti-dependency systems
TimeoutSet max wait time for operationsEvery external call
Fail FastReject early if we know it will failExpensive operations

Fault Tolerance ConceptLLD Implementation
RedundancyMultiple service instances, object pools
Retry LogicStrategy pattern for retry policies, decorator pattern
BulkheadThread pools with limits, semaphores, connection pools
TimeoutEvery external call wrapped with timeout config
Fail FastGuard clauses, precondition checks at method start
Circuit BreakerState machine pattern (covered in Resiliency section)


We’ve covered how to handle failures. Now let’s learn how to monitor system health proactively:

Next up: Health Checks & Heartbeats — Learn to detect failures before they impact users.