Skip to content
Low Level Design Mastery Logo
LowLevelDesign Mastery

Availability Patterns

Designing systems that never sleep

Availability measures how often your system is up and working. It’s the percentage of time users can successfully use your service.

Diagram

The industry measures availability in “nines” — each additional nine dramatically reduces allowed downtime:

Diagram
AvailabilityDowntime/YearDowntime/MonthDowntime/WeekUse Case
99%3.65 days7.2 hours1.68 hoursDevelopment/Test
99.9%8.76 hours43.8 min10.1 minStandard apps
99.95%4.38 hours21.9 min5 minE-commerce
99.99%52.6 min4.38 min1 minFinancial services
99.999%5.26 min26.3 sec6 secLife-critical systems

SLI, SLO, and SLA: The Availability Triangle

Section titled “SLI, SLO, and SLA: The Availability Triangle”

Understanding these three terms is crucial for any engineer:

Diagram
TermDefinitionOwnerConsequence of Miss
SLIThe metric you measureEngineeringInvestigation triggered
SLOInternal target (stricter than SLA)Engineering + ProductTeam prioritizes fixes
SLAExternal promise to customersBusinessFinancial penalties, lost trust
CategorySLIWhat It Measures
AvailabilitySuccess rate% of requests that succeed
LatencyP50, P95, P99 response timeHow fast responses are
ThroughputRequests per secondSystem capacity
Error Rate5xx errors / total requestsFailure frequency
SaturationCPU, memory, queue depthHow “full” the system is

Pattern 1: Redundancy (Eliminate Single Points of Failure)

Section titled “Pattern 1: Redundancy (Eliminate Single Points of Failure)”

A Single Point of Failure (SPOF) is any component whose failure brings down the entire system. HA systems eliminate SPOFs through redundancy at every layer.

Diagram
LevelDescriptionExample
Active-PassiveBackup sits idle until neededSecondary DB that takes over on primary failure
Active-ActiveAll copies handle trafficMultiple servers behind a load balancer
N+1Run one extra node for safetyNeed 3 servers? Run 4
N+2Extra buffer for maintenance + failureNeed 3 servers? Run 5

When the primary fails, traffic automatically switches to the backup.

Diagram

Key Failover Metrics:

MetricDescriptionTypical Target
Detection TimeHow quickly we notice the failure< 10 seconds
Failover TimeHow long to switch to backup< 30 seconds
Recovery TimeTotal time until service restored< 1 minute

When parts of your system fail, continue serving users with reduced functionality rather than complete failure.

Diagram

The Principle: Identify which features are core vs nice-to-have, and ensure core features work even when nice-to-haves fail.

E-commerce ExampleCategoryOn Failure
Product infoCoreMust work — show error page if down
RecommendationsNice-to-haveHide section, show empty
ReviewsNice-to-haveHide section, show cached
Real-time inventoryNice-to-haveShow “In Stock” (cached)
CheckoutCoreMust work — queue if payment down

How availability concepts affect your class design:

Availability ConceptLLD Implementation
RedundancyMultiple service client instances, failover logic
FailoverStrategy pattern for switching between providers
Graceful DegradationFacade pattern with fallback methods, Optional returns
Health ChecksImplementing health check interfaces on classes
SLI TrackingDecorator pattern for measuring method performance
TimeoutsConfigurable timeouts in service clients


Now that you understand availability concepts, let’s dive into how systems stay consistent across replicas:

Next up: Replication Strategies — Learn how data is replicated for both availability and performance.