Skip to content
Low Level Design Mastery Logo
LowLevelDesign Mastery

Health Checks & Heartbeats

Know you're sick before your users tell you

In distributed systems, things fail silently. A service might be running but:

  • Database connection is broken
  • Memory is exhausted
  • Thread pool is full
  • External API is unreachable

Health checks detect these issues before users do.

Diagram

These serve different purposes and should be implemented separately:

Diagram
CheckQuestionFailure ResponseFrequency
Liveness”Is it alive?”Restart containerEvery 10-30s
Readiness”Can it serve traffic?”Remove from load balancerEvery 5-10s
Startup”Is it initialized?”Wait for completionDuring startup

Diagram
Check TypeWhat to TestExample
Shallow (Liveness)Process alive, not deadlockedReturn 200 immediately
DatabaseCan connect and querySELECT 1 completes < 100ms
CacheCan read and writeWrite key, read back
QueueCan connect to message brokerCheck connection status
External ServiceDependency is reachableHit their /health endpoint

A well-designed health check system has clear separation:

Diagram

A good deep health check response:

{
"status": "healthy",
"timestamp": "2024-01-15T10:30:00Z",
"version": "1.2.3",
"components": {
"database": {
"status": "healthy",
"latency_ms": 5.2
},
"cache": {
"status": "healthy",
"latency_ms": 1.1
},
"payment-service": {
"status": "degraded",
"latency_ms": 450,
"message": "High latency detected"
}
}
}
StatusMeaningHTTP Code
healthyAll systems go200
degradedWorking but impaired200 (with warning)
unhealthyCannot serve traffic503

For distributed systems with multiple nodes, components need to know if their peers are alive. Heartbeats are periodic signals saying “I’m still here.”

Diagram

When tracking heartbeats, nodes go through states:

Diagram
ParameterTypical ValuePurpose
Interval5 secondsHow often to send
Suspect Timeout10 secondsWhen to start worrying
Dead Timeout15-30 secondsWhen to declare dead

Diagram
ProbeRecommended TimeoutWhy
Liveness1-5 secondsShould be instant
Readiness5-10 secondsDependencies may be slow
Startup30-300 secondsInitial load can take time
ScenarioHTTP CodeMeaning
Everything healthy200Keep serving traffic
Degraded but working200Serve but alert operators
Cannot serve requests503Remove from load balancer
Check timed out503Treat as unhealthy

Health Check ConceptLLD Implementation
Health Checker InterfaceStrategy pattern — different checks for different components
Aggregated HealthComposite pattern — combining multiple checkers
State ChangesObserver pattern — notify when health changes
Heartbeat ThreadDaemon thread pattern — background periodic task
Component ChecksDependency injection — pass dependencies for testing


You’ve completed the Reliability & Availability section! You now understand:

  • ✅ Availability patterns and SLAs
  • ✅ Replication strategies
  • ✅ Fault tolerance techniques
  • ✅ Health monitoring

Continue exploring: Check out the next section on Consistency & Distributed Transactions to learn how data stays consistent across distributed systems.