Skip to content
Low Level Design Mastery Logo
LowLevelDesign Mastery

Understanding Bottlenecks

Find the weakest link before it breaks

A bottleneck is the component that limits your system’s overall performance. No matter how fast other parts are, the system can only go as fast as its slowest component.

Diagram

Symptoms: High CPU usage, slow computations

cpu_bottleneck.py
# ❌ CPU-bound operation blocking the event loop
def calculate_fibonacci(n: int) -> int:
if n <= 1:
return n
return calculate_fibonacci(n-1) + calculate_fibonacci(n-2)
# ✅ Solution: Use efficient algorithm or offload to worker
from functools import lru_cache
@lru_cache(maxsize=1000)
def calculate_fibonacci_cached(n: int) -> int:
if n <= 1:
return n
return calculate_fibonacci_cached(n-1) + calculate_fibonacci_cached(n-2)

Symptoms: High memory usage, OOM errors, GC pauses

memory_bottleneck.py
# ❌ Loading everything into memory
def process_large_file(filename: str) -> list:
with open(filename) as f:
data = f.readlines() # Loads entire file into memory!
return [process(line) for line in data]
# ✅ Solution: Stream processing
def process_large_file_streaming(filename: str):
with open(filename) as f:
for line in f: # Reads one line at a time
yield process(line)

Symptoms: Slow queries, connection pool exhaustion, high DB CPU

database_bottleneck.py
# ❌ N+1 query problem
def get_orders_with_items(user_id: str) -> list:
orders = db.query("SELECT * FROM orders WHERE user_id = ?", user_id)
for order in orders:
# This runs a query for EACH order!
order.items = db.query("SELECT * FROM items WHERE order_id = ?", order.id)
return orders
# ✅ Solution: Use JOIN or batch query
def get_orders_with_items_optimized(user_id: str) -> list:
return db.query("""
SELECT o.*, i.*
FROM orders o
LEFT JOIN items i ON o.id = i.order_id
WHERE o.user_id = ?
""", user_id)

Symptoms: High network latency, waiting on external services

Diagram
ResourceToolWarning Signs
CPUtop, htop, metrics>80% sustained
Memoryfree, vmstat>90%, frequent GC
Diskiostat, iotopHigh wait times
Networknetstat, ssPacket loss, high latency
profiling.py
import cProfile
import pstats
def profile_function(func):
"""Decorator to profile a function"""
def wrapper(*args, **kwargs):
profiler = cProfile.Profile()
profiler.enable()
result = func(*args, **kwargs)
profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(10) # Top 10 slowest
return result
return wrapper
@profile_function
def my_slow_function():
# Your code here
pass
Diagram
Bottleneck TypeSolutions
CPUOptimize algorithms, caching, horizontal scaling
MemoryStreaming, pagination, efficient data structures
DatabaseIndexing, query optimization, caching, read replicas
NetworkCaching, compression, connection pooling
External APIsCaching, async calls, circuit breakers, timeouts

Advanced: Latency vs Throughput Bottlenecks

Section titled “Advanced: Latency vs Throughput Bottlenecks”

Understanding the difference is crucial for senior engineers:

Diagram
TypeSymptomDiagnosisSolution
LatencyHigh response timesProfile shows slow operationsOptimize the slow code
ThroughputRequests queue upResources saturatedAdd capacity or optimize resource usage
BothSlow AND queueingEverything is redTriage: fix biggest impact first

Deep Dive: Production Bottleneck Investigation

Section titled “Deep Dive: Production Bottleneck Investigation”

Here’s how senior engineers approach bottleneck investigation in production:

Before optimizing, you need to know what “normal” looks like. Track these key metrics:

CategoryMetrics to Track
Request LatencyP50, P95, P99 response times
DatabaseQuery times by operation, connection pool usage
External APIsCall durations by service, error rates
ResourcesCPU, memory, disk I/O, network

Tools:

  • APM (Application Performance Monitoring): Datadog, New Relic, Dynatrace
  • Metrics: Prometheus + Grafana
  • Distributed Tracing: Jaeger, Zipkin, AWS X-Ray

The hot path is the code that runs most frequently or consumes most resources:

Diagram

Real-World Case Study: E-Commerce Checkout Bottleneck

Section titled “Real-World Case Study: E-Commerce Checkout Bottleneck”

Situation: Checkout page takes 8 seconds to load during sales events.

Step 1: Add timing instrumentation to each component

Diagram

Step 2: Diagnose the root cause

The inventory service was making one database query per item:

  • Cart with 100 items = 100 database queries
  • Each query ~60ms = 6 seconds total

Step 3: Fix with batch query

-- BEFORE: N+1 queries (100 queries for 100 items)
SELECT * FROM inventory WHERE sku = 'SKU001';
SELECT * FROM inventory WHERE sku = 'SKU002';
-- ... 98 more queries
-- AFTER: Single batch query
SELECT * FROM inventory WHERE sku IN ('SKU001', 'SKU002', ...);
MetricBeforeAfterImprovement
P50 Latency6.2s0.4s93% reduction
P99 Latency12s0.8s93% reduction
DB Queries103595% reduction
Conversion Rate2.1%3.8%81% increase


You’ve completed the Foundations section! You now understand:

  • Why system design matters for LLD
  • Scalability fundamentals
  • Latency and throughput metrics
  • How to find and fix bottlenecks

Continue your journey: Explore other HLD Concepts sections to deepen your understanding of distributed systems.