Understanding Bottlenecks
What is a Bottleneck?
Section titled “What is a Bottleneck?”A bottleneck is the component that limits your system’s overall performance. No matter how fast other parts are, the system can only go as fast as its slowest component.
Types of Bottlenecks
Section titled “Types of Bottlenecks”1. CPU Bottleneck
Section titled “1. CPU Bottleneck”Symptoms: High CPU usage, slow computations
# ❌ CPU-bound operation blocking the event loopdef calculate_fibonacci(n: int) -> int: if n <= 1: return n return calculate_fibonacci(n-1) + calculate_fibonacci(n-2)
# ✅ Solution: Use efficient algorithm or offload to workerfrom functools import lru_cache
@lru_cache(maxsize=1000)def calculate_fibonacci_cached(n: int) -> int: if n <= 1: return n return calculate_fibonacci_cached(n-1) + calculate_fibonacci_cached(n-2)// ❌ CPU-bound operationpublic long calculateFibonacci(int n) { if (n <= 1) return n; return calculateFibonacci(n-1) + calculateFibonacci(n-2);}
// ✅ Solution: Use efficient algorithm with memoizationprivate Map<Integer, Long> cache = new ConcurrentHashMap<>();
public long calculateFibonacciCached(int n) { if (n <= 1) return n; return cache.computeIfAbsent(n, key -> calculateFibonacciCached(key-1) + calculateFibonacciCached(key-2) );}2. Memory Bottleneck
Section titled “2. Memory Bottleneck”Symptoms: High memory usage, OOM errors, GC pauses
# ❌ Loading everything into memorydef process_large_file(filename: str) -> list: with open(filename) as f: data = f.readlines() # Loads entire file into memory! return [process(line) for line in data]
# ✅ Solution: Stream processingdef process_large_file_streaming(filename: str): with open(filename) as f: for line in f: # Reads one line at a time yield process(line)// ❌ Loading everything into memorypublic List<String> processLargeFile(String filename) throws IOException { List<String> lines = Files.readAllLines(Path.of(filename)); // Loads all! return lines.stream().map(this::process).collect(Collectors.toList());}
// ✅ Solution: Stream processingpublic Stream<String> processLargeFileStreaming(String filename) throws IOException { return Files.lines(Path.of(filename)) // Streams line by line .map(this::process);}3. Database Bottleneck
Section titled “3. Database Bottleneck”Symptoms: Slow queries, connection pool exhaustion, high DB CPU
# ❌ N+1 query problemdef get_orders_with_items(user_id: str) -> list: orders = db.query("SELECT * FROM orders WHERE user_id = ?", user_id) for order in orders: # This runs a query for EACH order! order.items = db.query("SELECT * FROM items WHERE order_id = ?", order.id) return orders
# ✅ Solution: Use JOIN or batch querydef get_orders_with_items_optimized(user_id: str) -> list: return db.query(""" SELECT o.*, i.* FROM orders o LEFT JOIN items i ON o.id = i.order_id WHERE o.user_id = ? """, user_id)// ❌ N+1 query problempublic List<Order> getOrdersWithItems(String userId) { List<Order> orders = db.query("SELECT * FROM orders WHERE user_id = ?", userId); for (Order order : orders) { // This runs a query for EACH order! order.setItems(db.query("SELECT * FROM items WHERE order_id = ?", order.getId())); } return orders;}
// ✅ Solution: Use JOIN or batch querypublic List<Order> getOrdersWithItemsOptimized(String userId) { return db.query(""" SELECT o.*, i.* FROM orders o LEFT JOIN items i ON o.id = i.order_id WHERE o.user_id = ? """, userId);}4. Network/I/O Bottleneck
Section titled “4. Network/I/O Bottleneck”Symptoms: High network latency, waiting on external services
Finding Bottlenecks
Section titled “Finding Bottlenecks”Step 1: Monitor Resource Utilization
Section titled “Step 1: Monitor Resource Utilization”| Resource | Tool | Warning Signs |
|---|---|---|
| CPU | top, htop, metrics | >80% sustained |
| Memory | free, vmstat | >90%, frequent GC |
| Disk | iostat, iotop | High wait times |
| Network | netstat, ss | Packet loss, high latency |
Step 2: Profile Your Code
Section titled “Step 2: Profile Your Code”import cProfileimport pstats
def profile_function(func): """Decorator to profile a function""" def wrapper(*args, **kwargs): profiler = cProfile.Profile() profiler.enable() result = func(*args, **kwargs) profiler.disable()
stats = pstats.Stats(profiler) stats.sort_stats('cumulative') stats.print_stats(10) # Top 10 slowest
return result return wrapper
@profile_functiondef my_slow_function(): # Your code here pass// Use JVM profilers: JProfiler, YourKit, or async-profiler// Or simple timing:public class SimpleProfiler { public static <T> T profile(String name, java.util.function.Supplier<T> operation) { long start = System.nanoTime(); T result = operation.get(); long duration = System.nanoTime() - start; System.out.printf("%s took %.2fms%n", name, duration / 1_000_000.0); return result; }}
// UsageUser user = SimpleProfiler.profile("getUser", () -> userService.get(id));Step 3: Trace Requests End-to-End
Section titled “Step 3: Trace Requests End-to-End”Common Solutions
Section titled “Common Solutions”| Bottleneck Type | Solutions |
|---|---|
| CPU | Optimize algorithms, caching, horizontal scaling |
| Memory | Streaming, pagination, efficient data structures |
| Database | Indexing, query optimization, caching, read replicas |
| Network | Caching, compression, connection pooling |
| External APIs | Caching, async calls, circuit breakers, timeouts |
Advanced: Latency vs Throughput Bottlenecks
Section titled “Advanced: Latency vs Throughput Bottlenecks”Understanding the difference is crucial for senior engineers:
| Type | Symptom | Diagnosis | Solution |
|---|---|---|---|
| Latency | High response times | Profile shows slow operations | Optimize the slow code |
| Throughput | Requests queue up | Resources saturated | Add capacity or optimize resource usage |
| Both | Slow AND queueing | Everything is red | Triage: fix biggest impact first |
Deep Dive: Production Bottleneck Investigation
Section titled “Deep Dive: Production Bottleneck Investigation”Here’s how senior engineers approach bottleneck investigation in production:
Step 1: Establish Baseline Metrics
Section titled “Step 1: Establish Baseline Metrics”Before optimizing, you need to know what “normal” looks like. Track these key metrics:
| Category | Metrics to Track |
|---|---|
| Request Latency | P50, P95, P99 response times |
| Database | Query times by operation, connection pool usage |
| External APIs | Call durations by service, error rates |
| Resources | CPU, memory, disk I/O, network |
Tools:
- APM (Application Performance Monitoring): Datadog, New Relic, Dynatrace
- Metrics: Prometheus + Grafana
- Distributed Tracing: Jaeger, Zipkin, AWS X-Ray
Step 2: Identify the Hot Path
Section titled “Step 2: Identify the Hot Path”The hot path is the code that runs most frequently or consumes most resources:
Real-World Case Study: E-Commerce Checkout Bottleneck
Section titled “Real-World Case Study: E-Commerce Checkout Bottleneck”Situation: Checkout page takes 8 seconds to load during sales events.
Investigation Process
Section titled “Investigation Process”Step 1: Add timing instrumentation to each component
Step 2: Diagnose the root cause
The inventory service was making one database query per item:
- Cart with 100 items = 100 database queries
- Each query ~60ms = 6 seconds total
Step 3: Fix with batch query
-- BEFORE: N+1 queries (100 queries for 100 items)SELECT * FROM inventory WHERE sku = 'SKU001';SELECT * FROM inventory WHERE sku = 'SKU002';-- ... 98 more queries
-- AFTER: Single batch querySELECT * FROM inventory WHERE sku IN ('SKU001', 'SKU002', ...);Results
Section titled “Results”| Metric | Before | After | Improvement |
|---|---|---|---|
| P50 Latency | 6.2s | 0.4s | 93% reduction |
| P99 Latency | 12s | 0.8s | 93% reduction |
| DB Queries | 103 | 5 | 95% reduction |
| Conversion Rate | 2.1% | 3.8% | 81% increase |
Key Takeaways
Section titled “Key Takeaways”What’s Next?
Section titled “What’s Next?”You’ve completed the Foundations section! You now understand:
- Why system design matters for LLD
- Scalability fundamentals
- Latency and throughput metrics
- How to find and fix bottlenecks
Continue your journey: Explore other HLD Concepts sections to deepen your understanding of distributed systems.