Skip to content
Low Level Design Mastery Logo
LowLevelDesign Mastery

Microservices Architecture

Breaking the monolith - when complexity becomes a feature, not a bug

Remember our LEGO castle from the monolith lesson? A microservices architecture is like building your castle from many small, independent pieces that work together but can be changed, moved, or replaced individually.

Each piece (microservice) has one specific job:

  • The drawbridge service only handles opening and closing the gate
  • The watchtower service only handles lookout duties
  • The kitchen service only handles food preparation

If the drawbridge breaks, the kitchen still works!

Companies like Netflix, Uber, Amazon, and Spotify run on microservices architectures with hundreds or thousands of services. But they didn’t start that way - they evolved to microservices when their monoliths couldn’t scale anymore.


Diagram

Each microservice should do one thing well.

Bad Example:

OrdersAndPaymentsAndShippingService
├── Process orders
├── Handle payments
├── Manage shipping
└── Send notifications

Good Example:

OrderService → Manages orders only
PaymentService → Handles payments only
ShippingService → Manages shipping only
NotificationService → Sends notifications only

Deploy one service without touching others.

Terminal window
# Deploy only the payment service
kubectl apply -f payment-service-v2.yaml
# Order service keeps running with no downtime
# User service keeps running with no downtime

Each service has its own database. No shared databases!

Diagram

Each team owns their service end-to-end:

  • Choose their own technology stack
  • Choose their own database
  • Make their own architectural decisions
  • Set their own deployment schedule

Example:

  • Order Service: Python + PostgreSQL
  • Payment Service: Java + MySQL
  • Notification Service: Node.js + MongoDB
  • Search Service: Go + Elasticsearch

Scale only what needs scaling!

Diagram

Use the right tool for the job!

Real-World Example: Netflix

  • Java Spring Boot: Core business services (catalog, user management)
  • Node.js: API gateway (fast I/O, great for proxying)
  • Python: Machine learning recommendations
  • Go: High-performance streaming services

Each team chooses what works best for their problem domain.

One service failure doesn’t bring down the entire system.

Diagram

With graceful degradation:

  • Orders still work (view, create, update)
  • Payments queued for later processing
  • Notifications still sent
  • System remains partially functional

Teams can work independently without stepping on each other’s toes.

AspectMonolithMicroservices
DeploymentCoordinate with everyoneDeploy independently
TechnologyEveryone uses same stackChoose your own stack
DatabaseShared, coordinate schema changesOwn your schema
TestingWait for full integration testsTest your service in isolation
OwnershipBlurred boundariesClear ownership

Instead of understanding a 1M line monolith, understand a 10K line service.

Cognitive Load:

  • Monolith: 1,000,000 lines, 200 classes, 50 modules
  • Microservice: 10,000 lines, 20 classes, 5 modules

New engineers can become productive faster on individual services.


# In a monolith - this ALWAYS works or throws exception
payment_result = payment_service.process(order)
# In microservices - this can fail in many ways:
try:
response = http.post('http://payment-service/process',
json=order,
timeout=5)
# What if:
# - Network is down?
# - Payment service is down?
# - Request times out?
# - Response is corrupted?
# - Payment processed but response lost?
except RequestTimeout:
# Did the payment process or not? 🤷
# This is the "Two Generals Problem"
pass

The Eight Fallacies of Distributed Computing

Section titled “The Eight Fallacies of Distributed Computing”
  1. The network is reliable ❌ (it’s not)
  2. Latency is zero ❌ (10-50ms per service hop)
  3. Bandwidth is infinite ❌ (serialization overhead)
  4. The network is secure ❌ (need authentication everywhere)
  5. Topology doesn’t change ❌ (services come and go)
  6. There is one administrator ❌ (multiple teams)
  7. Transport cost is zero ❌ (serialization, monitoring)
  8. The network is homogeneous ❌ (different tech stacks)

No more ACID transactions across services!

# One database transaction - ACID guaranteed
@transactional
def create_order(user_id, items, payment_info):
# All succeed or all fail atomically
order = order_repo.save(Order(user_id, items))
payment = payment_repo.save(Payment(order.id, payment_info))
inventory_repo.reserve(items)
return order

Solution: Saga Pattern, eventual consistency (covered in async patterns section)

Diagram

Challenges:

  • Need to spin up multiple services for integration tests
  • End-to-end tests require entire ecosystem
  • Mocking service dependencies is complex
  • Test data management across services
TaskMonolith10 Microservices100 Microservices
Deployment pipeline110100
Monitoring dashboards110100
Log aggregation1 source10 sources100 sources
Databases to manage110100
Security patches1 app10 apps100 apps
Incident response1 service downWhich of 10?Which of 100?

Required Infrastructure:

  • Service discovery (Consul, Eureka)
  • API Gateway (Kong, NGINX)
  • Message broker (Kafka, RabbitMQ)
  • Distributed tracing (Jaeger, Zipkin)
  • Centralized logging (ELK, Splunk)
  • Service mesh (Istio, Linkerd)
  • Container orchestration (Kubernetes)

Scenario: User reports “checkout is slow”

In a Monolith:

Check logs → Find slow database query → Fix
Time: 10 minutes

In Microservices:

Check which service is slow → API Gateway logs
→ Order Service logs → Payment Service logs
→ Inventory Service logs → User Service logs
→ Find slow service → Check its dependencies
→ Distributed trace shows: Payment Service calling external API slowly
Time: 2 hours (and a lot of frustration)

Performance Comparison:

Diagram
  1. Large, Mature Organization

    • 50+ engineers
    • Multiple teams working independently
    • Clear organizational boundaries
  2. Proven Product with Clear Boundaries

    • You understand your domain well
    • Stable business capabilities
    • Clear service boundaries identified
  3. Different Scaling Requirements

    • Search: 1000 requests/sec → 100 instances
    • Admin: 10 requests/sec → 2 instances
    • Significant cost savings from independent scaling
  4. Team Autonomy is Critical

    • Teams can’t wait for other teams
    • Need independent deployment schedules
    • Want to experiment with new technologies
  5. You Have the Infrastructure and Expertise

    • DevOps team in place
    • Monitoring and observability ready
    • Experience with distributed systems
  1. Starting a New Product

    • Don’t know what will succeed
    • Boundaries will change frequently
    • Premature optimization
  2. Small Team (< 20 engineers)

    • Overhead outweighs benefits
    • Everyone can understand the monolith
    • Deployment coordination is easy
  3. Strong Consistency Requirements

    • Financial transactions requiring ACID
    • Can’t tolerate eventual consistency
    • Complex cross-entity workflows
  4. Limited DevOps Capabilities

    • No Kubernetes/Docker experience
    • No monitoring infrastructure
    • Small ops team

Designing Microservices: Finding Service Boundaries

Section titled “Designing Microservices: Finding Service Boundaries”

Group by business capabilities, not technical layers.

Bad (Technical Boundaries):

UserService
ProductService
DatabaseService
NotificationService

Good (Business Boundaries):

OrderManagement → Everything about orders
InventoryManagement → Everything about inventory
CustomerManagement → Everything about customers
PaymentProcessing → Everything about payments

Each service represents a bounded context with its own:

  • Domain model
  • Ubiquitous language
  • Business rules
Diagram

Key Insight: The same entity can have different representations in different contexts!


LLD Connection: Inter-Service Communication Patterns

Section titled “LLD Connection: Inter-Service Communication Patterns”
order_service.py
import httpx
from typing import Optional
class OrderService:
def __init__(self, payment_client: 'PaymentClient'):
self._payment_client = payment_client
async def create_order(self, user_id: str, items: list) -> Order:
# Create order
order = Order(user_id=user_id, items=items)
await self._order_repo.save(order)
# Synchronous call to payment service
try:
payment_result = await self._payment_client.process_payment(
order_id=order.id,
amount=order.total,
timeout=5.0 # Always set timeouts!
)
if payment_result.status == 'success':
order.mark_as_paid()
else:
order.mark_as_failed()
except httpx.TimeoutException:
# Handle timeout - what should we do?
order.mark_as_pending_payment()
# Maybe retry later via background job
return order
class PaymentClient:
"""Client for calling Payment Service"""
def __init__(self, base_url: str):
self._base_url = base_url
self._client = httpx.AsyncClient()
async def process_payment(
self,
order_id: str,
amount: float,
timeout: float
) -> PaymentResult:
response = await self._client.post(
f"{self._base_url}/payments",
json={"order_id": order_id, "amount": amount},
timeout=timeout
)
response.raise_for_status()
return PaymentResult(**response.json())

Pros:

  • Simple to understand
  • Immediate response
  • Easy to debug

Cons:

  • Tight coupling (caller waits for response)
  • Cascading failures (if payment service is down, order service fails)
  • Timeout management complexity

2. Asynchronous Communication (Message Queues)

Section titled “2. Asynchronous Communication (Message Queues)”
async_order_service.py
from dataclasses import dataclass
import asyncio
@dataclass
class OrderCreatedEvent:
order_id: str
user_id: str
total: float
timestamp: datetime
class OrderService:
def __init__(self, message_broker: 'MessageBroker'):
self._broker = message_broker
async def create_order(self, user_id: str, items: list) -> Order:
# Create order
order = Order(user_id=user_id, items=items)
await self._order_repo.save(order)
# Publish event asynchronously
await self._broker.publish(
topic='order.created',
event=OrderCreatedEvent(
order_id=order.id,
user_id=user_id,
total=order.total,
timestamp=datetime.now()
)
)
# Don't wait for payment processing!
# Payment service will consume event and process asynchronously
return order
class PaymentService:
"""Separate service that consumes events"""
def __init__(self, message_broker: 'MessageBroker'):
self._broker = message_broker
# Subscribe to order events
self._broker.subscribe('order.created', self.handle_order_created)
async def handle_order_created(self, event: OrderCreatedEvent):
"""Process payment asynchronously"""
try:
result = await self._process_payment(
event.order_id,
event.total
)
# Publish result event
await self._broker.publish(
topic='payment.processed',
event=PaymentProcessedEvent(
order_id=event.order_id,
status=result.status
)
)
except Exception as e:
# Handle error - maybe retry or dead-letter
await self._broker.publish(
topic='payment.failed',
event=PaymentFailedEvent(
order_id=event.order_id,
error=str(e)
)
)

Pros:

  • Loose coupling (services don’t wait for each other)
  • Fault tolerance (messages stored until consumed)
  • Natural retry mechanism
  • Better scalability

Cons:

  • Eventual consistency
  • Harder to debug
  • Message ordering challenges
  • Complexity in error handling

Scale:

  • 200+ million subscribers
  • 1000+ microservices
  • 100,000+ AWS instances

Evolution:

  • Started as a monolith (DVD rental business)
  • Migrated to microservices (2008-2009)
  • Reason: Database scalability issues

Key Practices:

  • Chaos Engineering (intentionally breaking services)
  • API-first design
  • Automated deployment pipelines
  • Strong observability culture

Architecture:

  • 2,000+ microservices
  • 50,000+ production servers
  • Polyglot: Go, Java, Python, Node.js

Challenges They Faced:

  1. Service discovery at scale

    • Solution: Built internal service mesh
  2. Cascading failures

    • Solution: Circuit breakers everywhere
  3. Debugging distributed traces

    • Solution: Built Jaeger (now open source)

Amazon: The Original Microservices Company

Section titled “Amazon: The Original Microservices Company”

The Mandate (2002):

“All teams will henceforth expose their data and functionality through service interfaces. Teams must communicate with each other through these interfaces. There will be no other form of interprocess communication allowed. Anyone who doesn’t do this will be fired.” - Jeff Bezos

Result:

  • Forced SOA (Service-Oriented Architecture)
  • Led to AWS (Amazon Web Services)
  • Enabled massive scale and innovation

api_versioning.py
from enum import Enum
from pydantic import BaseModel
class ApiVersion(Enum):
V1 = "v1"
V2 = "v2"
# V1 API - original
class OrderV1(BaseModel):
id: str
user_id: str
total: float
# V2 API - added items field
class OrderV2(BaseModel):
id: str
user_id: str
items: list[Item]
total: float
tax: float # Breaking change!
# Support both versions
@app.get("/api/v1/orders/{order_id}")
async def get_order_v1(order_id: str) -> OrderV1:
order = await order_repo.get(order_id)
return OrderV1(
id=order.id,
user_id=order.user_id,
total=order.total
)
@app.get("/api/v2/orders/{order_id}")
async def get_order_v2(order_id: str) -> OrderV2:
order = await order_repo.get(order_id)
return OrderV2(
id=order.id,
user_id=order.user_id,
items=order.items,
total=order.total,
tax=order.tax
)

Every service should expose health endpoints:

@app.get("/health/live")
async def liveness():
"""Is the service running?"""
return {"status": "ok"}
@app.get("/health/ready")
async def readiness():
"""Is the service ready to accept traffic?"""
# Check database connection
if not await database.is_connected():
raise HTTPException(status_code=503, detail="Database not ready")
# Check dependent services
if not await payment_service.is_available():
raise HTTPException(status_code=503, detail="Payment service unavailable")
return {"status": "ready"}

3. Observability: Logging, Metrics, Tracing

Section titled “3. Observability: Logging, Metrics, Tracing”

Structured Logging:

import structlog
logger = structlog.get_logger()
async def create_order(order_id: str, user_id: str):
logger.info(
"order.create.started",
order_id=order_id,
user_id=user_id,
service="order-service"
)
# ... business logic ...
logger.info(
"order.create.completed",
order_id=order_id,
duration_ms=duration,
service="order-service"
)

Distributed Tracing:

from opentelemetry import trace
tracer = trace.get_tracer(__name__)
async def create_order(order_id: str):
with tracer.start_as_current_span("order.create") as span:
span.set_attribute("order.id", order_id)
# Call payment service - trace continues!
with tracer.start_as_current_span("payment.process"):
await payment_client.process(order_id)

Prevent cascading failures:

circuit_breaker.py
from enum import Enum
import asyncio
class CircuitState(Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Failing, reject requests
HALF_OPEN = "half_open" # Testing if recovered
class CircuitBreaker:
def __init__(self, failure_threshold: int = 5, timeout: int = 60):
self.failure_threshold = failure_threshold
self.timeout = timeout
self.failure_count = 0
self.last_failure_time = None
self.state = CircuitState.CLOSED
async def call(self, func, *args, **kwargs):
if self.state == CircuitState.OPEN:
# Check if timeout expired
if time.time() - self.last_failure_time > self.timeout:
self.state = CircuitState.HALF_OPEN
else:
raise CircuitOpenError("Circuit breaker is OPEN")
try:
result = await func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _on_success(self):
self.failure_count = 0
self.state = CircuitState.CLOSED
def _on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
# Usage
payment_breaker = CircuitBreaker()
async def call_payment_service(order_id):
try:
return await payment_breaker.call(
payment_client.process,
order_id
)
except CircuitOpenError:
# Fallback: queue for later processing
await queue.enqueue(order_id)
return {"status": "queued"}

Start with Monolith

Don’t start with microservices. Prove your product first, then evolve architecture based on real constraints.

Trade-offs Matter

Microservices trade code complexity for operational complexity. Be prepared for distributed system challenges.

Team Size Matters

Microservices work best with large teams (50+). Small teams benefit more from well-designed monoliths.

Boundaries are Hard

Finding the right service boundaries is hard. Use Domain-Driven Design principles to guide decomposition.



  • “Building Microservices” by Sam Newman
  • “Microservices Patterns” by Chris Richardson
  • “Production-Ready Microservices” by Susan J. Fowler
  • Netflix Tech Blog - Real-world microservices at scale
  • Martin Fowler’s Microservices Guide - Comprehensive resource