Skip to content
Low Level Design Mastery Logo
LowLevelDesign Mastery

Design a Task Scheduler

Design a high-performance, DAG-aware task orchestration engine with retry logic.

Design a task scheduler system that manages task execution based on priority, handles job dependencies using a Directed Acyclic Graph (DAG), implements retry mechanisms with exponential backoff, and utilizes a thread pool for concurrent execution. The system should efficiently schedule tasks, manage dependencies, handle failures gracefully, and provide real-time status updates.

In this problem, you’ll design an engine that ensures tasks only run when their dependencies are met, handles failures with retries, and scales across multiple CPU cores.


Design an orchestration system that executes tasks based on their importance and prerequisites, while maintaining a transparent audit log.

Functional Requirements:

  • Task Submission: Add tasks with unique IDs and priority levels.
  • Dependency Management: Support tasks that depend on others (DAG).
  • Execution Engine: Run tasks concurrently using a thread pool.
  • Retry Logic: Automatically retry failed tasks with backoff.
  • Monitoring: Provide status updates (Queued, Running, Completed, Failed).
  • Graph Validation: Detect and reject circular dependencies during submission.

Non-Functional Requirements:

  • Reliability: Ensure a task never executes before its dependencies are done.
  • Throughput: Maximize CPU usage without overloading the system.
  • Consistency: Ensure thread-safe updates to task statuses.
  • Extensibility: Easy to add new scheduling strategies (e.g., Fair Scheduling).

The system coordinates between the Scheduler, the DependencyGraph, and the WorkerPool.

Diagram
classDiagram
    class TaskScheduler {
        -DependencyGraph graph
        -PriorityQueue readyQueue
        -ThreadPool pool
        +submitTask(task)
        +onTaskComplete(taskId)
    }
    
    class Task {
        -String id
        -Priority priority
        -List~String~ dependencies
        -RetryPolicy retryPolicy
        +execute()*
    }
    
    class DependencyGraph {
        -Map~String, List~String~~ adjList
        -Map~String, Integer~ inDegree
        +isAcyclic() bool
        +getReadyTasks() List
    }

    class Worker {
        -Thread thread
        +runTask(task)
    }

    TaskScheduler --> DependencyGraph
    TaskScheduler --> Task
    TaskScheduler o-- Worker

Diagram

If you have 10,000 tasks, checking every task’s dependencies every second is $O(N^2)$.

Solution: Use In-Degree Tracking. Maintain a map where the key is the TaskId and the value is the number of pending dependencies (in-degree). When a task completes, find its children in the DAG and decrement their in-degree. If an in-degree reaches $0$, push that child into the ReadyQueue.

If Task A depends on B, and B depends on A, the system will deadlock.

Solution: Use Kahn’s Algorithm or DFS-based Cycle Detection. Before accepting a set of tasks, perform a topological sort simulation. If you can’t visit all nodes, a cycle exists, and the scheduler should reject the submission.

If 100 tasks fail at the same time and retry immediately, they might crash the already-struggling provider.

Solution: Use Exponential Backoff with Jitter. Increase the wait time after each failure ($2^n$) and add a small random “jitter” to ensure the 100 tasks don’t all retry at the exact same millisecond.


By solving this problem, you’ll master:

  • Graph Theory - Applying DAGs and topological sorts to real-world workflows.
  • Concurrency Orchestration - Managing shared state in a multi-threaded pool.
  • Fault Tolerance - Building robust retry and error-handling pipelines.
  • Performance Tuning - Using priority queues and degree-tracking for $O(1)$ readiness checks.

Ready to see the full implementation? Open the interactive playground to access:

  • 🎯 Step-by-step guidance through the 8-step LLD approach
  • 📊 Interactive UML builder to visualize your design
  • 💻 Complete Code Solutions in Python, Java, C++, TypeScript, JavaScript, C#
  • 🤖 AI-powered review of your design and code

After mastering the Task Scheduler, try these similar problems: