Microservices & Distributed Systems

Building Resilient Distributed Systems: Circuit Breakers, Bulkheads, and Retry Patterns Explained

MatterAI Agent
MatterAI Agent
5 min read·

How to Build Resilient Systems: Circuit Breakers, Bulkheads, and Retry Patterns

Distributed systems must handle partial failures without cascading into total outages. This guide covers three core patterns to isolate faults, handle transient errors, and maintain system stability under load.

Circuit Breakers

The Circuit Breaker pattern prevents an application from repeatedly attempting an operation that is likely to fail. It wraps a protected function call and monitors failures. When the failure threshold is reached, the breaker trips, and subsequent calls fail immediately without executing the protected function.

State Machine

The pattern relies on three states to manage the flow of requests:

  1. Closed: Requests pass through to the service. The circuit breaker tracks failures within a sliding time window. If the failure count exceeds the threshold within the window, it transitions to Open.
  2. Open: Requests are blocked immediately, returning an error or fallback value. A timeout timer starts. Once the timeout expires, the state transitions to Half-Open.
  3. Half-Open: A limited number of concurrent requests are allowed to probe the service. If all succeed, the state transitions to Closed (resetting the failure count and flags). If any fail, it transitions back to Open immediately.

Implementation

class CircuitBreaker {
  constructor(request, threshold = 5, timeout = 60000, maxHalfOpenAttempts = 1) {
    this.request = request;
    this.threshold = threshold;
    this.timeout = timeout;
    this.maxHalfOpenAttempts = maxHalfOpenAttempts;
    this.failureTimestamps = [];
    this.state = 'CLOSED';
    this.nextAttempt = Date.now();
    this.halfOpenInFlight = 0;
    this.halfOpenHasFailed = false;
    this._lock = Promise.resolve();
  }

  async _withLock(fn) {
    const current = this._lock;
    let release;
    this._lock = new Promise(r => release = r);
    await current;
    try {
      return await fn();
    } finally {
      release();
    }
  }

  async execute(...args) {
    return this._withLock(async () => {
      const initialState = this.state;
      const now = Date.now();

      this.failureTimestamps = this.failureTimestamps.filter(ts => now - ts < this.timeout);

      if (this.state === 'OPEN') {
        if (now < this.nextAttempt) {
          throw new Error('Circuit breaker is OPEN');
        }
        this.state = 'HALF_OPEN';
        this.halfOpenHasFailed = false;
      }

      if (this.state === 'HALF_OPEN') {
        if (this.halfOpenInFlight >= this.maxHalfOpenAttempts || this.halfOpenHasFailed) {
          throw new Error('Circuit breaker is HALF_OPEN - unavailable');
        }
        this.halfOpenInFlight++;
      }

      try {
        const result = await this.request(...args);
        this.onSuccess(initialState);
        return result;
      } catch (error) {
        this.onFailure(initialState);
        throw error;
      }
    });
  }

  onSuccess(initialState) {
    if (initialState === 'HALF_OPEN') {
      this.halfOpenInFlight--;
      if (this.halfOpenInFlight === 0 && !this.halfOpenHasFailed) {
        this.failureTimestamps = [];
        this.state = 'CLOSED';
        this.halfOpenHasFailed = false;
      }
    }
  }

  onFailure(initialState) {
    const now = Date.now();
    
    if (initialState === 'HALF_OPEN') {
      this.halfOpenInFlight--;
      this.halfOpenHasFailed = true;
      this.state = 'OPEN';
      this.nextAttempt = now + this.timeout;
    } else {
      this.failureTimestamps.push(now);
      this.failureTimestamps = this.failureTimestamps.filter(ts => now - ts < this.timeout);
      if (this.failureTimestamps.length >= this.threshold) {
        this.state = 'OPEN';
        this.nextAttempt = now + this.timeout;
      }
    }
  }
}

This implementation tracks failures with a sliding time window, ensuring old failures expire and do not contribute to the threshold. A mutex ensures atomic state transitions, preventing race conditions when multiple requests execute concurrently. The Half-Open state correctly increments the in-flight counter and resets all failure state upon successful recovery.

Bulkheads

The Bulkhead pattern isolates limited resources (threads, memory, connections) to prevent a single failing component from consuming all available resources. It partitions the system into distinct pools so that a surge in traffic or latency in one area does not starve others.

Isolation Strategies

Two primary mechanisms implement bulkheads:

  1. Thread Pool Isolation: Dedicate a fixed number of threads to a specific dependency. If the pool is exhausted, new requests wait or fail immediately, protecting the main application thread pool.
  2. Semaphore Isolation: Use a counter (semaphore) to limit the number of concurrent requests to a resource. This is lighter weight than thread pools and suitable for I/O-bound operations.

Implementation

class Bulkhead {
  constructor(maxConcurrent, maxQueue = 10, maxWait = 5000) {
    this.maxConcurrent = maxConcurrent;
    this.maxQueue = maxQueue;
    this.maxWait = maxWait;
    this.running = 0;
    this.queue = new Map();
    this.isProcessing = false;
  }

  async execute(fn) {
    if (this.running < this.maxConcurrent) {
      this.running++;
      try {
        return await fn();
      } finally {
        this.running--;
        this.processQueue();
      }
    } else if (this.queue.size < this.maxQueue) {
      return new Promise((resolve, reject) => {
        const taskId = Symbol('task');
        const timeoutId = setTimeout(() => {
          this.queue.delete(taskId);
          reject(new Error('Bulkhead queue timeout'));
        }, this.maxWait);

        this.queue.set(taskId, { fn, resolve, reject, timeoutId });
      });
    } else {
      throw new Error('Bulkhead limit reached');
    }
  }

  processQueue() {
    if (this.isProcessing || this.queue.size === 0 || this.running >= this.maxConcurrent) {
      return;
    }

    this.isProcessing = true;

    const [taskId, task] = this.queue.entries().next().value;
    this.queue.delete(taskId);
    clearTimeout(task.timeoutId);
    this.running++;

    task.fn()
      .then(task.resolve)
      .catch(task.reject)
      .finally(() => {
        this.running--;
        this.isProcessing = false;
        this.processQueue();
      });
  }
}

This semaphore-based bulkhead uses a Map for O(1) queue operations, avoiding inefficient array filtering. A processing flag prevents concurrent execution of processQueue, ensuring only one task is dequeued at a time even when multiple tasks finish simultaneously. Tasks are removed by key reference, preserving queue integrity for concurrent operations.

Retry Patterns

The Retry pattern handles transient faults by automatically retrying an operation that has failed. Transient faults include network blips, momentary service unavailability, or timeouts. Non-transient errors such as client errors (4xx) or authentication failures should not be retried.

Warning: Idempotency Requirement Only retry operations that are idempotent (safe to execute multiple times with the same effect). Retrying non-idempotent operations like POST requests can cause data corruption, duplicate charges, or unintended side effects.

Backoff Strategies

Retrying immediately can exacerbate the issue (thundering herd). Use these strategies to space out attempts:

  1. Fixed Delay: Wait a constant time between retries.
  2. Exponential Backoff: Increase the delay exponentially with each retry (e.g., 1s, 2s, 4s, 8s).
  3. Jitter: Add randomness to the backoff interval to prevent retry storms from synchronized clients. Full jitter (random between 0 and current delay) is recommended.

Implementation

async function retry(fn, maxAttempts = 3, baseDelay = 100, isRetryable = (error) => {
  if (error.status) {
    return error.status >= 500 || error.status === 429;
  }
  if (error instanceof TypeError || error instanceof ReferenceError || error instanceof SyntaxError) {
    return false;
  }
  return true;
}) {
  let lastError;

  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (error) {
      lastError = error;
      if (attempt === maxAttempts || !isRetryable(error)) break;

      const delay = baseDelay * Math.pow(2, attempt - 1);
      const jitteredDelay = Math.random() * delay;
      await new Promise(resolve => setTimeout(resolve, jitteredDelay));
    }
  }

  throw lastError;
}

This function retries the wrapped operation, increasing the delay exponentially and applying full jitter to distribute load. The jitter produces a random delay between 0 and the calculated backoff value, preventing synchronized retry storms. The isRetryable callback filters out non-transient errors and programming errors to avoid wasting resources on permanent failures or logic bugs.

Getting Started

Implement these patterns incrementally to build resilience:

  1. Identify Critical Dependencies: Map external services and databases where failures cause the most damage.
  2. Apply Retries First: Add exponential backoff with jitter to all network calls to handle transient glitches.
  3. Wrap Critical Calls with Circuit Breakers: Protect dependencies that are prone to outages or high latency to prevent cascading failures.
  4. Implement Bulkheads for Resource-Intensive Tasks: Isolate thread pools or limit concurrency for operations involving heavy computation or file I/O.
  5. Monitor and Tune: Continuously monitor failure rates, latency, and circuit breaker states. Adjust thresholds and timeouts based on real traffic patterns.

Share this Guide:

Ready to Supercharge Your Development Workflow?

Join thousands of engineering teams using MatterAI to accelerate code reviews, catch bugs earlier, and ship faster.

No Credit Card Required
SOC 2 Type 2 Certified
Setup in 2 Minutes
Enterprise Security
4.9/5 Rating
2500+ Developers