Building Resilient Distributed Systems: Circuit Breakers, Bulkheads, and Retry Patterns Explained
How to Build Resilient Systems: Circuit Breakers, Bulkheads, and Retry Patterns
Distributed systems must handle partial failures without cascading into total outages. This guide covers three core patterns to isolate faults, handle transient errors, and maintain system stability under load.
Circuit Breakers
The Circuit Breaker pattern prevents an application from repeatedly attempting an operation that is likely to fail. It wraps a protected function call and monitors failures. When the failure threshold is reached, the breaker trips, and subsequent calls fail immediately without executing the protected function.
State Machine
The pattern relies on three states to manage the flow of requests:
- Closed: Requests pass through to the service. The circuit breaker tracks failures within a sliding time window. If the failure count exceeds the threshold within the window, it transitions to Open.
- Open: Requests are blocked immediately, returning an error or fallback value. A timeout timer starts. Once the timeout expires, the state transitions to Half-Open.
- Half-Open: A limited number of concurrent requests are allowed to probe the service. If all succeed, the state transitions to Closed (resetting the failure count and flags). If any fail, it transitions back to Open immediately.
Implementation
class CircuitBreaker {
constructor(request, threshold = 5, timeout = 60000, maxHalfOpenAttempts = 1) {
this.request = request;
this.threshold = threshold;
this.timeout = timeout;
this.maxHalfOpenAttempts = maxHalfOpenAttempts;
this.failureTimestamps = [];
this.state = 'CLOSED';
this.nextAttempt = Date.now();
this.halfOpenInFlight = 0;
this.halfOpenHasFailed = false;
this._lock = Promise.resolve();
}
async _withLock(fn) {
const current = this._lock;
let release;
this._lock = new Promise(r => release = r);
await current;
try {
return await fn();
} finally {
release();
}
}
async execute(...args) {
return this._withLock(async () => {
const initialState = this.state;
const now = Date.now();
this.failureTimestamps = this.failureTimestamps.filter(ts => now - ts < this.timeout);
if (this.state === 'OPEN') {
if (now < this.nextAttempt) {
throw new Error('Circuit breaker is OPEN');
}
this.state = 'HALF_OPEN';
this.halfOpenHasFailed = false;
}
if (this.state === 'HALF_OPEN') {
if (this.halfOpenInFlight >= this.maxHalfOpenAttempts || this.halfOpenHasFailed) {
throw new Error('Circuit breaker is HALF_OPEN - unavailable');
}
this.halfOpenInFlight++;
}
try {
const result = await this.request(...args);
this.onSuccess(initialState);
return result;
} catch (error) {
this.onFailure(initialState);
throw error;
}
});
}
onSuccess(initialState) {
if (initialState === 'HALF_OPEN') {
this.halfOpenInFlight--;
if (this.halfOpenInFlight === 0 && !this.halfOpenHasFailed) {
this.failureTimestamps = [];
this.state = 'CLOSED';
this.halfOpenHasFailed = false;
}
}
}
onFailure(initialState) {
const now = Date.now();
if (initialState === 'HALF_OPEN') {
this.halfOpenInFlight--;
this.halfOpenHasFailed = true;
this.state = 'OPEN';
this.nextAttempt = now + this.timeout;
} else {
this.failureTimestamps.push(now);
this.failureTimestamps = this.failureTimestamps.filter(ts => now - ts < this.timeout);
if (this.failureTimestamps.length >= this.threshold) {
this.state = 'OPEN';
this.nextAttempt = now + this.timeout;
}
}
}
}
This implementation tracks failures with a sliding time window, ensuring old failures expire and do not contribute to the threshold. A mutex ensures atomic state transitions, preventing race conditions when multiple requests execute concurrently. The Half-Open state correctly increments the in-flight counter and resets all failure state upon successful recovery.
Bulkheads
The Bulkhead pattern isolates limited resources (threads, memory, connections) to prevent a single failing component from consuming all available resources. It partitions the system into distinct pools so that a surge in traffic or latency in one area does not starve others.
Isolation Strategies
Two primary mechanisms implement bulkheads:
- Thread Pool Isolation: Dedicate a fixed number of threads to a specific dependency. If the pool is exhausted, new requests wait or fail immediately, protecting the main application thread pool.
- Semaphore Isolation: Use a counter (semaphore) to limit the number of concurrent requests to a resource. This is lighter weight than thread pools and suitable for I/O-bound operations.
Implementation
class Bulkhead {
constructor(maxConcurrent, maxQueue = 10, maxWait = 5000) {
this.maxConcurrent = maxConcurrent;
this.maxQueue = maxQueue;
this.maxWait = maxWait;
this.running = 0;
this.queue = new Map();
this.isProcessing = false;
}
async execute(fn) {
if (this.running < this.maxConcurrent) {
this.running++;
try {
return await fn();
} finally {
this.running--;
this.processQueue();
}
} else if (this.queue.size < this.maxQueue) {
return new Promise((resolve, reject) => {
const taskId = Symbol('task');
const timeoutId = setTimeout(() => {
this.queue.delete(taskId);
reject(new Error('Bulkhead queue timeout'));
}, this.maxWait);
this.queue.set(taskId, { fn, resolve, reject, timeoutId });
});
} else {
throw new Error('Bulkhead limit reached');
}
}
processQueue() {
if (this.isProcessing || this.queue.size === 0 || this.running >= this.maxConcurrent) {
return;
}
this.isProcessing = true;
const [taskId, task] = this.queue.entries().next().value;
this.queue.delete(taskId);
clearTimeout(task.timeoutId);
this.running++;
task.fn()
.then(task.resolve)
.catch(task.reject)
.finally(() => {
this.running--;
this.isProcessing = false;
this.processQueue();
});
}
}
This semaphore-based bulkhead uses a Map for O(1) queue operations, avoiding inefficient array filtering. A processing flag prevents concurrent execution of processQueue, ensuring only one task is dequeued at a time even when multiple tasks finish simultaneously. Tasks are removed by key reference, preserving queue integrity for concurrent operations.
Retry Patterns
The Retry pattern handles transient faults by automatically retrying an operation that has failed. Transient faults include network blips, momentary service unavailability, or timeouts. Non-transient errors such as client errors (4xx) or authentication failures should not be retried.
Warning: Idempotency Requirement Only retry operations that are idempotent (safe to execute multiple times with the same effect). Retrying non-idempotent operations like POST requests can cause data corruption, duplicate charges, or unintended side effects.
Backoff Strategies
Retrying immediately can exacerbate the issue (thundering herd). Use these strategies to space out attempts:
- Fixed Delay: Wait a constant time between retries.
- Exponential Backoff: Increase the delay exponentially with each retry (e.g., 1s, 2s, 4s, 8s).
- Jitter: Add randomness to the backoff interval to prevent retry storms from synchronized clients. Full jitter (random between 0 and current delay) is recommended.
Implementation
async function retry(fn, maxAttempts = 3, baseDelay = 100, isRetryable = (error) => {
if (error.status) {
return error.status >= 500 || error.status === 429;
}
if (error instanceof TypeError || error instanceof ReferenceError || error instanceof SyntaxError) {
return false;
}
return true;
}) {
let lastError;
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
try {
return await fn();
} catch (error) {
lastError = error;
if (attempt === maxAttempts || !isRetryable(error)) break;
const delay = baseDelay * Math.pow(2, attempt - 1);
const jitteredDelay = Math.random() * delay;
await new Promise(resolve => setTimeout(resolve, jitteredDelay));
}
}
throw lastError;
}
This function retries the wrapped operation, increasing the delay exponentially and applying full jitter to distribute load. The jitter produces a random delay between 0 and the calculated backoff value, preventing synchronized retry storms. The isRetryable callback filters out non-transient errors and programming errors to avoid wasting resources on permanent failures or logic bugs.
Getting Started
Implement these patterns incrementally to build resilience:
- Identify Critical Dependencies: Map external services and databases where failures cause the most damage.
- Apply Retries First: Add exponential backoff with jitter to all network calls to handle transient glitches.
- Wrap Critical Calls with Circuit Breakers: Protect dependencies that are prone to outages or high latency to prevent cascading failures.
- Implement Bulkheads for Resource-Intensive Tasks: Isolate thread pools or limit concurrency for operations involving heavy computation or file I/O.
- Monitor and Tune: Continuously monitor failure rates, latency, and circuit breaker states. Adjust thresholds and timeouts based on real traffic patterns.
Share this Guide:
More Guides
Database Performance Tuning: Master Indexing Strategies and Query Optimization Techniques
Learn how to minimize I/O latency and CPU cycles through effective indexing strategies like B-Tree and Hash indexes, covering indexes, and composite indexes. Master query optimization techniques including SARGable predicates, execution plan analysis, join optimization, and keyset pagination.
3 min readRedis vs Memcached vs Hazelcast: The Ultimate Distributed Caching Guide
Compare Redis, Memcached, and Hazelcast architectures, features, and use cases to choose the right distributed caching solution for your application's performance and scalability needs.
4 min readMessage Queue Patterns: P2P, Pub/Sub, and Request-Reply Explained
Master asynchronous communication by comparing Point-to-Point, Publish-Subscribe, and Request-Reply patterns with practical code examples and reliability strategies.
3 min readWebSockets vs SSE vs WebRTC: Choosing the Right Real-Time Protocol
Compare WebSockets, Server-Sent Events, and WebRTC to choose the best protocol for your real-time application needs. Includes implementation examples, architecture comparisons, and security best practices.
5 min readREST vs GraphQL vs gRPC: Complete API Architecture Comparison Guide
Compare REST, GraphQL, and gRPC architectures across performance, security, and use cases to make informed API design decisions.
4 min readContinue Reading
Database Performance Tuning: Master Indexing Strategies and Query Optimization Techniques
Learn how to minimize I/O latency and CPU cycles through effective indexing strategies like B-Tree and Hash indexes, covering indexes, and composite indexes. Master query optimization techniques including SARGable predicates, execution plan analysis, join optimization, and keyset pagination.
3 min readRedis vs Memcached vs Hazelcast: The Ultimate Distributed Caching Guide
Compare Redis, Memcached, and Hazelcast architectures, features, and use cases to choose the right distributed caching solution for your application's performance and scalability needs.
4 min readMessage Queue Patterns: P2P, Pub/Sub, and Request-Reply Explained
Master asynchronous communication by comparing Point-to-Point, Publish-Subscribe, and Request-Reply patterns with practical code examples and reliability strategies.
3 min readReady to Supercharge Your Development Workflow?
Join thousands of engineering teams using MatterAI to accelerate code reviews, catch bugs earlier, and ship faster.
