Building a Circuit Breaker from Scratch
Stop cascading failures before they take down your entire system — with Redis-backed state and atomic transitions
By SysAdmin · Published 2026-05-27
Building a Circuit Breaker from Scratch
When a downstream service goes down, the worst thing you can do is keep hammering it. A circuit breaker detects persistent failures, short-circuits requests, and gives the failing service time to recover.
1. The Problem
Imagine your payment service calls a fraud-detection API. The API starts timing out — 5 seconds per call. Without protection, every checkout request now takes 5+ seconds, your thread pool fills up, and your entire application grinds to a halt. This is a cascading failure.
A circuit breaker sits between your code and the downstream call. After a configurable number of consecutive failures, it "trips" — immediately rejecting all calls without even attempting the request. After a recovery timeout, it cautiously allows a few test requests through. If they succeed, the circuit closes and normal traffic resumes.
Our challenge: implement the three-state circuit breaker pattern (CLOSED → OPEN → HALF_OPEN) with:
- Per-service failure tracking
- Configurable thresholds and recovery timeouts
- Redis-backed state for multi-instance consistency
- Atomic state transitions under concurrent load
2. The Naïve Approach (and Why It Fails)
private int failureCount = 0;
private boolean isOpen = false;
public boolean execute(Callable<Boolean> action) {
if (isOpen) return false;
try {
return action.call();
} catch (Exception e) {
failureCount++;
if (failureCount >= threshold) isOpen = true;
return false;
}
}
This has three critical flaws:
- No recovery path — once open, it stays open forever. There's no HALF_OPEN state to test if the service has recovered.
- Not thread-safe —
failureCount++is not atomic. Under concurrent load, you'll miss increments or double-count. - Single-instance only — if you have 3 app servers, each has its own count. Server A sees 2 failures, Server B sees 2 failures, but neither trips the breaker even though there have been 4 failures total.
3. The Right Model
Circuit breaker state lives in Redis, shared across all instances:
Redis Hash: cb:{serviceId}
state → CLOSED | OPEN | HALF_OPEN
failureCount → integer
lastFailureTime → epoch-ms
openedAt → epoch-ms (when the circuit tripped)
halfOpenAttempts → integer (test calls allowed in HALF_OPEN)
State Machine
┌─────────┐ failure threshold ┌──────┐ recovery timeout ┌───────────┐
│ CLOSED │ ────────────────► │ OPEN │ ──────────────────► │ HALF_OPEN │
└─────────┘ └──────┘ └───────────┘
▲ │ │
│ success │ │
└──────────────────────────────────────────────────────────┘ │
failure │
┌──────┐◄──────────────┘
│ OPEN │
└──────┘
4. The Implementation, Walked Through
The Interface
public interface CircuitBreakerContract {
void configure(int failureThreshold, long recoveryTimeoutMs, int halfOpenMaxAttempts);
boolean execute(String serviceId, Callable<Boolean> action);
String getState(String serviceId);
int getFailureCount(String serviceId);
long getLastFailureTimestamp(String serviceId);
void reset(String serviceId);
}
Atomic Failure Recording with Lua
The most critical operation is recording a failure. We need to atomically increment the count AND check if we've hit the threshold AND transition to OPEN if so. A Lua script makes this a single atomic operation:
local key = KEYS[1]
local threshold = tonumber(ARGV[1])
local now = ARGV[2]
local count = redis.call('HINCRBY', key, 'failureCount', 1)
redis.call('HSET', key, 'lastFailureTime', now)
if count >= threshold then
redis.call('HSET', key, 'state', 'OPEN')
redis.call('HSET', key, 'openedAt', now)
end
return count
💡 Tip: Without Lua, you'd need to do a read-modify-write cycle with WATCH/MULTI/EXEC. The Lua approach is simpler and faster — Redis executes the script atomically.
The OPEN → HALF_OPEN Transition
This is time-based, not event-based. When we check the state and see OPEN, we also check whether recoveryTimeoutMs has elapsed since openedAt:
private String resolveState(String serviceId) {
String state = redis.hget(key(serviceId), "state");
if (state == null) return "CLOSED";
if ("OPEN".equals(state)) {
long openedAt = parseLong(redis.hget(key(serviceId), "openedAt"));
if (System.currentTimeMillis() - openedAt >= recoveryTimeoutMs) {
redis.hset(key(serviceId), "state", "HALF_OPEN");
redis.hset(key(serviceId), "halfOpenAttempts", "0");
return "HALF_OPEN";
}
}
return state;
}
⚠️ Trap: The grader's
testHalfOpenAfterTimeoutconfigures a 200ms recovery timeout and sleeps 250ms before checking. Your time math must use>=, not>, otherwise you'll fail intermittently.
HALF_OPEN: The Cautious Test
In HALF_OPEN, we allow exactly halfOpenMaxAttempts test calls. If any succeeds, we close the circuit. If one fails, we reopen immediately:
if ("HALF_OPEN".equals(state)) {
// Atomically claim a test slot
if (!claimHalfOpenSlot(serviceId)) {
return false; // Budget exhausted
}
boolean success = invokeAction(action);
if (success) {
// Reset everything → CLOSED
redis.hset(key, "state", "CLOSED");
redis.hset(key, "failureCount", "0");
return true;
} else {
// Back to OPEN with fresh timestamp
redis.hset(key, "state", "OPEN");
redis.hset(key, "openedAt", String.valueOf(System.currentTimeMillis()));
return false;
}
}
5. Performance + Concurrency
Atomic Operations Are Essential
The grader runs a testConcurrentAccess test that fires 4 threads × 10 failures each against a threshold of 50. Without atomic failure counting (Lua scripts or transactions), the count will be corrupted under concurrent access.
Fail-Fast Latency
The whole point of a circuit breaker is that OPEN-state rejections are fast. The grader benchmarks this: 2000+ ops/sec with p99 < 5ms in OPEN state. Since we just check a Redis hash field and return false, this is trivially achievable.
CLOSED-State Throughput
In normal operation (CLOSED), the breaker adds one Redis HSET on success (to reset the failure count). The grader expects 500+ ops/sec with p99 < 50ms — well within Redis's capabilities.
6. What the Grader Checks
| Test | What It Verifies |
|---|---|
testClosedStateAllowsExecution | Successful calls return true, state stays CLOSED |
testFailureCountIncrementsOnException | Each exception increments the counter |
testSuccessResetsFailureCount | A success zeros the consecutive failure count |
testTransitionToOpen | Exactly at threshold, state flips to OPEN |
testOpenStateRejectsRequests | OPEN state returns false WITHOUT invoking the action |
testHalfOpenAfterTimeout | After recovery timeout, state becomes HALF_OPEN |
testHalfOpenSuccessCloses | Successful test call → CLOSED, count reset |
testHalfOpenFailureReopens | Failed test call → back to OPEN |
testIndependentServices | Tripping svc-a doesn't affect svc-b |
testConcurrentAccess | 4 threads × 10 failures → breaker trips correctly |
7. Takeaways
- Lua scripts are your friend for atomicity. Any time you need read-check-write in Redis, wrap it in a Lua script. It's cleaner than WATCH/MULTI/EXEC and impossible to have a race condition within the script.
- Time-based transitions need careful comparison. Use
>=not>for timeout checks, and always base time comparisons onSystem.currentTimeMillis()captured at decision time — not a cached value.
- The HALF_OPEN state is what separates a circuit breaker from a kill switch. Without it, you need manual intervention to restore traffic. With it, recovery is automatic.
👉 Try it yourself: Circuit Breaker on Cruscible