Building a Circuit Breaker from Scratch

Stop cascading failures before they take down your entire system — with Redis-backed state and atomic transitions

By SysAdmin · Published 2026-05-27

Building a Circuit Breaker from Scratch

When a downstream service goes down, the worst thing you can do is keep hammering it. A circuit breaker detects persistent failures, short-circuits requests, and gives the failing service time to recover.

1. The Problem

Imagine your payment service calls a fraud-detection API. The API starts timing out — 5 seconds per call. Without protection, every checkout request now takes 5+ seconds, your thread pool fills up, and your entire application grinds to a halt. This is a cascading failure.

A circuit breaker sits between your code and the downstream call. After a configurable number of consecutive failures, it "trips" — immediately rejecting all calls without even attempting the request. After a recovery timeout, it cautiously allows a few test requests through. If they succeed, the circuit closes and normal traffic resumes.

Our challenge: implement the three-state circuit breaker pattern (CLOSED → OPEN → HALF_OPEN) with:

2. The Naïve Approach (and Why It Fails)

private int failureCount = 0;
private boolean isOpen = false;

public boolean execute(Callable<Boolean> action) {
    if (isOpen) return false;
    try {
        return action.call();
    } catch (Exception e) {
        failureCount++;
        if (failureCount >= threshold) isOpen = true;
        return false;
    }
}

This has three critical flaws:

  1. No recovery path — once open, it stays open forever. There's no HALF_OPEN state to test if the service has recovered.
  2. Not thread-safefailureCount++ is not atomic. Under concurrent load, you'll miss increments or double-count.
  3. Single-instance only — if you have 3 app servers, each has its own count. Server A sees 2 failures, Server B sees 2 failures, but neither trips the breaker even though there have been 4 failures total.

3. The Right Model

Circuit breaker state lives in Redis, shared across all instances:

Redis Hash:  cb:{serviceId}
  state          → CLOSED | OPEN | HALF_OPEN
  failureCount   → integer
  lastFailureTime → epoch-ms
  openedAt       → epoch-ms (when the circuit tripped)
  halfOpenAttempts → integer (test calls allowed in HALF_OPEN)

State Machine

  ┌─────────┐  failure threshold  ┌──────┐  recovery timeout  ┌───────────┐
  │ CLOSED  │ ────────────────► │ OPEN │ ──────────────────► │ HALF_OPEN │
  └─────────┘                    └──────┘                     └───────────┘
       ▲                                                          │    │
       │              success                                     │    │
       └──────────────────────────────────────────────────────────┘    │
                                                     failure          │
                                              ┌──────┐◄──────────────┘
                                              │ OPEN │
                                              └──────┘

4. The Implementation, Walked Through

The Interface

public interface CircuitBreakerContract {
    void configure(int failureThreshold, long recoveryTimeoutMs, int halfOpenMaxAttempts);
    boolean execute(String serviceId, Callable<Boolean> action);
    String getState(String serviceId);
    int getFailureCount(String serviceId);
    long getLastFailureTimestamp(String serviceId);
    void reset(String serviceId);
}

Atomic Failure Recording with Lua

The most critical operation is recording a failure. We need to atomically increment the count AND check if we've hit the threshold AND transition to OPEN if so. A Lua script makes this a single atomic operation:

local key = KEYS[1]
local threshold = tonumber(ARGV[1])
local now = ARGV[2]
local count = redis.call('HINCRBY', key, 'failureCount', 1)
redis.call('HSET', key, 'lastFailureTime', now)
if count >= threshold then
    redis.call('HSET', key, 'state', 'OPEN')
    redis.call('HSET', key, 'openedAt', now)
end
return count

💡 Tip: Without Lua, you'd need to do a read-modify-write cycle with WATCH/MULTI/EXEC. The Lua approach is simpler and faster — Redis executes the script atomically.

The OPEN → HALF_OPEN Transition

This is time-based, not event-based. When we check the state and see OPEN, we also check whether recoveryTimeoutMs has elapsed since openedAt:

private String resolveState(String serviceId) {
    String state = redis.hget(key(serviceId), "state");
    if (state == null) return "CLOSED";
    
    if ("OPEN".equals(state)) {
        long openedAt = parseLong(redis.hget(key(serviceId), "openedAt"));
        if (System.currentTimeMillis() - openedAt >= recoveryTimeoutMs) {
            redis.hset(key(serviceId), "state", "HALF_OPEN");
            redis.hset(key(serviceId), "halfOpenAttempts", "0");
            return "HALF_OPEN";
        }
    }
    return state;
}

⚠️ Trap: The grader's testHalfOpenAfterTimeout configures a 200ms recovery timeout and sleeps 250ms before checking. Your time math must use >=, not >, otherwise you'll fail intermittently.

HALF_OPEN: The Cautious Test

In HALF_OPEN, we allow exactly halfOpenMaxAttempts test calls. If any succeeds, we close the circuit. If one fails, we reopen immediately:

if ("HALF_OPEN".equals(state)) {
    // Atomically claim a test slot
    if (!claimHalfOpenSlot(serviceId)) {
        return false; // Budget exhausted
    }
    boolean success = invokeAction(action);
    if (success) {
        // Reset everything → CLOSED
        redis.hset(key, "state", "CLOSED");
        redis.hset(key, "failureCount", "0");
        return true;
    } else {
        // Back to OPEN with fresh timestamp
        redis.hset(key, "state", "OPEN");
        redis.hset(key, "openedAt", String.valueOf(System.currentTimeMillis()));
        return false;
    }
}

5. Performance + Concurrency

Atomic Operations Are Essential

The grader runs a testConcurrentAccess test that fires 4 threads × 10 failures each against a threshold of 50. Without atomic failure counting (Lua scripts or transactions), the count will be corrupted under concurrent access.

Fail-Fast Latency

The whole point of a circuit breaker is that OPEN-state rejections are fast. The grader benchmarks this: 2000+ ops/sec with p99 < 5ms in OPEN state. Since we just check a Redis hash field and return false, this is trivially achievable.

CLOSED-State Throughput

In normal operation (CLOSED), the breaker adds one Redis HSET on success (to reset the failure count). The grader expects 500+ ops/sec with p99 < 50ms — well within Redis's capabilities.

6. What the Grader Checks

TestWhat It Verifies
testClosedStateAllowsExecutionSuccessful calls return true, state stays CLOSED
testFailureCountIncrementsOnExceptionEach exception increments the counter
testSuccessResetsFailureCountA success zeros the consecutive failure count
testTransitionToOpenExactly at threshold, state flips to OPEN
testOpenStateRejectsRequestsOPEN state returns false WITHOUT invoking the action
testHalfOpenAfterTimeoutAfter recovery timeout, state becomes HALF_OPEN
testHalfOpenSuccessClosesSuccessful test call → CLOSED, count reset
testHalfOpenFailureReopensFailed test call → back to OPEN
testIndependentServicesTripping svc-a doesn't affect svc-b
testConcurrentAccess4 threads × 10 failures → breaker trips correctly

7. Takeaways

  1. Lua scripts are your friend for atomicity. Any time you need read-check-write in Redis, wrap it in a Lua script. It's cleaner than WATCH/MULTI/EXEC and impossible to have a race condition within the script.
  1. Time-based transitions need careful comparison. Use >= not > for timeout checks, and always base time comparisons on System.currentTimeMillis() captured at decision time — not a cached value.
  1. The HALF_OPEN state is what separates a circuit breaker from a kill switch. Without it, you need manual intervention to restore traffic. With it, recovery is automatic.

👉 Try it yourself: Circuit Breaker on Cruscible