Building a Circuit Breaker from Scratch

Stop cascading failures before they take down your entire system — with Redis-backed state and atomic transitions

By SysAdmin · Published 2026-05-27

Building a Circuit Breaker from Scratch

When a downstream service goes down, the worst thing you can do is keep hammering it. A circuit breaker detects persistent failures, short-circuits requests, and gives the failing service time to recover.

1. The Problem

Imagine your payment service calls a fraud-detection API. The API starts timing out — 5 seconds per call. Without protection, every checkout request now takes 5+ seconds, your thread pool fills up, and your entire application grinds to a halt. This is a cascading failure.

A circuit breaker sits between your code and the downstream call. After a configurable number of consecutive failures, it "trips" — immediately rejecting all calls without even attempting the request. After a recovery timeout, it cautiously allows a few test requests through. If they succeed, the circuit closes and normal traffic resumes.

Our challenge: implement the three-state circuit breaker pattern (CLOSED → OPEN → HALF_OPEN) with:

Per-service failure tracking
Configurable thresholds and recovery timeouts
Redis-backed state for multi-instance consistency
Atomic state transitions under concurrent load

2. The Naïve Approach (and Why It Fails)

private int failureCount = 0;
private boolean isOpen = false;

public boolean execute(Callable<Boolean> action) {
    if (isOpen) return false;
    try {
        return action.call();
    } catch (Exception e) {
        failureCount++;
        if (failureCount >= threshold) isOpen = true;
        return false;
    }
}

This has three critical flaws:

No recovery path — once open, it stays open forever. There's no HALF_OPEN state to test if the service has recovered.
Not thread-safe — failureCount++ is not atomic. Under concurrent load, you'll miss increments or double-count.
Single-instance only — if you have 3 app servers, each has its own count. Server A sees 2 failures, Server B sees 2 failures, but neither trips the breaker even though there have been 4 failures total.

3. The Right Model

Circuit breaker state lives in Redis, shared across all instances:

Redis Hash:  cb:{serviceId}
  state          → CLOSED | OPEN | HALF_OPEN
  failureCount   → integer
  lastFailureTime → epoch-ms
  openedAt       → epoch-ms (when the circuit tripped)
  halfOpenAttempts → integer (test calls allowed in HALF_OPEN)

State Machine

  ┌─────────┐  failure threshold  ┌──────┐  recovery timeout  ┌───────────┐
  │ CLOSED  │ ────────────────► │ OPEN │ ──────────────────► │ HALF_OPEN │
  └─────────┘                    └──────┘                     └───────────┘
       ▲                                                          │    │
       │              success                                     │    │
       └──────────────────────────────────────────────────────────┘    │
                                                     failure          │
                                              ┌──────┐◄──────────────┘
                                              │ OPEN │
                                              └──────┘

4. The Implementation, Walked Through

The Interface

public interface CircuitBreakerContract {
    void configure(int failureThreshold, long recoveryTimeoutMs, int halfOpenMaxAttempts);
    boolean execute(String serviceId, Callable<Boolean> action);
    String getState(String serviceId);
    int getFailureCount(String serviceId);
    long getLastFailureTimestamp(String serviceId);
    void reset(String serviceId);
}

Atomic Failure Recording with Lua

The most critical operation is recording a failure. We need to atomically increment the count AND check if we've hit the threshold AND transition to OPEN if so. A Lua script makes this a single atomic operation:

local key = KEYS[1]
local threshold = tonumber(ARGV[1])
local now = ARGV[2]
local count = redis.call('HINCRBY', key, 'failureCount', 1)
redis.call('HSET', key, 'lastFailureTime', now)
if count >= threshold then
    redis.call('HSET', key, 'state', 'OPEN')
    redis.call('HSET', key, 'openedAt', now)
end
return count

💡 Tip: Without Lua, you'd need to do a read-modify-write cycle with WATCH/MULTI/EXEC. The Lua approach is simpler and faster — Redis executes the script atomically.

The OPEN → HALF_OPEN Transition

This is time-based, not event-based. When we check the state and see OPEN, we also check whether recoveryTimeoutMs has elapsed since openedAt:

private String resolveState(String serviceId) {
    String state = redis.hget(key(serviceId), "state");
    if (state == null) return "CLOSED";
    
    if ("OPEN".equals(state)) {
        long openedAt = parseLong(redis.hget(key(serviceId), "openedAt"));
        if (System.currentTimeMillis() - openedAt >= recoveryTimeoutMs) {
            redis.hset(key(serviceId), "state", "HALF_OPEN");
            redis.hset(key(serviceId), "halfOpenAttempts", "0");
            return "HALF_OPEN";
        }
    }
    return state;
}

⚠️ Trap: The grader's testHalfOpenAfterTimeout configures a 200ms recovery timeout and sleeps 250ms before checking. Your time math must use >=, not >, otherwise you'll fail intermittently.

HALF_OPEN: The Cautious Test

In HALF_OPEN, we allow exactly halfOpenMaxAttempts test calls. If any succeeds, we close the circuit. If one fails, we reopen immediately:

if ("HALF_OPEN".equals(state)) {
    // Atomically claim a test slot
    if (!claimHalfOpenSlot(serviceId)) {
        return false; // Budget exhausted
    }
    boolean success = invokeAction(action);
    if (success) {
        // Reset everything → CLOSED
        redis.hset(key, "state", "CLOSED");
        redis.hset(key, "failureCount", "0");
        return true;
    } else {
        // Back to OPEN with fresh timestamp
        redis.hset(key, "state", "OPEN");
        redis.hset(key, "openedAt", String.valueOf(System.currentTimeMillis()));
        return false;
    }
}

5. Performance + Concurrency

Atomic Operations Are Essential

The grader runs a testConcurrentAccess test that fires 4 threads × 10 failures each against a threshold of 50. Without atomic failure counting (Lua scripts or transactions), the count will be corrupted under concurrent access.

Fail-Fast Latency

The whole point of a circuit breaker is that OPEN-state rejections are fast. The grader benchmarks this: 2000+ ops/sec with p99 < 5ms in OPEN state. Since we just check a Redis hash field and return false, this is trivially achievable.

CLOSED-State Throughput

In normal operation (CLOSED), the breaker adds one Redis HSET on success (to reset the failure count). The grader expects 500+ ops/sec with p99 < 50ms — well within Redis's capabilities.

6. What the Grader Checks

Test	What It Verifies
`testClosedStateAllowsExecution`	Successful calls return true, state stays CLOSED
`testFailureCountIncrementsOnException`	Each exception increments the counter
`testSuccessResetsFailureCount`	A success zeros the consecutive failure count
`testTransitionToOpen`	Exactly at threshold, state flips to OPEN
`testOpenStateRejectsRequests`	OPEN state returns false WITHOUT invoking the action
`testHalfOpenAfterTimeout`	After recovery timeout, state becomes HALF_OPEN
`testHalfOpenSuccessCloses`	Successful test call → CLOSED, count reset
`testHalfOpenFailureReopens`	Failed test call → back to OPEN
`testIndependentServices`	Tripping svc-a doesn't affect svc-b
`testConcurrentAccess`	4 threads × 10 failures → breaker trips correctly

7. Takeaways

Lua scripts are your friend for atomicity. Any time you need read-check-write in Redis, wrap it in a Lua script. It's cleaner than WATCH/MULTI/EXEC and impossible to have a race condition within the script.

Time-based transitions need careful comparison. Use >= not > for timeout checks, and always base time comparisons on System.currentTimeMillis() captured at decision time — not a cached value.

The HALF_OPEN state is what separates a circuit breaker from a kill switch. Without it, you need manual intervention to restore traffic. With it, recovery is automatic.

👉 Try it yourself: Circuit Breaker on Cruscible