Building a Retry Engine with Exponential Backoff

Handle transient failures gracefully with configurable retries, jitter, and per-operation state tracking

By SysAdmin · Published 2026-05-27

Building a Retry Engine with Exponential Backoff

Network calls fail. Databases go down temporarily. External APIs hit rate limits. A well-designed retry engine handles these gracefully — and building one from scratch teaches you exactly how.

1. The Problem

In production systems, transient failures are the norm. A database connection drops for 200ms, an HTTP call times out, a rate limiter rejects your request. Libraries like Resilience4j and Spring Retry implement the retry-with-backoff pattern — but do you actually understand what's happening under the hood?

Our challenge: build a configurable retry engine that:

Retries failed operations with exponential backoff
Adds jitter (±10% randomization) to prevent thundering herd
Tracks per-operation state (attempt count, status, timestamps, errors)
Supports cancellation and reset
Is thread-safe for concurrent operations

2. The Naïve Approach (and Why It Fails)

The simplest retry looks like this:

for (int i = 0; i < maxRetries; i++) {
    try {
        if (action.call()) return true;
    } catch (Exception e) { /* ignore */ }
    Thread.sleep(1000); // fixed delay
}

This breaks in three ways:

Fixed delay — if 100 clients retry at the same time with the same 1-second delay, they all slam the server simultaneously again. This is the thundering herd problem.
No state tracking — you can't answer "how many attempts did operation X make?" or "what was the last error?"
No cancellation — once started, it runs to completion. If the user navigates away or the downstream service is confirmed dead, you're wasting resources.

3. The Right Model

We need per-operation state tracked independently:

private static class OperationState {
    volatile String status = "PENDING";     // PENDING → RETRYING → SUCCEEDED/FAILED/CANCELLED
    volatile int attemptCount = 0;
    volatile String lastError = null;
    volatile boolean cancelled = false;
    final List<Long> timestamps = Collections.synchronizedList(new ArrayList<>());
    // Snapshot of config at execution time
    volatile int configMaxRetries;
    volatile long configInitialDelayMs;
    volatile double configBackoffMultiplier;
    volatile long configMaxDelayMs;
}

The key insight: snapshot configuration at execution time. If someone calls configure() while an operation is mid-retry, the running operation keeps its original parameters. Only new execute() calls pick up the new config.

State is stored in a ConcurrentHashMap<String, OperationState> — no external dependencies needed.

4. The Implementation, Walked Through

The Interface

public interface RetryEngineContract {
    void configure(int maxRetries, long initialDelayMs, double backoffMultiplier, long maxDelayMs);
    boolean execute(String operationId, Callable<Boolean> action);
    int getAttemptCount(String operationId);
    String getStatus(String operationId);
    String getLastError(String operationId);
    long getNextRetryDelayMs(String operationId);
    List<Long> getRetryTimestamps(String operationId);
    void cancel(String operationId);
    void reset(String operationId);
}

The Execute Loop

The core retry loop follows this pattern:

public boolean execute(String operationId, Callable<Boolean> action) {
    OperationState state = operations.computeIfAbsent(operationId, k -> new OperationState());
    
    // Snapshot current config
    state.configMaxRetries = this.maxRetries;
    state.status = "RETRYING";
    state.attemptCount = 0;
    
    int totalAttempts = 1 + state.configMaxRetries;
    
    for (int attempt = 1; attempt <= totalAttempts; attempt++) {
        if (state.cancelled) {
            state.status = "CANCELLED";
            return false;
        }
        
        state.timestamps.add(System.currentTimeMillis());
        state.attemptCount = attempt;
        
        try {
            if (action.call()) {
                state.status = "SUCCEEDED";
                return true;
            }
        } catch (Exception e) {
            state.lastError = e.getMessage();
        }
        
        // Backoff sleep (not after last attempt)
        if (attempt < totalAttempts && !state.cancelled) {
            long delay = computeDelayWithJitter(attempt, state);
            Thread.sleep(delay);
        }
    }
    
    state.status = "FAILED";
    return false;
}

⚠️ Trap: The totalAttempts calculation is 1 + maxRetries, not maxRetries. With maxRetries=3, you get 4 total attempts: 1 initial + 3 retries. The grader's testExhaustRetries test checks this explicitly.

Exponential Backoff with Jitter

The backoff formula is:

delay(n) = min(initialDelayMs × multiplier^(n-1), maxDelayMs)
actualDelay = delay(n) × (1 + jitter)    where jitter ∈ [-0.10, +0.10]

In code:

private long computeDelay(int retryNumber, OperationState state) {
    double raw = state.configInitialDelayMs * Math.pow(state.configBackoffMultiplier, retryNumber - 1);
    return Math.min((long) raw, state.configMaxDelayMs);
}

With configure(3, 100, 2.0, 5000), the delays are:

Retry 1: 100 × 2^0 = 100ms
Retry 2: 100 × 2^1 = 200ms
Retry 3: 100 × 2^2 = 400ms

The jitter adds ±10% randomization, so retry 1 is actually 100 × (1 + random(-0.10, 0.10)) = somewhere between 90ms and 110ms.

💡 Tip: The maxDelayMs cap prevents delays from growing unbounded. With multiplier=10.0 and initialDelay=100, retry 3 would be 100 × 10^2 = 10,000ms without the cap. The grader's testMaxDelayCap verifies no delay exceeds the configured maximum.

Cancellation

Cancellation is checked between retries:

if (state.cancelled) {
    state.status = "CANCELLED";
    return false;
}

⚠️ Trap: Cancel must be observable from another thread. The cancelled field must be volatile — otherwise the retry loop might never see the update due to CPU cache coherence.

5. Performance + Concurrency

Thread Safety

ConcurrentHashMap handles the map-level concurrency, but individual OperationState fields need volatile to ensure visibility across threads. The timestamps list uses Collections.synchronizedList() because it's mutated during execution and read during queries.

Performance Targets

The grader benchmarks two scenarios:

Successful execution overhead: 500+ ops/sec with p99 < 25ms — basically, the bookkeeping around a fast action shouldn't add significant latency
Retry throughput: 200+ ops/sec with p99 < 75ms — even with one retry + backoff sleep, throughput should remain reasonable

Since each operation gets its own OperationState in a ConcurrentHashMap, there's zero contention between different operations. The only "cost" is the Thread.sleep() for backoff, which is by design.

6. What the Grader Checks

Test	What It Verifies
`testSuccessOnFirstAttempt`	Returns `true`, attempt count = 1, status = SUCCEEDED
`testSuccessAfterRetries`	Fails twice, succeeds on attempt 3
`testExhaustRetries`	4 total attempts with maxRetries=3, status = FAILED
`testExponentialBackoff`	Timestamp gaps grow exponentially (within jitter tolerance)
`testMaxDelayCap`	No delay gap exceeds maxDelayMs + jitter
`testJitterApplied`	20 operations with fixed multiplier show varying delays
`testCancel`	Mid-execution cancel stops retries, status = CANCELLED
`testReset`	Clears all state back to PENDING
`testConcurrentExecutions`	5 parallel operations complete independently
`testZeroRetries`	maxRetries=0 means exactly 1 attempt
`testGetLastError`	Exception message captured from thrown exception

7. Takeaways

Snapshot config at execution time. If configuration can change mid-flight, capture the values when execution starts. This is the same pattern used in database transaction isolation.

Jitter prevents thundering herd. Even ±10% randomization breaks the synchronization between competing clients. In production, AWS recommends "full jitter" — random(0, baseDelay) — for even better distribution.

Total attempts = 1 + maxRetries. This off-by-one catches more engineers than you'd expect. The first attempt is not a "retry" — it's the initial try.

👉 Try it yourself: Retry Engine on Cruscible