Building a Retry Engine with Exponential Backoff

Handle transient failures gracefully with configurable retries, jitter, and per-operation state tracking

By SysAdmin Β· Published 2026-05-27

Building a Retry Engine with Exponential Backoff

Network calls fail. Databases go down temporarily. External APIs hit rate limits. A well-designed retry engine handles these gracefully β€” and building one from scratch teaches you exactly how.

1. The Problem

In production systems, transient failures are the norm. A database connection drops for 200ms, an HTTP call times out, a rate limiter rejects your request. Libraries like Resilience4j and Spring Retry implement the retry-with-backoff pattern β€” but do you actually understand what's happening under the hood?

Our challenge: build a configurable retry engine that:

2. The NaΓ―ve Approach (and Why It Fails)

The simplest retry looks like this:

for (int i = 0; i < maxRetries; i++) {
    try {
        if (action.call()) return true;
    } catch (Exception e) { /* ignore */ }
    Thread.sleep(1000); // fixed delay
}

This breaks in three ways:

  1. Fixed delay β€” if 100 clients retry at the same time with the same 1-second delay, they all slam the server simultaneously again. This is the thundering herd problem.
  2. No state tracking β€” you can't answer "how many attempts did operation X make?" or "what was the last error?"
  3. No cancellation β€” once started, it runs to completion. If the user navigates away or the downstream service is confirmed dead, you're wasting resources.

3. The Right Model

We need per-operation state tracked independently:

private static class OperationState {
    volatile String status = "PENDING";     // PENDING β†’ RETRYING β†’ SUCCEEDED/FAILED/CANCELLED
    volatile int attemptCount = 0;
    volatile String lastError = null;
    volatile boolean cancelled = false;
    final List<Long> timestamps = Collections.synchronizedList(new ArrayList<>());
    // Snapshot of config at execution time
    volatile int configMaxRetries;
    volatile long configInitialDelayMs;
    volatile double configBackoffMultiplier;
    volatile long configMaxDelayMs;
}

The key insight: snapshot configuration at execution time. If someone calls configure() while an operation is mid-retry, the running operation keeps its original parameters. Only new execute() calls pick up the new config.

State is stored in a ConcurrentHashMap<String, OperationState> β€” no external dependencies needed.

4. The Implementation, Walked Through

The Interface

public interface RetryEngineContract {
    void configure(int maxRetries, long initialDelayMs, double backoffMultiplier, long maxDelayMs);
    boolean execute(String operationId, Callable<Boolean> action);
    int getAttemptCount(String operationId);
    String getStatus(String operationId);
    String getLastError(String operationId);
    long getNextRetryDelayMs(String operationId);
    List<Long> getRetryTimestamps(String operationId);
    void cancel(String operationId);
    void reset(String operationId);
}

The Execute Loop

The core retry loop follows this pattern:

public boolean execute(String operationId, Callable<Boolean> action) {
    OperationState state = operations.computeIfAbsent(operationId, k -> new OperationState());
    
    // Snapshot current config
    state.configMaxRetries = this.maxRetries;
    state.status = "RETRYING";
    state.attemptCount = 0;
    
    int totalAttempts = 1 + state.configMaxRetries;
    
    for (int attempt = 1; attempt <= totalAttempts; attempt++) {
        if (state.cancelled) {
            state.status = "CANCELLED";
            return false;
        }
        
        state.timestamps.add(System.currentTimeMillis());
        state.attemptCount = attempt;
        
        try {
            if (action.call()) {
                state.status = "SUCCEEDED";
                return true;
            }
        } catch (Exception e) {
            state.lastError = e.getMessage();
        }
        
        // Backoff sleep (not after last attempt)
        if (attempt < totalAttempts && !state.cancelled) {
            long delay = computeDelayWithJitter(attempt, state);
            Thread.sleep(delay);
        }
    }
    
    state.status = "FAILED";
    return false;
}

⚠️ Trap: The totalAttempts calculation is 1 + maxRetries, not maxRetries. With maxRetries=3, you get 4 total attempts: 1 initial + 3 retries. The grader's testExhaustRetries test checks this explicitly.

Exponential Backoff with Jitter

The backoff formula is:

delay(n) = min(initialDelayMs Γ— multiplier^(n-1), maxDelayMs)
actualDelay = delay(n) Γ— (1 + jitter)    where jitter ∈ [-0.10, +0.10]

In code:

private long computeDelay(int retryNumber, OperationState state) {
    double raw = state.configInitialDelayMs * Math.pow(state.configBackoffMultiplier, retryNumber - 1);
    return Math.min((long) raw, state.configMaxDelayMs);
}

With configure(3, 100, 2.0, 5000), the delays are:

The jitter adds Β±10% randomization, so retry 1 is actually 100 Γ— (1 + random(-0.10, 0.10)) = somewhere between 90ms and 110ms.

πŸ’‘ Tip: The maxDelayMs cap prevents delays from growing unbounded. With multiplier=10.0 and initialDelay=100, retry 3 would be 100 Γ— 10^2 = 10,000ms without the cap. The grader's testMaxDelayCap verifies no delay exceeds the configured maximum.

Cancellation

Cancellation is checked between retries:

if (state.cancelled) {
    state.status = "CANCELLED";
    return false;
}

⚠️ Trap: Cancel must be observable from another thread. The cancelled field must be volatile β€” otherwise the retry loop might never see the update due to CPU cache coherence.

5. Performance + Concurrency

Thread Safety

ConcurrentHashMap handles the map-level concurrency, but individual OperationState fields need volatile to ensure visibility across threads. The timestamps list uses Collections.synchronizedList() because it's mutated during execution and read during queries.

Performance Targets

The grader benchmarks two scenarios:

Since each operation gets its own OperationState in a ConcurrentHashMap, there's zero contention between different operations. The only "cost" is the Thread.sleep() for backoff, which is by design.

6. What the Grader Checks

TestWhat It Verifies
testSuccessOnFirstAttemptReturns true, attempt count = 1, status = SUCCEEDED
testSuccessAfterRetriesFails twice, succeeds on attempt 3
testExhaustRetries4 total attempts with maxRetries=3, status = FAILED
testExponentialBackoffTimestamp gaps grow exponentially (within jitter tolerance)
testMaxDelayCapNo delay gap exceeds maxDelayMs + jitter
testJitterApplied20 operations with fixed multiplier show varying delays
testCancelMid-execution cancel stops retries, status = CANCELLED
testResetClears all state back to PENDING
testConcurrentExecutions5 parallel operations complete independently
testZeroRetriesmaxRetries=0 means exactly 1 attempt
testGetLastErrorException message captured from thrown exception

7. Takeaways

  1. Snapshot config at execution time. If configuration can change mid-flight, capture the values when execution starts. This is the same pattern used in database transaction isolation.
  1. Jitter prevents thundering herd. Even Β±10% randomization breaks the synchronization between competing clients. In production, AWS recommends "full jitter" β€” random(0, baseDelay) β€” for even better distribution.
  1. Total attempts = 1 + maxRetries. This off-by-one catches more engineers than you'd expect. The first attempt is not a "retry" β€” it's the initial try.

πŸ‘‰ Try it yourself: Retry Engine on Cruscible