How to Implement Webhook Retry Logic with Exponential Backoff

How to Implement Webhook Retry Logic with Exponential Backoff

by | Jun 1, 2026 | Uncategorized | 0 comments

If you’ve ever shipped a webhook integration to production, you already know the truth: the network is hostile, downstream servers crash, and HTTP 500s happen at the worst possible moment. Without solid webhook retry logic, a single hiccup can mean lost orders, missed payments, or out of sync data between your platform and your customers.

In this practical guide, we’ll walk through how to design a resilient webhook delivery pipeline using exponential backoff, jitter, and dead letter queues (DLQ). We’ll share code patterns you can drop into your stack today and the common pitfalls that bite teams in production.

Why Webhook Retry Logic Matters

Webhooks are fire and forget by design. The producer pushes an event, and the consumer is expected to process it. But consumers fail for many reasons:

  • Temporary network partitions or DNS issues
  • Receiver deployments and restarts
  • Rate limits hit on the consumer side
  • Database locks or downstream timeouts
  • Partial outages from cloud providers

Without a retry strategy, you lose data. With a naive retry strategy (think: a tight while loop), you turn a small outage into a self inflicted DDoS on your own customers. The goal is reliable delivery without overwhelming the receiver.

webhook server retry

The Building Blocks of a Resilient Webhook System

Any production grade webhook delivery pipeline needs four pieces working together:

  1. Idempotency on both ends so retries are safe
  2. Retry classification to decide which failures should be retried
  3. Exponential backoff with jitter for spacing out attempts
  4. A dead letter queue for events that exhaust retries

1. Make Webhooks Idempotent

Before you retry anything, every webhook must carry a stable, unique identifier. The classic approach is an X-Event-Id or Idempotency-Key header. The receiver stores processed IDs (typically with a TTL) and silently acknowledges duplicates.

Without idempotency, retries can charge a card twice, double create a user, or fire the same notification multiple times.

2. Classify Failures Correctly

Not every failure deserves a retry. Retrying a 400 Bad Request a hundred times will not magically make the payload valid.

Status / Condition Retry? Reason
Network timeout / connection refused Yes Transient by definition
408, 425, 429 Yes Respect Retry-After if present
500, 502, 503, 504 Yes Server side, likely temporary
400, 401, 403, 404, 410, 422 No Client error, retry will not fix it
2xx No Success

3. Exponential Backoff with Jitter

Exponential backoff doubles the delay between attempts. The formula is simple:

delay = base * (2 ^ attempt)

With base = 1s, you get attempts at 1s, 2s, 4s, 8s, 16s, 32s, and so on. Most production systems cap retries between 5 and 10 attempts spread over 24 to 72 hours.

The catch: if thousands of webhooks fail at the same instant (because the receiver was down), they will all retry at exactly the same moment. This is called the thundering herd. The fix is jitter, randomness added to each delay so retries spread out naturally.

Sample Schedule with Full Jitter

Attempt Base Delay With Jitter (range)
1 1s 0 to 1s
2 2s 0 to 2s
3 4s 0 to 4s
4 30s 0 to 30s
5 5min 0 to 5min
6 30min 0 to 30min
7 2h 0 to 2h
8 12h 0 to 12h
webhook server retry

Code Patterns You Can Use Today

Node.js: Retry with Exponential Backoff and Jitter

async function deliverWebhook(url, payload, eventId, attempt = 0) {
  const MAX_ATTEMPTS = 8;
  const BASE_MS = 1000;
  const MAX_DELAY_MS = 12 * 60 * 60 * 1000;

  try {
    const res = await fetch(url, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'Idempotency-Key': eventId,
        'X-Webhook-Attempt': String(attempt + 1)
      },
      body: JSON.stringify(payload),
      signal: AbortSignal.timeout(10000)
    });

    if (res.ok) return { status: 'delivered' };

    if (res.status >= 400 && res.status < 500 && ![408, 425, 429].includes(res.status)) {
      return { status: 'permanent_failure', code: res.status };
    }

    throw new Error(`Retryable status ${res.status}`);
  } catch (err) {
    if (attempt + 1 >= MAX_ATTEMPTS) {
      await sendToDeadLetterQueue({ url, payload, eventId, error: err.message });
      return { status: 'dlq' };
    }

    const exp = Math.min(BASE_MS * Math.pow(2, attempt), MAX_DELAY_MS);
    const delay = Math.floor(Math.random() * exp);

    await scheduleRetry({ url, payload, eventId, attempt: attempt + 1, runAt: Date.now() + delay });
    return { status: 'scheduled', nextAttemptIn: delay };
  }
}

Python: Same Pattern with a Persistent Queue

import random, time, httpx

MAX_ATTEMPTS = 8
BASE = 1.0
CAP = 12 * 3600

def next_delay(attempt: int) -> float:
    exp = min(BASE * (2 ** attempt), CAP)
    return random.uniform(0, exp)

def deliver(event):
    try:
        r = httpx.post(event['url'], json=event['payload'],
                       headers={'Idempotency-Key': event['id']}, timeout=10)
        if r.status_code < 300:
            mark_delivered(event['id'])
            return
        if 400 <= r.status_code < 500 and r.status_code not in (408, 425, 429):
            mark_permanent_failure(event['id'], r.status_code)
            return
        raise RuntimeError(f'retryable {r.status_code}')
    except Exception as e:
        attempt = event['attempt'] + 1
        if attempt >= MAX_ATTEMPTS:
            send_to_dlq(event, str(e))
            return
        schedule(event['id'], run_at=time.time() + next_delay(attempt), attempt=attempt)

Don’t Skip the Dead Letter Queue

A dead letter queue is where webhooks go after exhausting all retry attempts. Without a DLQ, failed events disappear into logs and you only find out about them when a customer complains.

A solid DLQ implementation should provide:

  • Persistence: store the full payload, headers, target URL, and last error
  • Visibility: a UI or API for support and engineering to inspect failures
  • Replay: a one click or single API call way to re queue an event
  • Alerting: trigger notifications when DLQ size crosses a threshold
  • Retention: keep events for at least 30 days for forensic analysis

Common backends for DLQs include AWS SQS (with its native DLQ feature), Redis Streams, RabbitMQ dead letter exchanges, or a plain Postgres table with a status column. The technology matters less than the discipline of actually using it.

webhook server retry

Common Pitfalls to Avoid

Synchronous Retries Inside the HTTP Request

Never retry a webhook inside the same request that triggered the event. You will block your producer, exhaust connection pools, and turn small outages into full incidents. Always queue the delivery and process it from a worker.

Forgetting Timeouts

A receiver that accepts the connection but never responds will hold your worker hostage. Set both connection and read timeouts (5 to 15 seconds is reasonable for most cases).

Ignoring Retry-After

When a receiver returns 429 with a Retry-After header, respect it. Overriding it with your own backoff is the fastest way to get rate limited harder.

Unbounded Retries

Retrying forever sounds robust, but it just hides bugs. Cap your attempts (8 to 10 is typical) and let the DLQ catch the rest.

Same Backoff for Every Receiver

If one customer endpoint is consistently slow, isolate it. Per destination queues or circuit breakers prevent one bad receiver from poisoning delivery for everyone else.

No Observability

You cannot fix what you cannot see. Track at minimum:

  • Delivery success rate per endpoint
  • Average attempts to success
  • P95 delivery latency end to end
  • DLQ size and age of oldest item

A Reference Architecture

Putting it all together, here is what a production grade webhook pipeline typically looks like:

  1. Event producer writes the event to an outbox table in the same DB transaction as the business change
  2. Dispatcher reads from the outbox and pushes to a delivery queue
  3. Delivery workers consume the queue, sign the payload (HMAC), and POST to the receiver
  4. Failed deliveries are rescheduled with exponential backoff and jitter
  5. After max attempts, events land in the DLQ with full context
  6. Operators can replay DLQ events once the receiver is healthy again
webhook server retry

Final Thoughts

Good webhook retry logic is the difference between an integration your customers trust and one they file tickets about. Start with idempotency, classify failures correctly, apply exponential backoff with jitter, and never ship without a dead letter queue. The patterns in this guide scale from a side project to millions of events per day, and they will save you from the 3 AM page when a downstream provider goes dark.

FAQ

How many times should I retry a webhook?

Most production systems cap between 5 and 10 attempts spread over 24 to 72 hours. Fewer attempts miss recovery from longer outages, more attempts mostly waste resources without improving delivery rates.

What is the difference between exponential backoff and jitter?

Exponential backoff increases the delay between retries (1s, 2s, 4s, 8s…). Jitter adds randomness to those delays so a wave of failed events does not retry simultaneously and overwhelm the receiver. They work together, not as alternatives.

Should I retry on 4xx errors?

Generally no. 4xx means the client side (your request) is wrong, retrying will not help. The exceptions are 408 (timeout), 425 (too early), and 429 (rate limited), all of which are worth retrying.

What’s the point of a dead letter queue if I have unlimited retries?

Unlimited retries are an anti pattern. They mask bugs, fill queues, and consume resources indefinitely. A DLQ gives you a clean cutoff and a place to investigate or replay events once the underlying issue is fixed.

How do I make webhook delivery idempotent?

Send a stable unique ID with every webhook (commonly Idempotency-Key or X-Event-Id). On the receiver side, store processed IDs with a TTL and acknowledge duplicates with a 200 instead of reprocessing.

Can I just use a managed service instead of building this myself?

Absolutely. Services like Hookdeck, Svix, or Inngest handle retries, backoff, and DLQs out of the box. Building it yourself makes sense when you need deep customization or you already operate the infrastructure. Either way, the principles in this guide still apply.