Stop CI Email Flakes With Bounded Waits and Dedupe

CI email flakes rarely come from one mysterious broken test. They usually come from two predictable design mistakes: waiting without a clear deadline and processing the same email more than once.

Email is asynchronous by nature. Your app queues a message, a mail service accepts it, an inbound system receives it, your test runner or agent reads it, and only then can the workflow continue. If your CI job uses a fixed sleep, an unbounded loop, or a broad inbox search, it is guessing. Under parallel load, retries, delayed delivery, and duplicate webhook attempts, guesses become flakes.

The fix is not to make the sleep longer. The fix is to treat email as an event stream with two reliability rules: bounded waits and dedupe.

What bounded waits actually mean

A bounded wait is an explicit contract for waiting on an email. It says what inbox to watch, which message counts as a match, how long the test is allowed to wait, how retrieval should happen, and what diagnostic information should be returned on failure.

That is very different from sleep(30000).

A good bounded wait includes:

A specific inbox_id, not a shared mailbox query.
A deadline, not an infinite loop.
A narrow matcher, such as expected sender, recipient, subject pattern, correlation ID, or artifact type.
A retrieval plan, usually webhook-first with polling fallback.
A structured failure result with enough context to debug CI.

The goal is not just to reduce waiting time. The goal is to make failure deterministic. If the message does not arrive before the deadline, the test should fail once, clearly, with evidence. It should not hang, pass by accidentally reading an old email, or fail three steps later because the wrong token was used.

Brittle CI email wait	Bounded CI email wait
Sleeps for a fixed number of seconds	Waits until a deadline and exits early on success
Searches a shared inbox	Watches one disposable inbox for one attempt
Matches on broad subject text	Uses sender, recipient, correlation, and artifact rules
Fails with “element not found” later	Fails with “email not received” diagnostics
Reprocesses duplicates	Uses dedupe keys at delivery, message, and artifact layers

Why dedupe is just as important as waiting

Even if your wait logic is perfect, duplicate processing can still break CI. Email providers, webhook systems, test retries, resend buttons, and your own queue workers can all produce repeated delivery events. A robust automation pipeline should assume that inbound email delivery is effectively at-least-once.

That does not mean every email will be duplicated. It means your code should remain correct if it is.

Without dedupe, these failures are common:

A webhook retry inserts the same message twice.
A resend produces two valid OTP emails and the test consumes the older one.
A polling fallback sees a message already handled by the webhook path.
A retried CI attempt reads a message from a previous attempt.
An LLM agent submits the same magic link twice because it saw the same artifact twice.

Dedupe turns “maybe repeated” events into stable records. It also gives you a clean audit trail for debugging. Instead of asking “which of these five emails did the test use?”, you can ask “which delivery, message, and artifact were accepted for this attempt?”

The flake-resistant CI email flow

For email-dependent CI steps, use a short-lived inbox per attempt. This keeps the search space small, prevents stale messages from previous jobs, and gives every wait a natural boundary.

A reliable flow looks like this:

Create a disposable inbox through an API.
Store the returned email address and inbox identifier with the CI run ID.
Trigger the product action, such as signup, password reset, OTP login, or invite flow.
Wait for a matching email using webhooks first, with polling as a fallback.
Receive the email as structured JSON, then extract only the needed artifact.
Mark the artifact as consumed once, then continue the test.
Expire or stop using the inbox after the attempt.

Mailhook is built around this pattern: disposable inbox creation via API, structured JSON email output, RESTful access, real-time webhook notifications, polling for fallback retrieval, signed payloads, shared domains, custom domain support, and batch email processing. For exact integration details and machine-readable guidance, start with the Mailhook llms.txt contract.

How to design a bounded wait

A bounded wait should have one owner: the test harness, not the test body and not the LLM agent. The test should call a helper such as waitForVerificationEmail() and receive either a matched message or a typed timeout error.

Here is provider-neutral pseudocode for a polling fallback. In production, this often runs alongside a webhook event buffer so the webhook path can resolve the wait as soon as the message arrives.

async function waitForMessage({ inboxId, deadlineMs, match, emailApi }) {
  const deadlineAt = Date.now() + deadlineMs;
  const seenMessages = new Set();
  let pollCount = 0;

  while (Date.now() < deadlineAt) {
    pollCount += 1;

    const messages = await emailApi.listMessages({
      inboxId,
      limit: 50
    });

    for (const msg of messages) {
      const messageKey = msg.message_id ?? stableHash([
        msg.inbox_id,
        msg.from,
        msg.subject,
        msg.received_at,
        msg.text
      ]);

      if (seenMessages.has(messageKey)) continue;
      seenMessages.add(messageKey);

      if (match(msg)) {
        return {
          status: "found",
          message: msg,
          pollCount,
          waitMs: Date.now() - (deadlineAt - deadlineMs)
        };
      }
    }

    await sleep(jitteredDelay(deadlineAt - Date.now()));
  }

  return {
    status: "timeout",
    inboxId,
    deadlineMs,
    seenCount: seenMessages.size,
    pollCount
  };
}

The important part is not the exact code. The important part is that the function has a deadline, tracks what it has seen, returns structured results, and does not leak duplicate messages into the rest of the test.

For webhook-first delivery, the same principles apply. The webhook handler should verify the signed payload, dedupe the delivery, persist or enqueue the message, and notify any waiters for that inbox. Polling then becomes a reconciliation mechanism, not the primary source of truth. If you are choosing between delivery approaches, this guide on webhooks or polling for inbound email automation covers the trade-offs in more depth.

Match narrowly, not hopefully

The most dangerous matcher in CI is “first email with a subject containing verify.” It works until your suite runs in parallel, the product sends a reminder, a resend arrives late, or a previous run left a matching message behind.

A safer matcher checks multiple signals inside a specific inbox. Structured JSON makes this much easier because your code can compare fields directly instead of scraping rendered HTML.

Matching signal	Why it helps	Common mistake
`inbox_id`	Isolates the attempt	Searching across a shared mailbox
Recipient address	Confirms the app sent to the generated address	Trusting only the subject
Sender or domain	Filters unrelated system emails	Accepting any sender with a matching keyword
Correlation token	Ties the message to a run or attempt	Reusing the same token across retries
Received timestamp	Avoids stale messages	Selecting the oldest matching email
Artifact type	Ensures the email contains an OTP, link, or expected payload	Parsing the whole HTML body blindly

If your product can include a correlation ID in the email subject, body, link parameter, or custom header, use it. If not, the per-attempt inbox still gives you strong isolation. The matcher should be narrow enough that receiving the wrong email is harder than timing out.

Dedupe in layers

One dedupe key is not enough because duplicates happen at different levels. A webhook retry is not the same as a resend. Two messages can contain the same OTP artifact. A retried CI attempt may intentionally generate a new message but must not reuse an old artifact.

Use layered dedupe so each layer answers a different question.

Layer	Question it answers	Example dedupe key
Delivery	Have we processed this webhook or polling delivery before?	Provider delivery ID, or hash of raw payload plus timestamp bucket
Message	Have we stored this email message before?	`inbox_id` plus `message_id`, with content hash fallback
Artifact	Have we already extracted this OTP or magic link from this message?	`message_id` plus artifact type plus normalized artifact hash
Attempt	Has this CI attempt already consumed a verification artifact?	`attempt_id` plus workflow step

The attempt layer is the one many teams miss. If a test step is retried, the second execution should not blindly reuse a previously consumed token unless your test harness explicitly allows it. For OTPs and magic links, consume-once semantics are usually safer: once an artifact is accepted for an attempt, later duplicates become no-ops.

Be careful with secrets in dedupe keys. Do not log raw OTPs or full magic links. Hash sensitive artifacts before storage or logging, and store only what you need to debug behavior.

Make webhook handlers idempotent

Webhook handlers should be fast, secure, and repeat-safe. They should also verify authenticity before parsing or trusting the payload. Mailhook supports signed payloads, so your receiver can reject spoofed or tampered webhook requests before they reach test logic.

A robust handler follows this shape:

async function handleEmailWebhook(request) {
  const rawBody = request.rawBody;

  if (!verifyProviderSignature(rawBody, request.headers)) {
    return response(401, "invalid signature");
  }

  const event = JSON.parse(rawBody);
  const deliveryKey = event.delivery_id ?? stableHash(rawBody);

  if (await deliveries.exists(deliveryKey)) {
    return response(200, "duplicate ignored");
  }

  await deliveries.insert({
    deliveryKey,
    inboxId: event.inbox_id,
    receivedAt: event.received_at
  });

  await messages.upsert({
    inboxId: event.inbox_id,
    messageId: event.message_id,
    payload: event
  });

  await queue.enqueue({
    type: "email.received",
    inboxId: event.inbox_id,
    messageId: event.message_id
  });

  return response(200, "accepted");
}

This structure keeps verification separate from processing. It also means a webhook retry returns success without re-running extraction or advancing the CI state twice. For a deeper security checklist, see Signed Webhooks for Email: What to Verify First.

Give LLM agents bounded outcomes, not inbox access

LLM agents should not browse raw inboxes or decide how long to wait. They should call a small tool with deterministic behavior. The tool can use Mailhook or another email API behind the scenes, but the model should only see a minimal result.

Tool result	Meaning	What the agent should do
`found`	A matching artifact was extracted	Continue with the OTP or approved URL
`timeout`	No matching message arrived before the deadline	Retry according to policy or fail the task
`duplicate`	The artifact was already consumed	Do not submit it again
`invalid`	A message arrived but failed policy checks	Request a resend or escalate
`security_rejected`	Link, sender, signature, or content failed validation	Stop and report safely

This keeps the model out of security enforcement. Code verifies signatures, validates links, dedupes artifacts, and applies deadlines. The agent receives only the outcome needed for the next step. If you are designing agent-facing email payloads, the Email to JSON schema guide is a useful companion.

What to log when a bounded wait times out

A timeout should be actionable. “Verification email not found” is better than a browser assertion failure, but it is still not enough for fast CI triage.

Log the identifiers and counters that let an engineer reconstruct the path without exposing secrets. Useful fields include run ID, attempt ID, inbox ID, generated email address, expected sender, expected artifact type, deadline, poll count, webhook delivery count, number of messages seen, number of messages rejected by the matcher, and the last received timestamp.

Also log why candidates were rejected. For example, “sender mismatch,” “no OTP candidate,” “correlation token missing,” or “artifact already consumed.” These reasons turn flaky email failures into ordinary engineering failures that can be fixed at the right layer.

Avoid logging full email bodies, raw OTPs, full magic links, or untrusted HTML. CI logs are often more widely accessible than production data stores.

CI checklist for bounded waits and dedupe

Before you call an email-dependent test stable, check the full harness, not just the assertion.

Create a new disposable inbox for each attempt or each isolated test run.
Store inbox_id, email address, run ID, and attempt ID together.
Use webhook-first delivery for low latency, with polling fallback for recovery.
Give every wait a deadline and return a typed timeout result.
Match inside the inbox using structured JSON fields, not rendered HTML.
Dedupe delivery events, stored messages, extracted artifacts, and consumed attempts.
Verify signed webhook payloads before parsing or processing.
Extract only the minimum artifact needed, such as an OTP or approved magic link.
Redact tokens and full links from logs.
Emit wait duration, duplicate count, timeout count, and matcher rejection metrics.

Mailhook provides the primitives needed for this checklist: programmable temp inboxes, structured JSON emails, real-time webhooks, polling API access, signed payloads, shared domains for quick starts, custom domains for controlled environments, and batch processing for larger agent runs. You can review the canonical integration surface in Mailhook’s llms.txt.

Frequently Asked Questions

What is a bounded wait in CI email testing? A bounded wait is an explicit wait for a specific email event with a deadline, matcher, retrieval strategy, and structured failure result. It replaces fixed sleeps and unbounded polling loops.

Why do I need dedupe if my app sends only one email? Your app may intend to send one email, but webhook retries, polling fallback, resend buttons, queue retries, and CI retries can still cause duplicate deliveries or duplicate artifacts. Dedupe makes those repeats harmless.

Should CI use webhooks or polling to wait for email? Use webhooks first when possible because they are low latency and event-driven. Keep polling as a fallback for reconciliation, local development, and recovery from missed webhook events.

How long should an email wait timeout be? Set the deadline based on your system’s normal delivery behavior and CI tolerance. The key is that the timeout is explicit, measured, and produces diagnostics. Do not hide uncertainty with very long sleeps.

Can LLM agents safely wait for verification emails? Yes, if the agent calls a constrained tool. The tool should create an isolated inbox, verify delivery, apply bounded waits, dedupe artifacts, validate links or OTPs in code, and return only the minimal result to the model.

How does Mailhook help stop CI email flakes? Mailhook lets you create disposable inboxes via API, receive emails as structured JSON, use real-time webhooks or polling, verify signed payloads, and isolate each CI or agent attempt with its own inbox.

Build email waits that fail clearly, not randomly

CI email flakes are not inevitable. Most disappear when you stop waiting hopefully and start waiting contractually: one inbox per attempt, bounded waits, narrow matchers, webhook-first delivery, polling fallback, and layered dedupe.

If your tests, QA automation, or LLM agents need reliable temporary inboxes, try Mailhook. You can create disposable inboxes through an API, receive structured JSON emails, and wire bounded email waits into your workflows without managing mail servers. For implementation details, use the Mailhook llms.txt reference as the source of truth.