Skip to content
Engineering

Stop CI Email Flakes With Bounded Waits and Dedupe

| | 13 min read
A cyberpunk night scene in a rain-soaked control bay showing a CI email workflow as a flowing event stream: a disposable inbox node, a deadline timer, a narrow message matcher, and a deduplication gate connected by glowing paths. One verified email event moves forward into a structured JSON result panel while duplicate events are deflected into a separate silent lane. Wet metal, glass, and puddled flooring reflect electric cyan, hot magenta, deep purple, and warm orange accents. Atmospheric fog, volumetric light rays, drifting particles, subtle holographic API labels, and noir shadows create strong depth. Landscape composition with organic edges fading into smoke and black.
A cyberpunk night scene in a rain-soaked control bay showing a CI email workflow as a flowing event stream: a disposable inbox node, a deadline timer, a narrow message matcher, and a deduplication gate connected by glowing paths. One verified email event moves forward into a structured JSON result panel while duplicate events are deflected into a separate silent lane. Wet metal, glass, and puddled flooring reflect electric cyan, hot magenta, deep purple, and warm orange accents. Atmospheric fog, volumetric light rays, drifting particles, subtle holographic API labels, and noir shadows create strong depth. Landscape composition with organic edges fading into smoke and black.

CI email flakes rarely come from one mysterious broken test. They usually come from two predictable design mistakes: waiting without a clear deadline and processing the same email more than once.

Email is asynchronous by nature. Your app queues a message, a mail service accepts it, an inbound system receives it, your test runner or agent reads it, and only then can the workflow continue. If your CI job uses a fixed sleep, an unbounded loop, or a broad inbox search, it is guessing. Under parallel load, retries, delayed delivery, and duplicate webhook attempts, guesses become flakes.

The fix is not to make the sleep longer. The fix is to treat email as an event stream with two reliability rules: bounded waits and dedupe.

What bounded waits actually mean

A bounded wait is an explicit contract for waiting on an email. It says what inbox to watch, which message counts as a match, how long the test is allowed to wait, how retrieval should happen, and what diagnostic information should be returned on failure.

That is very different from sleep(30000).

A good bounded wait includes:

  • A specific inbox_id, not a shared mailbox query.
  • A deadline, not an infinite loop.
  • A narrow matcher, such as expected sender, recipient, subject pattern, correlation ID, or artifact type.
  • A retrieval plan, usually webhook-first with polling fallback.
  • A structured failure result with enough context to debug CI.

The goal is not just to reduce waiting time. The goal is to make failure deterministic. If the message does not arrive before the deadline, the test should fail once, clearly, with evidence. It should not hang, pass by accidentally reading an old email, or fail three steps later because the wrong token was used.

Brittle CI email wait Bounded CI email wait
Sleeps for a fixed number of seconds Waits until a deadline and exits early on success
Searches a shared inbox Watches one disposable inbox for one attempt
Matches on broad subject text Uses sender, recipient, correlation, and artifact rules
Fails with “element not found” later Fails with “email not received” diagnostics
Reprocesses duplicates Uses dedupe keys at delivery, message, and artifact layers

Why dedupe is just as important as waiting

Even if your wait logic is perfect, duplicate processing can still break CI. Email providers, webhook systems, test retries, resend buttons, and your own queue workers can all produce repeated delivery events. A robust automation pipeline should assume that inbound email delivery is effectively at-least-once.

That does not mean every email will be duplicated. It means your code should remain correct if it is.

Without dedupe, these failures are common:

  • A webhook retry inserts the same message twice.
  • A resend produces two valid OTP emails and the test consumes the older one.
  • A polling fallback sees a message already handled by the webhook path.
  • A retried CI attempt reads a message from a previous attempt.
  • An LLM agent submits the same magic link twice because it saw the same artifact twice.

Dedupe turns “maybe repeated” events into stable records. It also gives you a clean audit trail for debugging. Instead of asking “which of these five emails did the test use?”, you can ask “which delivery, message, and artifact were accepted for this attempt?”

The flake-resistant CI email flow

For email-dependent CI steps, use a short-lived inbox per attempt. This keeps the search space small, prevents stale messages from previous jobs, and gives every wait a natural boundary.

A reliable flow looks like this:

  1. Create a disposable inbox through an API.
  2. Store the returned email address and inbox identifier with the CI run ID.
  3. Trigger the product action, such as signup, password reset, OTP login, or invite flow.
  4. Wait for a matching email using webhooks first, with polling as a fallback.
  5. Receive the email as structured JSON, then extract only the needed artifact.
  6. Mark the artifact as consumed once, then continue the test.
  7. Expire or stop using the inbox after the attempt.

Mailhook is built around this pattern: disposable inbox creation via API, structured JSON email output, RESTful access, real-time webhook notifications, polling for fallback retrieval, signed payloads, shared domains, custom domain support, and batch email processing. For exact integration details and machine-readable guidance, start with the Mailhook llms.txt contract.

How to design a bounded wait

A bounded wait should have one owner: the test harness, not the test body and not the LLM agent. The test should call a helper such as waitForVerificationEmail() and receive either a matched message or a typed timeout error.

Here is provider-neutral pseudocode for a polling fallback. In production, this often runs alongside a webhook event buffer so the webhook path can resolve the wait as soon as the message arrives.

async function waitForMessage({ inboxId, deadlineMs, match, emailApi }) {
  const deadlineAt = Date.now() + deadlineMs;
  const seenMessages = new Set();
  let pollCount = 0;

  while (Date.now() < deadlineAt) {
    pollCount += 1;

    const messages = await emailApi.listMessages({
      inboxId,
      limit: 50
    });

    for (const msg of messages) {
      const messageKey = msg.message_id ?? stableHash([
        msg.inbox_id,
        msg.from,
        msg.subject,
        msg.received_at,
        msg.text
      ]);

      if (seenMessages.has(messageKey)) continue;
      seenMessages.add(messageKey);

      if (match(msg)) {
        return {
          status: "found",
          message: msg,
          pollCount,
          waitMs: Date.now() - (deadlineAt - deadlineMs)
        };
      }
    }

    await sleep(jitteredDelay(deadlineAt - Date.now()));
  }

  return {
    status: "timeout",
    inboxId,
    deadlineMs,
    seenCount: seenMessages.size,
    pollCount
  };
}

The important part is not the exact code. The important part is that the function has a deadline, tracks what it has seen, returns structured results, and does not leak duplicate messages into the rest of the test.

For webhook-first delivery, the same principles apply. The webhook handler should verify the signed payload, dedupe the delivery, persist or enqueue the message, and notify any waiters for that inbox. Polling then becomes a reconciliation mechanism, not the primary source of truth. If you are choosing between delivery approaches, this guide on webhooks or polling for inbound email automation covers the trade-offs in more depth.

Match narrowly, not hopefully

The most dangerous matcher in CI is “first email with a subject containing verify.” It works until your suite runs in parallel, the product sends a reminder, a resend arrives late, or a previous run left a matching message behind.

A safer matcher checks multiple signals inside a specific inbox. Structured JSON makes this much easier because your code can compare fields directly instead of scraping rendered HTML.

Matching signal Why it helps Common mistake
inbox_id Isolates the attempt Searching across a shared mailbox
Recipient address Confirms the app sent to the generated address Trusting only the subject
Sender or domain Filters unrelated system emails Accepting any sender with a matching keyword
Correlation token Ties the message to a run or attempt Reusing the same token across retries
Received timestamp Avoids stale messages Selecting the oldest matching email
Artifact type Ensures the email contains an OTP, link, or expected payload Parsing the whole HTML body blindly

If your product can include a correlation ID in the email subject, body, link parameter, or custom header, use it. If not, the per-attempt inbox still gives you strong isolation. The matcher should be narrow enough that receiving the wrong email is harder than timing out.

Dedupe in layers

One dedupe key is not enough because duplicates happen at different levels. A webhook retry is not the same as a resend. Two messages can contain the same OTP artifact. A retried CI attempt may intentionally generate a new message but must not reuse an old artifact.

Use layered dedupe so each layer answers a different question.

Layer Question it answers Example dedupe key
Delivery Have we processed this webhook or polling delivery before? Provider delivery ID, or hash of raw payload plus timestamp bucket
Message Have we stored this email message before? inbox_id plus message_id, with content hash fallback
Artifact Have we already extracted this OTP or magic link from this message? message_id plus artifact type plus normalized artifact hash
Attempt Has this CI attempt already consumed a verification artifact? attempt_id plus workflow step

The attempt layer is the one many teams miss. If a test step is retried, the second execution should not blindly reuse a previously consumed token unless your test harness explicitly allows it. For OTPs and magic links, consume-once semantics are usually safer: once an artifact is accepted for an attempt, later duplicates become no-ops.

Be careful with secrets in dedupe keys. Do not log raw OTPs or full magic links. Hash sensitive artifacts before storage or logging, and store only what you need to debug behavior.

Make webhook handlers idempotent

Webhook handlers should be fast, secure, and repeat-safe. They should also verify authenticity before parsing or trusting the payload. Mailhook supports signed payloads, so your receiver can reject spoofed or tampered webhook requests before they reach test logic.

A robust handler follows this shape:

async function handleEmailWebhook(request) {
  const rawBody = request.rawBody;

  if (!verifyProviderSignature(rawBody, request.headers)) {
    return response(401, "invalid signature");
  }

  const event = JSON.parse(rawBody);
  const deliveryKey = event.delivery_id ?? stableHash(rawBody);

  if (await deliveries.exists(deliveryKey)) {
    return response(200, "duplicate ignored");
  }

  await deliveries.insert({
    deliveryKey,
    inboxId: event.inbox_id,
    receivedAt: event.received_at
  });

  await messages.upsert({
    inboxId: event.inbox_id,
    messageId: event.message_id,
    payload: event
  });

  await queue.enqueue({
    type: "email.received",
    inboxId: event.inbox_id,
    messageId: event.message_id
  });

  return response(200, "accepted");
}

This structure keeps verification separate from processing. It also means a webhook retry returns success without re-running extraction or advancing the CI state twice. For a deeper security checklist, see Signed Webhooks for Email: What to Verify First.

Give LLM agents bounded outcomes, not inbox access

LLM agents should not browse raw inboxes or decide how long to wait. They should call a small tool with deterministic behavior. The tool can use Mailhook or another email API behind the scenes, but the model should only see a minimal result.

Tool result Meaning What the agent should do
found A matching artifact was extracted Continue with the OTP or approved URL
timeout No matching message arrived before the deadline Retry according to policy or fail the task
duplicate The artifact was already consumed Do not submit it again
invalid A message arrived but failed policy checks Request a resend or escalate
security_rejected Link, sender, signature, or content failed validation Stop and report safely

This keeps the model out of security enforcement. Code verifies signatures, validates links, dedupes artifacts, and applies deadlines. The agent receives only the outcome needed for the next step. If you are designing agent-facing email payloads, the Email to JSON schema guide is a useful companion.

What to log when a bounded wait times out

A timeout should be actionable. “Verification email not found” is better than a browser assertion failure, but it is still not enough for fast CI triage.

Log the identifiers and counters that let an engineer reconstruct the path without exposing secrets. Useful fields include run ID, attempt ID, inbox ID, generated email address, expected sender, expected artifact type, deadline, poll count, webhook delivery count, number of messages seen, number of messages rejected by the matcher, and the last received timestamp.

Also log why candidates were rejected. For example, “sender mismatch,” “no OTP candidate,” “correlation token missing,” or “artifact already consumed.” These reasons turn flaky email failures into ordinary engineering failures that can be fixed at the right layer.

Avoid logging full email bodies, raw OTPs, full magic links, or untrusted HTML. CI logs are often more widely accessible than production data stores.

CI checklist for bounded waits and dedupe

Before you call an email-dependent test stable, check the full harness, not just the assertion.

  • Create a new disposable inbox for each attempt or each isolated test run.
  • Store inbox_id, email address, run ID, and attempt ID together.
  • Use webhook-first delivery for low latency, with polling fallback for recovery.
  • Give every wait a deadline and return a typed timeout result.
  • Match inside the inbox using structured JSON fields, not rendered HTML.
  • Dedupe delivery events, stored messages, extracted artifacts, and consumed attempts.
  • Verify signed webhook payloads before parsing or processing.
  • Extract only the minimum artifact needed, such as an OTP or approved magic link.
  • Redact tokens and full links from logs.
  • Emit wait duration, duplicate count, timeout count, and matcher rejection metrics.

Mailhook provides the primitives needed for this checklist: programmable temp inboxes, structured JSON emails, real-time webhooks, polling API access, signed payloads, shared domains for quick starts, custom domains for controlled environments, and batch processing for larger agent runs. You can review the canonical integration surface in Mailhook’s llms.txt.

Frequently Asked Questions

What is a bounded wait in CI email testing? A bounded wait is an explicit wait for a specific email event with a deadline, matcher, retrieval strategy, and structured failure result. It replaces fixed sleeps and unbounded polling loops.

Why do I need dedupe if my app sends only one email? Your app may intend to send one email, but webhook retries, polling fallback, resend buttons, queue retries, and CI retries can still cause duplicate deliveries or duplicate artifacts. Dedupe makes those repeats harmless.

Should CI use webhooks or polling to wait for email? Use webhooks first when possible because they are low latency and event-driven. Keep polling as a fallback for reconciliation, local development, and recovery from missed webhook events.

How long should an email wait timeout be? Set the deadline based on your system’s normal delivery behavior and CI tolerance. The key is that the timeout is explicit, measured, and produces diagnostics. Do not hide uncertainty with very long sleeps.

Can LLM agents safely wait for verification emails? Yes, if the agent calls a constrained tool. The tool should create an isolated inbox, verify delivery, apply bounded waits, dedupe artifacts, validate links or OTPs in code, and return only the minimal result to the model.

How does Mailhook help stop CI email flakes? Mailhook lets you create disposable inboxes via API, receive emails as structured JSON, use real-time webhooks or polling, verify signed payloads, and isolate each CI or agent attempt with its own inbox.

Build email waits that fail clearly, not randomly

CI email flakes are not inevitable. Most disappear when you stop waiting hopefully and start waiting contractually: one inbox per attempt, bounded waits, narrow matchers, webhook-first delivery, polling fallback, and layered dedupe.

If your tests, QA automation, or LLM agents need reliable temporary inboxes, try Mailhook. You can create disposable inboxes through an API, receive structured JSON emails, and wire bounded email waits into your workflows without managing mail servers. For implementation details, use the Mailhook llms.txt reference as the source of truth.

Related Articles