Build Retry-Safe Email Verification Flows in CI

Email verification looks simple until CI starts retrying. A test creates a user, waits for an OTP or magic link, submits it, and passes. Then parallel jobs, webhook retries, SMTP delays, and test-run reruns arrive, and suddenly the suite verifies the wrong account, consumes an old code, or loops forever waiting for a message that already arrived.

A retry-safe email verification flow treats email as an eventually delivered event stream, not as a synchronous mailbox lookup. The goal is not just to “receive the email.” The goal is to make every retry harmless, observable, and tied to one specific CI attempt.

What “retry-safe” means in CI email verification

A retry-safe flow can be rerun without changing the meaning of the test. If the CI provider retries a failed job, if your test runner retries one spec, if a webhook is delivered twice, or if your app resends a verification email, the harness should still select the correct message and consume the verification artifact only once.

In practice, that means the flow must satisfy five properties:

Isolation: Each verification attempt gets its own inbox or inbox handle, not a shared mailbox.
Deterministic waiting: The harness waits until a deadline using webhooks or polling, never with fixed sleeps alone.
Strong correlation: Every message is matched to an attempt using inbox identity, recipient, subject, sender, timestamps, and ideally a correlation token.
Idempotent consumption: Duplicate deliveries and repeated polling results cannot cause the same OTP or magic link to be submitted twice.
Safe extraction: CI and LLM agents receive only the verification artifact they need, not arbitrary HTML or untrusted email content.

Email is built on delivery systems that can retry, defer, and duplicate work. SMTP itself is a store-and-forward protocol, as described in RFC 5321. Your test harness should assume that messages can arrive late, arrive more than once through different delivery paths, or be observed multiple times by polling.

Why normal verification tests break under retries

Most flaky CI email tests are not broken because the email provider is unreliable. They are broken because the test flow has no retry contract.

A typical fragile flow looks like this: create a user with [email protected], wait 10 seconds, open a shared mailbox, pick the newest email with “Verify your email” in the subject, extract the first link, and click it. That may work locally. It fails in CI because several independent retry systems are active at the same time.

Retry source	What can happen	Retry-safe response
CI job retry	The same test runs again after partial success	Create a new inbox per attempt and persist attempt IDs
Test runner retry	One spec reruns while previous email is still in flight	Match by inbox and attempt, not by global “newest message”
Application resend	Multiple OTPs or links are generated for one user	Consume only the latest valid artifact for the attempt
Webhook retry	The same inbound email event is delivered more than once	Verify signatures and dedupe by delivery ID and message ID
Polling restart	The poller sees messages it already processed	Use cursors or seen IDs and artifact-level idempotency
SMTP delay	The first attempt’s email arrives during the second attempt	Isolate inboxes and close or ignore expired attempts

The fix is architectural. Instead of making one shared mailbox smarter, make the unit of work smaller and more explicit.

The core invariant: one inbox per attempt

The most important design choice is to treat a CI verification attempt as disposable. Every attempt gets a fresh programmable inbox, and the inbox descriptor is stored with the attempt metadata.

Do not store only an email string. Store a descriptor that lets your harness route, wait, debug, and clean up deterministically.

{
  "run_id": "ci-983421",
  "attempt_id": "ci-983421:test-signup:2",
  "inbox_id": "inb_123",
  "email": "[email protected]",
  "created_at": "2026-06-01T21:00:00Z",
  "expires_at": "2026-06-01T21:15:00Z",
  "status": "active"
}

The exact fields depend on your provider and internal harness, but the principle is stable: the inbox is a first-class resource. When a retry starts, it should create or load the attempt descriptor intentionally. It should never scan a shared mailbox and guess.

Mailhook is designed around this model: disposable inbox creation via API, structured JSON email output, REST access, webhooks, polling fallback, shared domains, custom domain support, signed payloads, and batch email processing. For the exact machine-readable integration contract, use the Mailhook llms.txt reference.

A retry-safe verification state machine

A good CI harness has explicit states. This makes retries predictable because every step can decide whether to continue, retry, or return an existing result.

State	Meaning	Safe retry behavior
`provisioned`	Inbox exists, verification has not been triggered	Reuse the attempt descriptor or create a new attempt
`triggered`	App was asked to send the verification email	Do not assume delivery, wait until deadline
`message_received`	A candidate message was stored	Re-run matchers and dedupe before extracting
`artifact_extracted`	OTP or magic link was extracted	Mark artifact hash before submitting it
`artifact_consumed`	Verification was submitted to the app	Return the saved result if the same artifact appears again
`closed`	Attempt is complete or expired	Ignore late messages or store them only for debugging

The important detail is that “message received” is not the same as “verification complete.” CI should not act on raw email arrival. It should act only after matching, extraction, and idempotency checks pass.

Build the flow step by step

Provision the inbox before triggering the email

Create the disposable inbox first, then pass its email address into your signup, login, password reset, or verification flow. This ordering avoids a common race where the application sends an email before the receiver is ready.

Persist the inbox descriptor immediately. If the CI process crashes after provisioning but before triggering the app, the next retry can decide whether to continue or start a clean attempt. In most test suites, a clean new attempt is simpler and safer.

Trigger verification with attempt-level correlation

If your application supports metadata, include a correlation token in the verification request or test user profile. For example, you might include a run ID in a test-only header, a hidden metadata field, or a deterministic user attribute such as ci_983421_attempt_2.

You should still rely on inbox isolation as the primary matcher. Correlation tokens are extra protection, not a substitute for a dedicated inbox.

Wait with webhooks first, polling as fallback

Webhooks are ideal for CI because they turn email arrival into an event. They reduce latency and avoid aggressive polling across many parallel jobs. However, CI networks, tunnels, and test environments are not always perfect, so a polling fallback makes the flow resilient.

A robust waiter uses a deadline, not a fixed sleep. It listens for verified webhook events and, near the same deadline, polls the inbox for matching messages that may have arrived while the webhook listener was unavailable.

The flow should be bounded. If the email does not arrive by the deadline, fail with useful diagnostics: attempt ID, inbox ID, recipient, expected sender, expected subject pattern, and the last poll cursor. Avoid logging full email bodies unless your retention and privacy rules allow it.

Match the message before extracting the artifact

Do not extract the first six digits from the first email you see. Message matching should be layered.

Good matchers include the inbox ID, recipient address, sender domain, subject pattern, received timestamp after the attempt started, and expected content markers. If your app can include a correlation token in the email body or link parameters, use it.

A layered matcher is more robust than a single regex because it can explain why a message was rejected. This is especially useful when debugging CI flakes.

Extract the smallest useful artifact

The CI harness usually needs one thing: an OTP code or a verification URL. Extract that from structured JSON fields, preferably from text/plain content or provider-normalized artifacts when available. Avoid rendering arbitrary HTML in CI, and do not expose full HTML to an LLM agent.

For magic links, validate the URL before using it. Check the host, scheme, path pattern, and expected query parameters. This protects against open redirects, malicious links in forwarded mail, and prompt-injection scenarios when agents are involved.

Consume the artifact once

Before submitting an OTP or clicking a magic link, compute an idempotency key. A practical key can include the attempt ID, artifact type, normalized artifact value, and target host.

Store that key with a unique constraint. If a retry sees the same artifact again, it should return the existing verification result instead of submitting it again.

Dedupe at the right layer

Dedupe is not one key. Email verification has several layers, and each layer answers a different question.

Layer	Example key	Prevents
Delivery	`delivery_id` from a webhook event	Processing the same webhook delivery twice
Message	Provider message ID or RFC `Message-ID` plus inbox ID	Storing the same email as multiple messages
Artifact	Hash of OTP or normalized verification URL plus attempt ID	Submitting the same code or link twice
Attempt	`run_id`, test name, attempt number, inbox ID	Mixing old and new test attempts

Artifact-level idempotency is the most important layer for verification. Duplicate messages are annoying, but duplicate verification submissions can change application state, invalidate a code, or create false failures.

Provider-neutral pseudocode

The following example is intentionally provider-neutral. Adapt the function names to your test framework and confirm Mailhook-specific request and payload details in the llms.txt reference.

async function verifyEmailInCi(input) {
  const attemptId = `${input.runId}:${input.testName}:${input.attemptNo}`;

  const inbox = await inboxProvider.createInbox({
    metadata: { runId: input.runId, attemptId },
    ttl: "15m"
  });

  await attempts.insert({
    attemptId,
    inboxId: inbox.id,
    email: inbox.email,
    status: "provisioned",
    startedAt: new Date().toISOString()
  });

  await app.startSignup({
    email: inbox.email,
    correlationToken: attemptId
  });

  await attempts.update(attemptId, { status: "triggered" });

  const message = await waitForMessage({
    inboxId: inbox.id,
    deadlineMs: 90_000,
    matcher: (msg) =>
      msg.to.includes(inbox.email) &&
      msg.subject.includes("Verify") &&
      msg.receivedAt >= input.startedAt
  });

  const artifact = extractVerificationArtifact(message);
  const artifactKey = sha256(
    `${attemptId}:${artifact.type}:${normalizeArtifact(artifact.value)}`
  );

  const firstConsumer = await artifacts.markConsumedOnce({
    artifactKey,
    attemptId,
    messageId: message.id
  });

  if (!firstConsumer) {
    return artifacts.getSavedResult(artifactKey);
  }

  const result = await app.submitVerification(artifact);

  await artifacts.saveResult(artifactKey, result);
  await attempts.update(attemptId, { status: "closed" });

  return result;
}

The key pattern is markConsumedOnce. It should be backed by durable storage, not just process memory. CI processes restart, parallelize, and fail halfway through work.

Webhook handling for retry safety

Webhook consumers should verify first and process later. If your email provider signs webhook payloads, validate the signature against the raw request body before parsing JSON. Also enforce timestamp freshness and replay detection.

A safe webhook handler has a small synchronous surface:

async function handleEmailWebhook(request) {
  const rawBody = await request.rawBody();

  verifySignatureOrThrow({
    rawBody,
    headers: request.headers
  });

  const event = JSON.parse(rawBody.toString("utf8"));

  const inserted = await deliveries.insertOnce({
    deliveryId: event.delivery_id,
    inboxId: event.inbox_id,
    receivedAt: event.received_at
  });

  if (!inserted) {
    return { status: 200 };
  }

  await queue.enqueue({
    inboxId: event.inbox_id,
    deliveryId: event.delivery_id,
    messageId: event.message_id
  });

  return { status: 200 };
}

The handler acknowledges fast, stores a dedupe record, and lets a worker perform matching and extraction. This prevents webhook retries from causing duplicate verification work.

Mailhook supports signed payloads for security, real-time webhook notifications for low-latency delivery, and a polling API for fallback retrieval. That combination is what you want in CI: push when everything is healthy, pull when the environment is flaky.

Timeouts, resend budgets, and drain windows

A retry-safe flow still needs time budgets. The mistake is using one global sleep(30s) and hoping email arrives. Better is to define per-stage deadlines and explicit resend rules.

Concern	Practical starting point	Why it helps
Inbox TTL	Long enough for the test plus cleanup	Prevents late messages from polluting future attempts
Wait deadline	Short enough to fail CI quickly, long enough for normal delivery	Avoids infinite waits and false early failures
Poll interval	Backoff with jitter	Reduces load and avoids synchronized CI polling
Resend budget	Usually zero or one resend per attempt	Prevents bot loops and duplicate code storms
Drain window	Brief period after timeout or close	Captures late arrivals for debugging without acting on them

For many teams, the best resend policy in CI is simple: do not resend inside the same attempt unless the product behavior specifically requires it. If you do resend, keep the same inbox and attempt ID, then select the latest valid artifact after the resend timestamp.

If a full test retry happens, create a new attempt and a new inbox. That keeps the old message stream isolated and makes failures easier to reason about.

CI observability: log identifiers, not secrets

When a verification step fails, engineers need enough information to debug without exposing OTPs, magic links, or user data in logs.

Useful CI artifacts include:

Attempt ID, run ID, test name, and retry number
Inbox ID and recipient address
Message IDs and delivery IDs observed
Matcher decisions, such as “subject matched” or “sender rejected”
Poll cursor or last poll timestamp
Redacted artifact metadata, such as artifact type and hash prefix

Avoid logging full links with tokens. For debugging, store normalized JSON as a protected CI artifact only if your team’s retention policy allows it. If an LLM agent participates in the flow, give it a minimized view that contains only the fields it needs to proceed.

Guardrails for LLM agents

LLM agents are good at orchestrating workflows, but they should not be asked to interpret raw email. A retry-safe agent flow should expose narrow tools with deterministic outputs.

A safe tool surface might include:

create_inbox, returning an email address and inbox ID
wait_for_verification_message, returning a matched message ID and metadata
extract_verification_artifact, returning only an OTP or validated URL
expire_inbox, closing the attempt after success or failure

The agent should not decide whether to trust an arbitrary sender, render HTML, follow unknown links, or repeatedly click “resend.” Those decisions belong in code. The model can orchestrate the happy path, while the harness enforces budgets, matchers, link validation, and idempotency.

This pattern also matters outside pure test automation. When email verification is part of customer support, onboarding, or AI-assisted operations, reliability affects the customer experience directly. Teams that are connecting help desk, CRM, workflow automation, and AI systems may benefit from working with customer experience experts who understand the operational side of these flows.

How Mailhook fits into the CI verification harness

Mailhook provides the primitives needed to build this pattern without maintaining your own mailbox infrastructure:

Create disposable inboxes through an API for each CI attempt.
Receive inbound emails as structured JSON instead of scraping a human mailbox.
Use real-time webhooks for fast delivery and a polling API as a fallback.
Verify signed webhook payloads before processing inbound email events.
Start quickly with shared domains or configure custom domains when you need allowlisting and environment control.
Process batches when your CI suite runs many verification flows in parallel.

The recommended implementation path is straightforward. Put Mailhook behind a small EmailVerificationHarness in your test code. The harness should create the inbox, trigger your app, wait for JSON email, match and extract the artifact, consume it once, and close the attempt. Your tests should call the harness, not manipulate inboxes directly.

For exact API behavior, payload fields, and integration details, keep the Mailhook llms.txt file in your agent and developer references.

Frequently Asked Questions

Should CI reuse the same inbox when a test retries? No. Reusing an inbox makes it too easy to select stale messages from an earlier attempt. Create a new disposable inbox per attempt and store the inbox ID with the test metadata.

Do I need both webhooks and polling? For resilient CI, yes. Webhooks are faster and cheaper for normal operation, while polling is a safety net when the CI environment misses a callback or restarts a worker.

How do I prevent duplicate OTP submissions? Use artifact-level idempotency. Hash the normalized OTP or verification URL with the attempt ID, then store it with a unique constraint before submitting it to the application.

What if the app sends multiple verification emails for one attempt? Match messages within the attempt’s inbox and select the latest valid artifact according to your app’s rules. If resends are allowed, record the resend timestamp and ignore older artifacts.

Are shared domains safe for CI verification? Shared domains are useful for fast setup and ephemeral tests. Use a custom domain or subdomain when you need allowlisting, stricter governance, or environment separation.

Can LLM agents run email verification safely? Yes, if the agent is given narrow tools and minimized data. The harness should handle webhook verification, matching, URL validation, retry budgets, and idempotency outside the model.

Build CI email verification that can be retried safely

Retry-safe email verification is not about adding longer sleeps. It is about isolating each attempt, waiting with explicit deadlines, matching narrowly, deduping at the right layers, and consuming OTPs or magic links exactly once.

Mailhook gives developers and agent builders programmable temp inboxes, JSON email output, webhooks, polling, signed payloads, and domain options for these flows. If your CI suite still depends on a shared mailbox or brittle HTML scraping, replace it with an inbox-per-attempt harness and make retries boring.

Start by creating a disposable inbox for the next verification test, wire the received JSON into your harness, and use the Mailhook integration reference to align your implementation with the current API contract.