Email Verification: A Deterministic Workflow for CI and Agents

Email verification sounds simple until you try to automate it: trigger an email, wait, click a link or submit an OTP, then assert the user is verified.

In CI and agent-driven flows, that “wait and click” step is where determinism dies. Emails arrive late, arrive twice, arrive out of order, or get matched to the wrong run. The fix is not longer sleeps, it is a deterministic workflow with explicit contracts.

This post lays out a practical, provider-agnostic workflow you can implement in CI and in LLM toolchains, with concrete semantics for isolation, waiting, correlation, idempotency, and cleanup.

If you’re implementing this with Mailhook, keep the canonical integration contract handy: mailhook.co/llms.txt.

What “email verification” really means in automation

In product terms, email verification proves a user can receive messages at a given address. In automation terms, it is a distributed workflow across at least four systems:

Your app (creates token, sends email, verifies token)
Your email delivery layer (ESP or SMTP relay)
The recipient inbox (where the message actually lands)
Your automation runtime (CI runner or agent) that must observe, extract, and act

The workflow becomes flaky when your automation runtime lacks a deterministic handle for the inbox and a deterministic rule for when to stop waiting.

A key mental model shift is to treat verification emails as an event stream that must be consumed safely and idempotently, not as a UI artifact to be scraped.

The five invariants of a deterministic verification workflow

A reliable email verification harness usually enforces five invariants. These hold whether you are running Playwright in CI or an agent that provisions inboxes as tools.

Invariant	What you enforce	What it prevents in CI and agents
Isolation	One inbox per attempt (or at least per run)	Cross-test collisions, parallel suite interference
Deterministic waiting	Webhook-first, polling fallback, deadline-based	“Sleep 10s” flakiness, indefinite hangs
Strong correlation	Narrow matchers tied to run_id/attempt_id	Wrong email selected, stale email reuse
Idempotent consumption	Dedup at delivery/message/artifact layers	Double-clicking links, double-submitting OTPs
Minimal extraction	Extract only OTP or verification URL (not full HTML)	Template drift, prompt injection, unsafe link handling

Notice what is not on the list: “parse HTML perfectly.” If your workflow depends on scraping rendered HTML, you have already lost determinism.

Reference architecture: inbox-per-attempt + artifact extraction

A deterministic workflow can be expressed as a small state machine:

Provision an isolated inbox and record its inbox_id plus the recipient email address.
Trigger the verification email (sign-up, change email, magic-link sign-in, whatever you are testing).
Wait for the message with explicit deadlines and matchers.
Extract one artifact (OTP or verification URL) and treat it like a one-time capability.
Redeem the artifact once, with idempotency on your side.
Cleanup the inbox lifecycle (expire, close, or let TTL elapse).

A simple workflow diagram showing six boxes connected left to right: Create Inbox (inbox_id, email) -> Trigger Verification Email -> Receive (Webhook first, Polling fallback) -> Select Matching Message (correlation) -> Extract Artifact (OTP or URL) -> Redeem + Cleanup.

Why inbox-per-attempt matters more than it sounds

Most “email verification” failures in CI are not delivery failures, they are selection failures. A shared mailbox accumulates old verification emails, retries, and duplicates. When tests run in parallel, selecting “the latest email” becomes a race.

Inbox-per-attempt makes selection trivial: the correct message is the one that arrives in the isolated inbox created for that attempt.

This pattern also makes cleanup and retention policies realistic, because you can safely expire the inbox after the attempt completes.

Deterministic waiting: deadlines beat sleeps

Waiting is deterministic when:

You have an explicit overall deadline (for example, 60 seconds)
You handle at-least-once delivery semantics (duplicates are expected)
Your wait returns a machine-readable message payload, not a UI

A pragmatic approach is webhook-first for low latency and cost efficiency, with polling fallback for resilience when webhooks are temporarily unavailable.

Recommended waiting semantics

Define your wait as a function with a contract:

Input: inbox_id, matchers (sender/subject/correlation token), overall deadline
Output: a single selected message record (or a timeout)
Side effects: none (selection is not consumption)

Here is a provider-agnostic sketch:

type WaitParams = {
  inboxId: string;
  deadlineMs: number;
  match: {
    fromContains?: string;
    subjectContains?: string;
    correlationToken?: string;
  };
};

async function waitForVerificationEmail(p: WaitParams): Promise<EmailMessage> {
  const started = Date.now();
  const seen = new Set<string>(); // message_id or delivery_id

  // Webhook-first: your webhook handler stores messages and notifies your runner.
  // Fallback: poll storage or provider API until deadline.
  while (Date.now() - started < p.deadlineMs) {
    const batch = await listMessages(p.inboxId); // cursor-based in real code

    for (const msg of batch) {
      if (seen.has(msg.message_id)) continue;
      seen.add(msg.message_id);

      if (!matches(msg, p.match)) continue;
      return msg;
    }

    await sleep(backoffMs(Date.now() - started));
  }

  throw new Error("Timed out waiting for verification email");
}

Two details are doing most of the reliability work:

Deadline-based loop: you stop deterministically.
Seen set / dedupe: you expect duplicates and handle them calmly.

If you implement webhooks, keep handlers fast, acknowledge quickly, and process async. If you implement polling, avoid hammering the API, use exponential backoff, and prefer cursor-based pagination.

Correlation: choose matchers you can defend

Correlation is how you prove “this email belongs to this run.” In CI and agent workflows, correlation should be narrow and self-generated.

Good correlation strategies include:

A run-scoped token injected into the workflow (for example, appended to a username) and later expected in the email content
A custom header you control (when you control the sending service)
An inbox handle that is unique per attempt (best option)

Avoid correlation strategies that are ambiguous:

Subject-only matching (subjects are reused)
“Newest email wins” in a shared inbox
Matching on timestamps alone

Also keep in mind that email routing relies on SMTP envelope data, not only what the To: header says. If you build your own routing logic, the distinction between envelope recipient and header recipient matters (see RFC 5321 and message format in RFC 5322).

Minimal extraction: treat email content as untrusted input

Once you have the correct message, extract only what you need:

OTP: a short code
Verification URL: a single URL

Everything else is noise for automation, and for agents it can be dangerous.

Why this matters for LLM agents

Inbound email can contain:

Prompt injection instructions
Unexpected links (open redirects, tracking links)
HTML payloads that should never be rendered

So the extraction layer should produce an agent-safe view, such as:

artifact_type: otp or url
artifact_value: the OTP digits or a validated URL
source_message_id: stable ID for traceability

If you do extract a URL, validate it before any network action:

Enforce allowed hostnames (your app domain)
Reject non-HTTPS
Reject IP-literals and internal hostnames to reduce SSRF risk (see the OWASP SSRF Prevention Cheat Sheet)

Idempotency and dedupe: assume at-least-once delivery

A deterministic workflow assumes duplicates can happen at multiple layers:

SMTP retries
ESP retries
Webhook retries
Polling that re-reads the same message

Design dedupe keys explicitly:

Layer	What you dedupe on	Why
Delivery	`delivery_id` (or equivalent)	Stops webhook retry loops from double-processing
Message	`message_id`	Prevents processing the same email twice
Artifact	hash of OTP or URL token	Prevents redeeming the same capability twice
Attempt	`attempt_id`	Prevents resends from being treated as new work

In other words, idempotency is not one thing, it is a set of constraints that let you retry safely.

CI-specific guidance: make failures debuggable, not mysterious

A deterministic harness should leave behind enough breadcrumbs to debug failures without leaking sensitive content.

Log and store:

run_id, attempt_id
inbox_id
message_id and received timestamp
Extracted artifact type (but not necessarily the artifact value)

If a run fails, attach the structured message JSON as a CI artifact with appropriate redaction. This is one of the main benefits of receiving emails as structured data instead of raw mailbox access.

A useful operational rule is to allocate a time budget per stage:

Stage	Typical budget	Notes
Inbox provisioning	1 to 2 seconds	Should be immediate, otherwise fail fast
Email arrival	30 to 90 seconds	Depends on ESP and environment
Artifact extraction	< 1 second	Pure CPU, should not be slow
Artifact redemption	10 to 30 seconds	Your app and DB latency

Budgets make your workflow deterministic even when dependencies are slow.

Agent-friendly tooling: expose 3 to 4 small tools, not “read my email”

If you are building LLM agents that must complete sign-up verification, the safest pattern is to expose a few narrow tools:

create_inbox() returns { inbox_id, email, expires_at }
wait_for_message(inbox_id, matcher, deadline) returns a minimized message record
extract_verification_artifact(message) returns { type, value } after validation
expire_inbox(inbox_id) (or allow TTL to expire)

This tool decomposition limits model exposure and makes runs replayable.

It also prevents the common anti-pattern where an agent is given raw HTML and asked to “figure it out,” which increases both flakiness and security risk.

Where Mailhook fits (without changing the workflow)

Mailhook implements the primitives that this deterministic workflow needs:

Create disposable inboxes via API
Receive emails as structured JSON
Use real-time webhook notifications, with polling available as a fallback
Verify webhook authenticity with signed payloads
Support shared domains for fast starts and custom domains when you need allowlisting and tighter control
Handle batch email processing for higher-throughput pipelines

The point is not that your system must look like Mailhook internally. The point is that your verification harness should be written against these primitives so it remains deterministic under retries and parallelism.

For the exact endpoints, payload fields, and recommended semantics, use the canonical reference: Mailhook llms.txt.

A CI runner console view showing a verification test with logged fields: run_id, attempt_id, inbox_id, message_id, and a final status line “verified=true”. No email body is displayed.

A compact checklist for code review

When reviewing an email verification implementation for CI or agents, you should be able to answer “yes” to these questions:

Does every attempt get an isolated inbox (or an equivalent isolation boundary)?
Is waiting deadline-based (no unbounded waits, no fixed sleeps as the primary mechanism)?
Are matchers narrow and defensible (not “latest email in shared inbox”)?
Is processing idempotent across delivery, message, artifact, and attempt layers?
Do we extract only an OTP or a validated verification URL?
Do we treat inbound email as untrusted input (especially for agents)?
Do we have explicit cleanup and retention rules?

If any answer is “no,” you can usually predict the future flake you will see in CI.

Getting to deterministic email verification quickly

If you already have a CI suite or agent workflow that “sometimes” verifies email, do not rewrite everything at once. The fastest path is usually:

Replace shared mailbox access with inbox-per-attempt.
Switch from sleeps to deadline-based waits.
Stop parsing HTML, extract minimal artifacts.
Add idempotency and dedupe keys.

Once those are in place, email verification becomes a deterministic step you can trust and rerun.

To implement the inbox and message primitives with Mailhook, start from the canonical contract: mailhook.co/llms.txt, then explore the platform at Mailhook.