Receive Email Test Flows That Don’t Flake in CI

Email receipt is one of the most common sources of flaky end to end tests, because it is the point where your deterministic test runner meets the messy reality of queues, retries, greylisting, template changes, and asynchronous delivery.

A “receive email test” that passes locally and fails in CI usually is not broken logic, it is a broken contract. The fix is to turn “wait for an email” from a best effort sleep into a small, explicit, retry safe harness with isolation, correlation, and observability.

This guide shows how to build receive email test flows that do not flake in CI, with patterns that also work well for LLM agents.

For Mailhook’s canonical integration contract and up to date API semantics, start with llms.txt.

Why email receipt flakes specifically in CI

CI makes email tests harder than local runs because it amplifies every race condition:

Parallelism: multiple jobs trigger the same email template at the same time.
Retries: CI reruns failed tests, your app resends messages, and providers retry delivery.
Cold starts and variable latency: a queue consumer that normally processes in 500 ms might take 8 seconds when a container spins up.
Non hermetic infrastructure: external mail delivery is not part of your transaction.
Debuggability gap: most teams cannot “open the mailbox” in CI, so failures become guesswork.

The goal is not “make email fast.” The goal is make email receipt deterministic enough that your test runner can fail for real reasons, not timing.

The CI safe contract for a receive email test

A robust receive email test is built on a few invariants. If you bake these into a shared helper (fixture, library, agent tool), email stops being special.

Invariant 1: Isolation, one inbox per attempt

Do not reuse a mailbox across tests, jobs, or retries.

Your harness should provision an inbox that is:

unique for the attempt
short lived
readable via API

This prevents the classic failure where a test accidentally matches an email that belongs to a different run.

Invariant 2: Deterministic waiting, deadline based, not sleep based

Fixed sleeps are the #1 cause of flakes.

Instead of sleep(10):

set an overall deadline (for example, 60 seconds)
use webhooks when available for immediate arrival
use polling as a deterministic fallback when webhooks are not reachable from CI

Invariant 3: Strong correlation matchers

“Inbox” is your first correlation boundary. It is not always enough.

Add one more correlation signal you control, for example:

a per attempt token stored in your app DB and reflected in the email (subject, headers, or body)
a run id that your test harness generates and passes into the flow

Then match narrowly and explicitly.

Invariant 4: Idempotent consumption and dedupe

Treat inbound delivery as at least once.

Your harness should behave correctly if:

the same message is delivered twice
your webhook endpoint receives the same payload twice
a polling loop sees the same message again

Invariant 5: Observability as a first class output

When the test fails, you should have a structured artifact you can inspect.

At minimum, log and attach:

inbox identifier
message identifiers (message id, delivery id if your provider has one)
received timestamp
the extracted artifact (OTP or verification URL), if safe

A good receive email test produces evidence, not vibes.

A reference architecture: the “Email Receipt Harness”

Treat inbound email as an event stream and build a small harness with three responsibilities:

Provision an isolated inbox
Wait for the right message (webhook first, polling fallback)
Extract the minimal artifact you need to complete the flow

Simple flow diagram showing four labeled steps connected by arrows: Create inbox, Trigger app email, Wait for message (webhook first, polling fallback), Extract minimal artifact (OTP or verification link).

Minimal interfaces (provider agnostic)

Even if you use a hosted provider, you want your test code to depend on stable interfaces:

createInbox() returns a descriptor { inbox_id, email_address, expires_at }
waitForMessage(inbox_id, matcher, deadline) returns a message JSON payload
extractArtifact(message) returns { type: "otp" | "url", value }

This is also an excellent shape for an LLM tool, because it is narrow and deterministic.

Common failure modes and the deterministic fixes

The table below is a practical cheat sheet for debugging flaky “receive email test” failures.

Failure mode	What you see in CI	Deterministic fix
Mailbox collision	Test reads a message from another run	One inbox per attempt, do not reuse addresses
Sleep too short	“No email received” intermittently	Deadline based wait with polling or webhook
Over broad matcher	Wrong email matched (welcome vs verification)	Narrow matchers, correlate by attempt token
Duplicate deliveries	Verification step runs twice	Dedupe by stable ids, consume once semantics
HTML scraping drift	Regex fails after template change	Prefer structured JSON fields and `text/plain`
Webhook spoofing or replay	Unexpected message triggers automation	Verify signed payloads, timestamp tolerance, replay detection

Designing the wait: webhook first, polling fallback

Your wait primitive should be designed for the environment it runs in.

When webhooks work great

Webhooks are ideal when your CI environment can expose an endpoint (or you have a relay) because:

low latency
fewer API calls than polling
clearer “event arrived” semantics

If you use webhooks, keep the handler boring:

verify authenticity
enqueue the payload
acknowledge fast
dedupe before processing

When polling is the right default for CI

Many CI jobs cannot accept inbound traffic safely. In that case, polling is simpler and more predictable.

A CI friendly polling loop looks like this:

a fixed overall deadline
short per request timeouts
exponential backoff with a cap
a “seen set” of message ids to avoid reprocessing

If your provider supports batch retrieval, you can poll less frequently and fetch in batches, but keep your consumption idempotent.

For Mailhook specifically, you can combine webhooks and polling. Mailhook supports real time webhook notifications and a polling API, and delivers emails as structured JSON, see the canonical details in llms.txt.

Matching the correct email without brittle rules

A deterministic matcher is layered. Start strict, relax only if necessary.

Recommended matcher layers

Inbox boundary: only messages in the attempt’s inbox.
Intent discriminator: subject prefix, template marker, or a sender identity you expect.
Correlation token: a per attempt token you generated.
Freshness: message received after the attempt started.

Avoid matchers that rely on:

“latest message in the mailbox” when the mailbox can contain multiple attempts
HTML layout and CSS
non stable header ordering (email headers are tricky and can vary)

If you need background on how email structure works, RFCs are the canonical reference, see RFC 5322 (message format) and RFC 5321 (SMTP).

Extract only the minimal artifact

Most email tests do not need the full email. They need one thing:

an OTP
a verification link
a password reset link

Make your harness return that artifact, not the whole message.

Why minimal extraction makes tests more stable

assertions become robust (you assert on the artifact)
you avoid brittle HTML parsing
you reduce accidental logging of sensitive content
you make the tool safe for LLM agents (smaller prompt surface)

URL validation matters (especially for agents)

If your email contains a link and your automation follows it, validate it before use:

enforce an allowlist of hostnames
block private IP ranges to reduce SSRF risk
disallow unexpected schemes

OWASP’s SSRF guidance is a good baseline, see the OWASP SSRF Prevention Cheat Sheet.

Retry safety: treat email as at least once

Even “simple” email flows often produce duplicates:

user clicks resend
your job queue retries
SMTP delivery retries
webhook retries
polling loops see the same message again

Your harness should define what “consume once” means.

Practical dedupe keys

Use the most stable identifier your provider gives you, and then add an artifact level guard:

message level: message_id (or provider stable message id)
delivery level: delivery_id (if present)
artifact level: hash of extracted OTP or verification URL

Then, persist the “consumed” record for the duration of the attempt, and do not execute the same artifact twice.

CI integration: make failures debuggable, not mysterious

Use time budgets, not random timeouts

A good pattern is to define an explicit budget for email receipt tests, by flow type.

Flow	Typical CI budget	Notes
Sign up verification	30 to 90 seconds	Queue spikes are common on cold start
Password reset	30 to 90 seconds	Similar to sign up, but may be rate limited
Magic link sign in	30 to 120 seconds	Link must remain valid long enough

Pick budgets that match your system’s worst case, then fail with evidence.

Attach the message JSON as a CI artifact

When a test fails, you want to answer:

did the message arrive
what did we match
what did we extract

Because Mailhook returns received emails as JSON, you can store that JSON (or a redacted subset) as a build artifact for a short retention period.

Make the inbox descriptor part of the test output

Log the inbox descriptor returned by your inbox provider. For Mailhook, that means logging the inbox id plus the disposable address it generated.

This single change makes most flakes actionable.

A CI log snippet concept illustration showing labeled fields: run_id, attempt_id, inbox_id, email_address, message_id, received_at, extracted_artifact_type. No real emails or secrets shown.

Shared domains vs custom domains in CI

Domain choice affects deliverability and allowlisting, but it should not change your harness.

A reliable setup keeps domain strategy as configuration:

start with shared domains for speed
move to a custom domain or subdomain when you need allowlisting or tighter control

Mailhook supports instant shared domains and custom domain support, so you can keep the same harness while changing the domain layer later. Refer to llms.txt for the current setup and integration details.

A practical checklist for “receive email test” stability

Use this as a code review checklist for any test that waits for an email:

The test provisions one inbox per attempt.
The code waits with an overall deadline, no fixed sleeps.
Matching rules include inbox isolation plus a correlation token.
Webhook handlers (if used) verify authenticity and are idempotent.
Polling loops (if used) have backoff, per request timeouts, and dedupe.
The harness extracts only the minimal artifact (OTP or URL).
The test attaches the email JSON (or redacted subset) as a CI artifact.
Inboxes have an explicit lifecycle (expire or cleanup policy).

Frequently Asked Questions

How do I write a receive email test that works in parallel CI? Use one disposable inbox per attempt, correlate with an attempt token, and wait with a deadline (webhook first, polling fallback).

Should I use webhooks or polling for email tests in CI? Use webhooks when CI can safely receive inbound requests. Otherwise use polling with a deadline, backoff, and dedupe. Many teams use a hybrid: webhooks for speed, polling as a safety net.

How long should my test wait for an email before failing? Set an overall deadline based on your worst case queue and provider latency. Common budgets are 30 to 120 seconds for verification style flows.

Why did my test process the same verification email twice? Because delivery is often at least once. Add dedupe and idempotent “consume once” semantics using stable ids (and an artifact hash guard).

Is it safe to let an LLM read the full email body? Often no. Prefer extracting a minimal artifact in deterministic code, and give the model only the smallest safe representation it needs.

Make your CI email receipt deterministic with Mailhook

If you are tired of “email not received” flakes, stop treating email like a shared mailbox and start treating it like an event stream tied to isolated inbox resources.

Mailhook lets you create disposable inboxes via API, receive emails as structured JSON, and integrate via webhooks or polling, with signed payloads for security. To integrate accurately, use the canonical reference: Mailhook llms.txt.

Explore Mailhook at mailhook.co and build a receive email test harness that survives parallel CI, retries, and agent driven automation.