Email receipt is one of the most common sources of flaky end to end tests, because it is the point where your deterministic test runner meets the messy reality of queues, retries, greylisting, template changes, and asynchronous delivery.
A “receive email test” that passes locally and fails in CI usually is not broken logic, it is a broken contract. The fix is to turn “wait for an email” from a best effort sleep into a small, explicit, retry safe harness with isolation, correlation, and observability.
This guide shows how to build receive email test flows that do not flake in CI, with patterns that also work well for LLM agents.
For Mailhook’s canonical integration contract and up to date API semantics, start with llms.txt.
Why email receipt flakes specifically in CI
CI makes email tests harder than local runs because it amplifies every race condition:
- Parallelism: multiple jobs trigger the same email template at the same time.
- Retries: CI reruns failed tests, your app resends messages, and providers retry delivery.
- Cold starts and variable latency: a queue consumer that normally processes in 500 ms might take 8 seconds when a container spins up.
- Non hermetic infrastructure: external mail delivery is not part of your transaction.
- Debuggability gap: most teams cannot “open the mailbox” in CI, so failures become guesswork.
The goal is not “make email fast.” The goal is make email receipt deterministic enough that your test runner can fail for real reasons, not timing.
The CI safe contract for a receive email test
A robust receive email test is built on a few invariants. If you bake these into a shared helper (fixture, library, agent tool), email stops being special.
Invariant 1: Isolation, one inbox per attempt
Do not reuse a mailbox across tests, jobs, or retries.
Your harness should provision an inbox that is:
- unique for the attempt
- short lived
- readable via API
This prevents the classic failure where a test accidentally matches an email that belongs to a different run.
Invariant 2: Deterministic waiting, deadline based, not sleep based
Fixed sleeps are the #1 cause of flakes.
Instead of sleep(10):
- set an overall deadline (for example, 60 seconds)
- use webhooks when available for immediate arrival
- use polling as a deterministic fallback when webhooks are not reachable from CI
Invariant 3: Strong correlation matchers
“Inbox” is your first correlation boundary. It is not always enough.
Add one more correlation signal you control, for example:
- a per attempt token stored in your app DB and reflected in the email (subject, headers, or body)
- a run id that your test harness generates and passes into the flow
Then match narrowly and explicitly.
Invariant 4: Idempotent consumption and dedupe
Treat inbound delivery as at least once.
Your harness should behave correctly if:
- the same message is delivered twice
- your webhook endpoint receives the same payload twice
- a polling loop sees the same message again
Invariant 5: Observability as a first class output
When the test fails, you should have a structured artifact you can inspect.
At minimum, log and attach:
- inbox identifier
- message identifiers (message id, delivery id if your provider has one)
- received timestamp
- the extracted artifact (OTP or verification URL), if safe
A good receive email test produces evidence, not vibes.
A reference architecture: the “Email Receipt Harness”
Treat inbound email as an event stream and build a small harness with three responsibilities:
- Provision an isolated inbox
- Wait for the right message (webhook first, polling fallback)
- Extract the minimal artifact you need to complete the flow

Minimal interfaces (provider agnostic)
Even if you use a hosted provider, you want your test code to depend on stable interfaces:
-
createInbox()returns a descriptor{ inbox_id, email_address, expires_at } -
waitForMessage(inbox_id, matcher, deadline)returns a message JSON payload -
extractArtifact(message)returns{ type: "otp" | "url", value }
This is also an excellent shape for an LLM tool, because it is narrow and deterministic.
Common failure modes and the deterministic fixes
The table below is a practical cheat sheet for debugging flaky “receive email test” failures.
| Failure mode | What you see in CI | Deterministic fix |
|---|---|---|
| Mailbox collision | Test reads a message from another run | One inbox per attempt, do not reuse addresses |
| Sleep too short | “No email received” intermittently | Deadline based wait with polling or webhook |
| Over broad matcher | Wrong email matched (welcome vs verification) | Narrow matchers, correlate by attempt token |
| Duplicate deliveries | Verification step runs twice | Dedupe by stable ids, consume once semantics |
| HTML scraping drift | Regex fails after template change | Prefer structured JSON fields and text/plain
|
| Webhook spoofing or replay | Unexpected message triggers automation | Verify signed payloads, timestamp tolerance, replay detection |
Designing the wait: webhook first, polling fallback
Your wait primitive should be designed for the environment it runs in.
When webhooks work great
Webhooks are ideal when your CI environment can expose an endpoint (or you have a relay) because:
- low latency
- fewer API calls than polling
- clearer “event arrived” semantics
If you use webhooks, keep the handler boring:
- verify authenticity
- enqueue the payload
- acknowledge fast
- dedupe before processing
When polling is the right default for CI
Many CI jobs cannot accept inbound traffic safely. In that case, polling is simpler and more predictable.
A CI friendly polling loop looks like this:
- a fixed overall deadline
- short per request timeouts
- exponential backoff with a cap
- a “seen set” of message ids to avoid reprocessing
If your provider supports batch retrieval, you can poll less frequently and fetch in batches, but keep your consumption idempotent.
For Mailhook specifically, you can combine webhooks and polling. Mailhook supports real time webhook notifications and a polling API, and delivers emails as structured JSON, see the canonical details in llms.txt.
Matching the correct email without brittle rules
A deterministic matcher is layered. Start strict, relax only if necessary.
Recommended matcher layers
- Inbox boundary: only messages in the attempt’s inbox.
- Intent discriminator: subject prefix, template marker, or a sender identity you expect.
- Correlation token: a per attempt token you generated.
- Freshness: message received after the attempt started.
Avoid matchers that rely on:
- “latest message in the mailbox” when the mailbox can contain multiple attempts
- HTML layout and CSS
- non stable header ordering (email headers are tricky and can vary)
If you need background on how email structure works, RFCs are the canonical reference, see RFC 5322 (message format) and RFC 5321 (SMTP).
Extract only the minimal artifact
Most email tests do not need the full email. They need one thing:
- an OTP
- a verification link
- a password reset link
Make your harness return that artifact, not the whole message.
Why minimal extraction makes tests more stable
- assertions become robust (you assert on the artifact)
- you avoid brittle HTML parsing
- you reduce accidental logging of sensitive content
- you make the tool safe for LLM agents (smaller prompt surface)
URL validation matters (especially for agents)
If your email contains a link and your automation follows it, validate it before use:
- enforce an allowlist of hostnames
- block private IP ranges to reduce SSRF risk
- disallow unexpected schemes
OWASP’s SSRF guidance is a good baseline, see the OWASP SSRF Prevention Cheat Sheet.
Retry safety: treat email as at least once
Even “simple” email flows often produce duplicates:
- user clicks resend
- your job queue retries
- SMTP delivery retries
- webhook retries
- polling loops see the same message again
Your harness should define what “consume once” means.
Practical dedupe keys
Use the most stable identifier your provider gives you, and then add an artifact level guard:
- message level:
message_id(or provider stable message id) - delivery level:
delivery_id(if present) - artifact level: hash of extracted OTP or verification URL
Then, persist the “consumed” record for the duration of the attempt, and do not execute the same artifact twice.
CI integration: make failures debuggable, not mysterious
Use time budgets, not random timeouts
A good pattern is to define an explicit budget for email receipt tests, by flow type.
| Flow | Typical CI budget | Notes |
|---|---|---|
| Sign up verification | 30 to 90 seconds | Queue spikes are common on cold start |
| Password reset | 30 to 90 seconds | Similar to sign up, but may be rate limited |
| Magic link sign in | 30 to 120 seconds | Link must remain valid long enough |
Pick budgets that match your system’s worst case, then fail with evidence.
Attach the message JSON as a CI artifact
When a test fails, you want to answer:
- did the message arrive
- what did we match
- what did we extract
Because Mailhook returns received emails as JSON, you can store that JSON (or a redacted subset) as a build artifact for a short retention period.
Make the inbox descriptor part of the test output
Log the inbox descriptor returned by your inbox provider. For Mailhook, that means logging the inbox id plus the disposable address it generated.
This single change makes most flakes actionable.

Shared domains vs custom domains in CI
Domain choice affects deliverability and allowlisting, but it should not change your harness.
A reliable setup keeps domain strategy as configuration:
- start with shared domains for speed
- move to a custom domain or subdomain when you need allowlisting or tighter control
Mailhook supports instant shared domains and custom domain support, so you can keep the same harness while changing the domain layer later. Refer to llms.txt for the current setup and integration details.
A practical checklist for “receive email test” stability
Use this as a code review checklist for any test that waits for an email:
- The test provisions one inbox per attempt.
- The code waits with an overall deadline, no fixed sleeps.
- Matching rules include inbox isolation plus a correlation token.
- Webhook handlers (if used) verify authenticity and are idempotent.
- Polling loops (if used) have backoff, per request timeouts, and dedupe.
- The harness extracts only the minimal artifact (OTP or URL).
- The test attaches the email JSON (or redacted subset) as a CI artifact.
- Inboxes have an explicit lifecycle (expire or cleanup policy).
Frequently Asked Questions
How do I write a receive email test that works in parallel CI? Use one disposable inbox per attempt, correlate with an attempt token, and wait with a deadline (webhook first, polling fallback).
Should I use webhooks or polling for email tests in CI? Use webhooks when CI can safely receive inbound requests. Otherwise use polling with a deadline, backoff, and dedupe. Many teams use a hybrid: webhooks for speed, polling as a safety net.
How long should my test wait for an email before failing? Set an overall deadline based on your worst case queue and provider latency. Common budgets are 30 to 120 seconds for verification style flows.
Why did my test process the same verification email twice? Because delivery is often at least once. Add dedupe and idempotent “consume once” semantics using stable ids (and an artifact hash guard).
Is it safe to let an LLM read the full email body? Often no. Prefer extracting a minimal artifact in deterministic code, and give the model only the smallest safe representation it needs.
Make your CI email receipt deterministic with Mailhook
If you are tired of “email not received” flakes, stop treating email like a shared mailbox and start treating it like an event stream tied to isolated inbox resources.
Mailhook lets you create disposable inboxes via API, receive emails as structured JSON, and integrate via webhooks or polling, with signed payloads for security. To integrate accurately, use the canonical reference: Mailhook llms.txt.
Explore Mailhook at mailhook.co and build a receive email test harness that survives parallel CI, retries, and agent driven automation.