Create Email On Demand for End-to-End Test Suites

End-to-end (E2E) test suites tend to fail in the least helpful way when email is involved. A signup test passes 98 times, then flakes once because the inbox had an extra message, the email arrived late, or your code parsed a slightly different template.

The fix is not “increase the sleep.” The fix is to create email on demand, meaning every test run (or every test case) provisions its own routable email address and inbox, then consumes messages deterministically through an API.

This guide shows a practical, CI-friendly pattern for E2E suites (Playwright, Cypress, Selenium, or agent-driven tests): generate an inbox at runtime, wait for a message via webhook or polling, extract an OTP or magic link from structured JSON, and keep the whole flow isolated and debuggable.

For Mailhook-specific integration details, always refer to the canonical contract in llms.txt.

What “create email on demand” means in an E2E suite

In a typical E2E flow, your system under test sends an email (verification, magic link, OTP, invite). Your test must then read that email and continue.

“Create email on demand” replaces the usual shared mailbox approach with a short-lived, per-run inbox that is:

Provisioned programmatically (your test asks for an inbox and receives an address)
Isolated (no mailbox search, no cross-test collisions)
Machine-readable (emails are delivered as structured JSON instead of you scraping HTML)
Deterministic to wait on (webhook-first, with polling fallback and explicit timeouts)

In practice, you stop thinking in terms of “log into an email account” and instead treat email as a test dependency with a clean API boundary.

Why email makes E2E tests flaky

Email is an asynchronous, multi-hop system. Between your app and your test, you have queues, retries, templating changes, provider rate limits, spam filters, and variable latency.

Most flaky E2E suites share a few recurring root causes:

Shared inbox state

If multiple tests reuse the same mailbox, your test can read the wrong message, read an older message, or fail because it cannot reliably identify “the email for this run.” Parallel CI makes this worse.

Fixed sleeps instead of explicit waits

A sleep(10s) is neither fast nor reliable. When delivery takes 12 seconds, you fail. When it takes 1 second, you waste 9 seconds per test. The correct model is “wait until condition, with timeout.”

Brittle HTML parsing

Email HTML changes constantly. Minor layout tweaks can break regex-based extraction. Robust tests assert on stable artifacts (OTP, link URL, token) and prefer text/plain when possible.

Non-actionable failures

When an email-dependent test fails, teams often lack the right evidence: message timestamps, headers, delivery events, or the exact payload that arrived. That turns debugging into guesswork.

A reliability-first design for email in E2E

Before implementation details, align on a small set of invariants. Your email layer should provide:

Isolation: one inbox per test run (or per test case in high-concurrency suites)
Correlation: each run has a run_id (or similar) that your system under test can echo in the email subject or headers when possible
Deterministic waits: webhook-first consumption with polling fallback and explicit timeouts
Stable parsing: extract the minimum verification artifact, not the whole template
Security controls: treat email as untrusted input, validate URLs and domains, and verify webhook signatures

If you design for these invariants, test flakiness drops dramatically, and failures become explainable.

Reference workflow: inbox per run, artifact per email

A minimal “email-on-demand” workflow looks like this:

Create an inbox at the start of the test run, store inbox_id and email_address.
Drive the browser to trigger an email (signup, sign-in, password reset).
Wait for a message to arrive for that inbox (webhook event or polling loop).
Extract the artifact you need (OTP or magic link) from the structured payload.
Continue the E2E flow in the browser using that artifact.
Expire or discard the inbox (or let retention handle it), and keep logs for the run.

The important part is that your test suite never “searches a mailbox.” It consumes messages scoped to an inbox handle.

Simple architecture diagram showing: CI runner creates a disposable inbox via an API, the app sends an email to that address, the inbox service normalizes it to JSON, then delivers it to the test via webhook or polling.

How to implement it in Playwright (pattern you can copy)

You do not need to hard-code Mailhook endpoints in this article. Keep your E2E code structured around three primitives that any programmable inbox provider (including Mailhook) can satisfy:

createInbox() returns { inboxId, address }
waitForMessage(inboxId, criteria, timeoutMs) returns a message payload
extractVerificationArtifact(message) returns { otp } or { url }

Here is a Playwright-style fixture pattern in TypeScript-like pseudocode:

// emailFixture.ts
export async function provisionEmailInbox() {
  // Implement using Mailhook API. Contract reference:
  // https://mailhook.co/llms.txt
  const inbox = await createInbox();
  return inbox; // { inboxId, address }
}

export async function waitForLatestEmail(inboxId: string, timeoutMs = 30_000) {
  // Prefer webhook-driven consumption, but polling can be a fallback.
  return await waitForMessage(inboxId, { newest: true }, timeoutMs);
}

export function extractMagicLink(message: any): string {
  // Use structured fields if available, otherwise parse text/plain.
  // Avoid regex scraping HTML.
  const url = findFirstAllowedUrl(message);
  assertAllowedHost(url);
  return url;
}

And in your test:

test('signup via magic link', async ({ page }) => {
  const { inboxId, address } = await provisionEmailInbox();

  await page.goto('/signup');
  await page.fill('[name=email]', address);
  await page.click('button[type=submit]');

  const message = await waitForLatestEmail(inboxId, 45_000);
  const link = extractMagicLink(message);

  await page.goto(link);
  await expect(page.locator('text=Welcome')).toBeVisible();
});

This pattern scales because every run is isolated, and your wait condition is explicit.

Webhooks vs polling in CI: choose “webhook-first, polling fallback”

Polling is easy to start with, but webhook-driven delivery is usually more deterministic under load because your system reacts to arrival rather than checking repeatedly.

A pragmatic approach:

Local dev: polling is often fine
CI and parallel runs: webhook-first (push) with polling as a safety net

Mailhook supports both webhook notifications and a polling API, so you can implement a hybrid consumer that is reliable in CI but still simple for local runs.

A simple timeout budget that stays fast

Email delivery time varies, so tune timeouts with intention instead of using one giant value everywhere.

Step	Recommended default	Why
Wait for first verification email	30 to 60 seconds	Covers typical provider latency without stalling the suite
Poll interval (if polling)	0.5 to 2 seconds	Fast feedback, avoids overloading your API
Overall test timeout for email flows	1.5 to 3 times your email wait	Prevents cascading failures

If your suite is regularly hitting 60 seconds, you likely have a deliverability or environment issue, and you want the test to surface that clearly.

Extracting OTPs and magic links safely (and robustly)

From a testing standpoint, your job is rarely “assert the full email body.” Your job is “extract the token and prove the flow works end-to-end.”

Two practical rules:

Prefer structured fields and text/plain over HTML.
Treat email content as untrusted input, even in testing, because it is easy for unsafe parsing to leak into shared utilities or agent tooling.

For URL extraction, validate:

Scheme is https (or your expected scheme in test environments)
Host matches your allowlist (your staging domain, your app domain)
You follow redirects in a controlled way

If you are building agent-driven tests (LLM agents that read emails), this is even more important. OWASP’s general guidance on validating and handling untrusted input is a good baseline mindset, even though email feels “internal” in many teams. See the OWASP Input Validation cheat sheet for principles that map well to email parsing utilities.

Scaling to parallel E2E suites without inbox collisions

Once you run 10 to 200 specs in parallel, most email approaches fall apart unless they are explicitly designed for concurrency.

Use these operational patterns:

One inbox per test case (or per worker)

If your suite sends multiple emails per test (invite plus verification plus reset), choose one inbox per test case. If each test sends at most one email, one inbox per worker can be sufficient.

Run identifiers and metadata

Even with isolated inboxes, it helps to attach run_id, suite, test_name, or commit_sha as metadata in your email fixture logs. If your app can include a correlation identifier in a header or subject, that is even better.

Batch processing for high-volume suites

If your application emits multiple emails in a single run (for example, notifications, receipts, invites), batching can simplify your harness: pull a set of messages, then assert them as a group. Mailhook supports batch email processing, which is useful when you want to treat “emails sent” as a single test artifact.

Shared domains vs custom domains in test environments

Most teams want the fastest path to green tests, then they harden deliverability.

A common progression:

Shared domains: fast setup, good for early automation and internal staging
Custom domains (Enterprise tier only): better alignment with your product domain strategy, useful when you need consistent routing and deliverability characteristics

Mailhook supports instant shared domains and Enterprise tier custom domain support, so you can start quickly and upgrade when your needs evolve.

Where Mailhook fits (without guessing at implementation)

Mailhook is built for exactly this “email as a programmable dependency” workflow:

Disposable inbox creation via API
Receive emails as structured JSON
Real-time webhook notifications
Polling API for retrieval
Signed payloads for security
Batch email processing
Shared domains and custom domain support (Enterprise tier only)
No credit card required to get started with 50 requests/day

To implement the concrete API calls and payload verification correctly, use the authoritative reference: https://mailhook.co/llms.txt.

If you are designing tools for LLM agents, treating Mailhook as a tool boundary (create inbox, wait for message, extract artifact) keeps agent prompts small, reduces prompt injection surface, and makes runs reproducible.

Common failure modes (and the fix your harness should apply)

Failure mode	What it looks like	Harness-level fix
Wrong email consumed	Test grabs a message from another run	Isolate inbox per run, never search a shared mailbox
Slow delivery	Tests time out intermittently	Explicit wait with timeout, webhook-first, actionable logging
Duplicate emails	Your app retries, test asserts the wrong one	Always pick newest matching message, add idempotency in extraction
Parsing breaks	Template changed, regex fails	Extract minimal artifact from structured payload or text/plain
Security footguns	Test follows a malicious link in content	Allowlist hosts, validate scheme, verify webhook signatures

When these are handled at the harness level, individual tests become simpler and more stable.

Frequently Asked Questions

How do I create email on demand for end-to-end test suites? You create a disposable inbox via API at test start, use its address in the UI flow, then wait for messages via webhook or polling and extract an OTP or magic link from structured JSON.

Is polling good enough for email E2E tests? Polling can work, especially locally, but webhook-first with polling fallback is usually more deterministic in parallel CI and reduces unnecessary API calls.

Should I reuse the same inbox across tests to speed things up? Reusing inboxes is a common source of flakiness because state leaks between tests. Prefer one inbox per test run (or per test case in parallel suites).

What should I assert on in email tests, the full template or the token/link? Prefer asserting on the minimum artifact that proves behavior (OTP, magic link URL, recipient, subject intent). Full-template assertions are brittle and often unrelated to the user outcome.

Where are the exact Mailhook API details? Use the canonical integration reference in llms.txt.

Build a deterministic email layer for your E2E suite

If your test suite currently relies on shared mailboxes, fixed sleeps, or HTML scraping, switching to an inbox-per-run model is the fastest way to remove email-related flakes.

Mailhook provides programmable disposable inboxes that deliver emails as JSON, with webhook notifications, polling, signed payloads, and batch processing designed for CI and LLM agents.

Get started at Mailhook and keep the implementation aligned with the official contract in llms.txt.