Parallel CI makes hidden shared state obvious. A signup test that passes locally can fail when eight workers run at once, not because the product is broken, but because every worker is fighting over the same mailbox. One worker reads another worker’s OTP, a retry consumes a stale magic link, or cleanup removes a message that another test has not processed yet.
The clean fix is to treat email as a per-run resource. Instead of sharing one QA inbox, create a disposable inbox for each parallel test run, or more safely, for each test attempt that expects email. Mailhook is built for this pattern: create disposable inboxes via API, receive messages as structured JSON, consume them through real-time webhooks or polling, and keep the email step deterministic for automation and LLM agents. For exact API semantics, use Mailhook’s llms.txt integration reference.
Why shared inboxes break under parallel test runs
Modern test runners are designed to parallelize. Playwright, for example, runs tests in worker processes, and CI systems routinely shard suites across machines. That is great for speed, but it exposes any resource that was quietly global.
Email is one of the worst global resources in a test suite because it is asynchronous, retried by infrastructure, sometimes delayed, and often parsed with loose matchers. A shared mailbox has no built-in ownership model. It can tell you that an email arrived, but not which worker owns it unless you add a reliable routing layer.
Common failure patterns include stale selection, duplicate delivery, wrong-recipient reads, retry collisions, and destructive cleanup. If an automated agent is involved, a shared mailbox also increases the chance that the agent sees unrelated, untrusted email content.

The isolation rule: one inbox for the smallest unit of ownership
A disposable inbox gives each parallel execution its own email address and inbox identifier. The practical rule is simple: if two executions can run at the same time or be retried independently, they should not share an inbox.
For critical email flows such as signup verification, password reset, OTP login, and magic-link sign-in, the safest unit is usually one inbox per test attempt. A retry is not the same attempt, because the first attempt may still receive late messages. Reusing the same inbox across retries reintroduces stale message selection.
| Inbox scope | Good for | Main risk | Recommendation |
|---|---|---|---|
| Shared QA mailbox | Manual debugging only | Max collisions and stale reads | Avoid for parallel CI |
| Per CI job | Coarse smoke tests | Workers inside the job can still collide | Use only for non-email assertions |
| Per worker | Sequential worker-owned flows | Test order can leak state | Acceptable for controlled fixtures |
| Per test run | Independent tests without retries | Retry can read a prior attempt’s email | Good when retries are disabled |
| Per test attempt | OTPs, magic links, signup, agents | More inboxes to manage | Best default for reliability |
This model changes your mental model from find the right message in a mailbox to read the message from the right inbox. That distinction is what makes parallel test runs predictable.
Pass an inbox descriptor, not just an email string
The email address is only one part of the contract. Your harness should carry an inbox descriptor through the flow so every wait, webhook, log, and assertion can reference the same resource.
| Field | Why it matters in parallel CI |
|---|---|
| The address used by the app under test | |
| inbox_id | The stable handle used to retrieve messages from the correct inbox |
| run_id | Links the inbox to the CI pipeline, shard, or test file |
| attempt_id | Separates retries from earlier attempts |
| created_at | Helps debug timing and late arrivals |
| state | Lets cleanup distinguish active, draining, and closed inboxes |
| correlation_token | Optional extra matcher when your app can echo a test token |
Store this descriptor in the test context, not in a global variable. When a test fails, log identifiers like run_id, attempt_id, and inbox_id rather than dumping raw email bodies into CI logs.
A parallel-safe workflow for disposable inboxes
A reliable workflow has five phases. The important detail is that inbox creation happens before the application sends the email, and every later operation is scoped to that inbox.
- Provision the inbox: Create a disposable inbox through your email API and store the returned email address and inbox identifier in the test context.
- Trigger the application event: Use the unique email address in the signup, login, password reset, or invite flow under test.
- Wait deterministically: Prefer a webhook signal for fast arrival, with a bounded polling fallback for resilience.
- Assert on structured JSON: Match by inbox_id first, then sender, subject intent, text content, and extracted artifact.
- Consume and clean up: Use the OTP or verification URL once, record the artifact as consumed, and expire or stop using the inbox.
Avoid fixed sleeps. A 10-second sleep can be too short on a slow CI day and too long on a fast one. A deadline-based wait with webhooks or polling is both faster and more reliable.
Provider-neutral implementation sketch
The exact API calls depend on your provider, so treat this as a harness shape rather than copy-paste code. For Mailhook endpoint details and payload fields, refer to the canonical llms.txt file.
async function runEmailVerificationFlow(testInfo) {
const attemptId = [
process.env.CI_PIPELINE_ID,
testInfo.file,
testInfo.title,
`worker-${testInfo.workerIndex}`,
`retry-${testInfo.retry}`
].join(':')
const inbox = await mail.createInbox({
label: attemptId
})
try {
await app.signUp({
email: inbox.email,
name: `qa-${testInfo.workerIndex}`
})
const message = await waitForInboxMessage({
inboxId: inbox.id,
deadlineMs: 60000,
match: (m) =>
hasExpectedSender(m) &&
mentionsVerificationIntent(m) &&
containsVerificationArtifact(m)
})
const verificationUrl = extractVerificationUrl(message.text)
await browser.goto(allowlisted(verificationUrl))
await expectAccountVerified()
} finally {
await stopUsingInbox(inbox.id)
}
}
Two details matter more than the syntax. First, waitForInboxMessage only reads from one inbox. Second, extraction uses the structured message payload, preferably the text body and derived artifacts, rather than scraping a rendered mailbox UI.
Webhook-first, polling fallback in parallel suites
Webhooks are the best default for parallel test runs because they avoid a thundering herd of workers polling every few seconds. A webhook can receive a JSON email event, verify it, route it by inbox_id, and wake the waiting test.
Polling is still useful as a fallback. Network issues, webhook endpoint outages, or local test environments may make push delivery unavailable. The reliable pattern is not webhook or polling. It is webhook first, polling with a deadline as the safety net.
| Delivery mode | Best role | Parallelism concern | Guardrail |
|---|---|---|---|
| Webhook | Primary low-latency signal | Spoofing, replay, duplicate delivery | Verify signed payloads, dedupe, route by inbox_id |
| Polling | Fallback and local development | Excess load and repeated reads | Use per-inbox cursors, backoff, and an overall deadline |
| Batch processing | High-throughput suites | Overmatching across many messages | Partition by inbox_id and process idempotently |
With Mailhook, you can receive email through real-time webhooks or a polling API, and signed payloads help verify webhook authenticity before processing. Your handler should acknowledge quickly, store the normalized JSON, and process assertions asynchronously when possible.
Dedupe is still required, even with isolated inboxes
Disposable inboxes remove cross-worker collisions, but they do not remove the need for idempotency. Email infrastructure and webhooks are commonly designed around at-least-once delivery, which means your code should tolerate duplicates.
Separate dedupe into three layers. At the delivery layer, dedupe webhook attempts so the same HTTP delivery is not processed twice. At the message layer, dedupe the same email if it is fetched by both webhook and polling. At the artifact layer, dedupe the same OTP or verification URL so it is consumed once.
| Layer | Example key | What it prevents |
|---|---|---|
| Delivery | delivery_id or webhook event id | Reprocessing the same webhook attempt |
| Message | inbox_id plus message_id | Processing the same email from webhook and polling |
| Artifact | inbox_id plus artifact hash | Submitting the same OTP or link twice |
| Attempt | attempt_id plus artifact type | Retry confusion and stale verification |
The goal is not to make duplicates impossible. The goal is to make duplicates harmless.
Match narrowly, but let isolation do most of the work
In a shared mailbox, teams often build complicated filters to identify the right email. They match subject lines, timestamps, sender addresses, body text, and sometimes brittle HTML selectors. This complexity is a symptom of missing isolation.
With a disposable inbox per attempt, the first and strongest filter is the inbox_id. After that, use narrow intent checks: expected sender, expected flow type, and the presence of a valid artifact. If your application can include a correlation token in the email body, subject, or metadata, use it as an additional guard, but do not depend on it as the only isolation mechanism.
For OTP extraction, prefer text/plain content when available. For magic links, validate the destination host against an allowlist before navigation. An email is untrusted input, even in test automation.
Make failures easy to debug in CI
Isolated inboxes also improve observability. When a parallel run fails, you should be able to answer four questions quickly: which inbox was used, whether the app sent the email, whether the email arrived, and which artifact was extracted.
A useful CI artifact is a redacted JSON record containing the inbox_id, attempt_id, received_at timestamp, sender, subject, message identifiers, and extraction result. Avoid storing full raw HTML or secrets unless your retention and access controls are designed for that data.
| Symptom | Likely cause | What to log |
|---|---|---|
| Timeout waiting for email | App did not send, routing failed, or provider delay | inbox_id, email, attempt_id, deadline, send timestamp |
| Wrong OTP submitted | Reused inbox or stale artifact | attempt_id, message_id, artifact hash, consumed_at |
| Duplicate webhook processing | Missing idempotency key | delivery id, message id, handler status |
| Works locally, fails in CI | Parallelism or fixed sleeps | worker id, retry index, timing between trigger and receive |
| Agent took unsafe action | Raw email exposed to model | minimized view, selected artifact, URL validation result |
These logs turn email failures from vague flakes into actionable defects.
LLM agents need inbox isolation even more than tests do
For LLM agents, shared inboxes are not just flaky. They expand the agent’s input surface. If an agent can read unrelated emails, it may see stale instructions, malicious content, or secrets that were never meant for the current task.
Use disposable inboxes as a tool boundary. A safe agent-facing interface can be small: create an inbox, wait for a message in that inbox, extract a typed artifact, and close the inbox. The agent should receive the minimum useful result, such as an OTP or allowlisted verification URL, not an entire mailbox.
Mailhook’s structured JSON email output fits this model because your orchestration layer can filter and minimize the message before exposing anything to the model. Signed webhooks and polling fallback help keep the ingestion layer reliable without asking the agent to reason about mailbox state.
Where Mailhook fits
Mailhook provides programmable, disposable email inboxes through a RESTful API. Received emails can be delivered as structured JSON, pushed through real-time webhooks, or retrieved through polling. The platform also supports instant shared domains, custom domain support, signed payloads for security, and batch email processing.
For parallel test runs, those primitives map directly to the isolation pattern:
- Create a disposable inbox for each test attempt that needs email.
- Use the returned address in the app under test and keep the inbox_id in your test context.
- Receive email as JSON through webhooks, with polling as a fallback.
- Verify signed webhook payloads before processing.
- Dedupe messages and extracted artifacts so retries are safe.
- Use shared domains for quick setup or custom domains when you need allowlisting, governance, or environment separation.
If you are designing an agent or test harness, keep the integration provider-neutral at the interface level, then use Mailhook’s llms.txt as the implementation reference for Mailhook-specific details.
Frequently Asked Questions
Is one disposable inbox per CI worker enough? Sometimes, but it is not the safest default. If a worker runs multiple email-dependent tests or retries a failed test, messages can still collide. One inbox per test attempt gives the strongest isolation.
Should I use disposable inboxes for every test? No. Use reserved non-routable domains for validation-only unit tests that should never receive mail. Use disposable inboxes for end-to-end flows where the application must send and your test must receive a real email.
Are webhooks better than polling for parallel test runs? Webhooks are usually better as the primary path because they reduce latency and avoid many workers polling at once. Polling is still valuable as a fallback when webhook delivery is unavailable or delayed.
How do custom domains help with isolated test runs? Custom domains can help when your application or identity provider requires allowlisted domains, when you need environment-specific routing, or when you want stronger governance. For fast setup, shared domains are often enough.
Can LLM agents safely use disposable inboxes? Yes, if the inbox is isolated per task, webhook payloads are verified, message content is treated as untrusted, and the agent only receives a minimized artifact such as an OTP or validated link.
Make parallel email tests boring
Email-dependent tests should not require a human mailbox, global cleanup scripts, or lucky timing. By creating disposable inboxes for each parallel test attempt, you isolate state, simplify matching, and make CI failures easier to explain.
Start with Mailhook to create disposable inboxes via API and receive emails as JSON, with webhooks, polling, signed payloads, shared domains, and custom domain support available for automation workflows. For implementation details, keep the Mailhook llms.txt reference close to your test harness.