One Inbox per Attempt Is the Rule That Stops Flakes

Email-related flakes usually look like timing problems. A test waits 20 seconds and still misses the OTP. A signup retry clicks an expired link. A parallel CI job consumes the wrong verification email. An LLM agent sees two matching messages and chooses the stale one.

The fix is often simpler than adding longer sleeps or smarter regexes: create one inbox per attempt.

Not one shared mailbox. Not one address per environment. Not even always one inbox per test file. One inbox for the exact attempt that will trigger, wait for, and consume the email artifact. When that attempt is retried, create a new inbox.

That rule turns email from a messy shared resource into a scoped event stream. It gives your QA harness, CI pipeline, or agent workflow a clean boundary: every message in this inbox belongs to this attempt, and nothing from a previous attempt can be selected by accident.

What “one inbox per attempt” actually means

An attempt is a single end-to-end try at completing an email-dependent operation. It starts before the system sends an email and ends when the artifact from that email is consumed, times out, or the flow is abandoned.

For example, in a signup verification test, the attempt begins when your test provisions the temporary inbox and submits the signup form. It ends when the OTP or magic link has been extracted and used, or when the wait deadline expires.

Workflow	Attempt boundary	New inbox when…
Signup verification	Create account, receive verification email, submit code or link	The signup is retried from the beginning
Password reset	Request reset, receive reset email, use token	The reset request is restarted
Magic-link login	Submit email, receive login link, open link	The login flow is restarted
LLM agent signup task	Agent requests inbox, uses address, waits for artifact	The agent retries the task or starts a new session
Parallel CI test	Test case provisions inbox, triggers mail, asserts result	The test case is retried by the runner

The important detail is that retries are not just repeats of the same logic. They are new attempts with their own state. If a retry reuses the same inbox, it inherits old messages, late deliveries, duplicate webhooks, and consumed artifacts from the previous try.

That inherited state is where flakes come from.

Why shared inboxes create flakes

Email is asynchronous by design. SMTP delivery can be delayed, providers may retry, and webhook consumers often use at-least-once processing semantics. The SMTP specification describes a store-and-forward delivery model, which is reliable for human mail but awkward for deterministic automation.

A human can look at a mailbox and infer which message is newest, relevant, or already used. A test runner or agent needs explicit rules. Shared inboxes make those rules fragile.

Stale message selection

The classic failure is selecting an old email that still matches the parser. For example, your regex finds a six-digit OTP, but it is from the previous CI retry. The code is valid-looking, but expired or tied to the wrong account.

Adding “select latest” helps only until two messages arrive close together, clocks differ, or the mailbox contains duplicate deliveries. The stronger fix is to remove stale candidates altogether by isolating the attempt.

Parallel races

When two CI workers share one mailbox, both may see the same message list. Even if each worker uses a different email address, mailbox-level retrieval can still race if the consumer logic is not perfectly filtered.

Per-attempt inboxes make the query space smaller. Worker A can only read inbox A. Worker B can only read inbox B. The harness no longer has to prove that its matcher is perfect across unrelated traffic.

Duplicate deliveries and retries

Webhook providers commonly retry deliveries if your endpoint times out or returns an error. Polling loops can also process the same message twice if cursors or seen IDs are mishandled.

With a shared inbox, duplicate handling must separate duplicates from many different attempts. With one inbox per attempt, idempotency is still required, but the scope is clear: dedupe within this attempt, not across an unbounded mailbox.

Late arrivals

A failed attempt may time out at 30 seconds, while the email arrives at 35 seconds. If the retry reuses the same inbox, that late email can be mistaken for the retry’s email.

This is one of the most common causes of “passed locally, failed in CI” behavior. CI runners add load, network variance, and parallelism. A late message from attempt 1 becomes a false positive or false negative for attempt 2.

Agent confusion

LLM agents are especially sensitive to ambiguous inputs. If an agent sees multiple emails with similar content, it may choose the wrong one or follow instructions embedded in email content. For agent workflows, the safest pattern is a narrow tool surface: create an inbox, wait for a specific artifact, return only the minimal extracted result.

Per-attempt inboxes reduce ambiguity before the model sees anything.

A clean automation pipeline where each retry attempt has its own disposable inbox, preventing stale emails, duplicate messages, and parallel CI races from crossing boundaries.

The deterministic contract: attempt, inbox, message, artifact

The rule becomes powerful when you model it explicitly. Instead of passing around a bare email address string, create an attempt descriptor that includes the inbox identity and lifecycle.

A useful provider-agnostic descriptor looks like this:

{
  "attempt_id": "signup-8427-attempt-2",
  "inbox_id": "inb_abc123",
  "email": "[email protected]",
  "created_at": "2026-05-07T21:10:00Z",
  "expires_at": "2026-05-07T21:20:00Z",
  "purpose": "signup_verification"
}

The email address is what you give to the application under test. The inbox_id is what your automation uses to read messages. The attempt_id is what you log, attach to CI artifacts, and use for idempotency.

Then each received email is processed into a message and, ideally, a smaller artifact:

Layer	Example ID	Purpose
Attempt	`attempt_id`	Scopes the whole retryable operation
Inbox	`inbox_id`	Isolates inbound messages for that attempt
Delivery	`delivery_id`	Dedupes webhook or polling delivery events
Message	`message_id`	Dedupes the normalized email itself
Artifact	`artifact_hash` or token ID	Ensures the OTP or magic link is consumed once

This model gives you a clean rule for retries: a new attempt gets a new inbox. It also gives you a clean rule for processing: artifacts are consumed once within the attempt that created them.

Reference workflow for retry-safe email automation

A retry-safe flow has five phases. The exact API calls depend on your provider, but the semantics should stay the same.

1. Create the inbox before triggering the email

Provision the disposable inbox first. Do not submit the signup form and then try to create or discover an address after the fact.

This ordering matters because the inbox descriptor is part of the test input. Your application should send to an address that is already routable and observable.

2. Trigger exactly one side effect for the attempt

Use the generated email address in the operation under test. For example, submit the signup form, request a password reset, or ask the third-party integration to send a verification code.

Log the attempt ID, inbox ID, test name, CI run ID, and any application-level correlation ID. Avoid logging full message bodies unless you have a clear retention and privacy policy.

3. Wait with a deadline, not a fixed sleep

Fixed sleeps are a flake multiplier. If you sleep for 5 seconds, the test fails when delivery takes 6 seconds and wastes time when delivery takes 500 milliseconds.

Use a bounded wait. Prefer webhooks for low-latency delivery, with polling as a fallback when the webhook path is unavailable or the test runner cannot receive inbound HTTP. Mailhook supports both real-time webhook notifications and polling API access, so you can use the pattern that fits your environment.

For a deeper treatment of this receive pattern, see Mailhook’s guide to webhook-first delivery with polling fallback.

4. Extract the minimal artifact

The automation usually does not need the whole email. It needs a typed artifact: an OTP, a magic link, a verification URL, or a confirmation token.

Prefer structured JSON and text content over scraping rendered HTML. Validate links before opening them. For LLM agents, return only the artifact and trusted metadata, not arbitrary email HTML.

5. Expire or close the inbox after the attempt

When the attempt completes, the inbox should leave the active path. Depending on your lifecycle policy, you may keep a short drain window for late-arriving messages and debugging, then close or expire the inbox.

This keeps later attempts from reading old traffic and limits the retention of potentially sensitive verification data.

Pseudocode: one inbox per attempt

The following sketch is intentionally provider-agnostic. It shows the control flow you want your test harness or agent tool to enforce.

async function runEmailVerificationAttempt(input) {
  const attemptId = createAttemptId(input.testName, input.retryIndex);

  const inbox = await emailProvider.createInbox({
    purpose: "signup_verification",
    attempt_id: attemptId,
    ttl_seconds: 600
  });

  await app.signup({
    email: inbox.email,
    username: input.username
  });

  const message = await waitForMessage({
    inbox_id: inbox.id,
    deadline_ms: 60_000,
    matcher: {
      expected_sender: "[email protected]",
      subject_contains: "Verify",
      purpose: "signup_verification"
    }
  });

  const artifact = extractVerificationArtifact(message);

  await consumeOnce({
    attempt_id: attemptId,
    artifact_hash: hashArtifact(artifact),
    action: () => app.submitVerification(artifact)
  });

  await emailProvider.expireInbox(inbox.id);

  return { attemptId, inboxId: inbox.id, status: "verified" };
}

The key is not the syntax. The key is that the inbox is created inside the attempt, used only by that attempt, and expired after the attempt.

What about resend buttons?

Resend flows need an explicit policy. A resend can either be part of the same attempt or start a new attempt. Both can work, but ambiguity causes flakes.

Resend policy	When to use it	Selection rule
Same attempt, same inbox	Testing the app’s resend button inside one user session	Accept only the newest matching artifact, dedupe old artifacts
New attempt, new inbox	Retrying the whole test or agent task	Ignore old inbox completely
Same inbox with resend counter	Testing rate limits or resend budgets	Match by attempt plus resend index or timestamp window

For most CI retries, use a new attempt and a new inbox. For a test specifically about the resend feature, keep the inbox but make the resend index part of your matcher and logs.

Make the rule enforceable

Teams often agree with “one inbox per attempt” but still violate it under deadline pressure. The solution is to move the rule into shared tooling.

A good harness makes it difficult to do the wrong thing:

The test helper returns an inbox descriptor, not just an email string.
The wait function requires an inbox_id and deadline.
The retry wrapper creates a new inbox automatically.
The parser returns a typed artifact instead of raw HTML.
The cleanup step runs even when the test fails.

This is especially important for LLM agents. Do not ask the model to decide whether to reuse an inbox. Give the agent a deterministic tool contract: create_inbox, wait_for_message, extract_artifact, and expire_inbox. Keep the policy in code.

Mailhook is designed around these primitives: disposable inbox creation via API, structured JSON email output, webhook notifications, polling access, signed payloads, shared domains, custom domain support, and batch email processing. For exact integration details, use the canonical Mailhook reference at https://mailhook.co/llms.txt.

Anti-patterns that keep email tests flaky

If your email workflow still flakes after adding retries, look for these patterns.

Anti-pattern	Why it flakes	Better pattern
One shared QA mailbox	Old and parallel messages mix together	Disposable inbox per attempt
Fixed sleep before checking mail	Too short under load, too slow when delivery is fast	Deadline-based wait with webhook or polling
Selecting by subject only	Many messages can share the same subject	Match by inbox, sender, purpose, and artifact
Reusing inbox on CI retry	Late mail from failed attempt contaminates retry	New retry means new inbox
Returning raw HTML to an agent	Prompt injection and brittle parsing	Return minimal extracted artifact
No webhook verification	Spoofed or replayed events can enter the pipeline	Verify signed payloads and dedupe delivery IDs

None of these require a complex architecture to fix. They require a stronger boundary around each attempt.

Observability: the logs that make flakes debuggable

Per-attempt inboxes reduce flakes, but good observability helps you prove why a failure happened.

Log stable identifiers, not just free-form text. At minimum, capture the attempt ID, inbox ID, email address, CI run ID, message ID, delivery ID, artifact hash, wait deadline, and final status.

When a test times out, attach the normalized JSON message list for that inbox as a CI artifact if your privacy policy allows it. Because the inbox is scoped to one attempt, the artifact is easier to review and less likely to contain unrelated messages.

Useful counters include:

Time from trigger to first matching message
Number of messages received in the attempt inbox
Number of duplicate deliveries ignored
Number of artifacts extracted
Timeout rate by test name and domain strategy
Late arrivals after the attempt deadline

If you see late arrivals regularly, adjust the application’s email sending path or the test deadline. Do not solve it by reusing the same inbox for retries.

Security considerations for LLM agents

Email is untrusted input. That is true even when the email is sent by your own application, because templates can include user-controlled fields, forwarded content, links, or unexpected MIME structures.

For LLM-driven workflows, combine per-attempt inboxes with strict safety boundaries. Verify webhook signatures before processing. Deduplicate delivery events. Normalize email into structured JSON. Extract the OTP or verification link with deterministic code. Validate URLs against an allowlist before opening them. Give the model the smallest possible view, such as { type: "otp", code: "123456" }, instead of the full body.

This design reduces both reliability issues and prompt-injection risk. The model no longer has to read a mailbox, choose a message, parse HTML, and decide whether a link is safe. Those are deterministic software responsibilities.

How to migrate from shared mailboxes without a rewrite

You do not need to rebuild your entire test suite at once. Start by wrapping email access behind one helper.

First, create an EmailAttempt helper that provisions an inbox and returns the address plus inbox ID. Replace hardcoded test addresses with that helper. Next, update your wait logic to read only from the attempt inbox. Then change retry behavior so each retry calls the helper again.

After that, move parsing into a shared extractor that emits typed artifacts. Finally, add cleanup and retention rules so completed attempts do not stay active indefinitely.

If you are already using Mailhook, this maps naturally to its programmable temporary inbox model: create disposable inboxes via API, receive emails as structured JSON, consume via webhooks or polling, verify signed payloads, and choose shared or custom domains depending on your test environment.

For related implementation details, see Mailhook’s articles on email testing in parallel CI and managing inbox lifecycle with TTLs and drain windows.

Frequently Asked Questions

Is one inbox per test run enough? Sometimes, but one inbox per attempt is safer. A test run can contain retries, resends, and multiple email-dependent steps. If a retry can happen, the retry should get a fresh inbox unless you are explicitly testing resend behavior inside the same attempt.

Does this replace message matching and deduplication? No. Inbox isolation reduces the candidate set, but you should still match by sender, purpose, subject, and artifact type. You should also dedupe webhook deliveries, messages, and consumed artifacts.

Is this too expensive for large CI suites? Usually the reliability trade-off is worth it because flakes waste developer time and CI minutes. If volume is high, use batch creation and clear lifecycle policies. The key is to make inbox creation cheap and automated, not manual.

Should LLM agents ever see the full email? Prefer not to. Agents should receive a minimized, typed artifact whenever possible. If raw content is needed for debugging, keep it outside the model-visible path and treat it as untrusted input.

Can I use custom domains with this pattern? Yes. The rule is about inbox isolation, not a specific domain strategy. Shared domains are useful for quick setup, while custom domains help with allowlisting, environment separation, and governance. Mailhook supports both shared domains and custom domain support.

Stop treating email as a shared mailbox

Flaky email automation is rarely fixed by one more sleep, one more regex, or one more retry. It is fixed by scoping state correctly.

One inbox per attempt gives every signup, login, password reset, QA test, and agent workflow a clean boundary. The inbox is created for the attempt, receives only that attempt’s messages, produces structured data, and expires when the attempt is done.

Mailhook provides the primitives to implement that rule: programmable disposable inboxes via API, structured JSON emails, real-time webhooks, polling fallback, signed payloads, batch processing, and shared or custom domain options. Start with the integration contract at Mailhook llms.txt, or visit Mailhook to create disposable inboxes for your next automation flow.