Email-dependent automation fails in a very specific way: it works perfectly until you add retries, parallelism, or an LLM agent that can take actions more than once. Then it becomes a swamp of stale verification links, duplicated OTPs, “wrong email” races, and non-reproducible flakes.
The simplest strategy that fixes most of this is also the most boring:
Create one temp inbox email per attempt, always.
Not per test suite. Not per build. Not “per user” with plus tags. Per attempt.
This post explains what “attempt” means in practice, why inbox reuse is the root cause of most email flakiness, and how to implement a per-attempt inbox policy with deterministic waiting, dedupe, and safe extraction for LLM agents.
If you want the canonical integration contract for Mailhook’s API and payload shapes, use the project’s reference file: mailhook.co/llms.txt.
What “one inbox per attempt” actually means
An attempt is a single execution of a workflow that can be repeated without you explicitly planning it.
Examples of attempts:
- A CI job retry (same commit, new run)
- A flaky E2E test that your runner auto-retries
- A background worker that replays a job after a timeout
- An LLM agent tool call that can be re-issued after a transient error
- A user-flow test that resends a verification email after “didn’t receive it”
A per-attempt strategy means:
- Every attempt provisions a fresh, isolated inbox (and therefore a fresh temp inbox email address).
- The attempt only reads from that inbox.
- The attempt closes the inbox (or lets it expire) after it extracts the one artifact it needs.
This is not about anonymity or “burner email” behavior. It is about determinism.
Why reusing a temp inbox email breaks under retries and parallelism
Reusing an inbox introduces ambiguity that your code cannot reliably resolve later.
Failure mode 1: stale message selection
If you reuse an inbox and trigger “send verification email” twice (because of retries, resends, or agent loops), you will see multiple similar messages. Many implementations then pick:
- “the first message that matches,” which can be stale
- “the last message,” which can be a different attempt
- “the message with the newest timestamp,” which breaks with clock skew and provider delays
A fresh inbox per attempt turns message selection from a search problem into a certainty problem.
Failure mode 2: duplicates are normal, not exceptional
Any system that uses webhooks, queues, and retries tends toward at-least-once delivery. That means duplicates happen even when nothing is “wrong” (network retries, handler timeouts, replay for safety).
If two attempts share an inbox, duplicates across attempts are indistinguishable from duplicates within one attempt.
Failure mode 3: parallel CI races
Two tests running simultaneously that share an inbox are guaranteed to race eventually. You can add correlation tokens, but now you are debugging string matching instead of building stable infrastructure.
Failure mode 4: cleanup and retention become risky
If an inbox is shared, you cannot safely delete messages without risking another attempt that is still using it. So you keep everything longer, which increases the chance of stale selection, increases PII exposure, and makes debugging worse.
The identifiers you need (attempt, inbox, message, delivery, artifact)
A per-attempt inbox policy gets even stronger when you name the layers explicitly.
Here is a practical vocabulary that works well for CI harnesses and agent tools:
| Layer | What it represents | Why it exists | Typical uniqueness scope |
|---|---|---|---|
attempt_id |
One execution of the workflow | The unit you retry | Unique per retry/run |
inbox_id |
The isolated container for inbound mail | The isolation boundary | Unique per attempt |
message_id |
The email message identity | Stable-ish message identity | Unique per message |
delivery_id |
The delivery event to your system | Webhook/poll dedupe | Unique per delivery attempt |
artifact |
OTP or verification URL you extract | What you actually need | Unique per intent |
Two key takeaways:
- You dedupe deliveries and processing, not just messages.
- You assert on artifacts, not HTML.
For more on structuring emails as machine-readable records, see Mailhook’s JSON-oriented approach in Temp Email API: Receive and Parse Emails as JSON.
The reference workflow: provision, trigger, wait, extract, expire
The per-attempt inbox strategy is easiest to enforce when you treat email receipt as a small state machine.

1) Provision an inbox at the start of the attempt
At attempt start, create a disposable inbox via API and store:
attempt_idinbox_id- the generated email address
-
expires_at(or equivalent TTL)
Mailhook is built for this pattern: you can create disposable inboxes programmatically and receive messages as structured JSON via webhooks or polling. Start at Mailhook and use the canonical spec at mailhook.co/llms.txt.
2) Trigger the outbound email with strong correlation
Even with inbox isolation, correlation is still useful for debugging and safety:
- Include
attempt_idin your application logs - If you control the sender, add a correlation header you generate (for example
X-Correlation-Id: attempt_id) - If you do not control the sender, scope your matcher using stable fields (recipient, subject prefix, known sender domain)
The difference is that correlation becomes a guardrail, not the primary selection mechanism.
3) Wait deterministically (webhook first, polling fallback)
The reliable default is:
- Use a webhook to get low-latency arrival when possible.
- Keep a polling loop as a fallback for when webhooks fail (misconfiguration, transient outages, CI network restrictions).
This hybrid pattern is covered in depth here: Temp Email Receive: Webhook-First, Polling Fallback.
Important waiting rules:
- Prefer a deadline-based wait over fixed sleeps.
- Prefer an explicit timeout budget (for example 60 to 120 seconds) over “wait forever.”
- Treat “no email received” as an actionable failure with logs attached (inbox_id, attempt_id, timestamps).
4) Extract the minimal artifact you need
Most flows only need one of these:
- an OTP
- a magic link / verification URL
- a password reset link
Your automation should extract the artifact deterministically from the JSON representation, ideally from text/plain when available, and avoid rendering HTML.
This matters even more for LLM agents. Emails are untrusted input and can contain prompt injection, malicious links, and confusing UI content. A minimized, machine-readable view is safer than “show the agent the entire email.”
If your pipeline includes an LLM, the security mindset and parsing rules are worth reviewing in Security Emails: How to Parse Safely in LLM Pipelines.
5) Consume once (idempotency at the artifact layer)
A common mistake is to make the “wait for email” step idempotent but the “use the OTP/link” step non-idempotent.
Instead, treat the artifact as the unit of consumption:
- Compute an
artifact_key(for example, a hash of the OTP or the canonicalized URL) - Store
artifact_keywithattempt_id - If you see the same
artifact_keyagain, do not re-submit it
This prevents resend loops and “double-click” behavior from agents.
6) Expire the inbox (with a drain window)
After extracting what you need, end the inbox lifecycle:
- If your provider supports explicit expiration, expire it.
- Otherwise rely on short TTLs.
In high-throughput systems, it helps to have a brief “drain window” to record late arrivals for debugging without keeping the inbox active for long. The underlying idea is: active for the attempt, then draining briefly, then closed.
A practical harness pattern (provider-agnostic)
Below is a provider-agnostic sketch you can adapt. It assumes a temp inbox email provider that supports inbox creation, message listing, and optionally webhooks.
type AttemptContext = {
attemptId: string;
inboxId: string;
emailAddress: string;
expiresAt: string;
};
async function runAttempt(sendVerification: (email: string) => Promise<void>) {
const attemptId = crypto.randomUUID();
// 1) Create isolated inbox per attempt
const inbox = await createInbox({
metadata: { attemptId },
ttlSeconds: 300,
});
const ctx: AttemptContext = {
attemptId,
inboxId: inbox.inbox_id,
emailAddress: inbox.email,
expiresAt: inbox.expires_at,
};
// 2) Trigger outbound email
await sendVerification(ctx.emailAddress);
// 3) Wait with a deadline (webhook-first, polling fallback)
const msg = await waitForMessage({
inboxId: ctx.inboxId,
deadlineMs: 90_000,
matcher: {
// keep matchers narrow, even with isolation
to: ctx.emailAddress,
subjectIncludes: "Verify",
},
});
// 4) Extract minimal artifact
const artifact = extractOtpOrLink(msg);
// 5) Artifact-level idempotency
await consumeOnce({ attemptId: ctx.attemptId, artifact });
// 6) Expire inbox
await expireInbox({ inboxId: ctx.inboxId });
return artifact;
}
Notes:
- Function names are placeholders. For Mailhook-specific endpoints and payload fields, refer to mailhook.co/llms.txt.
- The
matcheris still present, but it is now a safety check rather than a fragile “find the needle in the inbox haystack” operation.
How this strategy changes your debugging story
Per-attempt inboxes do something subtle but powerful: they make failures reproducible.
When an attempt fails, you can attach the inbox’s JSON message payloads to CI artifacts keyed by attempt_id and inbox_id. Now you can answer:
- Did the email arrive?
- If not, did the system send it?
- If it arrived, did it match the attempt?
- If it matched, did artifact extraction succeed?
This is much harder when a single inbox is shared across attempts, where the historical message set is constantly changing.
Per-attempt inboxes for LLM agents (extra guardrails)
If an LLM agent is involved, inbox-per-attempt is necessary but not sufficient. Add three more constraints.
Keep the tool surface small
Expose a small tool contract to the agent:
- create inbox
- wait for message
- extract artifact
- expire inbox
Do not expose “list all messages across inboxes” to an agent unless you are comfortable with it browsing large volumes of untrusted content.
Verify webhook authenticity when using push delivery
If you ingest messages via webhook, verify signatures and add replay detection. Mailhook supports signed payloads (see the exact verification details in mailhook.co/llms.txt).
A good rule is: verify first, then parse, then process.
Constrain link handling
If the artifact is a URL, validate it before use:
- enforce an allowlist of hostnames
- block link-local and private network ranges (SSRF defense)
- canonicalize redirects or disallow them
This is especially important when an agent can “click” links programmatically.
When “one inbox per attempt” feels expensive
Teams sometimes resist per-attempt inboxes because they worry about:
- inbox creation overhead
- domain allowlisting complexity
- higher object counts to store
In practice:
- The overhead is usually lower than the cost of debugging flakes.
- You can start on shared domains for speed and move to a custom domain when you need allowlisting or deliverability control.
- You can batch process emails when running many attempts in parallel (Mailhook supports batch email processing).
If you are deciding between shared and custom domains, this comparison helps: Custom Email Domains for Testing: Shared vs Dedicated.
A quick policy you can adopt today
If you want to make this operational, turn it into a team policy that code review can enforce:
- Each retryable workflow attempt must call
create_inbox()at the start. - Attempt logs must include
attempt_idandinbox_id. - Waiting must be deadline-based (no fixed sleeps as the primary mechanism).
- Webhooks are default, polling exists as a fallback.
- Artifact extraction is minimal (OTP/link only), and artifact consumption is idempotent.
- Inboxes must expire quickly with a small drain window if needed.
To implement this with Mailhook, use its programmable disposable inboxes, JSON message output, webhooks, polling API, and signed payloads. The authoritative integration reference is: mailhook.co/llms.txt.