Polling an email API looks simple: request messages, sleep, repeat. Duplicates are where it becomes an engineering problem.
A test runner may process the same verification code twice. An AI agent may click the same magic link after a retry. A background worker may re-read the same page because a cursor was not saved before a crash. The result is flaky CI, confusing logs, and actions that are hard to prove safe.
The goal is not to pretend duplicates will never happen. Email delivery, HTTP retries, webhooks, and polling loops are all much easier to make reliable when you assume duplicate reads are normal. The best polling design makes every message, artifact, and downstream action idempotent.
What polling without duplicates really means
In production, duplicate-free usually means duplicate-safe. Your email API or polling client may legitimately return the same message more than once, especially during retries, pagination, fallback recovery, or worker restarts.
A robust system guarantees three outcomes:
- Each logical email message is stored once.
- Each extracted artifact, such as an OTP or verification link, is consumed once.
- Each downstream action, such as completing signup or resuming an agent workflow, runs once per intended attempt.
This distinction matters for QA automation and LLM agents. A human can often notice that two messages are the same. A program needs stable identifiers, durable state, and explicit idempotency rules.
Why duplicate emails happen when polling APIs
Duplicate processing usually comes from the consumer, not from the email itself. The same email can be fetched multiple times for good reasons, and your code needs to treat that as expected behavior.
| Duplicate source | Common symptom | Best prevention |
|---|---|---|
| Overlapping pollers | Two workers process the same OTP | Use a lease or lock per inbox |
| Cursor replay after crash | The same page is fetched again | Insert idempotently before advancing the cursor |
| Timestamp-based windows | Boundary messages appear twice or are missed | Prefer provider cursors when available |
| Sender retry or resend | Two emails contain the same code | Dedupe at the artifact layer |
| Webhook plus polling fallback | Push and pull both deliver the same message | Share one dedupe store across both paths |
| Header reuse or missing headers | Message-ID is not unique enough | Use provider IDs scoped to inbox, with a hash fallback |

Start with an isolated inbox per attempt
The easiest duplicate to fix is the one you never create. Polling a shared mailbox forces your code to distinguish old messages, unrelated messages, retried messages, and parallel test messages. That is where most flaky email automation begins.
For test suites, verification flows, and agent workflows, create or allocate a dedicated inbox for each attempt. Treat the inbox as a resource with its own identifier, not just a string email address.
| Field | Purpose |
|---|---|
inbox_id |
Stable handle used for polling and correlation |
email |
Address passed to the system under test |
attempt_id |
Test run, signup attempt, or agent task identifier |
created_at |
Start of the valid matching window |
active_until |
Deadline for receiving expected messages |
cursor |
Last durable polling position, if supported |
state |
Active, draining, closed, or expired |
Mailhook is built around this inbox-first model: you can create disposable inboxes via API, receive emails as structured JSON, and consume them through webhooks or a polling API. For the exact integration contract, use the canonical Mailhook llms.txt reference.
Prefer cursors over timestamp polling
A polling loop needs a way to remember what it has already inspected. The safest option is an opaque server cursor when the API provides one. A cursor usually represents a position in the provider’s ordered message stream and avoids many edge cases around clock skew, equal timestamps, and inclusive time boundaries.
If you are working with a polling API, follow these cursor rules:
- Store the cursor durably, not only in process memory.
- Scope the cursor to one inbox and one query shape.
- Advance the cursor only after processing the full page idempotently.
- Treat replayed cursors as normal, not as errors.
- Keep a dedupe store even if the cursor seems reliable.
If a provider only supports since timestamps, use a small overlap window and dedupe aggressively. For example, poll from slightly before the last observed timestamp, then ignore messages already stored. This is safer than using a strict timestamp boundary that can miss messages created at the same instant.
Do not dedupe only by Message-ID
The Message-ID header is useful, but it is sender-generated. As defined in RFC 5322, it is part of the email message format, not a guarantee from your inbox provider that the message was delivered exactly once to your automation.
Use provider-attested identifiers when available, scoped to the inbox that received the message. If you need a fallback, compute a normalized hash from stable fields, such as inbox ID, envelope recipient, sender, subject, normalized body text, and received timestamp bucket. Do not build your only dedupe rule around the rendered HTML body or a sender-controlled header.
A good dedupe design has multiple layers:
| Layer | Example key | What it protects |
|---|---|---|
| Message storage | inbox_id + provider_message_id |
Prevents storing the same fetched message twice |
| Normalized fallback | inbox_id + normalized_message_hash |
Handles missing or unstable provider IDs |
| Artifact extraction | attempt_id + artifact_type + value_hash |
Prevents consuming the same OTP or link twice |
| Business action | attempt_id + action_type |
Prevents duplicate signup, reset, or agent continuation |
This layered approach is especially important for OTP and magic-link flows. A user or test runner may request a resend, and two different email messages may contain the same valid artifact. Message-level dedupe alone will not stop duplicate artifact consumption.
Make the poller idempotent by default
The poller should be safe to restart at any point. It should also be safe if two workers accidentally run for the same inbox. That means your database constraints and state transitions need to enforce correctness, not just your application logic.
Here is provider-neutral pseudocode for a duplicate-safe polling loop:
function waitForMessage(inboxId, matcher, deadline):
state = loadPollState(inboxId)
backoff = newBackoff(min=500ms, max=5s, jitter=true)
while now < deadline:
lease = tryAcquireLease(inboxId, ttl=15s)
if not lease:
sleep(shortJitter)
continue
page = emailApi.listMessages(inboxId, cursor=state.cursor, limit=state.limit)
for message in page.messages:
key = providerMessageKey(inboxId, message)
insertMessageIfAbsent(key, message)
if matcher(message):
artifact = extractArtifact(message)
if insertArtifactIfAbsent(inboxId, artifact):
markConsumed(inboxId, artifact)
releaseLease(lease)
return artifact
saveCursor(inboxId, page.nextCursor)
releaseLease(lease)
sleep(backoff.next())
raise Timeout
The key idea is that duplicate insertion is a normal no-op. A message can be fetched twice, but it cannot create two message records. An artifact can be extracted twice, but it cannot be consumed twice. The cursor can replay after a crash, but replay does not change the outcome.
Match narrowly before extracting anything
A poller that scans all recent email and picks the first matching subject line is fragile. Narrow matching reduces duplicates, prevents stale message selection, and gives better timeout errors.
Use matchers that combine several signals:
| Matcher signal | Why it helps |
|---|---|
inbox_id |
Keeps the search inside the attempt-specific inbox |
| Recipient address | Confirms the message was sent to the expected address |
received_at >= created_at |
Excludes stale messages from prior runs |
| Sender or domain allowlist | Reduces unrelated noise |
| Correlation token | Ties the message to a specific run or account |
| Expected purpose | Distinguishes signup, reset, login, and invite flows |
For LLM agents, do not expose an entire inbox and ask the model to choose the right email. Put the matching logic in code, then return a minimal result to the agent, such as an OTP, a verified URL, or a typed status. If you need a stable JSON shape for automation, the Mailhook guide on email to JSON schemas for agents and QA is a useful companion.
Use deadlines, backoff, and jitter instead of fixed sleeps
Fixed sleeps create two problems. If the sleep is too short, the test flakes. If it is too long, every run is slower than necessary. Polling should use an overall deadline with a bounded retry cadence.
A practical strategy is to start with a short interval for the first few seconds, then back off with jitter until the deadline. The exact values depend on the workflow and provider limits, but the structure is more important than the numbers.
Use these rules:
- Set a request timeout shorter than the overall wait deadline.
- Add jitter so parallel CI jobs do not poll at the same instant.
- Stop at a clear deadline and report why no message matched.
- Respect provider rate limits and retry-after signals.
- Avoid starting a second polling loop for the same inbox while the first one is active.
When a timeout happens, include the inbox ID, attempt ID, last cursor, number of messages inspected, and the matcher criteria in your logs. That turns a flaky failure into a debuggable event.
Share dedupe between webhooks and polling fallback
Even when polling is the focus, many reliable email workflows use webhooks as the primary path and polling as a fallback. Webhooks reduce latency and infrastructure churn. Polling catches missed webhook deliveries, deployment windows, or temporary endpoint failures.
The important rule is that webhook ingestion and polling ingestion must write to the same message store with the same dedupe keys. If a webhook processes the message first, the poller should see the existing message and do nothing. If the poller processes it first, the later webhook should also be a no-op.
Mailhook supports real-time webhook notifications and a polling API. When you use webhooks, verify signed payloads before processing. Polling can then act as a recovery mechanism without creating duplicate side effects.
Batch polling without global confusion
Batch polling can reduce overhead when you operate many temporary inboxes, but it can also create duplicate bugs if state is tracked too broadly. Keep polling state per inbox, even if retrieval is batched.
A safe batch design follows these principles:
- Shard work by
inbox_id, not by a global timestamp alone. - Keep per-inbox cursors or watermarks.
- Apply the same message-level unique constraints as single-inbox polling.
- Process each inbox independently so one malformed message does not block the batch.
- Emit metrics per inbox, attempt, and batch run.
This is where structured JSON email output helps. Instead of scraping HTML or reparsing raw MIME in every worker, your batch processor can apply consistent matching, dedupe, and extraction rules to normalized message data.
Normalize before storing derived artifacts
Deduplication depends on stable input. If two workers parse the same email differently, they may produce different hashes or artifact records. Normalize messages before extracting artifacts.
For email automation, normalize conservatively:
| Data | Normalization rule |
|---|---|
| Addresses | Lowercase domains, preserve local-part unless your policy says otherwise |
| Headers | Store repeated headers as arrays or preserve raw plus normalized views |
| Timestamps | Convert to a single timezone and keep provider received time |
| Body text | Prefer text/plain when available and normalize whitespace for extraction |
| URLs | Parse and validate host, scheme, and path before use |
| Attachments | Store metadata and hashes separately from raw content |
For AI agents, create a minimized message view. The agent usually does not need raw HTML, full headers, or unrelated body text. It needs the typed artifact and enough provenance to explain where it came from.
Track the metrics that reveal duplicate bugs
Duplicate processing bugs are much easier to fix when you can see them. Add observability around both polling behavior and dedupe behavior.
| Metric | What it tells you |
|---|---|
poll_attempt_count |
How many requests were needed before a match |
duplicate_message_count |
How often the poller sees already-stored messages |
cursor_replay_count |
Whether workers are restarting or replaying pages often |
artifact_duplicate_count |
Whether resends or retries are producing repeated codes |
time_to_first_message |
Delivery and polling latency |
timeout_count |
Matching, delivery, or provider issues |
concurrent_lease_conflict_count |
Multiple workers competing for the same inbox |
Logs should include identifiers, not sensitive content. Record inbox_id, attempt_id, provider message ID, cursor version, matcher summary, and artifact hash. Avoid logging full verification links, OTPs, or raw message bodies unless you have strict retention and access controls.
Security considerations for AI agents and LLM workflows
Email is untrusted input. A message can contain prompt-injection text, malicious links, tracking HTML, or misleading sender fields. Duplicate prevention and security should work together.
For agent-facing polling flows, apply these guardrails:
- Keep polling, matching, and dedupe in deterministic code.
- Return only the minimal artifact the agent needs.
- Validate verification links against an allowlist before use.
- Never let the model decide whether a message is a duplicate.
- Redact secrets and artifacts in logs.
- Use signed webhook verification if webhooks participate in the same pipeline.
This keeps the LLM tool surface small. The agent can ask for the verification result, but it does not need authority over raw mailbox contents.
A practical Mailhook pattern for duplicate-safe polling
A clean Mailhook-based workflow looks like this:
- Create a disposable inbox through the API and store the
inbox_id, email address, attempt ID, and deadline. - Trigger the external flow, such as signup, login, password reset, or verification.
- Poll the inbox for structured JSON messages using durable state and a bounded deadline.
- Insert each message with a unique key scoped to the inbox.
- Match narrowly, extract the minimal artifact, and store that artifact with a consume-once key.
- Return the artifact to your test runner or agent, then move the inbox toward cleanup according to your lifecycle policy.
Mailhook also provides real-time webhooks, signed payloads for webhook security, shared domains for fast setup, custom domain support for controlled environments, and batch email processing for higher-volume workflows. For exact request formats and integration details, refer to Mailhook llms.txt.
Frequently Asked Questions
How do I stop duplicate emails when polling an email API? Use a dedicated inbox per attempt, store a durable cursor or watermark, insert messages with unique constraints, and dedupe extracted artifacts before triggering downstream actions.
Should I dedupe by the email Message-ID header? Not by itself. Message-ID is sender-generated. Prefer provider message identifiers scoped to the inbox, then use a normalized content hash as a fallback if needed.
Is polling worse than webhooks for duplicate prevention? Not necessarily. Webhooks are often lower latency, but both webhooks and polling can duplicate. The solution is a shared idempotent ingestion layer, so either path can replay safely.
How often should I poll for verification emails? Use a deadline-based loop with backoff and jitter. Start quickly for interactive flows, then slow down while respecting provider rate limits. Avoid fixed sleeps as your main waiting strategy.
What should an AI agent receive from a polled email? The agent should receive a minimal, typed artifact, such as an OTP, verified magic link, or status. Keep raw email parsing, duplicate detection, and link validation in deterministic code.
Build duplicate-safe email polling with Mailhook
If your QA suite or AI agent workflow depends on email, make polling deterministic from the start. Mailhook gives you programmable temp inboxes, structured JSON emails, RESTful API access, a polling API, real-time webhooks, signed payloads, and shared or custom domain options.
Explore Mailhook to create disposable inboxes via API, or start with the llms.txt integration reference to wire duplicate-safe email polling into your agents and automation.