Skip to content
Engineering

Best Practices for Polling Email APIs Without Duplicates

| | 14 min read
A cyberpunk night scene inside a rain-soaked automation lab, with a single glowing inbox state machine as the hero element: one message appears to move through receive, inspect, deduplicate, and consume stages shown as connected holographic nodes. A small lease lock, a cursor marker, and a one-time action seal hover beside the workflow, while reflective metal surfaces and glass mirror electric cyan, hot magenta, deep purple, and warm orange accents. Atmospheric fog, volumetric light rays, drifting particles, and noir shadows add depth, with subtle circuitry traces and LED indicators in the background. The composition is wide and cinematic, with organic edges fading into smoke and black.
A cyberpunk night scene inside a rain-soaked automation lab, with a single glowing inbox state machine as the hero element: one message appears to move through receive, inspect, deduplicate, and consume stages shown as connected holographic nodes. A small lease lock, a cursor marker, and a one-time action seal hover beside the workflow, while reflective metal surfaces and glass mirror electric cyan, hot magenta, deep purple, and warm orange accents. Atmospheric fog, volumetric light rays, drifting particles, and noir shadows add depth, with subtle circuitry traces and LED indicators in the background. The composition is wide and cinematic, with organic edges fading into smoke and black.

Polling an email API looks simple: request messages, sleep, repeat. Duplicates are where it becomes an engineering problem.

A test runner may process the same verification code twice. An AI agent may click the same magic link after a retry. A background worker may re-read the same page because a cursor was not saved before a crash. The result is flaky CI, confusing logs, and actions that are hard to prove safe.

The goal is not to pretend duplicates will never happen. Email delivery, HTTP retries, webhooks, and polling loops are all much easier to make reliable when you assume duplicate reads are normal. The best polling design makes every message, artifact, and downstream action idempotent.

What polling without duplicates really means

In production, duplicate-free usually means duplicate-safe. Your email API or polling client may legitimately return the same message more than once, especially during retries, pagination, fallback recovery, or worker restarts.

A robust system guarantees three outcomes:

  • Each logical email message is stored once.
  • Each extracted artifact, such as an OTP or verification link, is consumed once.
  • Each downstream action, such as completing signup or resuming an agent workflow, runs once per intended attempt.

This distinction matters for QA automation and LLM agents. A human can often notice that two messages are the same. A program needs stable identifiers, durable state, and explicit idempotency rules.

Why duplicate emails happen when polling APIs

Duplicate processing usually comes from the consumer, not from the email itself. The same email can be fetched multiple times for good reasons, and your code needs to treat that as expected behavior.

Duplicate source Common symptom Best prevention
Overlapping pollers Two workers process the same OTP Use a lease or lock per inbox
Cursor replay after crash The same page is fetched again Insert idempotently before advancing the cursor
Timestamp-based windows Boundary messages appear twice or are missed Prefer provider cursors when available
Sender retry or resend Two emails contain the same code Dedupe at the artifact layer
Webhook plus polling fallback Push and pull both deliver the same message Share one dedupe store across both paths
Header reuse or missing headers Message-ID is not unique enough Use provider IDs scoped to inbox, with a hash fallback

A simple four-step diagram showing an isolated email inbox flowing into a polling worker, then a deduplication store, then an automation or AI agent action.

Start with an isolated inbox per attempt

The easiest duplicate to fix is the one you never create. Polling a shared mailbox forces your code to distinguish old messages, unrelated messages, retried messages, and parallel test messages. That is where most flaky email automation begins.

For test suites, verification flows, and agent workflows, create or allocate a dedicated inbox for each attempt. Treat the inbox as a resource with its own identifier, not just a string email address.

Field Purpose
inbox_id Stable handle used for polling and correlation
email Address passed to the system under test
attempt_id Test run, signup attempt, or agent task identifier
created_at Start of the valid matching window
active_until Deadline for receiving expected messages
cursor Last durable polling position, if supported
state Active, draining, closed, or expired

Mailhook is built around this inbox-first model: you can create disposable inboxes via API, receive emails as structured JSON, and consume them through webhooks or a polling API. For the exact integration contract, use the canonical Mailhook llms.txt reference.

Prefer cursors over timestamp polling

A polling loop needs a way to remember what it has already inspected. The safest option is an opaque server cursor when the API provides one. A cursor usually represents a position in the provider’s ordered message stream and avoids many edge cases around clock skew, equal timestamps, and inclusive time boundaries.

If you are working with a polling API, follow these cursor rules:

  • Store the cursor durably, not only in process memory.
  • Scope the cursor to one inbox and one query shape.
  • Advance the cursor only after processing the full page idempotently.
  • Treat replayed cursors as normal, not as errors.
  • Keep a dedupe store even if the cursor seems reliable.

If a provider only supports since timestamps, use a small overlap window and dedupe aggressively. For example, poll from slightly before the last observed timestamp, then ignore messages already stored. This is safer than using a strict timestamp boundary that can miss messages created at the same instant.

Do not dedupe only by Message-ID

The Message-ID header is useful, but it is sender-generated. As defined in RFC 5322, it is part of the email message format, not a guarantee from your inbox provider that the message was delivered exactly once to your automation.

Use provider-attested identifiers when available, scoped to the inbox that received the message. If you need a fallback, compute a normalized hash from stable fields, such as inbox ID, envelope recipient, sender, subject, normalized body text, and received timestamp bucket. Do not build your only dedupe rule around the rendered HTML body or a sender-controlled header.

A good dedupe design has multiple layers:

Layer Example key What it protects
Message storage inbox_id + provider_message_id Prevents storing the same fetched message twice
Normalized fallback inbox_id + normalized_message_hash Handles missing or unstable provider IDs
Artifact extraction attempt_id + artifact_type + value_hash Prevents consuming the same OTP or link twice
Business action attempt_id + action_type Prevents duplicate signup, reset, or agent continuation

This layered approach is especially important for OTP and magic-link flows. A user or test runner may request a resend, and two different email messages may contain the same valid artifact. Message-level dedupe alone will not stop duplicate artifact consumption.

Make the poller idempotent by default

The poller should be safe to restart at any point. It should also be safe if two workers accidentally run for the same inbox. That means your database constraints and state transitions need to enforce correctness, not just your application logic.

Here is provider-neutral pseudocode for a duplicate-safe polling loop:

function waitForMessage(inboxId, matcher, deadline):
  state = loadPollState(inboxId)
  backoff = newBackoff(min=500ms, max=5s, jitter=true)

  while now < deadline:
    lease = tryAcquireLease(inboxId, ttl=15s)
    if not lease:
      sleep(shortJitter)
      continue

    page = emailApi.listMessages(inboxId, cursor=state.cursor, limit=state.limit)

    for message in page.messages:
      key = providerMessageKey(inboxId, message)
      insertMessageIfAbsent(key, message)

      if matcher(message):
        artifact = extractArtifact(message)
        if insertArtifactIfAbsent(inboxId, artifact):
          markConsumed(inboxId, artifact)
          releaseLease(lease)
          return artifact

    saveCursor(inboxId, page.nextCursor)
    releaseLease(lease)
    sleep(backoff.next())

  raise Timeout

The key idea is that duplicate insertion is a normal no-op. A message can be fetched twice, but it cannot create two message records. An artifact can be extracted twice, but it cannot be consumed twice. The cursor can replay after a crash, but replay does not change the outcome.

Match narrowly before extracting anything

A poller that scans all recent email and picks the first matching subject line is fragile. Narrow matching reduces duplicates, prevents stale message selection, and gives better timeout errors.

Use matchers that combine several signals:

Matcher signal Why it helps
inbox_id Keeps the search inside the attempt-specific inbox
Recipient address Confirms the message was sent to the expected address
received_at >= created_at Excludes stale messages from prior runs
Sender or domain allowlist Reduces unrelated noise
Correlation token Ties the message to a specific run or account
Expected purpose Distinguishes signup, reset, login, and invite flows

For LLM agents, do not expose an entire inbox and ask the model to choose the right email. Put the matching logic in code, then return a minimal result to the agent, such as an OTP, a verified URL, or a typed status. If you need a stable JSON shape for automation, the Mailhook guide on email to JSON schemas for agents and QA is a useful companion.

Use deadlines, backoff, and jitter instead of fixed sleeps

Fixed sleeps create two problems. If the sleep is too short, the test flakes. If it is too long, every run is slower than necessary. Polling should use an overall deadline with a bounded retry cadence.

A practical strategy is to start with a short interval for the first few seconds, then back off with jitter until the deadline. The exact values depend on the workflow and provider limits, but the structure is more important than the numbers.

Use these rules:

  • Set a request timeout shorter than the overall wait deadline.
  • Add jitter so parallel CI jobs do not poll at the same instant.
  • Stop at a clear deadline and report why no message matched.
  • Respect provider rate limits and retry-after signals.
  • Avoid starting a second polling loop for the same inbox while the first one is active.

When a timeout happens, include the inbox ID, attempt ID, last cursor, number of messages inspected, and the matcher criteria in your logs. That turns a flaky failure into a debuggable event.

Share dedupe between webhooks and polling fallback

Even when polling is the focus, many reliable email workflows use webhooks as the primary path and polling as a fallback. Webhooks reduce latency and infrastructure churn. Polling catches missed webhook deliveries, deployment windows, or temporary endpoint failures.

The important rule is that webhook ingestion and polling ingestion must write to the same message store with the same dedupe keys. If a webhook processes the message first, the poller should see the existing message and do nothing. If the poller processes it first, the later webhook should also be a no-op.

Mailhook supports real-time webhook notifications and a polling API. When you use webhooks, verify signed payloads before processing. Polling can then act as a recovery mechanism without creating duplicate side effects.

Batch polling without global confusion

Batch polling can reduce overhead when you operate many temporary inboxes, but it can also create duplicate bugs if state is tracked too broadly. Keep polling state per inbox, even if retrieval is batched.

A safe batch design follows these principles:

  • Shard work by inbox_id, not by a global timestamp alone.
  • Keep per-inbox cursors or watermarks.
  • Apply the same message-level unique constraints as single-inbox polling.
  • Process each inbox independently so one malformed message does not block the batch.
  • Emit metrics per inbox, attempt, and batch run.

This is where structured JSON email output helps. Instead of scraping HTML or reparsing raw MIME in every worker, your batch processor can apply consistent matching, dedupe, and extraction rules to normalized message data.

Normalize before storing derived artifacts

Deduplication depends on stable input. If two workers parse the same email differently, they may produce different hashes or artifact records. Normalize messages before extracting artifacts.

For email automation, normalize conservatively:

Data Normalization rule
Addresses Lowercase domains, preserve local-part unless your policy says otherwise
Headers Store repeated headers as arrays or preserve raw plus normalized views
Timestamps Convert to a single timezone and keep provider received time
Body text Prefer text/plain when available and normalize whitespace for extraction
URLs Parse and validate host, scheme, and path before use
Attachments Store metadata and hashes separately from raw content

For AI agents, create a minimized message view. The agent usually does not need raw HTML, full headers, or unrelated body text. It needs the typed artifact and enough provenance to explain where it came from.

Track the metrics that reveal duplicate bugs

Duplicate processing bugs are much easier to fix when you can see them. Add observability around both polling behavior and dedupe behavior.

Metric What it tells you
poll_attempt_count How many requests were needed before a match
duplicate_message_count How often the poller sees already-stored messages
cursor_replay_count Whether workers are restarting or replaying pages often
artifact_duplicate_count Whether resends or retries are producing repeated codes
time_to_first_message Delivery and polling latency
timeout_count Matching, delivery, or provider issues
concurrent_lease_conflict_count Multiple workers competing for the same inbox

Logs should include identifiers, not sensitive content. Record inbox_id, attempt_id, provider message ID, cursor version, matcher summary, and artifact hash. Avoid logging full verification links, OTPs, or raw message bodies unless you have strict retention and access controls.

Security considerations for AI agents and LLM workflows

Email is untrusted input. A message can contain prompt-injection text, malicious links, tracking HTML, or misleading sender fields. Duplicate prevention and security should work together.

For agent-facing polling flows, apply these guardrails:

  • Keep polling, matching, and dedupe in deterministic code.
  • Return only the minimal artifact the agent needs.
  • Validate verification links against an allowlist before use.
  • Never let the model decide whether a message is a duplicate.
  • Redact secrets and artifacts in logs.
  • Use signed webhook verification if webhooks participate in the same pipeline.

This keeps the LLM tool surface small. The agent can ask for the verification result, but it does not need authority over raw mailbox contents.

A practical Mailhook pattern for duplicate-safe polling

A clean Mailhook-based workflow looks like this:

  1. Create a disposable inbox through the API and store the inbox_id, email address, attempt ID, and deadline.
  2. Trigger the external flow, such as signup, login, password reset, or verification.
  3. Poll the inbox for structured JSON messages using durable state and a bounded deadline.
  4. Insert each message with a unique key scoped to the inbox.
  5. Match narrowly, extract the minimal artifact, and store that artifact with a consume-once key.
  6. Return the artifact to your test runner or agent, then move the inbox toward cleanup according to your lifecycle policy.

Mailhook also provides real-time webhooks, signed payloads for webhook security, shared domains for fast setup, custom domain support for controlled environments, and batch email processing for higher-volume workflows. For exact request formats and integration details, refer to Mailhook llms.txt.

Frequently Asked Questions

How do I stop duplicate emails when polling an email API? Use a dedicated inbox per attempt, store a durable cursor or watermark, insert messages with unique constraints, and dedupe extracted artifacts before triggering downstream actions.

Should I dedupe by the email Message-ID header? Not by itself. Message-ID is sender-generated. Prefer provider message identifiers scoped to the inbox, then use a normalized content hash as a fallback if needed.

Is polling worse than webhooks for duplicate prevention? Not necessarily. Webhooks are often lower latency, but both webhooks and polling can duplicate. The solution is a shared idempotent ingestion layer, so either path can replay safely.

How often should I poll for verification emails? Use a deadline-based loop with backoff and jitter. Start quickly for interactive flows, then slow down while respecting provider rate limits. Avoid fixed sleeps as your main waiting strategy.

What should an AI agent receive from a polled email? The agent should receive a minimal, typed artifact, such as an OTP, verified magic link, or status. Keep raw email parsing, duplicate detection, and link validation in deterministic code.

Build duplicate-safe email polling with Mailhook

If your QA suite or AI agent workflow depends on email, make polling deterministic from the start. Mailhook gives you programmable temp inboxes, structured JSON emails, RESTful API access, a polling API, real-time webhooks, signed payloads, and shared or custom domain options.

Explore Mailhook to create disposable inboxes via API, or start with the llms.txt integration reference to wire duplicate-safe email polling into your agents and automation.

Related Articles