What makes parsing raw email so complex?

Email uses decades-old standards like RFC 5322 and MIME that include multipart bodies, various encodings (base64, quoted-printable), different character sets, and complex header structures that can be folded or duplicated.

Should I build my own email parser or use an API service?

For automation and AI agents, using a programmable inbox API that provides structured JSON is typically more reliable than building custom parsing, which requires handling numerous edge cases and security considerations.

How do I extract OTPs and verification links safely from emails?

Use structured JSON output with normalized content, prefer text/plain over HTML when available, validate links against allowlists, and extract artifacts using tight patterns with context checks rather than parsing raw HTML.

What's the best approach for email automation in testing?

Use isolated inboxes per test run, implement webhook-first delivery with polling fallback, assert on stable properties like sender domain and OTP presence rather than HTML presentation, and keep raw messages available for debugging.

Open an Email Programmatically: From Raw to JSON

Email is one of the last “human-first” surfaces many systems still depend on. But if you’re building an AI agent, an LLM toolchain, or a QA harness, you eventually need to open an email programmatically, extract just the useful artifacts (OTP, magic link, invoice ID, reset URL), and move on.

The hard part is that email arrives as a messy, decades-old stack of standards: RFC 5322 headers, MIME multipart bodies, odd encodings, and HTML that was never meant to be parsed by tests (or agents). This guide walks through what “raw email” actually is, why it’s tricky, and how to reliably convert it into a JSON shape your automation can trust.

What it means to “open an email” programmatically

When humans “open an email,” the email client quietly does a lot of work:

Parses the message format (headers plus body)
Decodes transfer encodings (base64, quoted-printable)
Picks a body to display (usually text/plain or HTML)
Unpacks attachments
Normalizes dates, addresses, and character sets

Programmatically, you need to decide what “open” means for your workflow. For automation, “open” usually means:

Locate the right message deterministically (no brittle mailbox searches)
Parse and normalize it into a stable schema
Extract a small, verifiable artifact (OTP, link, token)
Log enough to debug failures without leaking sensitive content

A good mental model is: treat email like an untrusted inbound event, not like a document.

Raw email, the formats you actually receive

Most systems ultimately represent an email as a raw RFC 5322 message: a blob of text and bytes composed of headers and a body. If you need the standards references, start with RFC 5322 (message format) and the MIME family like RFC 2045 (MIME basics).

A “raw” message typically includes:

Headers: key/value pairs like From, To, Subject, Date, Message-ID, plus many others
Body: sometimes plain text, often HTML, frequently multipart with boundaries
Attachments: represented as MIME parts, commonly base64 encoded

MIME is why “just parse the body” fails

If you only ever saw plain text emails, parsing would be easy. In practice:

Many messages are multipart/alternative (both text/plain and text/html)
Some are multipart/mixed (body plus attachments)
Some contain nested multiparts
Bodies can be encoded (quoted-printable, base64)
Character sets vary (UTF-8, ISO-8859-1, and more)

This is why regexing HTML or splitting on blank lines becomes fragile quickly.

From raw to JSON: a normalization pipeline that holds up in automation

A robust “raw to JSON” pipeline has a few clear stages. This is implementation-agnostic: you can do it with a library in your own service, or consume JSON produced by an inbox API.

A simple flow diagram showing an incoming email going through stages labeled: Raw RFC 5322, MIME parse, Decode + normalize, Extract links/OTP, Output JSON for tests and LLM agents.

Stage 1: Parse structure (headers, MIME tree)

At this stage you want to:

Parse headers safely (handle folded headers, duplicates)
Build a MIME tree of parts
Identify candidate bodies (text/plain, text/html)
Identify attachments (filename, content-type, size)

Stage 2: Decode and normalize

Normalization is where most automation reliability comes from:

Decode transfer encodings (quoted-printable, base64)
Normalize line endings
Convert text to a consistent Unicode representation
Parse Date into an ISO timestamp (but keep the raw value for debugging)
Normalize address fields into structured objects (name, address)

Stage 3: Choose and sanitize content

For automation and agents, prefer predictable content:

Prefer text/plain when available
Keep HTML, but treat it as secondary (good for rendering, risky for parsing)
Remove or ignore dangerous elements (scripts, weird redirects)

Stage 4: Extract automation artifacts

Instead of “understanding the whole email,” extract what your workflow needs:

Verification links (and the final target host allowlist)
OTP candidates (with tight patterns and context checks)
Key identifiers (order ID, ticket ID)

Stage 5: Emit JSON with stable fields

Your JSON output should support:

Deterministic matching (message_id, inbox_id, correlation IDs)
Simple assertions (subject contains, from domain equals)
Minimal artifact extraction (otp, verification_url)
Debuggability (raw headers snapshot, received timestamp)

Here’s a helpful way to think about mapping raw email to JSON fields.

Raw email element	What it looks like	JSON you want for automation	Why it matters
Message-ID header	`Message-ID: <abc@domain>`	`message_id`	Deduplication and idempotency
Date header	`Date: Tue, 30 Jan...`	`received_at` (ISO), `date_raw`	Timing assertions, debugging delays
From/To	RFC 5322 address forms	`from: {name, address}`, `to: [...]`	Reliable sender checks
MIME parts	multipart boundaries	`text`, `html`, `attachments[]`	Avoid parsing the wrong part
Transfer encoding	base64, quoted-printable	decoded strings and bytes	Prevent garbage output
Links in body	HTML anchors, plain URLs	`links[]` (normalized)	Safer magic-link handling

Gotchas that break naive “open email” implementations

Even mature teams get burned by the same email edge cases. If you’re building a programmatic “open email” path, design for these up front.

Duplicate and folded headers

Headers can legally repeat, and they can be folded across lines. If you naïvely map headers into a dictionary, you may lose data or parse incorrectly.

Choosing the wrong body

A lot of systems accidentally parse:

An HTML tracking pixel section instead of the user-visible content
A footer instead of the OTP line
A forwarded message inside the email

Prefer text/plain when possible, and be explicit about how you pick the “primary” body.

Encodings and character sets

If you do not consistently decode transfer encoding and charset, you will see:

Broken Unicode
Missing punctuation, which can break OTP extraction
Incorrect comparisons in tests

Time is not a single field

Email timestamps are messy. The Date header is sender-provided and not always trustworthy. Your receiving system’s timestamp is often more useful for latency and timeouts.

HTML parsing is a security boundary

If you run agents against email content, treat HTML as adversarial input. A safe strategy is:

Extract candidate links, then validate them against allowlists
Avoid “clicking” unknown URLs in automation
Keep raw content for audit, but do not feed full HTML into an LLM by default

For deeper reliability guidance on parsing identifiers like Message-ID and related fields, Mailhook has a separate post focused on header parsing: Headers Email Guide: What to Parse for Reliability.

A pragmatic JSON contract for LLM agents

Agents work best with small, structured inputs. Instead of giving an LLM an entire email (especially HTML), provide a compact JSON object that is:

Deterministic
Minimal
Traceable back to the raw message

An example “agent-safe” shape might look like this:

{
  "message_id": "<...>",
  "received_at": "2026-02-01T20:12:33Z",
  "from": {"address": "[email protected]", "name": "Example"},
  "to": [{"address": "[email protected]", "name": null}],
  "subject": "Your login code",
  "text": "Your code is 123456",
  "links": ["https://example.com/verify?token=..."],
  "attachments": [{"filename": "invoice.pdf", "content_type": "application/pdf", "size": 48211}]
}

You can then add a second layer: a tiny extraction object your tests or agent tools actually consume (for example { "otp": "123456" }). This keeps your workflow simple and reduces LLM exposure to hostile content.

Build it yourself vs consume JSON from an inbox API

You have two broad approaches:

Parse raw emails yourself (via IMAP/POP, direct SMTP ingest, or provider APIs)
Use a programmable inbox service that gives you structured JSON and deterministic retrieval

Here’s a decision table that tends to match real-world engineering tradeoffs.

Approach	Best for	Common pain points	Typical outcome
IMAP mailbox scraping	Quick prototypes	Flaky searches, concurrency collisions, slow polling	Breaks in CI and parallel runs
Provider APIs (Gmail/Graph)	Internal tooling with accounts	OAuth, quotas, long-lived identities	Works, but heavy operationally
Run your own SMTP capture	Local integration tests	Deliverability differences vs real email	Great locally, incomplete in staging
Programmable inbox API with JSON output	QA automation, LLM agents, verification flows	Need to integrate another API	Most deterministic for automation

If your core need is “open an email programmatically and get JSON,” the key property is machine-readable output that doesn’t require HTML scraping.

Using Mailhook to open an email as JSON (webhook-first, polling fallback)

Mailhook is built around programmable disposable inboxes. Instead of creating a full email account, you create an inbox via API, use the generated address in your workflow, then receive messages as structured JSON.

Relevant Mailhook capabilities (from the product description):

Disposable inbox creation via API
Structured JSON email output
RESTful API access
Real-time webhook notifications
Polling API for emails
Signed payloads for security
Batch email processing
Shared domains and custom domain support

Because APIs evolve, the source of truth for endpoints and payloads is Mailhook’s implementation reference. Make sure to review llms.txt before you wire up agent tools or tests:

Mailhook llms.txt

Reference flow (conceptual)

A reliable automation flow looks like this:

Create a new inbox for the run (or agent session)
Trigger the system under test to send an email to that address
Wait for delivery (prefer webhook, use polling as fallback)
Consume the JSON payload
Extract only what you need (OTP/link)

Here is pseudocode that illustrates the shape of the integration without assuming any specific endpoint names:

# Pseudocode: consult https://mailhook.co/llms.txt for exact API fields and routes.

inbox = mailhook.create_inbox(
  webhook_url="https://your-service.example/mailhook/webhook"
)

email_address = inbox["address"]
inbox_id = inbox["inbox_id"]

app.trigger_signup(email=email_address)

# Webhook-first: your webhook handler stores the JSON message keyed by inbox_id.
# Polling fallback: wait with timeout and backoff.

message = mailhook.wait_for_message(inbox_id=inbox_id, timeout_seconds=60)

otp = extract_otp(message["text"])
verify_url = extract_allowed_link(message.get("links", []))

assert otp is not None or verify_url is not None

Verify webhook signatures

If you accept inbound webhooks, treat them like any other external request:

Verify the signature (Mailhook supports signed payloads)
Use idempotency to handle retries
Store only what you need, for as long as you need it

Again, the exact signing scheme and headers should come from the contract in llms.txt.

Design tips that make email automation boring (in a good way)

The goal is not to “parse email perfectly,” it’s to make your automation predictable.

Prefer isolation and correlation

If multiple test runs or agent sessions share an inbox, you reintroduce the hardest problem: figuring out which message belongs to which run. Isolated inboxes avoid mailbox searching entirely.

Assert on intent, not presentation

HTML changes constantly. Your assertions should target stable properties:

Sender domain
Subject intent
Presence of a single OTP
A verification link whose host is in an allowlist

Keep the raw message available for debugging

When something fails, you want to know:

Did the message arrive?
What headers did it have?
Did you parse the correct MIME part?

This is where “raw plus normalized JSON” is helpful. The automation runs on normalized fields, while engineers debug with the raw context.

Where this leaves you

To open an email programmatically in 2026, you have two realistic options:

Become an email parsing expert (RFC 5322, MIME edge cases, encoding quirks, security pitfalls)
Use an inbox abstraction that already does the normalization and gives you JSON that your tests and agents can consume

If your primary need is agent workflows and QA reliability, the winning strategy is usually: treat email like an event stream, isolate inboxes per run, and consume structured JSON.

If you want to implement this with Mailhook, start with the contract in Mailhook llms.txt and design your tools around deterministic waits (webhook-first, polling fallback) and minimal artifact extraction.