What makes an email-to-JSON schema 'minimal'?

A minimal schema prioritizes stable identifiers over pretty fields, enables deterministic selection, treats email content as untrusted input, and can be extended without breaking existing integrations.

Why do you need both message_id and delivery_id?

message_id is used for deduplicating the message itself, while delivery_id handles webhook retries and multiple deliveries of the same message, ensuring at-least-once delivery safety.

How should AI agents handle email content safely?

Expose a minimized agent view with only essential fields like message_id, received_at, from.address, and artifacts. If body text is needed, include only plain text with length limits and strip quoted replies.

What should QA tests focus on when testing email automation?

Focus on correlation (message belongs to current attempt), intent (sender domain, template family), artifact correctness (OTP format, link validation), and idempotency, while avoiding assertions on exact subject lines or HTML structure.

Email to JSON: A Minimal Schema for Agents and QA

When you convert email to JSON for automation, the hardest part is not parsing MIME. It is deciding what your downstream code (and your agents) can safely rely on.

A brittle schema forces you to scrape HTML, overfit to one template, or leak too much untrusted content into an LLM. A minimal schema gives you the opposite: stable IDs, deterministic matching, and just enough content to extract the one artifact you actually need (OTP, magic link, ticket ID).

This post proposes a minimal, provider-agnostic “email to JSON” schema that works well for:

LLM agents that need a tight, tool-like interface
QA and CI pipelines that need deterministic assertions
Verification flows (sign-up, password reset, email login)

If you are integrating Mailhook specifically, the canonical, machine-readable integration reference is published at mailhook.co/llms.txt.

What “minimal” really means for email to JSON

“Minimal” does not mean “tiny.” It means:

Stable identifiers over pretty fields: IDs and timestamps beat display names and rendered HTML.
Deterministic selection: you can decide which message to use without guessing.
Layered trust: treat all email content as untrusted input, and keep agent-facing views constrained.
Extensible without breaking: you can add fields later without changing the contract.

A useful mental model is: you want a schema that can support the same workflow whether the message is plain text, multipart/alternative, forwarded, or slightly changed by a template update.

The minimal schema (recommended fields)

At minimum, you want five groups of fields:

Identity (for idempotency and dedupe)
Routing (who it was sent to, and which inbox it belongs to)
Content (text first, HTML optional)
Artifacts (OTP, verification URL, or other extracted outputs)
Provenance (raw source or a reference to it, for debugging)

Below is a practical schema that stays small, but is robust enough for agents and QA.

1) Identity and lifecycle

You need two IDs in most real systems:

A message identifier (dedupe the message itself)
A delivery identifier (dedupe webhook retries or multiple deliveries of the same message)

Also include a received timestamp, and (optionally) an expires time if your inbox is ephemeral.

2) Routing (envelope and addresses)

Automations fail when they only look at To: (a header) instead of the actual routed recipient (the envelope).

Your JSON should clearly expose:

The routed recipient(s) (envelope recipients, when available)
The parsed From, To, Cc headers as structured address objects

Structured address objects should separate:

address (the actual email)
name (display name, optional)

Do not force downstream code to re-parse RFC 5322 address lists if you can avoid it.

3) Content (prefer `text`)

For automation and agents, text/plain is the default. HTML is often present but unsafe to “interpret” (tracking pixels, hidden text, open redirects, prompt injection embedded in markup).

A minimal schema should include:

text: normalized plain text body (string, possibly empty)
html: raw HTML body (optional)

In many pipelines, you will also want a subject, but treat it as a hint, not as a primary key.

4) Artifacts (the reason you opened the email)

Most workflows do not need the whole message. They need an artifact:

OTP code
Verification link
Password reset link
One-time sign-in link

Put artifacts in their own structured section so your test harness or agent tool can be “artifact-first,” instead of scraping.

5) Provenance and debugging

Even if your default path is “minimal,” you need an escape hatch for debugging and audits:

raw: the original RFC 5322 source (or a reference to retrieve it)
headers: a normalized map, or a curated subset of headers

If you store raw email, apply retention limits and access controls appropriate to your environment.

A concrete minimal schema (field table)

The table below is a compact contract you can implement and version.

Field	Type	Required	Why it exists (agents + QA)
`message_id`	string	Yes	Stable message-level dedupe, correlation across systems. Often derived from `Message-ID`, but do not assume it is always present or unique without normalization.
`delivery_id`	string	Recommended	Dedupe webhook retries and “at-least-once” delivery semantics.
`inbox_id`	string	Recommended	Makes receipt deterministic, consumers fetch from a specific inbox instead of scanning a shared mailbox.
`received_at`	string (RFC 3339)	Yes	Deterministic time budget, ordering hints, observability.
`expires_at`	string (RFC 3339)
optional	Optional	Useful for disposable inboxes and cleanup in CI.
`from`	object	Yes	Structured sender identity for matchers and audit.
`to`	array	Yes	Intended recipients (header). Useful but not authoritative for routing.
`envelope_to`	array	Recommended	The routed recipients (envelope). Critical for reliable routing and debugging.
`subject`	string	Optional	Debugging and coarse filtering. Avoid using as a primary matcher.
`headers`	object	Optional	For reliability and audit. Prefer a curated set, not an unbounded dump, for agent-facing views.
`text`	string	Yes	Primary content for deterministic extraction.
`html`	string	Optional	Debugging or fallback extraction only. Treat as hostile.
`attachments`	array	Optional	Most flows ignore attachments, but you need metadata when they exist.
`artifacts`	object	Recommended	OTPs, links, and other extracted outputs your workflow actually needs.
`raw`	string or object reference	Optional	Ground truth for debugging parsing edge cases.

A minimal attachment object can be:

filename (string)
content_type (string)
size_bytes (number)
content_base64 (string, optional) or download_url (string, optional)

Choose one: embed content, or provide a retrievable reference, depending on your security model.

A simple diagram showing an email JSON object with five labeled groups: Identity, Routing, Content, Artifacts, and Provenance, each pointing to example fields like message_id, envelope_to, text, otp, and raw.

Example: verification email as minimal JSON

Here is an example payload that is small, testable, and agent-friendly. It includes both message identity and an extracted artifact.

{
  "message_id": "msg_01J7Z6D5Y8H9K2...",
  "delivery_id": "dly_01J7Z6D61T0F...",
  "inbox_id": "inb_01J7Z6CZZQ1M...",
  "received_at": "2026-03-09T21:08:12Z",
  "from": { "address": "[email protected]", "name": "Example" },
  "to": [{ "address": "[email protected]", "name": null }],
  "envelope_to": ["[email protected]"],
  "subject": "Your verification code",
  "text": "Your code is 483921. It expires in 10 minutes.",
  "artifacts": {
    "otp": { "code": "483921", "confidence": "high" }
  }
}

Notes:

The agent does not need HTML.
QA assertions can be written against artifacts.otp.code, not a template.
delivery_id lets you make your webhook handler idempotent.

Designing an agent-safe view (don’t hand an LLM the whole email)

For LLM agents, the most reliable pattern is to expose two representations:

Full message JSON (for engineers, debugging, and controlled processing)
Minimized agent view (for the agent tool call)

A minimized view can be as simple as:

message_id, received_at
from.address
artifacts only

This reduces prompt injection surface area and prevents the agent from “wandering” through untrusted content.

If your agent must inspect body text, consider including:

text only
a hard length limit
a policy that strips quoted replies and signatures

QA: deterministic assertions and matchers

Once you have a minimal schema, your test harness should become boring in a good way.

Good QA assertions typically check:

Correlation: message belongs to the current attempt (via inbox isolation or a correlation token you control)
Intent: sender domain, expected template family, or presence of a specific artifact type
Artifact correctness: OTP format, link host allowlist, token length
Idempotency: the same delivery_id does not advance state twice

Avoid asserting on:

exact subject lines (often localized or A/B tested)
HTML structure and CSS
display names

Webhooks and polling: what the schema must support

Whether you receive emails via webhooks or polling, the JSON must let you implement two reliability properties:

At-least-once delivery safety (duplicates can happen)
Deterministic selection (choose the right message under retries)

That is why delivery_id and message_id matter, even in a “minimal” model.

If you are consuming webhook events, also include (either in headers or payload metadata):

a timestamp
a signature (or signed payload)

Then your consumer can verify authenticity before processing.

Mailhook supports real-time webhook notifications and signed payloads, plus a polling API fallback. For the exact request/response shapes and verification rules, use the canonical reference at mailhook.co/llms.txt.

Implementation notes (how this maps to Mailhook without overfitting)

Mailhook’s core idea is “email as a tool primitive”:

Create disposable inboxes via API
Receive emails as structured JSON
Get webhook notifications in real time (or poll)

To stay provider-agnostic, you can define an internal interface like:

create_inbox() -> { inbox_id, email, expires_at? }
wait_for_message(inbox_id, matcher, deadline) -> MessageJSON
extract_artifact(MessageJSON) -> { otp | url | ... }

Mailhook is designed to fit this model. Use shared domains for quick starts, or custom domain support when you need allowlisting and environment isolation.

The “minimal schema” checklist

If you are reviewing an email-to-JSON integration (or a vendor), this short checklist catches most schema problems:

You can dedupe webhook retries (delivery_id or equivalent)
You can dedupe messages (message_id or equivalent)
You can tie the message to an isolated inbox (inbox_id)
You get text content reliably, without scraping HTML
You can represent artifacts separately from the body
You can retrieve raw source (or at least enough provenance) when debugging

A minimal schema is not about losing information. It is about making the default path deterministic, safe, and easy to assert on.

If you want to implement this with Mailhook, start with the canonical contract at mailhook.co/llms.txt and the product overview at mailhook.co.

Email to JSON: A Minimal Schema for Agents and QA

What “minimal” really means for email to JSON