When you convert email to JSON for automation, the hardest part is not parsing MIME. It is deciding what your downstream code (and your agents) can safely rely on.
A brittle schema forces you to scrape HTML, overfit to one template, or leak too much untrusted content into an LLM. A minimal schema gives you the opposite: stable IDs, deterministic matching, and just enough content to extract the one artifact you actually need (OTP, magic link, ticket ID).
This post proposes a minimal, provider-agnostic “email to JSON” schema that works well for:
- LLM agents that need a tight, tool-like interface
- QA and CI pipelines that need deterministic assertions
- Verification flows (sign-up, password reset, email login)
If you are integrating Mailhook specifically, the canonical, machine-readable integration reference is published at mailhook.co/llms.txt.
What “minimal” really means for email to JSON
“Minimal” does not mean “tiny.” It means:
- Stable identifiers over pretty fields: IDs and timestamps beat display names and rendered HTML.
- Deterministic selection: you can decide which message to use without guessing.
- Layered trust: treat all email content as untrusted input, and keep agent-facing views constrained.
- Extensible without breaking: you can add fields later without changing the contract.
A useful mental model is: you want a schema that can support the same workflow whether the message is plain text, multipart/alternative, forwarded, or slightly changed by a template update.
The minimal schema (recommended fields)
At minimum, you want five groups of fields:
- Identity (for idempotency and dedupe)
- Routing (who it was sent to, and which inbox it belongs to)
- Content (text first, HTML optional)
- Artifacts (OTP, verification URL, or other extracted outputs)
- Provenance (raw source or a reference to it, for debugging)
Below is a practical schema that stays small, but is robust enough for agents and QA.
1) Identity and lifecycle
You need two IDs in most real systems:
- A message identifier (dedupe the message itself)
- A delivery identifier (dedupe webhook retries or multiple deliveries of the same message)
Also include a received timestamp, and (optionally) an expires time if your inbox is ephemeral.
2) Routing (envelope and addresses)
Automations fail when they only look at To: (a header) instead of the actual routed recipient (the envelope).
Your JSON should clearly expose:
- The routed recipient(s) (envelope recipients, when available)
- The parsed
From,To,Ccheaders as structured address objects
Structured address objects should separate:
-
address(the actual email) -
name(display name, optional)
Do not force downstream code to re-parse RFC 5322 address lists if you can avoid it.
3) Content (prefer text)
For automation and agents, text/plain is the default. HTML is often present but unsafe to “interpret” (tracking pixels, hidden text, open redirects, prompt injection embedded in markup).
A minimal schema should include:
-
text: normalized plain text body (string, possibly empty) -
html: raw HTML body (optional)
In many pipelines, you will also want a subject, but treat it as a hint, not as a primary key.
4) Artifacts (the reason you opened the email)
Most workflows do not need the whole message. They need an artifact:
- OTP code
- Verification link
- Password reset link
- One-time sign-in link
Put artifacts in their own structured section so your test harness or agent tool can be “artifact-first,” instead of scraping.
5) Provenance and debugging
Even if your default path is “minimal,” you need an escape hatch for debugging and audits:
-
raw: the original RFC 5322 source (or a reference to retrieve it) -
headers: a normalized map, or a curated subset of headers
If you store raw email, apply retention limits and access controls appropriate to your environment.
A concrete minimal schema (field table)
The table below is a compact contract you can implement and version.
| Field | Type | Required | Why it exists (agents + QA) |
|---|---|---|---|
message_id |
string | Yes | Stable message-level dedupe, correlation across systems. Often derived from Message-ID, but do not assume it is always present or unique without normalization. |
delivery_id |
string | Recommended | Dedupe webhook retries and “at-least-once” delivery semantics. |
inbox_id |
string | Recommended | Makes receipt deterministic, consumers fetch from a specific inbox instead of scanning a shared mailbox. |
received_at |
string (RFC 3339) | Yes | Deterministic time budget, ordering hints, observability. |
expires_at |
string (RFC 3339) | ||
| optional | Optional | Useful for disposable inboxes and cleanup in CI. | |
from |
object | Yes | Structured sender identity for matchers and audit. |
to |
array | Yes | Intended recipients (header). Useful but not authoritative for routing. |
envelope_to |
array | Recommended | The routed recipients (envelope). Critical for reliable routing and debugging. |
subject |
string | Optional | Debugging and coarse filtering. Avoid using as a primary matcher. |
headers |
object | Optional | For reliability and audit. Prefer a curated set, not an unbounded dump, for agent-facing views. |
text |
string | Yes | Primary content for deterministic extraction. |
html |
string | Optional | Debugging or fallback extraction only. Treat as hostile. |
attachments |
array | Optional | Most flows ignore attachments, but you need metadata when they exist. |
artifacts |
object | Recommended | OTPs, links, and other extracted outputs your workflow actually needs. |
raw |
string or object reference | Optional | Ground truth for debugging parsing edge cases. |
A minimal attachment object can be:
-
filename(string) -
content_type(string) -
size_bytes(number) -
content_base64(string, optional) ordownload_url(string, optional)
Choose one: embed content, or provide a retrievable reference, depending on your security model.

Example: verification email as minimal JSON
Here is an example payload that is small, testable, and agent-friendly. It includes both message identity and an extracted artifact.
{
"message_id": "msg_01J7Z6D5Y8H9K2...",
"delivery_id": "dly_01J7Z6D61T0F...",
"inbox_id": "inb_01J7Z6CZZQ1M...",
"received_at": "2026-03-09T21:08:12Z",
"from": { "address": "[email protected]", "name": "Example" },
"to": [{ "address": "[email protected]", "name": null }],
"envelope_to": ["[email protected]"],
"subject": "Your verification code",
"text": "Your code is 483921. It expires in 10 minutes.",
"artifacts": {
"otp": { "code": "483921", "confidence": "high" }
}
}
Notes:
- The agent does not need HTML.
- QA assertions can be written against
artifacts.otp.code, not a template. -
delivery_idlets you make your webhook handler idempotent.
Designing an agent-safe view (don’t hand an LLM the whole email)
For LLM agents, the most reliable pattern is to expose two representations:
- Full message JSON (for engineers, debugging, and controlled processing)
- Minimized agent view (for the agent tool call)
A minimized view can be as simple as:
-
message_id,received_at from.address-
artifactsonly
This reduces prompt injection surface area and prevents the agent from “wandering” through untrusted content.
If your agent must inspect body text, consider including:
-
textonly - a hard length limit
- a policy that strips quoted replies and signatures
QA: deterministic assertions and matchers
Once you have a minimal schema, your test harness should become boring in a good way.
Good QA assertions typically check:
- Correlation: message belongs to the current attempt (via inbox isolation or a correlation token you control)
- Intent: sender domain, expected template family, or presence of a specific artifact type
- Artifact correctness: OTP format, link host allowlist, token length
-
Idempotency: the same
delivery_iddoes not advance state twice
Avoid asserting on:
- exact subject lines (often localized or A/B tested)
- HTML structure and CSS
- display names
Webhooks and polling: what the schema must support
Whether you receive emails via webhooks or polling, the JSON must let you implement two reliability properties:
- At-least-once delivery safety (duplicates can happen)
- Deterministic selection (choose the right message under retries)
That is why delivery_id and message_id matter, even in a “minimal” model.
If you are consuming webhook events, also include (either in headers or payload metadata):
- a timestamp
- a signature (or signed payload)
Then your consumer can verify authenticity before processing.
Mailhook supports real-time webhook notifications and signed payloads, plus a polling API fallback. For the exact request/response shapes and verification rules, use the canonical reference at mailhook.co/llms.txt.
Implementation notes (how this maps to Mailhook without overfitting)
Mailhook’s core idea is “email as a tool primitive”:
- Create disposable inboxes via API
- Receive emails as structured JSON
- Get webhook notifications in real time (or poll)
To stay provider-agnostic, you can define an internal interface like:
create_inbox() -> { inbox_id, email, expires_at? }wait_for_message(inbox_id, matcher, deadline) -> MessageJSONextract_artifact(MessageJSON) -> { otp | url | ... }
Mailhook is designed to fit this model. Use shared domains for quick starts, or custom domain support when you need allowlisting and environment isolation.
The “minimal schema” checklist
If you are reviewing an email-to-JSON integration (or a vendor), this short checklist catches most schema problems:
- You can dedupe webhook retries (
delivery_idor equivalent) - You can dedupe messages (
message_idor equivalent) - You can tie the message to an isolated inbox (
inbox_id) - You get
textcontent reliably, without scraping HTML - You can represent artifacts separately from the body
- You can retrieve raw source (or at least enough provenance) when debugging
A minimal schema is not about losing information. It is about making the default path deterministic, safe, and easy to assert on.
If you want to implement this with Mailhook, start with the canonical contract at mailhook.co/llms.txt and the product overview at mailhook.co.