Skip to content
Engineering

Email to JSON: A Minimal Schema for Agents and QA

| | 9 min read
Email to JSON: A Minimal Schema for Agents and QA
Email to JSON: A Minimal Schema for Agents and QA

When you convert email to JSON for automation, the hardest part is not parsing MIME. It is deciding what your downstream code (and your agents) can safely rely on.

A brittle schema forces you to scrape HTML, overfit to one template, or leak too much untrusted content into an LLM. A minimal schema gives you the opposite: stable IDs, deterministic matching, and just enough content to extract the one artifact you actually need (OTP, magic link, ticket ID).

This post proposes a minimal, provider-agnostic “email to JSON” schema that works well for:

  • LLM agents that need a tight, tool-like interface
  • QA and CI pipelines that need deterministic assertions
  • Verification flows (sign-up, password reset, email login)

If you are integrating Mailhook specifically, the canonical, machine-readable integration reference is published at mailhook.co/llms.txt.

What “minimal” really means for email to JSON

“Minimal” does not mean “tiny.” It means:

  • Stable identifiers over pretty fields: IDs and timestamps beat display names and rendered HTML.
  • Deterministic selection: you can decide which message to use without guessing.
  • Layered trust: treat all email content as untrusted input, and keep agent-facing views constrained.
  • Extensible without breaking: you can add fields later without changing the contract.

A useful mental model is: you want a schema that can support the same workflow whether the message is plain text, multipart/alternative, forwarded, or slightly changed by a template update.

The minimal schema (recommended fields)

At minimum, you want five groups of fields:

  1. Identity (for idempotency and dedupe)
  2. Routing (who it was sent to, and which inbox it belongs to)
  3. Content (text first, HTML optional)
  4. Artifacts (OTP, verification URL, or other extracted outputs)
  5. Provenance (raw source or a reference to it, for debugging)

Below is a practical schema that stays small, but is robust enough for agents and QA.

1) Identity and lifecycle

You need two IDs in most real systems:

  • A message identifier (dedupe the message itself)
  • A delivery identifier (dedupe webhook retries or multiple deliveries of the same message)

Also include a received timestamp, and (optionally) an expires time if your inbox is ephemeral.

2) Routing (envelope and addresses)

Automations fail when they only look at To: (a header) instead of the actual routed recipient (the envelope).

Your JSON should clearly expose:

  • The routed recipient(s) (envelope recipients, when available)
  • The parsed From, To, Cc headers as structured address objects

Structured address objects should separate:

  • address (the actual email)
  • name (display name, optional)

Do not force downstream code to re-parse RFC 5322 address lists if you can avoid it.

3) Content (prefer text)

For automation and agents, text/plain is the default. HTML is often present but unsafe to “interpret” (tracking pixels, hidden text, open redirects, prompt injection embedded in markup).

A minimal schema should include:

  • text: normalized plain text body (string, possibly empty)
  • html: raw HTML body (optional)

In many pipelines, you will also want a subject, but treat it as a hint, not as a primary key.

4) Artifacts (the reason you opened the email)

Most workflows do not need the whole message. They need an artifact:

  • OTP code
  • Verification link
  • Password reset link
  • One-time sign-in link

Put artifacts in their own structured section so your test harness or agent tool can be “artifact-first,” instead of scraping.

5) Provenance and debugging

Even if your default path is “minimal,” you need an escape hatch for debugging and audits:

  • raw: the original RFC 5322 source (or a reference to retrieve it)
  • headers: a normalized map, or a curated subset of headers

If you store raw email, apply retention limits and access controls appropriate to your environment.

A concrete minimal schema (field table)

The table below is a compact contract you can implement and version.

Field Type Required Why it exists (agents + QA)
message_id string Yes Stable message-level dedupe, correlation across systems. Often derived from Message-ID, but do not assume it is always present or unique without normalization.
delivery_id string Recommended Dedupe webhook retries and “at-least-once” delivery semantics.
inbox_id string Recommended Makes receipt deterministic, consumers fetch from a specific inbox instead of scanning a shared mailbox.
received_at string (RFC 3339) Yes Deterministic time budget, ordering hints, observability.
expires_at string (RFC 3339)
optional Optional Useful for disposable inboxes and cleanup in CI.
from object Yes Structured sender identity for matchers and audit.
to array Yes Intended recipients (header). Useful but not authoritative for routing.
envelope_to array Recommended The routed recipients (envelope). Critical for reliable routing and debugging.
subject string Optional Debugging and coarse filtering. Avoid using as a primary matcher.
headers object Optional For reliability and audit. Prefer a curated set, not an unbounded dump, for agent-facing views.
text string Yes Primary content for deterministic extraction.
html string Optional Debugging or fallback extraction only. Treat as hostile.
attachments array Optional Most flows ignore attachments, but you need metadata when they exist.
artifacts object Recommended OTPs, links, and other extracted outputs your workflow actually needs.
raw string or object reference Optional Ground truth for debugging parsing edge cases.

A minimal attachment object can be:

  • filename (string)
  • content_type (string)
  • size_bytes (number)
  • content_base64 (string, optional) or download_url (string, optional)

Choose one: embed content, or provide a retrievable reference, depending on your security model.

A simple diagram showing an email JSON object with five labeled groups: Identity, Routing, Content, Artifacts, and Provenance, each pointing to example fields like message_id, envelope_to, text, otp, and raw.

Example: verification email as minimal JSON

Here is an example payload that is small, testable, and agent-friendly. It includes both message identity and an extracted artifact.

{
  "message_id": "msg_01J7Z6D5Y8H9K2...",
  "delivery_id": "dly_01J7Z6D61T0F...",
  "inbox_id": "inb_01J7Z6CZZQ1M...",
  "received_at": "2026-03-09T21:08:12Z",
  "from": { "address": "[email protected]", "name": "Example" },
  "to": [{ "address": "[email protected]", "name": null }],
  "envelope_to": ["[email protected]"],
  "subject": "Your verification code",
  "text": "Your code is 483921. It expires in 10 minutes.",
  "artifacts": {
    "otp": { "code": "483921", "confidence": "high" }
  }
}

Notes:

  • The agent does not need HTML.
  • QA assertions can be written against artifacts.otp.code, not a template.
  • delivery_id lets you make your webhook handler idempotent.

Designing an agent-safe view (don’t hand an LLM the whole email)

For LLM agents, the most reliable pattern is to expose two representations:

  • Full message JSON (for engineers, debugging, and controlled processing)
  • Minimized agent view (for the agent tool call)

A minimized view can be as simple as:

  • message_id, received_at
  • from.address
  • artifacts only

This reduces prompt injection surface area and prevents the agent from “wandering” through untrusted content.

If your agent must inspect body text, consider including:

  • text only
  • a hard length limit
  • a policy that strips quoted replies and signatures

QA: deterministic assertions and matchers

Once you have a minimal schema, your test harness should become boring in a good way.

Good QA assertions typically check:

  • Correlation: message belongs to the current attempt (via inbox isolation or a correlation token you control)
  • Intent: sender domain, expected template family, or presence of a specific artifact type
  • Artifact correctness: OTP format, link host allowlist, token length
  • Idempotency: the same delivery_id does not advance state twice

Avoid asserting on:

  • exact subject lines (often localized or A/B tested)
  • HTML structure and CSS
  • display names

Webhooks and polling: what the schema must support

Whether you receive emails via webhooks or polling, the JSON must let you implement two reliability properties:

  • At-least-once delivery safety (duplicates can happen)
  • Deterministic selection (choose the right message under retries)

That is why delivery_id and message_id matter, even in a “minimal” model.

If you are consuming webhook events, also include (either in headers or payload metadata):

  • a timestamp
  • a signature (or signed payload)

Then your consumer can verify authenticity before processing.

Mailhook supports real-time webhook notifications and signed payloads, plus a polling API fallback. For the exact request/response shapes and verification rules, use the canonical reference at mailhook.co/llms.txt.

Implementation notes (how this maps to Mailhook without overfitting)

Mailhook’s core idea is “email as a tool primitive”:

  • Create disposable inboxes via API
  • Receive emails as structured JSON
  • Get webhook notifications in real time (or poll)

To stay provider-agnostic, you can define an internal interface like:

  • create_inbox() -> { inbox_id, email, expires_at? }
  • wait_for_message(inbox_id, matcher, deadline) -> MessageJSON
  • extract_artifact(MessageJSON) -> { otp | url | ... }

Mailhook is designed to fit this model. Use shared domains for quick starts, or custom domain support when you need allowlisting and environment isolation.

The “minimal schema” checklist

If you are reviewing an email-to-JSON integration (or a vendor), this short checklist catches most schema problems:

  • You can dedupe webhook retries (delivery_id or equivalent)
  • You can dedupe messages (message_id or equivalent)
  • You can tie the message to an isolated inbox (inbox_id)
  • You get text content reliably, without scraping HTML
  • You can represent artifacts separately from the body
  • You can retrieve raw source (or at least enough provenance) when debugging

A minimal schema is not about losing information. It is about making the default path deterministic, safe, and easy to assert on.

If you want to implement this with Mailhook, start with the canonical contract at mailhook.co/llms.txt and the product overview at mailhook.co.

email-automation json-schema ai-agents qa-testing api-design

Related Articles