Raw email parsing इतना complex क्यों है?

Email दशकों पुराने standards जैसे RFC 5322 और MIME का उपयोग करता है जिसमें multipart bodies, विभिन्न encodings (base64, quoted-printable), अलग character sets, और complex header structures शामिल हैं जो folded या duplicated हो सकती हैं।

क्या मुझे अपना email parser बनाना चाहिए या API service का उपयोग करना चाहिए?

Automation और AI agents के लिए, structured JSON प्रदान करने वाली programmable inbox API का उपयोग आमतौर पर custom parsing बनाने से अधिक reliable होता है, जिसमें numerous edge cases और security considerations को handle करना पड़ता है।

मैं emails से OTPs और verification links को safely कैसे extract करूं?

Normalized content के साथ structured JSON output का उपयोग करें, जब available हो तो HTML पर text/plain को prefer करें, links को allowlists के against validate करें, और raw HTML parse करने के बजाय context checks के साथ tight patterns का उपयोग करके artifacts extract करें।

Testing में email automation के लिए best approach क्या है?

प्रति test run isolated inboxes का उपयोग करें, polling fallback के साथ webhook-first delivery implement करें, HTML presentation के बजाय sender domain और OTP presence जैसी stable properties पर assert करें, और debugging के लिए raw messages को available रखें।

ईमेल को प्रोग्रामैटिक रूप से खोलें: Raw से JSON तक

ईमेल उन अंतिम “human-first” surfaces में से एक है जिस पर कई सिस्टम अभी भी निर्भर करते हैं। लेकिन यदि आप एक AI agent, LLM toolchain, या QA harness बना रहे हैं, तो आपको अंततः ईमेल को प्रोग्रामैटिक रूप से खोलना, केवल उपयोगी artifacts (OTP, magic link, invoice ID, reset URL) निकालना, और आगे बढ़ना होगा।

कठिन हिस्सा यह है कि ईमेल एक गड़बड़, दशकों पुराने standards के stack के रूप में आता है: RFC 5322 headers, MIME multipart bodies, विचित्र encodings, और HTML जो कभी tests (या agents) द्वारा parse करने के लिए नहीं बना था। यह गाइड बताता है कि “raw email” वास्तव में क्या है, यह क्यों मुश्किल है, और इसे एक JSON shape में विश्वसनीय रूप से कैसे convert करें जिस पर आपका automation भरोसा कर सके।

“ईमेल खोलने” का प्रोग्रामैटिक रूप से क्या मतलब है

जब इंसान “ईमेल खोलते” हैं, तो email client चुपचाप बहुत काम करता है:

Message format (headers plus body) को parse करता है
Transfer encodings (base64, quoted-printable) को decode करता है
Display करने के लिए एक body चुनता है (usually text/plain या HTML)
Attachments को unpack करता है
Dates, addresses, और character sets को normalize करता है

प्रोग्रामैटिक रूप से, आपको यह तय करना होगा कि आपके workflow के लिए “open” का क्या मतलब है। Automation के लिए, “open” का आमतौर पर मतलब है:

सही message को deterministically locate करना (कोई brittle mailbox searches नहीं)
इसे एक stable schema में parse और normalize करना
एक छोटा, verifiable artifact extract करना (OTP, link, token)
Sensitive content को leak किए बिना failures को debug करने के लिए पर्याप्त log करना

एक अच्छा mental model यह है: email को एक untrusted inbound event की तरह treat करें, document की तरह नहीं।

Raw email, वे formats जो आप वास्तव में receive करते हैं

अधिकांश सिस्टम अंततः एक email को raw RFC 5322 message के रूप में represent करते हैं: headers और body से composed text और bytes का एक blob। यदि आपको standards references की जरूरत है, तो RFC 5322 (message format) और MIME family जैसे RFC 2045 (MIME basics) से शुरू करें।

एक “raw” message में आमतौर पर शामिल होता है:

Headers: From, To, Subject, Date, Message-ID जैसे key/value pairs, plus कई अन्य
Body: कभी plain text, अक्सर HTML, frequently multipart with boundaries
Attachments: MIME parts के रूप में represent किए गए, commonly base64 encoded

MIME ही वजह है कि “सिर्फ body parse करना” fail हो जाता है

यदि आपने केवल plain text emails देखे होते, तो parsing आसान होती। व्यावहारिक रूप से:

कई messages multipart/alternative हैं (both text/plain और text/html)
कुछ multipart/mixed हैं (body plus attachments)
कुछ में nested multiparts होते हैं
Bodies encoded हो सकते हैं (quoted-printable, base64)
Character sets vary करते हैं (UTF-8, ISO-8859-1, और अधिक)

यही कारण है कि HTML को regex करना या blank lines पर split करना जल्दी fragile हो जाता है।

Raw से JSON तक: एक normalization pipeline जो automation में काम करती है

एक robust “raw to JSON” pipeline के कुछ clear stages होते हैं। यह implementation-agnostic है: आप इसे अपनी service में library के साथ कर सकते हैं, या inbox API द्वारा produced JSON consume कर सकते हैं।

एक simple flow diagram जो incoming email को stages के माध्यम से दिखाता है: Raw RFC 5322, MIME parse, Decode + normalize, Extract links/OTP, Output JSON for tests and LLM agents.

Stage 1: Structure parse करना (headers, MIME tree)

इस stage पर आप चाहते हैं:

Headers को safely parse करना (folded headers, duplicates handle करना)
Parts का एक MIME tree बनाना
Candidate bodies को identify करना (text/plain, text/html)
Attachments को identify करना (filename, content-type, size)

Stage 2: Decode और normalize करना

Normalization वह जगह है जहाँ से automation reliability आती है:

Transfer encodings (quoted-printable, base64) को decode करना
Line endings को normalize करना
Text को consistent Unicode representation में convert करना
Date को ISO timestamp में parse करना (लेकिन debugging के लिए raw value रखना)
Address fields को structured objects में normalize करना (name, address)

Stage 3: Content choose और sanitize करना

Automation और agents के लिए, predictable content को prefer करें:

Available होने पर text/plain को prefer करें
HTML रखें, लेकिन इसे secondary treat करें (rendering के लिए अच्छा, parsing के लिए risky)
Dangerous elements को remove या ignore करें (scripts, weird redirects)

Stage 4: Automation artifacts extract करना

“पूरे email को समझने” के बजाय, वह extract करें जिसकी आपके workflow को जरूरत है:

Verification links (और final target host allowlist)
OTP candidates (tight patterns और context checks के साथ)
Key identifiers (order ID, ticket ID)

Stage 5: Stable fields के साथ JSON emit करना

आपका JSON output support करना चाहिए:

Deterministic matching (message_id, inbox_id, correlation IDs)
Simple assertions (subject contains, from domain equals)
Minimal artifact extraction (otp, verification_url)
Debuggability (raw headers snapshot, received timestamp)

यहाँ raw email को JSON fields में map करने का एक helpful तरीका है।

Raw email element	यह कैसा दिखता है	Automation के लिए आप चाहते हैं JSON	यह क्यों मायने रखता है
Message-ID header	`Message-ID: <abc@domain>`	`message_id`	Deduplication और idempotency
Date header	`Date: Tue, 30 Jan...`	`received_at` (ISO), `date_raw`	Timing assertions, debugging delays
From/To	RFC 5322 address forms	`from: {name, address}`, `to: [...]`	Reliable sender checks
MIME parts	multipart boundaries	`text`, `html`, `attachments[]`	गलत part parse करने से बचना
Transfer encoding	base64, quoted-printable	decoded strings और bytes	Garbage output prevent करना
Links in body	HTML anchors, plain URLs	`links[]` (normalized)	Safer magic-link handling

Gotchas जो naive “open email” implementations को break करती हैं

यहाँ तक कि mature teams भी same email edge cases से burn हो जाते हैं। यदि आप programmatic “open email” path बना रहे हैं, तो इन्हें up front के लिए design करें।

Duplicate और folded headers

Headers legally repeat हो सकते हैं, और वे lines में folded हो सकते हैं। यदि आप naïvely headers को dictionary में map करते हैं, तो आप data lose कर सकते हैं या incorrectly parse कर सकते हैं।

गलत body choose करना

बहुत से systems गलती से parse करते हैं:

User-visible content के बजाय HTML tracking pixel section
OTP line के बजाय footer
Email के अंदर forwarded message

जब possible हो text/plain को prefer करें, और यह explicit रहें कि आप “primary” body कैसे pick करते हैं।

Encodings और character sets

यदि आप transfer encoding और charset को consistently decode नहीं करते, तो आपको दिखेगा:

Broken Unicode
Missing punctuation, जो OTP extraction को break कर सकता है
Tests में incorrect comparisons

Time एक single field नहीं है

Email timestamps messy होते हैं। Date header sender-provided है और हमेशा trustworthy नहीं होता। आपके receiving system का timestamp अक्सर latency और timeouts के लिए अधिक उपयोगी होता है।

HTML parsing एक security boundary है

यदि आप email content के against agents run करते हैं, तो HTML को adversarial input treat करें। एक safe strategy है:

Candidate links extract करें, फिर उन्हें allowlists के against validate करें
Automation में unknown URLs को “clicking” से बचें
Audit के लिए raw content रखें, लेकिन default रूप से LLM में full HTML feed न करें

Message-ID और related fields जैसे identifiers parse करने पर deeper reliability guidance के लिए, Mailhook का header parsing पर focused अलग post है: Headers Email Guide: What to Parse for Reliability।

LLM agents के लिए एक pragmatic JSON contract

Agents small, structured inputs के साथ best काम करते हैं। LLM को entire email (especially HTML) देने के बजाय, एक compact JSON object provide करें जो हो:

Deterministic
Minimal
Raw message तक traceable

एक example “agent-safe” shape इस तरह दिख सकता है:

{
  "message_id": "<...>",
  "received_at": "2026-02-01T20:12:33Z",
  "from": {"address": "[email protected]", "name": "Example"},
  "to": [{"address": "[email protected]", "name": null}],
  "subject": "Your login code",
  "text": "Your code is 123456",
  "links": ["https://example.com/verify?token=..."],
  "attachments": [{"filename": "invoice.pdf", "content_type": "application/pdf", "size": 48211}]
}

फिर आप एक second layer add कर सकते हैं: एक tiny extraction object जिसे आपके tests या agent tools वास्तव में consume करते हैं (उदाहरण के लिए { "otp": "123456" })। यह आपके workflow को simple रखता है और hostile content के लिए LLM exposure को reduce करता है।

खुद build करें vs inbox API से JSON consume करें

आपके पास दो broad approaches हैं:

Raw emails को खुद parse करें (IMAP/POP, direct SMTP ingest, या provider APIs के via)
Programmable inbox service का use करें जो आपको structured JSON और deterministic retrieval देती है

यहाँ एक decision table है जो real-world engineering tradeoffs से match करती है।

Approach	Best for	Common pain points	Typical outcome
IMAP mailbox scraping	Quick prototypes	Flaky searches, concurrency collisions, slow polling	CI और parallel runs में breaks
Provider APIs (Gmail/Graph)	Internal tooling with accounts	OAuth, quotas, long-lived identities	Works, लेकिन operationally heavy
अपना SMTP capture run करें	Local integration tests	Real email vs deliverability differences	Locally great, staging में incomplete
JSON output के साथ programmable inbox API	QA automation, LLM agents, verification flows	दूसरी API integrate करनी पड़ती है	Automation के लिए most deterministic

यदि आपकी core need “ईमेल को प्रोग्रामैटिक रूप से खोलना और JSON पाना” है, तो key property machine-readable output है जिसमें HTML scraping की जरूरत नहीं।

Mailhook का use करके ईमेल को JSON के रूप में खोलना (webhook-first, polling fallback)

Mailhook programmable disposable inboxes के around built है। Full email account create करने के बजाय, आप API के via inbox create करते हैं, अपने workflow में generated address का use करते हैं, फिर messages को structured JSON के रूप में receive करते हैं।

Relevant Mailhook capabilities (product description से):

API के via disposable inbox creation
Structured JSON email output
RESTful API access
Real-time webhook notifications
Emails के लिए polling API
Security के लिए signed payloads
Batch email processing
Shared domains और custom domain support

क्योंकि APIs evolve करती हैं, endpoints और payloads का source of truth Mailhook का implementation reference है। Agent tools या tests wire करने से पहले llms.txt को review करना सुनिश्चित करें:

Mailhook llms.txt

Reference flow (conceptual)

एक reliable automation flow इस तरह दिखता है:

Run (या agent session) के लिए नया inbox create करना
System under test को trigger करना कि वह उस address पर email send करे
Delivery के लिए wait करना (webhook prefer करें, polling को fallback के रूप में use करें)
JSON payload को consume करना
केवल वह extract करना जिसकी आपको जरूरत है (OTP/link)

यहाँ pseudocode है जो specific endpoint names assume किए बिना integration के shape को illustrate करता है:

# Pseudocode: exact API fields और routes के लिए https://mailhook.co/llms.txt consult करें।

inbox = mailhook.create_inbox(
  webhook_url="https://your-service.example/mailhook/webhook"
)

email_address = inbox["address"]
inbox_id = inbox["inbox_id"]

app.trigger_signup(email=email_address)

# Webhook-first: आपका webhook handler inbox_id द्वारा keyed JSON message store करता है।
# Polling fallback: timeout और backoff के साथ wait करें।

message = mailhook.wait_for_message(inbox_id=inbox_id, timeout_seconds=60)

otp = extract_otp(message["text"])
verify_url = extract_allowed_link(message.get("links", []))

assert otp is not None or verify_url is not None

Webhook signatures verify करें

यदि आप inbound webhooks accept करते हैं, तो उन्हें किसी भी अन्य external request की तरह treat करें:

Signature verify करें (Mailhook signed payloads support करता है)
Retries handle करने के लिए idempotency का use करें
केवल वह store करें जिसकी आपको जरूरत है, जब तक आपको जरूरत है

फिर से, exact signing scheme और headers llms.txt में contract से आना चाहिए।

Design tips जो email automation को boring बनाती हैं (अच्छे तरीके से)

Goal “email को perfectly parse करना” नहीं है, यह आपके automation को predictable बनाना है।

Isolation और correlation को prefer करें

यदि multiple test runs या agent sessions एक inbox share करते हैं, तो आप सबसे hard problem को reintroduce करते हैं: यह figure करना कि कौन सा message किस run से belong करता है। Isolated inboxes mailbox searching को पूरी तरह से avoid करते हैं।

Intent पर assert करें, presentation पर नहीं

HTML constantly change होता है। आपके assertions को stable properties target करना चाहिए:

Sender domain
Subject intent
Single OTP की presence
Verification link जिसका host allowlist में है

Debugging के लिए raw message को available रखें

जब कुछ fail हो जाता है, तो आप जानना चाहते हैं:

क्या message arrive हुआ?
इसमें क्या headers थे?
क्या आपने correct MIME part parse किया?

यहीं पर “raw plus normalized JSON” helpful है। Automation normalized fields पर run होता है, जबकि engineers raw context के साथ debug करते हैं।

यह आपको कहाँ छोड़ता है

2026 में ईमेल को प्रोग्रामैटिक रूप से खोलने के लिए, आपके पास दो realistic options हैं:

Email parsing expert बनना (RFC 5322, MIME edge cases, encoding quirks, security pitfalls)
Inbox abstraction का use करना जो पहले से normalization करती है और आपको JSON देती है जिसे आपके tests और agents consume कर सकें

यदि आपकी primary need agent workflows और QA reliability है, तो winning strategy आमतौर पर है: email को event stream की तरह treat करना, प्रति run inboxes को isolate करना, और structured JSON consume करना।

यदि आप इसे Mailhook के साथ implement करना चाहते हैं, तो Mailhook llms.txt में contract के साथ start करें और deterministic waits (webhook-first, polling fallback) और minimal artifact extraction के around अपने tools design करें।