Email pipeline में deduplication क्यों इतनी important है?

Email delivery typically at-least-once होती है, जिससे same message कई बार आ सकता है। Proper deduplication के बिना आप duplicate processing, incorrect analytics, और consume-once violations का सामना करेंगे। तीन layers (delivery, message, artifact) पर dedup करना जरूरी है।

Email addresses को कैसे safely normalize करें?

Email addresses को normalize करते समय केवल domain part को lowercase करें, local part को जैसा received हुआ है वैसा ही preserve करें। Display names को identifier के रूप में trust न करें और regex parsing से बचें। हमेशा parsed structure और original string दोनों store करें।

LLM agents के लिए emails को कैसे prepare करें?

LLM agents को minimized, sanitized view provide करें जिसमें केवल जरूरी fields हों: message_id, parsed addresses, subject (optional), plain text body (length-capped), और pre-extracted artifacts। Raw HTML, full headers, और untrusted content से बचें।

Email pipeline के लिए कौन सी storage strategy best है?

तीन layers का उपयोग करें: Raw (original source, immutable), Normalized (canonical JSON for downstream), और Derived (extracted artifacts)। यह structure normalization और extraction logic को बदलने पर easy reprocessing allow करता है।

Webhook failures को कैसे handle करें?

Webhook-first approach with polling fallback का उपयोग करें। Webhook handlers को idempotent बनाएं और signature verification implement करें। Polling को backup के रूप में use करें जब webhooks fail हों या endpoint down हो।

पाइपलाइन में ईमेल की सफाई: डुप्लिकेट हटाना, सामान्यीकरण, संग्रहण

ईमेल सिंपल इनपुट की तरह दिखता है, लेकिन वास्तविक automation में यह अविश्वसनीय event stream की तरह व्यवहार करता है: retries, duplicates, encoding की समस्याएं, multi-part bodies, और असंगत headers। अगर आप inbound mail को किसी LLM agent, QA harness, या data pipeline में डाल रहे हैं, तो “clean emails” का मतलब सुंदर formatting नहीं है। यह deduplication, deterministic normalization, और storage जो provenance को preserve करे के बारे में है।

यह गाइड एक practical pipeline design प्रस्तुत करता है जो inbound email को stable, queryable records में बदल देता है जिन पर आप भरोसा कर सकते हैं, बिना brittle HTML scraping या “sleep(10)” guessing के।

Pipeline में “clean emails” का मतलब क्या है

Pipeline कार्य के लिए, एक “clean” email record में आम तौर पर ये गुण होते हैं:

Stable identifiers: आप webhook retries, polling loops, और reprocessing के दौरान “इस message” को consistently refer कर सकते हैं।
Normalized structure: headers, addresses, bodies, और attachments को predictable types और encodings के साथ represent किया जाता है।
Clear lineage: आप हमेशा किसी derived field (जैसे OTP) को original message और exact extraction logic तक trace कर सकते हैं।
Idempotent ingestion: pipeline safely at-least-once delivery को handle कर सकती है।
Safe agent consumption: LLMs को minimized, sanitized view दिखता है, raw HTML और untrusted headers नहीं।

एक उपयोगी mental model यह है कि email ingestion को clickstream या payment events की तरह treat करें: आपको append-only log, canonical normalized representation, और fast consumption के लिए derived tables चाहिए।

Step 1: Correctly deduplicate करें (message, delivery, artifact)

Email pipelines अक्सर इसलिए fail हो जाती हैं क्योंकि teams गलत layer पर dedupe करती हैं।

तीन dedupe layers

Delivery dedupe (transport layer)

Same message कई बार deliver हो सकता है SMTP retries, greylisting, webhook retries, या polling races के कारण।

Message dedupe (content identity)

दो deliveries same logical message को represent कर सकती हैं। आपको एक canonical record चाहिए।

Artifact dedupe (what you actually need)

Verification flows में, “thing you act on” अक्सर OTP या magic link होती है। आप artifact को एक बार consume करना चाहते हैं, भले ही आपको message कई बार मिला हो।

Dedupe keys जो actually काम करते हैं

कोई single field universally reliable नहीं है। Tiered strategy का उपयोग करें और सभी candidates को store करें।

Layer	Goal	Good dedupe key candidates	Notes
Delivery	Same delivery को दो बार process न करें	`provider_delivery_id` (यदि उपलब्ध हो), webhook event id, `(inbox_id, message_id, delivered_at)`	Webhooks typically at-least-once होते हैं, इसलिए duplicates assume करें।
Message	Logical email के लिए एक row	RFC `Message-ID` (normalized), hash of raw source, `(inbox_id, internal_message_id)`	`Message-ID` missing या duplicated हो सकता है, fallback रखें।
Artifact	“Consume once” semantics	`sha256(artifact_type + canonical_value + context)`	OTP और verification URLs के लिए best।

Practical dedupe algorithm

Deterministic sequence का उपयोग करें:

अगर आपका provider stable internal message_id देता है, तो इसे inbox के अंदर primary key के रूप में उपयोग करें।
RFC Message-ID (normalized) और content hash (उदाहरण के लिए, raw source या canonicalized subset का hash) store करें।
Primary key द्वारा canonical message record को upsert करें, और प्रत्येक delivery event को अलग table में record करें।
Artifacts extract करते समय, artifact_hash compute करें और consume-once guarantee करने के लिए unique constraint enforce करें।

यह idempotency की backbone है: आप duplicates accept करते हैं, लेकिन आपकी database state सही रहती है।

Step 2: Email को deterministic shape में normalize करें

Normalization वह जगह है जहां अधिकतर “pipeline pain” छुपी होती है। उद्देश्य MIME के हर corner को perfectly model करना नहीं है, बल्कि automation के लिए stable contract बनाना है।

Edge cases को बिना तोड़े addresses को normalize करें

आम गलतियां:

Entire address को lowercase करना, जो कुछ local parts के लिए incorrect हो सकता है।
Display names को trustworthy identifiers की तरह treat करना।
Addresses को regex के साथ parse करना।

Mail parsing library को prefer करें और conservatively normalize करें:

केवल domain को lowercase करें।
Local part को जैसा received हुआ है वैसा preserve करें।
Parsed structure और original string दोनों store करें।

एक pragmatic normalized address object इस तरह दिखता है:

Field	Type	Example
`original`	string	`"Jane Doe" <[email protected]>`
`address`	string	`[email protected]`
`local`	string	`Jane.Doe+qa`
`domain`	string	`example.com`
`display_name`	string or null	`Jane Doe`

Trust model के साथ headers को normalize करें

Headers attacker-controlled input हैं। एक अच्छा normalized representation:

Raw header block को preserve करता है (forensic debugging के लिए)।
Parsed map produce करता है जो folded lines और duplicate headers को handle करता है।
“High-trust for dedupe/trace” fields (जैसे Message-ID) को “low-trust UI fields” (जैसे Subject) से separate करता है।

अगर आप reference point चाहते हैं कि raw email कितना messy हो सकता है, तो RFC 5322 में core format को skim करें।

Automation के लिए bodies को normalize करें (text-first)

Pipelines और LLM agents के लिए:

जब उपलब्ध हो तो text/plain को prefer करें।
HTML store करें, लेकिन अपनी automation को brittle HTML selectors पर depend न करने दें।
Tracking pixels को strip करके, whitespace को collapse करके, और length को limit करके “safe text” field produce करने पर विचार करें।

Intent के साथ timestamps को normalize करें

Email में multiple timestamps होते हैं:

Date: header: sender-provided, गलत हो सकता है।
Transport timestamps: आपके provider का receipt time, आम तौर पर सबसे reliable।

कम से कम store करें:

received_at (provider receipt time, ordering के लिए canonical)
date_header (optional, display या diagnostics के लिए)

Attachments को metadata + content pointer के रूप में normalize करें

Attachments को logs या agent prompts में dump न करें।

Storage-friendly model:

Attachment metadata रखें (filename, content-type, size)।
Integrity के लिए hash store करें।
Bytes को object storage में store करें, key द्वारा referenced।

Step 3: Reprocessing के लिए store करें, न कि सिर्फ “happy path” के लिए

Pipeline जिसे replay नहीं किया जा सकता, वह pipeline है जिस पर आप trust नहीं कर सकते।

Recommended storage layers

अधिकांश teams तीन layers के साथ end up करती हैं:

Raw: original RFC 5322 source (या provider raw payload), immutable।
Normalized: canonical JSON representation जो downstream systems द्वारा उपयोग किया जाता है।
Derived: specific workflows के लिए extracted artifacts, जैसे OTPs, verification URLs, ticket IDs।

यह structure templates change होने पर normalization या extraction को re-run करना आसान बना देती है।

Minimal relational schema

यहां एक practical baseline schema है (Postgres, MySQL, आदि में काम करता है):

Table	Purpose	Key columns
`email_messages`	Logical message के लिए एक row	`pk`, `inbox_id`, `provider_message_id`, `rfc_message_id`, `received_at`, `normalized_json`, `raw_ref`
`email_deliveries`	हर delivery attempt / event	`pk`, `message_pk`, `delivered_at`, `source` (webhook/poll), `event_id`
`email_artifacts`	Consume-once derived records	`pk`, `message_pk`, `artifact_type`, `artifact_value`, `artifact_hash` (unique), `extracted_at`

दो important operational notes:

(inbox_id, provider_message_id) या आपकी chosen message primary key पर unique constraint लगाएं।
Consume-once semantics के लिए artifact_hash पर unique constraint लगाएं।

Retention और privacy by design

Email में secrets और personal data हो सकता है। QA में भी, teams accidentally production-like content को test inboxes में route कर देती हैं।

Retention को “clean emails” के first-class part की तरह treat करें:

Debug और replay के लिए जरूरी minimum window तक raw रखें।
Logs में sensitive fields को redact या tokenize करें।
Raw content store करने पर rest पर encrypt करें।

Step 4: Reliably ingest करें (webhook-first, polling fallback)

“Perfect normalization” भी fail हो जाता है अगर ingestion flaky है।

Robust pattern यह है:

Low-latency eventing के लिए webhook-first।
Resilience के लिए polling fallback जब webhooks fail हों, queues back up हों, या आपका endpoint down हो।

Webhook delivery typically at-least-once होती है, इसलिए आपका handler idempotent होना चाहिए।

Webhook ingestion pseudocode (idempotent)

def handle_webhook(event):
    verify_signature(event)  # spoofed या replayed requests को reject करें

    msg = event["message"]
    inbox_id = msg["inbox_id"]
    provider_message_id = msg["id"]  # provider-stable id (example)

    # 1) Message को upsert करें (idempotent)
    message_pk = upsert_email_message(
        inbox_id=inbox_id,
        provider_message_id=provider_message_id,
        rfc_message_id=normalize_message_id(msg.get("headers", {}).get("message-id")),
        received_at=msg["received_at"],
        normalized_json=msg,
        raw_ref=msg.get("raw_ref"),
    )

    # 2) Delivery event को record करें (append-only)
    insert_delivery_event(
        message_pk=message_pk,
        event_id=event["event_id"],
        delivered_at=event["delivered_at"],
        source="webhook",
    )

    # 3) Artifacts extract करें (consume-once)
    for artifact in extract_artifacts(msg):
        insert_artifact_if_new(message_pk, artifact)

यह structure means webhook retries duplicates create नहीं करते, और extraction safe रहता है।

Throughput के लिए batch processing

अगर आप high volume ingest करते हैं (CI fleets, कई agent runs, या backfills), तो batch APIs matter करती हैं। Batch ingestion आपको ये सुविधा देती है:

Per-message overhead reduce करना।
Consistent backpressure apply करना।
Messages के slice पर extraction jobs को re-run करना।

Step 5: Email का “LLM-safe view” create करें

LLMs powerful consumers हैं, लेकिन email untrusted input है और common prompt-injection vector है। Agents के लिए “clean emails” का मतलब minimized contract provide करना है।

Typical agent-facing view:

message_id, inbox_id, received_at
from, to (parsed addresses)
subject (optional)
text (plain text only, length-capped)
artifacts (already extracted OTPs या allowlisted URLs)

Raw HTML और full headers को agent context से बाहर रखें जब तक आपके पास very specific reason न हो।

A simple data flow diagram showing inbound email arriving to an ingestion service, then passing through three stages labeled Deduplicate, Normalize, Store, and finally branching to Analytics and LLM Agent tools.

Clean email pipeline में Mailhook कहां fit करता है

अगर आप disposable inboxes के around automation build कर रहे हैं (LLM agents, QA, या verification flows के लिए), Mailhook ऐसे primitives provide करता है जो ऊपर दी गई pipeline के साथ well map करते हैं:

API के द्वारा disposable inboxes create करें।
Structured JSON के रूप में emails receive करें।
Real-time webhooks के द्वारा events deliver करें।
Fallback retrieval mechanism के रूप में polling का use करें।
Webhook security के लिए signed payloads verify करें।
Messages को batches में process करें।
Shared domains choose करें या custom domain bring करें।

Exact request/response fields और canonical integration contract के लिए, Mailhook का machine-readable reference use करें: llms.txt। आप main site Mailhook से भी start कर सकते हैं।

“Clean emails” के लिए final checklist

अगर आप सिर्फ कुछ चीजें implement करते हैं, तो ये करें:

Design से deduplicate करें: delivery events को canonical messages से separate करें, और unique constraints enforce करें।
Deterministically normalize करें: conservative address normalization, parsed headers with duplicates handled, text-first bodies।
Replay के लिए store करें: raw (briefly) plus normalized JSON रखें, और artifacts को consume-once table में derive करें।
At-least-once assume करें: webhook handlers idempotent होने चाहिए।
Agents को leash पर रखें: LLMs को minimized email view और extracted artifacts दें, raw HTML नहीं।

Clean emails सिर्फ nicer data नहीं हैं, ये उन pipelines के बीच difference हैं जिन्हें आप scale कर सकते हैं और जिन्हें आप babysit करते हैं।

पाइपलाइन में ईमेल की सफाई: डुप्लिकेट हटाना, सामान्यीकरण, और संग्रहण