Email validation के लिए regex क्यों नहीं use करना चाहिए?

Email regexes अक्सर false positives और false negatives देते हैं। वे complex होते हैं और performance issues भी पैदा कर सकते हैं। Dedicated parsing libraries ज्यादा reliable होती हैं।

Plus addressing (+) को कैसे handle करें?

Plus addressing provider-specific है। Gmail इसे support करता है लेकिन कई corporate systems नहीं करते। अपनी policy clearly define करें और configurable strategy रखें।

DNS check क्यों sufficient नहीं है deliverability के लिए?

DNS check केवल domain existence verify करती है, mailbox existence नहीं। Real deliverability के लिए verification email भेजना और response confirm करना जरूरी है।

ऑटोमेशन में ईमेल पते: वैलिडेशन और एज केसेस

ईमेल इंटरनेट की सबसे पुरानी “APIs” में से एक है, और यह अभी भी onboarding, password resets, alerts, और identity workflows के critical path में बैठता है। इससे ईमेल पते ऑटोमेशन के लिए एक आश्चर्यजनक रूप से high leverage surface area बन जाते हैं: यदि आपकी validation बहुत strict है, तो आप real users को block कर देते हैं और support tickets generate करते हैं। यदि यह बहुत lax है, तो आप flaky tests ship करते हैं, data leak करते हैं, या injection vulnerabilities खोल देते हैं।

AI agents और LLM-driven automations के लिए, समस्या और भी कठिन हो जाती है: agents अक्सर messy text से addresses को extract करते हैं, फिर उन पर विभिन्न rules वाले systems में act करते हैं। यह guide practical validation layers और उन edge cases को cover करती है जो आमतौर पर automated flows को तोड़ देते हैं।

“Valid” का मतलब क्या है (और क्यों teams असहमत हैं)

Production और test harnesses में, “valid email address” का मतलब कम से कम चार अलग-अलग चीजें हो सकती हैं:

“Valid” का अर्थ	आप वास्तव में क्या check कर रहे हैं	यह कहां belong करता है	Common failure mode
Syntactically valid	String को email address के रूप में parse किया जा सकता है	Client और API boundary	Overly strict regex legitimate addresses को reject कर देता है
Routable domain	Domain में DNS records हैं (अक्सर MX, कभी-कभी A/AAAA)	Server side	“No MX” को हमेशा invalid मानना (कुछ domains A/AAAA पर mail accept करते हैं)
Deliverable mailbox	Mailbox exist करता है और mail accept करता है	केवल message भेजकर (verification)	SMTP “probe” logic blocked, rate limited हो जाती है, या policy violate करती है
Your product के लिए acceptable	आपके product-specific rules (no role accounts, no plus tags, आदि)	Product layer	Product rules accidentally enterprise या international users को break कर देते हैं

Key automation lesson: parsing को policy से अलग करें। Addresses को real parser से parse करें, फिर अपने business rules को explicitly apply करें।

Standards vs reality: वह gap जो edge cases पैदा करती है

यदि आप email की full grammar के against validate करते हैं, तो आप उन addresses को accept करेंगे जिन्हें कई systems practice में कभी नहीं देखते। यदि आप केवल “Gmail जो accept करता है” के against validate करते हैं, तो आप बहुत सारे real addresses को reject कर देंगे।

Protocol level पर क्या allowed है, इसके लिए उपयोगी references:

RFC 5322 (message format, mailbox syntax)
RFC 5321 (SMTP, including length constraints)
RFC 6531 (SMTPUTF8 for internationalized email)

यह भी note करें कि कई frontends HTML “email” input type पर rely करते हैं, जो intentionally permissive है और full RFC validator नहीं है। WHATWG HTML spec for type="email" देखें।

Practical constraints जिन्हें आपको enforce करना चाहिए

ये constraints widely used हैं और SMTP limits के साथ align करती हैं:

Total length: 254 characters max commonly enforced limit है जो SMTP constraints से derived है (कई systems द्वारा उपयोग की जाने वाली practical ceiling)।
Local part length: 64 characters max।
No control characters: विशेष रूप से \r और \n।
Surrounding whitespace trim करें: लेकिन internal whitespace को silently remove न करें।

भले ही आपका parser अधिक accept कर सकता हो, इन limits को enforce करना vendors, CRMs, और transactional email providers में downstream failures को prevent करता है।

Edge cases जो automation को तोड़ती हैं (और इसके बारे में क्या करें)

ज्यादातर flaky “email tests” email भेजने के बारे में नहीं हैं। वे systems में address interpretation differences के बारे में हैं।

1) Plus addressing (subaddressing)

Example: [email protected]

कई providers +tag को same mailbox पर route करते हैं, जो automation और correlation के लिए बहुत अच्छा है।
कुछ corporate systems और legacy identity providers + को reject करते हैं।

Automation guidance: यदि आप tests के लिए address generation को control करते हैं, तो configurable strategy रखें: correlation के लिए plus tags को prefer करें, लेकिन simple runid-uuid@domain पर fall back करने में सक्षम हों।

2) Dot normalization (Gmail-specific behavior)

Example: [email protected] और [email protected] Gmail में equivalent हैं, लेकिन अधिकांश अन्य systems में नहीं।

Automation guidance: कभी भी assume न करें कि dots ignore हो जाते हैं। Full local part string को identifier के रूप में treat करें जब तक आप intentionally provider-specific rule apply न कर रहे हों।

3) Local part में case sensitivity

Technically, local part case-sensitive हो सकता है, लेकिन लगभग सभी modern providers इसे case-insensitive treat करते हैं।

Automation guidance: Display के लिए original को store करें, लेकिन normalized form का उपयोग करके compare करें जिसे आप define करते हैं (typically domain को lowercase करना, और optionally local part को यदि आपका product इसे case-insensitive treat करता है)। इसे deliberate policy decision बनाएं।

4) Quoted local parts (valid लेकिन rarely supported)

Example: "john..doe"@example.com (हां, quotes change कर सकते हैं कि क्या allowed है)

Automation guidance: जल्दी decide करें कि quoted local parts को support करना है या नहीं। कई SaaS products complexity को कम करने के लिए intentionally इन्हें reject करते हैं। यदि आप इन्हें reject करते हैं, तो इसे document करें और clear error return करें।

5) Internationalized email (EAI) और IDNs

Examples:

Unicode के साथ local part: miyuki.さくら@例え.テスト (SMTPUTF8 support की आवश्यकता)
Internationalized domain names: user@bücher.example (अक्सर internally punycode में convert हो जाते हैं)

Automation guidance: IDNs (Unicode domains) को support करना Unicode local parts को support करने से आसान है। यदि आप Unicode domains accept करते हैं, तो उन्हें IDNA library का उपयोग करके convert करें और DNS checks के लिए ASCII (punycode) form को validate करें।

6) Domain literals

Example: user@[192.168.1.10]

ये कुछ contexts में syntactically valid हैं लेकिन consumer signup में लगभग कभी acceptable नहीं हैं।

Automation guidance: Domain literals को reject करें जब तक आपका specific enterprise use case न हो। वे security review को complicate करते हैं और surprising routing behavior create कर सकते हैं।

7) Display names और angle brackets (extraction में common)

Example: Jane Doe <[email protected]>

Humans इस format को लगातार लिखते हैं, और LLMs अक्सर summarizing करते समय इसे output करते हैं।

Automation guidance: आपकी input validation इसे reject कर सकती है, लेकिन आपकी extraction pipeline को इसे handle करना चाहिए। Regex के बजाय parser का उपयोग करें जो addr-spec portion को extract कर सके।

8) Natural language से trailing punctuation

Examples:

email me at [email protected].
send to ([email protected])

Automation guidance: Text से extract करते समय, parsing के बाद surrounding punctuation को strip करें, पहले नहीं। Otherwise आप [email protected] को कुछ और में turn करने का risk लेते हैं।

9) एक field में multiple addresses

Example: [email protected], [email protected]

यह तब सामने आता है जब agents “cc these people” instructions पढ़ते हैं।

Automation guidance: “Single email” fields को single treat करें, और loudly fail करें। यदि आप lists support करते हैं, तो उन्हें API layer पर arrays के रूप में model करें।

एक validation pipeline जो humans, bots, और agents के लिए काम करती है

एक robust approach layered होती है, जिसमें हर layer एक question का answer देती है।

A simple five-step pipeline diagram showing: 1) normalize input, 2) parse address, 3) enforce length and character policy, 4) optional DNS check, 5) verification email and inbox observation. Each step is a labeled box connected left to right.

Layer 1: Normalize (meaning change किए बिना)

Automation के लिए recommended normalization:

Leading और trailing whitespace trim करें।
Domain को lowercase में convert करें।
यदि आप IDNs accept करते हैं, तो Unicode domain को storage और DNS के लिए ASCII (punycode) में convert करें।
Dots या plus tags को remove न करें जब तक आप explicitly provider-specific behavior implement न कर रहे हों।

Layer 2: Parse करें, regex न करें

Email regexes false positives और false negatives के लिए notorious हैं। अपनी language में well-maintained email parsing library को prefer करें जो:

addr-spec को correctly parse कर सके
Control characters को reject कर सके
Common extraction formats (Name <email@domain>) को handle कर सके

Layer 3: Product policy apply करें

Policy checks को explicit और testable रखें:

Plus addressing को allow या disallow करें
Quoted local parts को allow या disallow करें
Unicode local part को allow या disallow करें
Blocklists (temporary domains, known role accounts) यदि आपके product की आवश्यकता हो

Layer 4: Optional DNS check (helpful, definitive नहीं)

DNS lookup gmial.com जैसी obvious typos को catch कर सकती है, लेकिन यह deliverability की guarantee नहीं है।

Practical rules:

DNS checks को UX के लिए hint के रूप में treat करें (उदाहरण के लिए, “did you mean…?”)।
केवल “no MX” पर signups को block न करें जब तक आपकी requirements इसे justify न करें।

Layer 5: Verification ही real deliverability test है

यदि आपको जानना है कि mailbox real है, तो verification email भेजना (magic link या OTP) और user या workflow से इसे complete करवाना ही reliable method है।

Automated environments में (CI, QA, agent runs), यहीं deterministic inbox tooling matter करती है। Mailhook का approach API के via disposable inboxes create करना और messages को structured JSON के रूप में receive करना है (webhooks या polling के साथ), जो verification flows को brittle HTML parsing के बिना testable बनाता है। Up-to-date, canonical feature list के लिए, product के llms.txt को reference करें।

Test automation में edge cases: failures को explainable बनाना

Validation bugs painful होते हैं, लेकिन automation failures तब worse होते हैं जब आप उन्हें explain नहीं कर सकते। Goal हर email-related failure को actionable बनाना है।

Addresses को runs से correlate करें, people से नहीं

Tests और agent workflows में, ऐसे addresses generate करें जो clearly run identifier से tied हों:

जब allowed हो तो local part में short run id include करें (उदाहरण के लिए, plus tags या delimiter)।
Customer-like identifiers को logs में leak होने से बचें।

इससे “Which run generated this email?” का जवाब देना आसान हो जाता है HTML bodies में dig किए बिना।

Retries और duplicates को model करें

Verification systems frequently resend करते हैं। आपके harness को expect करना चाहिए:

Duplicate OTP emails
Out-of-order arrival
Multiple templates (welcome email plus verification email)

“The first email is the verification” assert करने के बजाय, stable fields (recipient, subject patterns, link token की presence) पर assert करें, और अपने code को idempotent बनाएं।

Agent workflows: extraction और document-heavy pipelines

LLM agents अक्सर आपकी validation logic के upstream बैठते हैं। वे PDFs, chat logs, intake forms, या long email threads से email addresses extract कर सकते हैं। इन contexts में, आप two-stage process चाहते हैं:

Provenance के साथ candidates extract करें (possibly multiple) (यह कहां से आया)।
अपने parsing और policy rules का उपयोग करके validate और select करें।

यह विशेष रूप से document-heavy domains में relevant है। उदाहरण के लिए, case files से litigation materials generate करने के लिए TrialBase AI जैसे tools का उपयोग करने वाली legal teams को demand letters या requests भेजने से पहले reliable contact extraction की आवश्यकता हो सकती है। Automations जो “Jane Doe [email protected]” को address extract किए बिना invalid treat करती हैं, वे exactly उस moment break होंगी जब speed matter करती है।

Security और reliability pitfalls (miss करना आसान)

भले ही आप “just validating an email” कर रहे हों, string untrusted input है।

Header injection: Email-sending code में value का बाद में उपयोग होने पर attackers को extra headers inject करने से prevent करने के लिए हमेशा \r और \n को reject या escape करें।
Regex performance: कुछ complex email regexes catastrophic backtracking के लिए vulnerable होते हैं। Parsers या simple bounded checks का उपयोग करें।
Logging और privacy: Email addresses को personal data treat करें। जब possible हो तो hashes या redacted forms log करें।
Normalization surprises: यदि आप aggressively normalize करते हैं (dots, plus tags), तो आप accidentally non-Gmail domains के लिए distinct accounts को merge कर सकते हैं।

General input validation guidance के लिए, OWASP की Input Validation Cheat Sheet solid baseline है।

Automation ship करने वाली teams के लिए practical checklist

अपनी email address handling के final review के लिए इसका उपयोग करें:

आपका system core validation के लिए parser का उपयोग करता है, single regex का नहीं।
आप length limits enforce करते हैं और control characters को reject करते हैं।
आप “syntactically valid” को “deliverable” से अलग करते हैं, और deliverability के लिए verification का उपयोग करते हैं।
आपकी extraction pipeline Name <email@domain> और trailing punctuation को handle कर सकती है।
आपके policy decisions (plus tags, quoted local parts, Unicode) explicit और documented हैं।
आपके tests unique, run-correlated addresses generate करते हैं और resends और out-of-order emails को tolerate करते हैं।

यदि आपके automations email verification पर depend करते हैं, तो final step हमेशा observability है: आपको reliably देखना होगा कि क्या भेजा गया, किस address पर, और कब, और इसे machine-friendly format में consume करना होगा।