Magic links are supposed to make authentication simple: open the email, click the link, and you are signed in. For LLM agents and automated QA, that same flow often becomes brittle when the agent has to “read” a rendered email, scrape a button from HTML, guess which URL is correct, and then click it safely.
A better pattern is to treat the magic link as a typed verification artifact inside an email event. Instead of scraping HTML, your test harness creates a disposable inbox, receives the message as structured JSON, extracts one allowed URL under a strict policy, and gives the agent only the minimum action it needs.
That shift matters for reliability, but it matters even more for safety. Email is untrusted input. HTML can contain misleading text, tracking wrappers, hidden content, prompt-injection instructions, malformed links, and template drift that breaks selectors. Agents should not be asked to interpret all of that.
Why HTML scraping is the wrong contract for magic link testing
HTML email is a presentation layer, not an automation API. It changes whenever marketing, product, or localization teams update templates. Buttons may become images, links may be wrapped by click-tracking services, and email clients may rewrite markup in ways your test never sees locally.
For a human, “Click the blue Sign in button” is reasonable. For an agent, it is an underspecified instruction. The agent must decide which link is the sign-in link, whether tracking links are acceptable, whether a visible button label matches the underlying href, and whether hidden or injected content should be ignored.
| Scraping rendered HTML | Extracting from structured JSON |
|---|---|
| Depends on CSS, button text, and template structure | Depends on normalized message fields and policy rules |
| Exposes the agent to the full email body | Exposes only the verified artifact or a small safe view |
| Breaks when templates or localizations change | Survives visual redesigns if URLs and matchers remain stable |
| Encourages broad browser-like behavior | Keeps URL parsing, validation, and consumption deterministic |
| Hard to debug in parallel runs | Logs stable IDs such as inbox, message, delivery, and attempt IDs |
For agent workflows, the goal is not to “understand the email.” The goal is narrower: detect the expected login email, extract the expected one-time URL, validate that it is safe to use, and consume it once.
Model the magic link as an artifact
A magic link test becomes much simpler when your harness has an artifact contract. The email message can still contain raw text and HTML for debugging, but the agent-facing output should be small and typed.
A practical artifact for magic link testing might look like this:
{
"type": "magic_link",
"url": "https://app.example.com/auth/magic?token=...",
"host": "app.example.com",
"path": "/auth/magic",
"message_id": "msg_123",
"inbox_id": "inbox_456",
"attempt_id": "run_789",
"expires_hint": null
}
The important design choice is that the agent does not receive the full HTML and then decide what to do. Your deterministic code extracts and validates the URL first. The agent receives the result of that decision, not the raw decision surface.
If you are defining a broader email JSON schema, keep trust boundaries explicit. Provider-attested fields such as inbox ID, delivery ID, and received timestamp are usually safer for orchestration than sender-claimed headers. Derived fields, like magic_link, should be produced by your own extraction policy. For a deeper schema pattern, see Mailhook’s guide to email to JSON for agents and QA.
A deterministic magic link testing flow for agents
The reliable flow has five steps. Each step reduces ambiguity before the agent is allowed to act.
- Create one disposable inbox per attempt: Do not reuse a shared mailbox across agent runs. A new inbox gives the attempt a clean message set, removes stale-link ambiguity, and makes retries safer.
- Trigger the magic link email with correlation: Use the generated email address in the sign-in flow. If your application supports a state, run ID, tenant ID, or test-only correlation header, include it. If not, rely on inbox isolation plus narrow sender and subject matchers.
- Receive the message as JSON: Prefer real-time webhooks for low-latency arrival, with polling as a bounded fallback. The message should arrive as structured data, not as something the agent has to scrape from a mailbox UI.
- Extract and validate exactly one URL: Parse candidate URLs from the text and structured fields, score them against an allowlist, reject unsafe hosts or schemes, and return only the accepted magic link artifact.
- Consume once and record the result: Mark the artifact as used by attempt ID and URL hash. Store enough metadata to debug failures, but avoid logging raw tokens unless your security policy explicitly allows it.
With Mailhook, the inbox and delivery pieces map directly to programmable disposable inboxes, structured JSON email output, real-time webhooks, polling fallback, signed payloads, shared domains, and custom domain support. For exact integration details, use the canonical Mailhook llms.txt reference.
Extract links without scraping HTML
“Without HTML scraping” does not mean “never inspect a message body.” It means you do not treat the rendered HTML as the source of truth. You should avoid DOM selectors like a.button.primary, visual text like “the first blue button,” and browser-style rendering logic.
A safer extractor uses a policy-driven pipeline:
- Prefer
text/plainwhen available, because it is less likely to hide content behind visual markup. - Parse URLs with a real URL parser, not a single broad regex.
- Allow only expected schemes, usually
https. - Allow only expected hosts, such as
app.example.comor a dedicated auth domain. - Require expected paths or query parameters, such as
/auth/magic,/login/verify,token, orstate. - Reject URLs in unsubscribe, preference, tracking, social, or documentation sections unless explicitly allowed.
- Return zero or one artifact. If two links score equally, fail closed and log candidates for debugging.
Here is provider-agnostic pseudocode for the extraction layer:
function extractMagicLink(email: EmailJson, policy: MagicLinkPolicy): MagicLinkArtifact {
const bodies = [email.text, stripHtmlToText(email.html)].filter(Boolean)
const candidates = extractUrls(bodies.join("\n"))
.map((rawUrl) => parseUrl(rawUrl))
.filter((url) => url.protocol === "https:")
.filter((url) => policy.allowedHosts.includes(url.hostname))
.filter((url) => policy.allowedPaths.some((path) => url.pathname.startsWith(path)))
.filter((url) => hasAnyQueryParam(url, policy.requiredQueryParams))
const scored = candidates.map((url) => ({
url,
score: scoreMagicLink(url, email, policy)
}))
const best = chooseSingleWinner(scored)
if (!best) {
throw new Error("No unique magic link matched the policy")
}
return {
type: "magic_link",
url: best.url.toString(),
host: best.url.hostname,
path: best.url.pathname,
inbox_id: email.inbox_id,
message_id: email.message_id,
received_at: email.received_at
}
}
HTML may still be useful as a fallback source if the sender does not include a plain-text part. In that case, convert HTML to text with a conservative sanitizer, then run the same URL policy. Do not let the agent render the HTML and choose a link. The extraction decision should remain deterministic code.
For URL handling, use platform URL parsers aligned with the WHATWG URL standard rather than custom string slicing. If your system ever resolves redirects or fetches URLs before handing them to an agent, apply SSRF defenses such as host allowlists, private network blocking, and redirect limits. The OWASP SSRF Prevention Cheat Sheet is a useful baseline.
Keep the agent interface small
A common mistake is to give the LLM a general mailbox-reading tool. That invites prompt injection and makes runs harder to reproduce. For magic link testing, expose a narrow tool contract instead.
{
"tool": "wait_for_magic_link",
"input": {
"inbox_id": "inbox_456",
"attempt_id": "run_789",
"timeout_ms": 60000,
"allowed_hosts": ["app.example.com"],
"allowed_paths": ["/auth/magic"]
},
"output": {
"status": "found",
"artifact": {
"type": "magic_link",
"url": "https://app.example.com/auth/magic?token=...",
"message_id": "msg_123"
}
}
}
This tool does three things the agent should not improvise: it waits for the right message, validates the link, and returns only the artifact. The model can decide when to call the tool, but the tool enforces the security and reliability policy.
For higher assurance, split the flow into two tools: one that waits for and extracts the magic link, and another that consumes the link in a constrained browser or HTTP client. That second tool should enforce the same host allowlist and should not follow arbitrary redirects unless your policy permits them.
Webhook-first waiting, polling fallback
Magic link tests often fail because the wait strategy is vague. Fixed sleeps are too short when email is delayed and too long when the message arrives instantly. Agents make this worse because they may retry early, trigger multiple emails, or process the wrong resend.
Use webhooks as the primary delivery path. A webhook lets your system react when the email arrives, verify the signed payload, dedupe the delivery, and wake the waiting test or agent. Keep the webhook handler fast: verify, persist, acknowledge, and process asynchronously.
Polling remains useful as a fallback. If a webhook is delayed, missed, or disabled in a local environment, the agent tool can poll the inbox with a deadline. The polling loop should use a cursor or seen-message set, backoff, and a hard timeout.
Mailhook supports both real-time webhook notifications and a polling API, so you can make webhooks the fast path without making them the only path. If you use webhooks, verify signed payloads before processing. Mailhook’s article on webhook payload authenticity covers the threat model in more detail.
Validation policy: what to check before returning the link
A magic link is a credential. Treat it like a bearer token. The test harness should apply policy before it returns the link to an agent or test runner.
| Check | Why it matters | Example rule |
|---|---|---|
| Scheme | Prevents unsafe or unexpected actions | Require https:
|
| Host | Prevents phishing and prompt-injected links | Allow only app.example.com
|
| Path | Distinguishes auth links from marketing links | Require /auth/magic or /login/verify
|
| Query parameters | Confirms the URL looks like a login artifact | Require token, code, or state
|
| Sender matcher | Reduces spoofed or unrelated messages | Match expected sender domain |
| Subject or template signal | Helps distinguish sign-in from welcome or alert emails | Require “Sign in” or internal template ID |
| Attempt correlation | Prevents stale links across retries | Match run ID, state, or isolated inbox |
| Consume-once key | Prevents duplicate processing | Unique on artifact hash plus attempt ID |
The policy should fail closed. If no candidate matches, do not ask the LLM to “look again” in the raw email. Return a structured error with candidate metadata, such as host and path, but not raw secrets.
Test cases that catch real magic link failures
A good magic link testing suite should intentionally exercise more than the happy path. These cases usually find the bugs that only appear in CI or agent-driven automation.
| Scenario | Expected behavior |
|---|---|
| One valid sign-in email arrives | Extract one magic link and consume it once |
| Two resend emails arrive | Select the newest valid message or enforce a resend policy |
| Duplicate webhook delivery occurs | Dedupe by delivery ID and artifact hash |
| Email template changes visual button markup | Test still passes if the URL policy is unchanged |
| Plain-text part is missing | Fall back to sanitized HTML-to-text extraction, not DOM scraping |
| Link points to an unexpected host | Reject and log a policy failure |
| Link is expired | Surface a typed expired_link or login failure state |
| Agent receives an unrelated email | Ignore it based on inbox, sender, subject, and path matchers |
| Prompt-injection text appears in the email | Do not expose the raw body to the agent |
The point is to test the contract you actually rely on: delivery, message matching, URL extraction, link safety, and one-time consumption. Visual template testing can exist separately, but it should not be the foundation of agent authentication tests.
Observability without leaking tokens
When magic link testing fails, you need enough detail to debug quickly. You do not need to dump the entire message body or raw link token into CI logs.
Log stable, non-secret identifiers: attempt_id, inbox_id, message_id, delivery_id, sender, subject hash, received timestamp, candidate count, selected host, selected path, and artifact hash. If the test fails because no link matched, log rejected candidate reasons such as host_not_allowed, missing_token_param, or ambiguous_candidates.
For CI artifacts, store normalized JSON with sensitive fields redacted. Keep raw email access restricted and time-limited. This is especially important for LLM pipelines, where logs may be reused for debugging, evaluation, or traces.
Where Mailhook fits
Mailhook is built for this exact style of workflow: create disposable email inboxes via API, receive emails as structured JSON, and connect delivery to agent or QA systems through webhooks or polling. For magic link testing, that means you can avoid shared mailbox state, avoid UI mailbox login, and avoid HTML scraping as the integration contract.
A practical Mailhook-backed harness looks like this:
async function runMagicLinkLoginTest(agent, appClient, mailClient) {
const inbox = await mailClient.createDisposableInbox()
await appClient.requestMagicLink({ email: inbox.email })
const artifact = await mailClient.waitForMagicLink({
inbox_id: inbox.inbox_id,
timeout_ms: 60000,
allowed_hosts: ["app.example.com"],
allowed_paths: ["/auth/magic"]
})
await agent.openAllowedUrl(artifact.url, {
allowed_hosts: ["app.example.com"]
})
await appClient.assertSignedIn()
}
Keep this wrapper provider-agnostic inside your test harness, but use Mailhook’s primitives for inbox creation, JSON receipt, signed webhook verification, polling fallback, shared domains, custom domains, and batch email processing where they fit your scale. The implementation contract for agents and tools is available in Mailhook’s llms.txt.
Frequently Asked Questions
Can an LLM agent click a magic link safely? Yes, if the link is extracted and validated by deterministic code first. The agent should receive only an allowed URL or a typed failure, not the raw email body.
Do I need to parse HTML emails at all? Prefer text/plain and structured JSON fields. If the sender only provides HTML, convert it to sanitized text and run the same URL policy. Avoid DOM selectors and visual assumptions.
What if the email contains multiple links? Score candidates against allowed host, path, query parameters, sender, subject, and attempt correlation. If there is no unique winner, fail closed rather than asking the agent to guess.
Should I use webhooks or polling for magic link tests? Use webhooks first for fast event delivery, then keep bounded polling as a fallback. Both paths should dedupe messages and enforce the same extraction policy.
How do I prevent retries from using stale magic links? Create one disposable inbox per attempt, correlate the sign-in request where possible, and mark each extracted artifact as consumed using an attempt ID and artifact hash.
Build magic link tests agents can trust
Magic link testing does not need a mailbox UI, brittle selectors, or an LLM reading raw HTML. Treat the email as a JSON event, extract a typed artifact, validate it under a strict URL policy, and expose only the minimum safe action to the agent.
Mailhook provides the programmable temp inbox primitives for that pattern: disposable inbox creation via API, structured JSON emails, real-time webhooks, polling fallback, signed payloads, shared domains, custom domain support, and batch processing. Start with the Mailhook integration reference, then wire your agent’s wait_for_magic_link tool around a deterministic extraction policy.