All posts
Patterns··10 min read·Mails.ai Team

How AI Agents Should Handle Email Replies

Email replies aren't just inbound messages — they carry conversation state, threading metadata, quoted history, and sender intent signals that your agent needs to parse correctly before generating a response. Get this wrong and you get broken threads, duplicate sends, context loss, and frustrated users wondering why the bot answered a question they didn't ask.

This post covers the full reply-handling stack: MIME threading, reply detection, context reconstruction, response generation, and the loop-prevention logic every production agent needs.

Why reply handling is harder than it looks

Replying to an email looks simple. In practice, email threading is governed by RFC-defined headers — Message-ID, In-Reply-To, and References — that email clients and servers use to group messages. An agent that ignores these headers will break threads, confuse recipients, and lose conversation history. The core challenge is reconstructing what was said before so the agent's response is contextually coherent, not just technically deliverable.

Four distinct problems your agent must solve:

  1. Thread identification — which conversation does this reply belong to?
  2. Reply detection — is this a genuine reply, or a new message that just looks like one?
  3. Context reconstruction — what's the full conversation history the agent needs to reason over?
  4. Response generation — what should the agent say, and how should it format the outbound message?

Threading mechanics: Message-ID, In-Reply-To, References

The RFC 5322 threading model ties email conversations together. Every outbound message your agent sends should include a globally unique Message-ID header. When a recipient replies, their email client sets In-Reply-To to that Message-ID and appends it to the References header alongside any prior IDs in the thread.

Message-ID: <agent-task-4a9f2@mails.ai>
In-Reply-To: <user-reply-83hx1@gmail.com>
References: <agent-task-4a9f2@mails.ai> <user-reply-83hx1@gmail.com>

When an inbound message arrives, extract these three headers immediately. The References chain gives you the full ordered thread ancestry. Store them in your conversation state alongside the message body.

Constructing Message-ID correctly

A well-formed Message-ID looks like <localpart@domain>. The local part needs enough entropy to be unique — a UUID or ULID works — and can optionally encode task or session metadata for easier lookup:

import uuid

def make_message_id(task_id: str, domain: str) -> str:
    uid = uuid.uuid4().hex[:12]
    return f"<{task_id}-{uid}@{domain}>"

Never reuse Message-IDs. Duplicates cause some servers to reject or silently discard messages, and will corrupt your thread state.

Setting reply headers on outbound messages

When your agent replies, set both In-Reply-To and References:

def build_reply_headers(inbound_message: dict) -> dict:
    in_reply_to = inbound_message.get("message_id")
    existing_refs = inbound_message.get("references", "")
    
    # Append the message we're replying to onto the references chain
    references = f"{existing_refs} {in_reply_to}".strip()
    
    return {
        "In-Reply-To": in_reply_to,
        "References": references,
        "Message-ID": make_message_id(task_id, sending_domain),
    }

Also mirror the subject with a Re: prefix if it isn't there already. Strip and normalize before prepending — never produce Re: Re: Re:.

Reply detection: separating real replies from noise

Not every inbound message is a genuine reply from a human. Autoresponders, vacation messages, delivery status notifications (DSNs), and mailing list digests all arrive in reply-like form. Sending an agent response to any of these creates loops or wastes quota.

A reliable reply classification pipeline checks multiple signals.

Header-based signals

Header Meaning
Auto-Submitted: auto-replied Autoresponder — do not reply
Auto-Submitted: auto-generated System-generated — do not reply
X-Auto-Response-Suppress: All Suppress any automated response
Precedence: bulk or Precedence: list Mailing list or bulk — skip
Content-Type: multipart/report DSN or MDN — do not reply
Return-Path: <> (empty) Bounce envelope — do not reply

Check these before anything else. If any match, drop the message into a no-reply bucket and stop.

Body-based signals

Vacation autoresponders don't always set headers correctly. Run a lightweight text classifier or regex pass over the body for phrases like "out of office", "automatic reply", "I am currently unavailable". These aren't user intent signals — they're noise.

For agents handling high volumes, a small fine-tuned classifier on the first 200 tokens of the body outperforms regex. That said, even a well-curated blocklist catches 95%+ of autoresponders in practice.

Context reconstruction

Once you've confirmed a message is a genuine reply, your agent needs the full conversation history to generate a useful response. Context comes from two sources: the quoted body in the reply itself, and your stored conversation state.

Quoted body extraction

Email clients include prior messages as quoted text — lines prefixed with >, or separated by -- Original Message -- delimiters. Don't pass this quoted text raw to your LLM as if it were new user input. Parse it out, structure it, and pass it as historical context separately.

A minimal quoted-body extractor:

import re

def split_reply_and_quoted(body: str) -> tuple[str, str]:
    # Match common quoting patterns
    quote_pattern = re.compile(
        r"^(>.*|\s*-{3,}\s*(Original|Forwarded) Message.*)",
        re.MULTILINE | re.IGNORECASE
    )
    match = quote_pattern.search(body)
    if match:
        new_text = body[:match.start()].strip()
        quoted_text = body[match.start():].strip()
        return new_text, quoted_text
    return body.strip(), ""

The new_text is what the human actually wrote this turn. The quoted_text is historical context. Pass both explicitly when building your LLM prompt.

Storing and retrieving thread state

For anything beyond single-turn replies, maintain a conversation store keyed by thread ID. Derive the thread ID from the root Message-ID — the first ID in the References chain.

def get_thread_id(message: dict) -> str:
    refs = message.get("references", "")
    if refs:
        return refs.split()[0]  # Root message ID
    return message.get("message_id", "")

Store each message as an ordered list of {role, content, timestamp} objects. Truncate from the oldest end if context length becomes a concern, but always preserve the first message — it usually contains the original task or request.

Prompt design for reply generation

The prompt structure your agent uses for reply generation matters as much as the retrieval logic. Pass conversation history in a format that makes the temporal relationship explicit:

def build_reply_prompt(thread: list[dict], new_message: str) -> list[dict]:
    messages = [
        {"role": "system", "content": AGENT_SYSTEM_PROMPT}
    ]
    # Insert thread history as alternating user/assistant turns
    for turn in thread:
        messages.append({
            "role": turn["role"],
            "content": turn["content"]
        })
    # Append the current reply
    messages.append({"role": "user", "content": new_message})
    return messages

Include explicit instructions about email formatting in your system prompt: plain text preferred for replies (HTML is fine for initial outreach, but multi-turn replies in HTML look robotic), no unnecessary preamble, consistent sign-off.

Handling attachments in replies

If a reply includes attachments, extract and describe them before passing to the LLM. For PDFs and documents, run a text extractor. For images, use a vision model or describe the attachment type and filename. Never dump raw binary into a prompt. Pass structured metadata:

attachment_context = [
    {"filename": "invoice.pdf", "type": "application/pdf", "content": extracted_text},
    {"filename": "photo.jpg", "type": "image/jpeg", "content": "[Image: product photo, 1.2MB]"}
]

Loop prevention

This is the most operationally critical piece. An agent that replies to its own replies, or to an autoresponder that replies back, creates an infinite loop that can exhaust your sending quota in minutes and get your domain flagged for spam.

A multi-layer loop prevention stack:

flowchart LR
    A[Inbound message] --> B[Check Auto-Submitted header]
    B -->|auto| C[Drop no-reply]
    B -->|absent| D[Check Return-Path]
    D -->|empty| C
    D -->|present| E[Check sender vs agent addresses]
    E -->|self| C
    E -->|other| F[Check reply count for thread]
    F -->|over limit| G[Escalate to human]
    F -->|under limit| H[Generate reply]

Specific loop prevention rules

  1. Never reply to your own addresses. Maintain a registry of all agent-controlled email addresses. If From matches any of them, drop immediately.
  2. Thread reply depth cap. Set a maximum reply depth — 10-15 turns is reasonable. Beyond that, escalate to a human or send a single "I'll need to hand this off" message and stop.
  3. Idempotency on Message-ID. Track every Message-ID you've processed. If the same ID arrives twice (delivery retry, duplicate webhook), skip it. Store processed IDs in Redis with a TTL of 7 days.
  4. Rate limit per thread. No more than N agent-generated replies per thread per hour. This catches subtle loops that header checks miss.
PROCESSED_KEY = "processed_message_ids"
MAX_THREAD_REPLIES = 15
REPLY_DEPTH_KEY = "thread_reply_count"

def should_process(message_id: str, thread_id: str, redis) -> bool:
    if redis.sismember(PROCESSED_KEY, message_id):
        return False  # Already handled
    reply_count = int(redis.get(f"{REPLY_DEPTH_KEY}:{thread_id}") or 0)
    if reply_count >= MAX_THREAD_REPLIES:
        return False  # Cap reached
    redis.sadd(PROCESSED_KEY, message_id)
    redis.expire(PROCESSED_KEY, 604800)  # 7 days
    redis.incr(f"{REPLY_DEPTH_KEY}:{thread_id}")
    return True

Delivery and timing considerations

Sending agent replies instantly can itself be a signal — some spam filters are suspicious of sub-second reply times. More practically, instant replies don't give humans a chance to send a follow-up that would supersede the original question.

A 30-90 second delay before sending is a reasonable production default. During that window, check whether another message arrived from the same sender in the same thread. If it did, consolidate both into a single context pass and send one reply.

For inbound email parsing, make sure your webhook handler ACKs the HTTP request immediately (return 200 OK) and queues the processing job asynchronously. Parsing and LLM inference inside the webhook request handler is a reliability antipattern — timeouts will cause your provider to retry delivery.

Putting it together: the reply pipeline

A production agent reply pipeline in sequence:

  1. Inbound webhook fires → ACK 200 OK, enqueue job
  2. Dequeue → extract Message-ID, In-Reply-To, References, From, Auto-Submitted
  3. Loop prevention checks (headers, sender registry, idempotency, depth cap)
  4. Split reply body from quoted text
  5. Retrieve thread state from store using root Message-ID
  6. Build LLM prompt with thread history + new message
  7. Generate response
  8. Set In-Reply-To, References, Reply-To, Message-ID on outbound message
  9. Apply send delay
  10. Send and store the agent's response in thread state

Platforms like Mails.ai expose threading metadata, inbound parsing, and reply headers as first-class primitives, which removes several of these steps from custom implementation.

For agents with complex multi-step workflows, the MCP-native email interface lets the model itself decide when to reply, what headers to set, and whether to escalate — rather than hard-coding those decisions in application logic.

Frequently Asked Questions

How do I match a reply to the original task my agent was working on?

Embed a task or session identifier in the Message-ID local part when you send the original message (e.g., <task-abc123-uid@yourdomain.com>). When a reply arrives, its References header will include this ID. Parse the task identifier from the root reference and look it up in your task store. This avoids a separate thread-to-task mapping table.

What should my agent do when a reply contains a completely different question from the original?

Treat it as a topic shift, not a continuation. Include a topic-shift detection step in your prompt — instruct the model to identify whether the new message continues the prior thread or introduces a new request. If it's a new request, you can handle it in the same thread (noting the context shift) or spin up a new task and acknowledge the original thread is closed.

How do I prevent my agent from replying to bounce notifications?

Bounce messages (DSNs) have Content-Type: multipart/report with a report-type=delivery-status parameter, and typically carry an empty Return-Path: <> envelope sender. Check both. Also check for Auto-Submitted: auto-replied and Auto-Submitted: auto-generated. Any one of these alone is sufficient to skip processing.

Should my agent reply in HTML or plain text?

For initial outreach, HTML gives you formatting control and link tracking. For replies in an ongoing thread, plain text is almost always better — it matches the conversational register, renders correctly in every client, and doesn't look like a marketing email. If you need to include structured data like tables or code, use plain text formatting conventions (pipes for tables, backticks for code) rather than HTML.

How do I handle threads where multiple agents or humans are in the To/CC fields?

Parse the full To, CC, and Reply-To headers. Build a participant list and check which addresses belong to your agent vs. humans vs. other systems. For multi-agent threads, you need coordination logic to prevent both agents from replying simultaneously — a simple distributed lock keyed on thread ID with a short TTL (30-60s) works. The agent that acquires the lock replies; the other backs off.

How many rounds of back-and-forth can a typical LLM context window handle?

At roughly 200-400 tokens per email turn, a 128K context window handles 300-600 turns comfortably. In practice, you'll hit the useful reasoning limit well before the token limit — after 20-30 dense turns, retrieval-augmented approaches (summarizing older turns and keeping full detail only for recent ones) outperform raw context stuffing. Summarize the oldest 50% of the thread when total token count exceeds 80% of your context budget.

Closed beta

Built for agents.
Self-serve in minutes.

Public API opens Q3 2026. Drop ~6 lines into your agent and ship.

npmpnpmbunpip
$ npm install @mailsai/sdk
Packages publish with cohort 1 · Q3 2026