All posts
Patterns··10 min read·Mails.ai Team

Structured Reply Events for Email Agents

Structured Reply Events for Email Agents

When an AI agent sends an email, the hard part isn't the send. It's what happens when the reply arrives.

Unstructured email replies are a mess: quoted text, forwarded threads, inline responses, signature blocks, out-of-office noise, and partial reformatting by every email client on earth. If your agent is waiting for a human to reply with a decision, an approval, or a piece of data — and your reply-handling pipeline can't reliably extract that signal — the whole automation breaks down.

Structured reply events solve this. The idea: treat every inbound reply as a typed event with a schema, not a raw string to be scraped. This post covers how to design that system from scratch — message threading, payload schemas, extraction strategies, idempotency, and the edge cases that will bite you in production.


What is a structured reply event?

A structured reply event is the normalized, schema-validated representation of an inbound email reply, emitted to your agent or workflow system as a discrete, actionable event.

Instead of your agent receiving a raw MIME blob and figuring out what to do with it, the reply event looks something like this:

{
  "event_id": "evt_01HX9K2T3V7MYPZ4Q8N",
  "event_type": "email.reply.received",
  "thread_id": "thread_approval_task_8821",
  "in_reply_to": "<agent-msg-8821@mail.yourapp.com>",
  "from": {
    "address": "alice@example.com",
    "name": "Alice Chen"
  },
  "received_at": "2024-11-14T09:23:41Z",
  "body": {
    "text_clean": "Looks good. Approved.",
    "quoted_stripped": true
  },
  "intent": "approval",
  "confidence": 0.94,
  "metadata": {
    "originating_task_id": "task_8821",
    "expected_reply_type": "approval_or_rejection"
  }
}

Your agent consumes this event — not the raw email. Every field is typed, every ambiguity is resolved upstream, and the agent logic stays clean.


Threading: the foundation of reply correlation

Before you can structure a reply event, you need to know which sent message the reply is responding to. Email gives you exactly this mechanism via the In-Reply-To and References headers.

When your agent sends an email, the outbound message gets a Message-ID header:

Message-ID: <agent-msg-8821.1699954800@mail.yourapp.com>

The format matters. Use a deterministic, parseable structure:

<{task_id}.{unix_timestamp}@{sending_domain}>

This means when the reply arrives with:

In-Reply-To: <agent-msg-8821.1699954800@mail.yourapp.com>
References: <agent-msg-8821.1699954800@mail.yourapp.com>

...you can extract task_id=agent-msg-8821 directly from the Message-ID without a database lookup. That's not always enough — replies sometimes drop headers — so you also want a secondary correlation mechanism.

Secondary correlation: encoded context in the Reply-To address

Set a unique Reply-To address per outbound message:

Reply-To: reply+task8821+tkn_abc123@inbound.yourapp.com

Your inbound email handler parses the local part of the To address on the received message. Even if every threading header gets stripped, the To address is always intact. In practice, this is your most reliable correlation method.

If you're using Mails.ai's inbound email parsing, you get the full parsed MIME payload delivered to your webhook — To, In-Reply-To, and References headers as structured fields, no raw MIME parsing required.

Threading priority order

Implement correlation in this precedence:

  1. To address local part (most reliable)
  2. In-Reply-To header matched against your sent message store
  3. References header (may list multiple ancestors)
  4. Subject line matching as a last resort (unreliable — avoid relying on this)

Payload schema design

A well-designed reply event schema works for two consumers at once: your deterministic routing logic, which needs typed fields it can switch on, and your AI agent, which needs clean text it can reason about.

Core fields

interface ReplyEvent {
  // Identity
  event_id: string;          // Idempotency key
  event_type: 'email.reply.received';
  received_at: string;       // ISO 8601

  // Threading
  thread_id: string;         // Your internal thread identifier
  in_reply_to_message_id: string;  // The Message-ID you sent
  originating_task_id: string;     // Extracted from Reply-To or Message-ID

  // Sender
  from_address: string;
  from_name: string | null;
  sender_verified: boolean;  // Did the reply come from an expected sender?

  // Content
  body_clean: string;        // Reply text, quoted content stripped
  body_html: string | null;
  has_attachments: boolean;
  attachments: Attachment[];

  // Classification (added by your pipeline)
  intent: ReplyIntent | null;
  intent_confidence: number | null;

  // Context passed through from the original send
  originating_context: Record<string, unknown>;
}

type ReplyIntent =
  | 'approval'
  | 'rejection'
  | 'question'
  | 'out_of_office'
  | 'unsubscribe'
  | 'data_provision'
  | 'ambiguous';

The body_clean field is critical

Email reply bodies are polluted. A human writes two words, and the email client appends 40 lines of quoted history. Your agent should never see that quoted content — it inflates token usage, confuses extraction, and adds zero signal.

Stripping quoted reply content is non-trivial. The > prefix convention is common but not universal. Outlook uses -----Original Message----- delimiters. Apple Mail uses a different pattern. Gmail's quoted content is often in a <blockquote> in the HTML part.

A reliable approach:

import re

OUTLOOK_DELIMITER = re.compile(
    r'[-_]{5,}\s*Original Message\s*[-_]{5,}',
    re.IGNORECASE
)
GMAIL_WROTE_PATTERN = re.compile(
    r'^On .+wrote:$',
    re.MULTILINE
)

def strip_quoted_content(text: str) -> str:
    # Strip Outlook-style delimiter
    text = OUTLOOK_DELIMITER.split(text)[0]
    
    # Strip Gmail/standard "On [date] wrote:" pattern and everything after
    match = GMAIL_WROTE_PATTERN.search(text)
    if match:
        text = text[:match.start()]
    
    # Strip > quoted lines
    lines = text.split('\n')
    clean_lines = [l for l in lines if not l.strip().startswith('>')]
    
    return '\n'.join(clean_lines).strip()

For production use, libraries like talon (Python, from Mailgun) or email-reply-parser (Ruby/JS ports available) handle more edge cases than a regex approach.


Intent classification

Once you have clean reply text, you need to classify it. For simple approval workflows, a rules-based classifier is sufficient and deterministic:

APPROVAL_KEYWORDS = {'approved', 'approve', 'yes', 'go ahead', 'lgtm', 'looks good'}
REJECTION_KEYWORDS = {'rejected', 'reject', 'no', 'deny', 'denied', 'do not proceed'}
OOO_PATTERNS = re.compile(r'out of office|on vacation|auto.?reply', re.IGNORECASE)

def classify_intent(text: str) -> tuple[str, float]:
    lower = text.lower().strip()
    
    if OOO_PATTERNS.search(lower):
        return ('out_of_office', 1.0)
    
    words = set(re.findall(r'\b\w+\b', lower))
    
    approval_match = words & APPROVAL_KEYWORDS
    rejection_match = words & REJECTION_KEYWORDS
    
    if approval_match and not rejection_match:
        return ('approval', 0.95)
    if rejection_match and not approval_match:
        return ('rejection', 0.95)
    
    return ('ambiguous', 0.0)

For more complex intents — where the reply contains data that needs extraction, or the language is genuinely ambiguous — a small LLM call with a structured output schema makes sense:

from openai import OpenAI
from pydantic import BaseModel

class IntentResult(BaseModel):
    intent: str
    confidence: float
    extracted_data: dict
    reasoning: str

def classify_with_llm(text: str, expected_reply_type: str) -> IntentResult:
    client = OpenAI()
    prompt = f"""
Classify this email reply. The sender was expected to provide: {expected_reply_type}

Reply text:
{text}

Return the intent, confidence (0-1), any extracted data, and brief reasoning.
"""
    response = client.beta.chat.completions.parse(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format=IntentResult
    )
    return response.choices[0].message.parsed

Use gpt-4o-mini or an equivalent small model for classification. It's faster and cheaper, and you're not doing complex reasoning — just intent extraction.


Idempotency and deduplication

Email infrastructure is not exactly-once. Replies can be delivered more than once. Your webhook handler can receive duplicates from retry logic. Your agent must not approve the same request twice because the webhook fired twice.

Every reply event needs an idempotency key derived from the email itself — not a random UUID generated at receive time. The Message-ID header of the reply is the right key:

def generate_event_id(reply_message_id: str) -> str:
    # Normalize and hash the Message-ID
    normalized = reply_message_id.strip('<>').lower()
    return 'evt_' + hashlib.sha256(normalized.encode()).hexdigest()[:20]

Before processing any reply event, check your event store:

def handle_inbound_webhook(payload: dict) -> Response:
    message_id = payload['headers'].get('message-id', '')
    event_id = generate_event_id(message_id)
    
    if event_store.exists(event_id):
        # Already processed — return 200 to acknowledge without reprocessing
        return Response(status=200, body={'status': 'duplicate', 'event_id': event_id})
    
    event = build_reply_event(payload, event_id)
    
    # Atomic: store the event and enqueue for processing
    with transaction():
        event_store.insert(event_id, event)
        queue.enqueue('process_reply_event', event)
    
    return Response(status=200, body={'status': 'accepted', 'event_id': event_id})

Always return 200 to the webhook source, even for duplicates. Returning a 4xx tells the retry logic to keep hammering your endpoint.


Handling the awkward cases

Out-of-office storms

When your agent emails a distribution list, you may get dozens of OOO auto-replies. Classify and discard them immediately — don't route them to your agent. Flag the task as "awaiting reply from a human" and move on.

One subtlety: OOO messages often don't set In-Reply-To, or they set it correctly but come from a noreply address. Your classification layer should check sender address patterns alongside content.

Partial approvals and conditional replies

Humans don't always give clean yes/no answers. "Yes, but only if the budget is under $10k" is a conditional approval. Your intent schema should support this:

type ReplyIntent =
  | 'approval'
  | 'conditional_approval'  // approved with conditions
  | 'rejection'
  | 'question'
  | ...

interface ConditionalApproval {
  intent: 'conditional_approval';
  conditions: string[];  // extracted condition strings
}

Don't flatten conditional replies to ambiguous just because they're not a clean yes or no. Extract the conditions and route them to your agent with the full context.

Reply chains vs. fresh replies

Sometimes a human forwards the original email to a colleague, who then replies. The From address changes but the In-Reply-To header still points to your original message. Your system should:

  1. Correlate the reply to the task via In-Reply-To
  2. Flag that the reply came from an unexpected sender (sender_verified: false)
  3. Route to a human review queue or apply stricter validation before acting

For sensitive workflows — financial approvals, access grants — only accept replies from the original recipient. Check from_address against your expected sender list.


Emitting events to your agent

Once you've built and validated the reply event, how does your agent consume it?

Webhook to agent endpoint — Simplest. Your inbound pipeline POSTs the structured event to an agent API endpoint. Works well if your agent is a long-running service.

Queue-based consumption — Your pipeline enqueues the event to SQS, RabbitMQ, or similar. Your agent polls or subscribes. Better for handling load spikes and enabling replay.

MCP tool invocation — If your agent is built on the Model Context Protocol, the reply event can trigger a tool call that resurfaces the task in the agent's context window. MCP-native email handling is a natural fit here — the agent registers a receive_reply tool, and the infrastructure calls it when a structured reply event arrives.

For complex agent workflows where email is one of many I/O channels, the MCP approach keeps your agent's reasoning loop clean. The agent doesn't poll for replies — replies arrive as tool invocations that re-enter the conversation at the right point in the task.


Schema versioning

Your reply event schema will evolve. Add a schema_version field from day one:

{
  "schema_version": "1.2",
  "event_type": "email.reply.received",
  ...
}

Version your schemas semantically: patch versions for non-breaking additions (new optional fields), minor versions for structural changes requiring consumer updates. Store old event payloads in their original version — don't backfill or mutate. When consuming events, handle unknown fields gracefully rather than rejecting them.


Frequently Asked Questions

How do I handle replies where the In-Reply-To header is missing?

Some email clients — particularly mobile ones and certain corporate mail gateways — strip threading headers. Your fallback order should be: (1) match on the Reply-To address local part you encoded when sending, (2) fuzzy-match on subject line with a Re: prefix against your sent message store, (3) if neither matches, classify as unroutable and route to a human review queue. Don't attempt to guess — an incorrectly correlated reply event can cause your agent to take an action against the wrong task.

Should I strip signatures from the reply body?

Yes, but it's harder than stripping quoted content. Signatures are typically preceded by -- (dash dash space, per RFC 3676) but many clients omit the space or use custom delimiters. For agent workflows, stripping aggressively and losing a few words of signature is better than passing a 20-line signature block to your LLM classifier. Libraries like talon have dedicated signature detection models trained on real email data.

What's the right timeout for waiting on a reply?

It depends entirely on the workflow. For urgent approvals, set a 4–8 hour timeout and escalate. For non-urgent data collection, 48–72 hours is reasonable. Always implement a timeout handler that either escalates, re-sends the request, or marks the task as stalled. Don't let tasks silently hang in a "waiting for reply" state indefinitely — instrument your reply wait queue with age-based alerts.

How do I prevent replay attacks where someone re-sends an old reply email?

Two controls: (1) your idempotency check on Message-ID prevents a literal duplicate from being processed twice, and (2) implement a reply window — after N hours or days, mark the task as closed and reject any replies that arrive with In-Reply-To pointing to that task's message. Check received_at against the task's expected reply window before processing. For high-stakes workflows, also verify that the reply DKIM signature is valid for the sender's domain.

Can I use this pattern for bulk email campaigns?

Structured reply events are designed for transactional, agent-initiated emails where you have a 1:1 relationship between sent message and expected reply. Bulk campaign replies are a different problem — typically handled by unsubscribe processing, bounce classification, and engagement tracking rather than reply intent extraction. This pattern is the wrong tool for that job.

How should I handle replies with attachments?

Process the attachment metadata in your event schema (filename, MIME type, size) and store the attachment content separately in object storage (S3, GCS). Include a signed URL or reference ID in the event payload — don't inline binary content in your event. Your agent can then fetch the attachment on demand. For workflows expecting attachments (e.g., "please reply with the signed contract"), classify has_attachments: true as part of the intent signal and route accordingly.

Closed beta

Built for agents.
Self-serve in minutes.

Public API opens Q3 2026. Drop ~6 lines into your agent and ship.

npmpnpmbunpip
$ npm install @mailsai/sdk
Packages publish with cohort 1 · Q3 2026