
When an AI agent sends an email, the hard part isn't the send. It's what happens when the reply arrives.
Unstructured email replies are a mess: quoted text, forwarded threads, inline responses, signature blocks, out-of-office noise, and partial reformatting by every email client on earth. If your agent is waiting for a human to reply with a decision, an approval, or a piece of data — and your reply-handling pipeline can't reliably extract that signal — the whole automation breaks down.
Structured reply events solve this. The idea: treat every inbound reply as a typed event with a schema, not a raw string to be scraped. This post covers how to design that system from scratch — message threading, payload schemas, extraction strategies, idempotency, and the edge cases that will bite you in production.
What is a structured reply event?
A structured reply event is the normalized, schema-validated representation of an inbound email reply, emitted to your agent or workflow system as a discrete, actionable event.
Instead of your agent receiving a raw MIME blob and figuring out what to do with it, the reply event looks something like this:
{
"event_id": "evt_01HX9K2T3V7MYPZ4Q8N",
"event_type": "email.reply.received",
"thread_id": "thread_approval_task_8821",
"in_reply_to": "<agent-msg-8821@mail.yourapp.com>",
"from": {
"address": "alice@example.com",
"name": "Alice Chen"
},
"received_at": "2024-11-14T09:23:41Z",
"body": {
"text_clean": "Looks good. Approved.",
"quoted_stripped": true
},
"intent": "approval",
"confidence": 0.94,
"metadata": {
"originating_task_id": "task_8821",
"expected_reply_type": "approval_or_rejection"
}
}
Your agent consumes this event — not the raw email. Every field is typed, every ambiguity is resolved upstream, and the agent logic stays clean.
Threading: the foundation of reply correlation
Before you can structure a reply event, you need to know which sent message the reply is responding to. Email gives you exactly this mechanism via the In-Reply-To and References headers.
When your agent sends an email, the outbound message gets a Message-ID header:
Message-ID: <agent-msg-8821.1699954800@mail.yourapp.com>
The format matters. Use a deterministic, parseable structure:
<{task_id}.{unix_timestamp}@{sending_domain}>
This means when the reply arrives with:
In-Reply-To: <agent-msg-8821.1699954800@mail.yourapp.com>
References: <agent-msg-8821.1699954800@mail.yourapp.com>
...you can extract task_id=agent-msg-8821 directly from the Message-ID without a database lookup. That's not always enough — replies sometimes drop headers — so you also want a secondary correlation mechanism.
Secondary correlation: encoded context in the Reply-To address
Set a unique Reply-To address per outbound message:
Reply-To: reply+task8821+tkn_abc123@inbound.yourapp.com
Your inbound email handler parses the local part of the To address on the received message. Even if every threading header gets stripped, the To address is always intact. In practice, this is your most reliable correlation method.
If you're using Mails.ai's inbound email parsing, you get the full parsed MIME payload delivered to your webhook — To, In-Reply-To, and References headers as structured fields, no raw MIME parsing required.
Threading priority order
Implement correlation in this precedence:
Toaddress local part (most reliable)In-Reply-Toheader matched against your sent message storeReferencesheader (may list multiple ancestors)- Subject line matching as a last resort (unreliable — avoid relying on this)
Payload schema design
A well-designed reply event schema works for two consumers at once: your deterministic routing logic, which needs typed fields it can switch on, and your AI agent, which needs clean text it can reason about.
Core fields
interface ReplyEvent {
// Identity
event_id: string; // Idempotency key
event_type: 'email.reply.received';
received_at: string; // ISO 8601
// Threading
thread_id: string; // Your internal thread identifier
in_reply_to_message_id: string; // The Message-ID you sent
originating_task_id: string; // Extracted from Reply-To or Message-ID
// Sender
from_address: string;
from_name: string | null;
sender_verified: boolean; // Did the reply come from an expected sender?
// Content
body_clean: string; // Reply text, quoted content stripped
body_html: string | null;
has_attachments: boolean;
attachments: Attachment[];
// Classification (added by your pipeline)
intent: ReplyIntent | null;
intent_confidence: number | null;
// Context passed through from the original send
originating_context: Record<string, unknown>;
}
type ReplyIntent =
| 'approval'
| 'rejection'
| 'question'
| 'out_of_office'
| 'unsubscribe'
| 'data_provision'
| 'ambiguous';
The body_clean field is critical
Email reply bodies are polluted. A human writes two words, and the email client appends 40 lines of quoted history. Your agent should never see that quoted content — it inflates token usage, confuses extraction, and adds zero signal.
Stripping quoted reply content is non-trivial. The > prefix convention is common but not universal. Outlook uses -----Original Message----- delimiters. Apple Mail uses a different pattern. Gmail's quoted content is often in a <blockquote> in the HTML part.
A reliable approach:
import re
OUTLOOK_DELIMITER = re.compile(
r'[-_]{5,}\s*Original Message\s*[-_]{5,}',
re.IGNORECASE
)
GMAIL_WROTE_PATTERN = re.compile(
r'^On .+wrote:$',
re.MULTILINE
)
def strip_quoted_content(text: str) -> str:
# Strip Outlook-style delimiter
text = OUTLOOK_DELIMITER.split(text)[0]
# Strip Gmail/standard "On [date] wrote:" pattern and everything after
match = GMAIL_WROTE_PATTERN.search(text)
if match:
text = text[:match.start()]
# Strip > quoted lines
lines = text.split('\n')
clean_lines = [l for l in lines if not l.strip().startswith('>')]
return '\n'.join(clean_lines).strip()
For production use, libraries like talon (Python, from Mailgun) or email-reply-parser (Ruby/JS ports available) handle more edge cases than a regex approach.
Intent classification
Once you have clean reply text, you need to classify it. For simple approval workflows, a rules-based classifier is sufficient and deterministic:
APPROVAL_KEYWORDS = {'approved', 'approve', 'yes', 'go ahead', 'lgtm', 'looks good'}
REJECTION_KEYWORDS = {'rejected', 'reject', 'no', 'deny', 'denied', 'do not proceed'}
OOO_PATTERNS = re.compile(r'out of office|on vacation|auto.?reply', re.IGNORECASE)
def classify_intent(text: str) -> tuple[str, float]:
lower = text.lower().strip()
if OOO_PATTERNS.search(lower):
return ('out_of_office', 1.0)
words = set(re.findall(r'\b\w+\b', lower))
approval_match = words & APPROVAL_KEYWORDS
rejection_match = words & REJECTION_KEYWORDS
if approval_match and not rejection_match:
return ('approval', 0.95)
if rejection_match and not approval_match:
return ('rejection', 0.95)
return ('ambiguous', 0.0)
For more complex intents — where the reply contains data that needs extraction, or the language is genuinely ambiguous — a small LLM call with a structured output schema makes sense:
from openai import OpenAI
from pydantic import BaseModel
class IntentResult(BaseModel):
intent: str
confidence: float
extracted_data: dict
reasoning: str
def classify_with_llm(text: str, expected_reply_type: str) -> IntentResult:
client = OpenAI()
prompt = f"""
Classify this email reply. The sender was expected to provide: {expected_reply_type}
Reply text:
{text}
Return the intent, confidence (0-1), any extracted data, and brief reasoning.
"""
response = client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
response_format=IntentResult
)
return response.choices[0].message.parsed
Use gpt-4o-mini or an equivalent small model for classification. It's faster and cheaper, and you're not doing complex reasoning — just intent extraction.
Idempotency and deduplication
Email infrastructure is not exactly-once. Replies can be delivered more than once. Your webhook handler can receive duplicates from retry logic. Your agent must not approve the same request twice because the webhook fired twice.
Every reply event needs an idempotency key derived from the email itself — not a random UUID generated at receive time. The Message-ID header of the reply is the right key:
def generate_event_id(reply_message_id: str) -> str:
# Normalize and hash the Message-ID
normalized = reply_message_id.strip('<>').lower()
return 'evt_' + hashlib.sha256(normalized.encode()).hexdigest()[:20]
Before processing any reply event, check your event store:
def handle_inbound_webhook(payload: dict) -> Response:
message_id = payload['headers'].get('message-id', '')
event_id = generate_event_id(message_id)
if event_store.exists(event_id):
# Already processed — return 200 to acknowledge without reprocessing
return Response(status=200, body={'status': 'duplicate', 'event_id': event_id})
event = build_reply_event(payload, event_id)
# Atomic: store the event and enqueue for processing
with transaction():
event_store.insert(event_id, event)
queue.enqueue('process_reply_event', event)
return Response(status=200, body={'status': 'accepted', 'event_id': event_id})
Always return 200 to the webhook source, even for duplicates. Returning a 4xx tells the retry logic to keep hammering your endpoint.
Handling the awkward cases
Out-of-office storms
When your agent emails a distribution list, you may get dozens of OOO auto-replies. Classify and discard them immediately — don't route them to your agent. Flag the task as "awaiting reply from a human" and move on.
One subtlety: OOO messages often don't set In-Reply-To, or they set it correctly but come from a noreply address. Your classification layer should check sender address patterns alongside content.
Partial approvals and conditional replies
Humans don't always give clean yes/no answers. "Yes, but only if the budget is under $10k" is a conditional approval. Your intent schema should support this:
type ReplyIntent =
| 'approval'
| 'conditional_approval' // approved with conditions
| 'rejection'
| 'question'
| ...
interface ConditionalApproval {
intent: 'conditional_approval';
conditions: string[]; // extracted condition strings
}
Don't flatten conditional replies to ambiguous just because they're not a clean yes or no. Extract the conditions and route them to your agent with the full context.
Reply chains vs. fresh replies
Sometimes a human forwards the original email to a colleague, who then replies. The From address changes but the In-Reply-To header still points to your original message. Your system should:
- Correlate the reply to the task via
In-Reply-To - Flag that the reply came from an unexpected sender (
sender_verified: false) - Route to a human review queue or apply stricter validation before acting
For sensitive workflows — financial approvals, access grants — only accept replies from the original recipient. Check from_address against your expected sender list.
Emitting events to your agent
Once you've built and validated the reply event, how does your agent consume it?
Webhook to agent endpoint — Simplest. Your inbound pipeline POSTs the structured event to an agent API endpoint. Works well if your agent is a long-running service.
Queue-based consumption — Your pipeline enqueues the event to SQS, RabbitMQ, or similar. Your agent polls or subscribes. Better for handling load spikes and enabling replay.
MCP tool invocation — If your agent is built on the Model Context Protocol, the reply event can trigger a tool call that resurfaces the task in the agent's context window. MCP-native email handling is a natural fit here — the agent registers a receive_reply tool, and the infrastructure calls it when a structured reply event arrives.
For complex agent workflows where email is one of many I/O channels, the MCP approach keeps your agent's reasoning loop clean. The agent doesn't poll for replies — replies arrive as tool invocations that re-enter the conversation at the right point in the task.
Schema versioning
Your reply event schema will evolve. Add a schema_version field from day one:
{
"schema_version": "1.2",
"event_type": "email.reply.received",
...
}
Version your schemas semantically: patch versions for non-breaking additions (new optional fields), minor versions for structural changes requiring consumer updates. Store old event payloads in their original version — don't backfill or mutate. When consuming events, handle unknown fields gracefully rather than rejecting them.
Frequently Asked Questions
How do I handle replies where the In-Reply-To header is missing?
Some email clients — particularly mobile ones and certain corporate mail gateways — strip threading headers. Your fallback order should be: (1) match on the Reply-To address local part you encoded when sending, (2) fuzzy-match on subject line with a Re: prefix against your sent message store, (3) if neither matches, classify as unroutable and route to a human review queue. Don't attempt to guess — an incorrectly correlated reply event can cause your agent to take an action against the wrong task.
Should I strip signatures from the reply body?
Yes, but it's harder than stripping quoted content. Signatures are typically preceded by -- (dash dash space, per RFC 3676) but many clients omit the space or use custom delimiters. For agent workflows, stripping aggressively and losing a few words of signature is better than passing a 20-line signature block to your LLM classifier. Libraries like talon have dedicated signature detection models trained on real email data.
What's the right timeout for waiting on a reply?
It depends entirely on the workflow. For urgent approvals, set a 4–8 hour timeout and escalate. For non-urgent data collection, 48–72 hours is reasonable. Always implement a timeout handler that either escalates, re-sends the request, or marks the task as stalled. Don't let tasks silently hang in a "waiting for reply" state indefinitely — instrument your reply wait queue with age-based alerts.
How do I prevent replay attacks where someone re-sends an old reply email?
Two controls: (1) your idempotency check on Message-ID prevents a literal duplicate from being processed twice, and (2) implement a reply window — after N hours or days, mark the task as closed and reject any replies that arrive with In-Reply-To pointing to that task's message. Check received_at against the task's expected reply window before processing. For high-stakes workflows, also verify that the reply DKIM signature is valid for the sender's domain.
Can I use this pattern for bulk email campaigns?
Structured reply events are designed for transactional, agent-initiated emails where you have a 1:1 relationship between sent message and expected reply. Bulk campaign replies are a different problem — typically handled by unsubscribe processing, bounce classification, and engagement tracking rather than reply intent extraction. This pattern is the wrong tool for that job.
How should I handle replies with attachments?
Process the attachment metadata in your event schema (filename, MIME type, size) and store the attachment content separately in object storage (S3, GCS). Include a signed URL or reference ID in the event payload — don't inline binary content in your event. Your agent can then fetch the attachment on demand. For workflows expecting attachments (e.g., "please reply with the signed contract"), classify has_attachments: true as part of the intent signal and route accordingly.