All posts
Architecture··11 min read·Mails.ai Team

Building an Email-Driven Agent Workflow

Email is one of the oldest async communication protocols, and it maps surprisingly well onto agent architectures. An email arrives, carries structured intent, triggers computation, and expects a reply — sometimes immediately, sometimes after a multi-step process that takes days. That's a workflow, not just a message.

This post covers the full engineering picture: parsing inbound email into agent-usable events, designing the state machine that drives agent behavior, dispatching to tools, maintaining thread coherence, and sending replies that actually reach the inbox.

What makes email a good agent interface

Email is async by design, carries arbitrary context in a thread, and every participant already has a client. Unlike REST APIs or chat interfaces, email requires no SDK on the user's side — the protocol is the SDK. For agents handling tasks that span hours or days (contract review, multi-party approval, expense reconciliation), email threads provide a natural audit trail and resumption point.

The tradeoffs are real though. Email is unstructured, ambiguous, and arrives out-of-order. A production agent workflow has to handle all three.

Step 1 — Receiving email as a structured event

Inbound email parsing is the entry point. When a message arrives at your agent's address (agent@yourdomain.com), you need it delivered to your application as a structured payload — not stored in an IMAP mailbox you have to poll.

Most email infrastructure providers support webhook delivery of inbound mail. The raw MIME message gets parsed and POSTed to your endpoint as JSON. A minimal inbound payload looks like this:

{
  "message_id": "<CABk7Ry3z...@mail.gmail.com>",
  "from": { "name": "Alice Chen", "email": "alice@example.com" },
  "to": [{ "email": "agent@yourdomain.com" }],
  "subject": "Review this contract draft",
  "text": "Hi, attached is the MSA. Can you flag any indemnification clauses?",
  "html": "<p>Hi, attached is the MSA...</p>",
  "attachments": [
    { "filename": "msa_draft.pdf", "content_type": "application/pdf", "url": "https://..." }
  ],
  "headers": {
    "In-Reply-To": null,
    "References": null,
    "Message-ID": "<CABk7Ry3z...@mail.gmail.com>"
  },
  "timestamp": "2025-01-15T14:23:01Z"
}

The In-Reply-To and References headers are critical — they're how you know whether this is a new conversation or part of an existing thread. A null In-Reply-To means a new workflow should be created. A populated one means you're resuming.

Webhook security

Verify every inbound webhook. Most providers sign the payload with an HMAC-SHA256 signature in a request header. Validate it before processing:

import hmac
import hashlib

def verify_webhook(payload: bytes, signature: str, secret: str) -> bool:
    expected = hmac.new(
        secret.encode(),
        payload,
        hashlib.sha256
    ).hexdigest()
    return hmac.compare_digest(expected, signature)

Skip this, and anyone who discovers your endpoint can inject arbitrary tasks into your agent.

Step 2 — Thread identity and state management

Email threads are identified by the References header chain, not by subject line. Two messages with the same subject but different Message-ID trees are separate conversations. Your state store needs to key on thread identity, not subject.

A practical approach: extract the root Message-ID from the References header and use it as your workflow ID.

def extract_thread_id(headers: dict) -> str:
    references = headers.get("References", "")
    message_id = headers.get("Message-ID", "")
    
    if references:
        # First ID in the References chain is the root
        return references.strip().split()[0]
    else:
        # This is the first message; it becomes the root
        return message_id

Store your workflow state keyed by this thread ID:

@dataclass
class WorkflowState:
    thread_id: str
    status: str  # "pending", "processing", "awaiting_human", "complete"
    messages: list[dict]  # Full thread history
    tool_calls: list[dict]  # What the agent has done
    created_at: datetime
    updated_at: datetime
    metadata: dict  # Task-specific data

Use a durable store — Redis with persistence, Postgres, or DynamoDB. In-memory state dies with your process, and email workflows often span multiple deployments.

Step 3 — Designing the agent state machine

An email-driven agent isn't a simple request-response handler. It's a state machine where each inbound message is an event that drives transitions.

flowchart LR
    A[New Email Received] --> B[Parse and Classify]
    B --> C{Thread Exists}
    C -->|No| D[Create Workflow]
    C -->|Yes| E[Resume Workflow]
    D --> F[Initial LLM Planning]
    E --> G[Append Message to Context]
    F --> H[Tool Dispatch]
    G --> H
    H --> I{Tools Complete}
    I -->|Yes| J[Generate Reply]
    I -->|Need Human Input| K[Send Clarification Email]
    K --> L[State = Awaiting Human]
    L --> A
    J --> M[Send Reply and Mark Complete]

The state machine has a few key states:

  • created — thread ID assigned, initial message stored
  • planning — LLM deciding what tools to run
  • executing — tools running (possibly async, possibly long-running)
  • awaiting_human — agent sent a question, waiting for reply
  • composing — LLM generating the final response
  • complete — reply sent, workflow done
  • failed — something went wrong; needs intervention

The awaiting_human state is what separates email agents from chatbots. The agent sends an email, halts, and picks up days later when the user replies. Your state machine has to survive that gap.

Step 4 — Email classification before agent dispatch

Not every inbound email should trigger the same agent behavior. A production system routes messages to different handlers based on intent. Email classification at the entry point keeps your orchestrator from having to handle every edge case.

Classify on at least these dimensions:

Dimension Examples Why it matters
Intent request, reply, acknowledgment, complaint Drives workflow type
Priority urgent, normal, low Affects response SLA
Entity type contract, invoice, support ticket Routes to correct agent
Requires human yes/no Escalation logic

A classifier LLM call with a structured output schema:

from pydantic import BaseModel
from typing import Literal

class EmailClassification(BaseModel):
    intent: Literal["new_request", "reply", "acknowledgment", "complaint", "spam"]
    entity_type: Literal["contract", "invoice", "support", "general"] | None
    priority: Literal["high", "normal", "low"]
    requires_human_review: bool
    confidence: float

# Pass email body + subject to LLM with structured output
classification = llm.structured_output(
    prompt=f"Classify this email:\nSubject: {subject}\nBody: {body[:2000]}",
    schema=EmailClassification
)

if classification.intent == "spam" or classification.confidence < 0.6:
    return  # Drop or quarantine

if classification.requires_human_review:
    escalate_to_human(thread_id, classification)
    return

# Route to appropriate handler
router.dispatch(classification.entity_type, thread_id, email_payload)

Step 5 — Tool dispatch and async execution

Once the agent has classified and planned, it runs tools. Email workflows fit naturally with long-running async tools — PDF extraction, database lookups, API calls to third-party systems.

The key engineering decision: synchronous vs asynchronous tool execution.

For tools that finish in under 30 seconds, execute inline before replying. For longer operations, push to a job queue (Celery, BullMQ, or a cloud task queue), store the job IDs in WorkflowState.tool_calls, and poll or use callbacks to resume.

async def execute_tools(plan: list[ToolCall], state: WorkflowState) -> list[ToolResult]:
    results = []
    for call in plan:
        if call.estimated_duration_seconds < 30:
            result = await tool_registry.execute(call)
            results.append(result)
        else:
            job_id = await queue.enqueue(
                tool=call.tool_name,
                args=call.arguments,
                callback_url=f"/workflows/{state.thread_id}/tool-complete"
            )
            state.tool_calls.append({"call": call, "job_id": job_id, "status": "pending"})
    
    await state_store.save(state)
    return results

When an async tool completes and POSTs to your callback, check whether all pending tools are done, then proceed to the compose step.

Step 6 — Maintaining thread coherence in replies

This is where most implementations break. To maintain a proper email thread, your reply must set the right headers. Miss them, and the agent's reply starts a new thread in the recipient's email client instead of continuing the conversation.

Required headers on outbound replies:

def build_reply_headers(original_message_id: str, original_references: str | None) -> dict:
    references_chain = original_references or ""
    if original_message_id not in references_chain:
        references_chain = f"{references_chain} {original_message_id}".strip()
    
    return {
        "In-Reply-To": original_message_id,
        "References": references_chain,
        "Subject": f"Re: {original_subject}"  # Prefix only if not already present
    }

Also preserve the To/CC structure of the original thread. If the original was sent to agent@yourdomain.com and CC'd two humans, your reply should typically CC those same humans. Strip email addresses that belong to your own agent infrastructure.

Step 7 — Deliverability for automated senders

Agents send a lot of email. If your agent handles 50 support tickets a day, that's 50+ outbound messages from a single address. Volume alone triggers spam filters. Add in the fact that agent-generated text can have patterns — consistent sentence structure, low variation — that Bayesian filters flag, and you have a deliverability problem.

The foundations:

Authentication — SPF, DKIM, and DMARC are non-negotiable. SPF authorizes your sending IPs, DKIM signs each message cryptographically, and DMARC tells receiving servers what to do when either check fails. Without all three aligned, your deliverability is unpredictable.

Dedicated sending IP — Shared IPs mean shared reputation. If another tenant on your email provider's shared pool gets blocklisted, your agent's mail suffers. A dedicated IP gives you isolated reputation that you control entirely.

Warm-up — New IPs start with zero reputation. Send too much too fast and ISPs throttle or block you. Ramp from ~50/day to full volume over 4-6 weeks.

Content — Agents should vary reply phrasing, avoid all-uppercase words, and keep image-to-text ratios reasonable. A reply that's 90% boilerplate will pattern-match to bulk mail.

Mails.ai's sender reputation tooling surfaces these signals in real time, so you know when deliverability degrades before users start reporting missing emails.

Step 8 — Idempotency and failure handling

Email infrastructure can deliver duplicates. Your webhook endpoint might receive the same inbound message twice after a network retry. Your agent must be idempotent.

Key mechanisms:

  1. Deduplicate on Message-ID — Store processed Message-IDs. On receipt, check before processing:

    if await redis.sismember("processed_message_ids", message_id):
        return 200  # Already handled
    await redis.sadd("processed_message_ids", message_id)
    await redis.expire("processed_message_ids", 86400 * 7)  # 7-day window
    
  2. Idempotent tool calls — Use deterministic IDs for external API calls (e.g., f"{thread_id}:{tool_name}:{input_hash}") so retries don't double-book, double-charge, or double-send.

  3. Dead letter queue — Workflows that fail repeatedly should move to a dead letter queue with full context preserved, not get silently dropped. A human review step then decides whether to retry or escalate.

Putting it together: a minimal end-to-end flow

Here's the full request lifecycle for a contract-review agent:

  1. Alice emails contracts@yourdomain.com with a PDF attachment
  2. Inbound webhook fires at POST /inbound
  3. Signature validated, Message-ID deduplicated
  4. Thread ID extracted from Message-ID (null References → new thread)
  5. Workflow created in state store, status = created
  6. Classifier LLM call → {intent: "new_request", entity_type: "contract", priority: "normal"}
  7. Planner LLM call → [{tool: "extract_pdf", args: {url: attachment_url}}, {tool: "analyze_clauses", args: {...}}]
  8. Both tools execute async, job IDs stored in state
  9. Tool callbacks fire over the next 45 seconds as PDF extraction and clause analysis complete
  10. All tools complete → status = composing
  11. Composer LLM call with full tool results → reply text generated
  12. Reply sent with correct In-Reply-To and References headers via email API
  13. Workflow status = complete

Alice replies with questions → In-Reply-To header matches the agent's reply, same thread ID resolved → workflow resumes at step 6 with full context.

Frequently Asked Questions

How do you handle emails where the References header is missing or malformed?

Many email clients, especially older mobile clients, drop References headers. Fall back to subject-line matching with a time window: if you receive a message with no References but the subject matches an active workflow's subject (after stripping Re: / Fwd: prefixes) within the last 72 hours, treat it as a continuation. Log it as a header-less match for debugging. This isn't perfect, but it covers the majority of real-world cases.

Should the agent reply from the same address it receives email on, or a different one?

Same address is simpler and maintains thread coherence — the From on the reply matches the To on the original. Use a different reply address only if you have a specific reason (e.g., routing replies to a different handler). If you do, make sure your DKIM signing domain and SPF records cover the reply-from address, and set Reply-To explicitly on the original outbound.

How do you prevent the agent from creating infinite reply loops with other automated systems?

Check for loop indicators before processing: Auto-Submitted: auto-generated or Auto-Submitted: auto-replied headers signal an automated sender. Also check the Precedence: bulk or Precedence: list headers. Maintain a counter in your workflow state — if you've sent more than N replies without a human-initiated message in the thread, halt and escalate. Three is a reasonable default.

What's the right way to handle attachments in agent workflows?

Never store raw attachment content in your workflow state — PDFs and images bloat your state store fast. On inbound, download attachments immediately (webhook URLs often expire in minutes), upload to object storage (S3, GCS), and store only the stable object URL in your state. Pass that URL to tool calls. For outbound attachments, same pattern: generate or retrieve the file, upload to storage, and pass the URL to your email send API.

How do you handle multi-party threads where multiple humans email the agent?

The thread ID remains stable regardless of who sends the message. What changes is that your agent needs to understand who in the thread has authority for what. Store the original requester's email separately from subsequent contributors. When a third party joins the thread (detected by a new From address not seen in prior messages), classify their intent independently — they might be providing information, or they might be trying to override the original request. Emit an event to your workflow engine and let your business logic decide how to handle it.

Can this workflow architecture handle high volume — thousands of emails per day?

Yes, but you need to architect for it. The inbound webhook endpoint should be stateless and respond in under 200ms (do all heavy lifting async). Use a job queue with multiple workers for tool execution. Partition your state store by thread ID for horizontal scaling. The main bottleneck at scale is usually LLM latency and rate limits — budget your token usage and use batching or caching where classification patterns repeat.

Closed beta

Built for agents.
Self-serve in minutes.

Public API opens Q3 2026. Drop ~6 lines into your agent and ship.

npmpnpmbunpip
$ npm install @mailsai/sdk
Packages publish with cohort 1 · Q3 2026