
AI agents fail. Networks drop. Retries happen. Without idempotency built into your email sending layer, those failures produce duplicate emails — a confirmation sent three times, an invoice delivered twice, a password reset that confuses the recipient. For human-operated apps this is annoying. For automated agents firing at scale, it's a trust-destroying, deliverability-killing problem.
This post walks through the mechanisms, data structures, and failure modes you need to design an idempotent email pipeline — one where sending the same logical message twice produces exactly one delivery.
What idempotency means for email
Idempotency means that repeating the same operation produces the same outcome as doing it once. For email, the outcome you want to guarantee is: one logical message results in exactly one delivery to the recipient's inbox. The challenge is that SMTP is fire-and-forget — the protocol gives you no native deduplication. You have to build it above the transport layer.
The key insight is that idempotency in email isn't about the SMTP session — it's about the intent to send. You need to capture that intent, assign it a stable identifier, and check that identifier before every send attempt. If it's been seen before and the send succeeded, skip. If it failed, retry safely. If it's in-flight, wait or deduplicate.
The idempotency key: your deduplication primitive
An idempotency key is a stable, deterministic string that uniquely identifies one logical send operation. When your agent decides "send order confirmation to user@example.com for order #4821," that decision should produce exactly one key — regardless of how many times the agent code runs.
How to construct a good key
Don't use random UUIDs generated at send time. They're different on every retry, which defeats the purpose. Instead, derive keys from the semantic content of the intent:
import hashlib
import json
def make_idempotency_key(event_type: str, entity_id: str, recipient: str, version: int = 1) -> str:
payload = json.dumps({
"type": event_type,
"entity": entity_id,
"to": recipient,
"v": version
}, sort_keys=True)
return hashlib.sha256(payload.encode()).hexdigest()
# Example: order confirmation
key = make_idempotency_key(
event_type="order.confirmation",
entity_id="order_4821",
recipient="user@example.com"
)
# => stable SHA-256 regardless of when or how many times this runs
The version field matters. If you intentionally want to resend — say the order was refunded and reprocessed — bump the version. That produces a new key, which is correct behavior: it's a new logical intent.
Key scope tradeoffs
| Scope | Example key components | Risk |
|---|---|---|
| Too narrow | order_id + recipient |
Doesn't distinguish email type; suppresses legitimate resends |
| Too broad | recipient + timestamp |
Collisions across distinct intents; doesn't deduplicate |
| Right | event_type + entity_id + recipient + version |
Unique per logical send intent, stable across retries |
The deduplication store
The key is only useful if you check it somewhere persistent before sending. That store needs to be fast (you're checking on every send attempt, potentially thousands per minute), atomic (two concurrent retries must not both pass the check), durable (a process restart can't clear it), and TTL-aware (keys for old intents can expire).
Redis with atomic SET NX EX (set if not exists, with expiry) is the standard choice:
import redis
r = redis.Redis(host='localhost', port=6379, decode_responses=True)
def attempt_send(idempotency_key: str, send_fn, ttl_seconds: int = 86400 * 7) -> dict:
"""
Returns {"status": "sent"} or {"status": "duplicate", "original_message_id": ...}
"""
lock_key = f"email:lock:{idempotency_key}"
result_key = f"email:result:{idempotency_key}"
# Atomic claim: only one caller wins this
claimed = r.set(lock_key, "claimed", nx=True, ex=ttl_seconds)
if not claimed:
# Already sent or in-flight
stored = r.get(result_key)
if stored:
return {"status": "duplicate", "original_message_id": stored}
else:
# In-flight — another process is sending right now
return {"status": "in_flight"}
# We won the claim — actually send
try:
message_id = send_fn()
r.set(result_key, message_id, ex=ttl_seconds)
return {"status": "sent", "message_id": message_id}
except Exception as e:
# Release lock on failure so retry can try again
r.delete(lock_key)
raise
Note the failure handling: if send_fn() throws, you delete the lock key. This allows a future retry to reclaim it. Leave the lock in place after a failure and you permanently suppress a send that never happened — the opposite of what you want.
What to store as the result
Store the Message-ID returned by your email provider, not just a boolean. The Message-ID (format: <unique-string@domain>) is what SMTP uses to thread messages and what your logs reference for debugging. Storing it means that when a retry hits the deduplication check, you can return the original Message-ID — useful for audit trails and for connecting downstream events like opens and clicks back to the original intent.
Failure modes and how to handle each
Idempotent systems fail in specific, predictable ways.
1. Provider accepted but returned error
Some providers accept a message into their queue but return a 5xx on the HTTP response (network timeout after the message was enqueued). Your code sees a failure and retries — but the provider already has the message.
Defense: Use provider-level idempotency keys. Most serious email APIs accept an X-Idempotency-Key or equivalent header. Pass your derived key there. The provider will deduplicate on their side if you retry with the same key within their window (often 24 hours).
import httpx
def send_via_api(idempotency_key: str, payload: dict) -> str:
response = httpx.post(
"https://api.email-provider.com/send",
json=payload,
headers={
"Authorization": "Bearer ...",
"Idempotency-Key": idempotency_key
},
timeout=10.0
)
response.raise_for_status()
return response.json()["message_id"]
2. Agent restarts mid-workflow
An agent runs a multi-step workflow: fetch context → compose email → send → update database. The process crashes after send but before the database update. On restart, the agent re-derives the same idempotency key, hits the deduplication store, finds the lock was released (because send threw), and tries again.
Defense: Treat the deduplication store as the source of truth for send status — not your application database. Check the store first on any resume path, not just on first attempt.
3. Distributed agents racing
Two agent instances independently decide to send the same email (e.g., both pick up the same job from a queue that delivers at-least-once). Both derive the same key and hit the store at the same millisecond.
Defense: The SET NX atomicity in Redis handles this. Only one caller gets true from SET NX — the other gets nil and reads the stored result. No lock striping needed unless your throughput requires it (>50k keys/sec on a single Redis node).
4. Legitimate resend requested
A user clicks "resend confirmation." You want to send again even though the original key was used.
Defense: This is exactly what the version field in your key construction is for. The user action should increment the version, producing a new key:
# Original send: version=1
# User-triggered resend: version=2
key = make_idempotency_key("order.confirmation", "order_4821", "user@example.com", version=2)
Request-reply flow for idempotent agents
sequenceDiagram
participant Agent
participant DedupeStore
participant EmailAPI
participant Recipient
Agent->>DedupeStore: SET NX email_lock_KEY claimed
alt Lock claimed
DedupeStore-->>Agent: OK claimed
Agent->>EmailAPI: POST send with Idempotency-Key header
EmailAPI-->>Agent: 200 message_id abc123
Agent->>DedupeStore: SET email_result_KEY abc123
EmailAPI->>Recipient: Deliver message
else Lock exists and result stored
DedupeStore-->>Agent: nil lock exists
Agent->>DedupeStore: GET email_result_KEY
DedupeStore-->>Agent: abc123
Agent-->>Agent: Skip send return cached result
else Lock exists no result yet
DedupeStore-->>Agent: nil lock exists
DedupeStore-->>Agent: nil no result yet
Agent-->>Agent: Return in_flight status
end
TTL strategy for the deduplication store
Keys can't live forever — you'll exhaust memory. But they need to live long enough to cover all realistic retry windows.
- Transactional emails (order confirmations, password resets): 7-day TTL covers any plausible retry scenario
- Scheduled agent emails (weekly reports, digest emails): TTL should be at least 2x your scheduling interval — if your agent runs weekly, use a 14-day TTL
- High-frequency notifications: 24-hour TTL is usually sufficient; these are time-sensitive and a next-period re-trigger should send fresh
Store lock keys and result keys with the same TTL. If your result key expires before the lock key (or vice versa), you create inconsistent state.
Idempotency at the DKIM/SPF layer
Deduplication keeps you from sending duplicates — but you still need the underlying emails to be trusted by receiving servers. Each message your agent sends should carry a valid DKIM signature tied to your sending domain. Duplicate detection happens before the SMTP layer, so your DKIM infrastructure isn't involved in deduplication. But here's the subtlety: a replayed SMTP connection carrying an old DKIM signature will still pass DKIM verification — DKIM doesn't have replay protection by design. Your deduplication store is what prevents semantic duplicates, not the authentication layer.
This is why dedicated IP infrastructure and sender reputation matter for agents specifically — high-volume automated senders that occasionally produce duplicates (before idempotency is wired up) accumulate spam complaints that stick to the IP, not just the message.
Integrating with event-driven agent architectures
Most agents aren't calling a send function directly — they're responding to events from a queue (SQS, Kafka, Pub/Sub). Those queues are almost always at-least-once delivery. That means your idempotency layer isn't optional; it's the primary defense against duplicate processing.
def handle_event(event: dict):
# Derive key from event content — not from event metadata
key = make_idempotency_key(
event_type=event["type"],
entity_id=event["data"]["order_id"],
recipient=event["data"]["email"]
)
result = attempt_send(
idempotency_key=key,
send_fn=lambda: send_order_confirmation(event["data"])
)
if result["status"] == "duplicate":
# Log but don't error — this is expected behavior
logger.info("Deduplicated send", key=key, original_id=result["original_message_id"])
elif result["status"] == "in_flight":
# Re-enqueue with backoff — another worker is handling it
raise RetryableError("Send in flight, retry later")
Deriving the key from event content (order_id, email) rather than event metadata (event_id, timestamp) ensures that even if the same logical event is published twice with different IDs, you still deduplicate correctly. Event IDs are not stable semantic identifiers.
Observability: what to instrument
Idempotency without observability is flying blind. Instrument these metrics:
- Deduplication rate (
email.send.deduplicated / email.send.attempted): should be low in steady state; spikes indicate upstream retry storms or agent loops - Lock contention rate: how often you hit the
in_flightcase — high contention suggests you need distributed locking or your workers are processing the same events in parallel too aggressively - Key TTL distribution: are keys expiring before you expect? Indicates your TTL is too short for your retry window
- Provider-level deduplication: some providers (including Mails.ai) expose whether a send was deduplicated at their layer — reconcile this against your store to catch cases where your deduplication didn't fire but theirs did
Frequently Asked Questions
How is an idempotency key different from a Message-ID?
A Message-ID (e.g., <abc123@yourdomain.com>) is assigned by the mail server at send time and is used for SMTP threading — it appears in email headers and lets clients group replies. An idempotency key is your application-level identifier for the intent to send, checked before the send ever happens. The Message-ID is the result; the idempotency key is the gate. You store the Message-ID as the result value in your deduplication store.
What happens if my Redis instance goes down?
You have two options: fail open (skip the deduplication check and send anyway, accepting potential duplicates) or fail closed (refuse to send until the store is available). For most agent workflows, fail open is preferable — a rare duplicate is better than missed delivery. Gate your fail-open path with a circuit breaker and alert aggressively so you restore the store quickly.
Can I use a relational database instead of Redis for the deduplication store?
Yes, with caveats. You need a table with a unique index on the idempotency key and an atomic upsert (INSERT ... ON CONFLICT DO NOTHING in PostgreSQL). The performance is fine for moderate volumes (< 1k sends/sec). Above that, write contention on a single table becomes a bottleneck, and Redis's O(1) atomic SET NX is the better tool.
How do I handle idempotency for bulk sends — e.g., an agent sending 10,000 newsletter emails?
Each recipient gets their own idempotency key: make_idempotency_key("newsletter.issue_42", "issue_42", recipient_email). The batch is the logical grouping but individual deliveries are the atomic unit. This lets you safely retry a failed batch without re-sending to addresses that already received it.
Do email providers support idempotency keys natively?
Some do. Check your provider's API documentation for Idempotency-Key header support or equivalent fields. When available, pass your derived key — this gives you a second deduplication layer at the provider level, which covers the specific failure mode where the provider accepts your message but the HTTP response is lost in transit. Your application-level store is still necessary because provider-side idempotency windows are typically short (24 hours) and don't cover semantic intent across different API sessions.
How should I version idempotency keys when email content changes?
Content changes fall into two categories. If the change is incidental (you fixed a typo in a template), the logical intent is the same — don't bump the version, the recipient doesn't need the email again. If the change represents a new business event (the order was modified, the price changed), that's a new intent — bump the version or include the content-relevant field (like order_version) in the key derivation. Tie key versioning to semantic intent, not to template revisions.