Temporal: Durable Execution for AI Agents

What is Temporal?

Temporal is a durable execution platform that guarantees workflows survive crashes, restarts, and deployments. Netflix runs 100K+ workflows/day on Temporal, with Datadog processing millions monthly. For AI agents, Temporal solves the "Restart Tax"—a 15-minute agent that crashes at 99% loses $4.50 in compute. Temporal persists every step as an event history, enabling deterministic replay from failure points. It provides exactly-once semantics (no duplicate charges), infinite wait support for human-in-the-loop approvals, and time-travel debugging via complete audit trails.

Agents are workflows, not functions.

A customer support agent doesn't just answer one question. It retrieves customer data from a CRM, searches a knowledge base, generates a response with an LLM, sends an email via SendGrid, and updates ticket status in Zendesk. If any step fails, the entire task must retry from the last successful step, not from scratch. This is durable execution.

Traditional approaches fail at this:

Stateless functions (AWS Lambda): Lose state on failure, must restart from beginning, timeout after 15 minutes
Message queues (SQS, RabbitMQ): Manual state management, brittle retry logic, no visibility into workflow progress
Cron jobs: Polling is inefficient, no fault tolerance, can't handle event-driven workflows

Temporal solves this with three guarantees that matter for production agents:

Durable execution: Workflows survive crashes, restarts, and deployments without losing state
Deterministic replay: Re-execute workflows from event history to reconstruct state and continue
Exactly-once semantics: No duplicate side effects—critical for billing customers, sending emails, or calling external APIs

This article is a technical blueprint for architects and TPMs building agent orchestration on Temporal. We'll cover architecture fundamentals, scaling to Netflix-level throughput (hundreds of thousands of workflows per day), and operational patterns learned from companies like Datadog (millions of workflows per month), Stripe, HashiCorp, and Coinbase.

1. Temporal Architecture Fundamentals

Temporal models your application as Workflow Executions and Activity Executions coordinated by a central Temporal Service and executed by stateless Workers polling named Task Queues. Each workflow's complete evolution is stored as an Event History—an immutable log of every decision, activity execution, timer, signal, and completion.

Core Components

Workflows are durable, language-level functions (Go, TypeScript, Java, Python, .NET) that orchestrate tasks. Unlike stateless functions, workflows persist their state as an event log, not in-memory variables. You write workflows using normal language features—loops, conditionals, composition—while delegating all side effects to activities.

Activities are units of work that perform side effects: HTTP calls, database operations, LLM requests. Temporal records their scheduling, completion, or failure as events and applies retry policies, timeouts, and heartbeats according to the configuration you specify in code. Activities can fail and retry without affecting workflow state—the workflow simply waits for the activity to eventually succeed or exhaust retries.

Workers are your processes that host workflow and activity code. They long-poll task queues over gRPC for workflow and activity tasks assigned by the Temporal Service. Workers are stateless—if one crashes, another picks up the work. You scale workers horizontally (10, 100, 1,000 instances) to handle more concurrent workflows.

Task Queues are named queues that route work to workers. Workflows and activities subscribe to specific queues (e.g., default, llm-workers, high-priority), enabling routing strategies like "send GPU-intensive LLM work to GPU-equipped workers" or "prioritize urgent customer tickets."

The Temporal Service is a horizontally scalable control plane that stores event histories and manages workflow execution. It's backed by a persistence layer—Cassandra for multi-region, high-throughput deployments (10,000s of workflows/second), or PostgreSQL/MySQL for simpler, lower-throughput use cases (<10,000 workflows/second). The service exposes APIs for starting workflows, sending signals, querying state, and listing execution histories.

Event Sourcing: How It All Works

Temporal uses event sourcing to achieve durable execution. Instead of storing workflow state as snapshots (like a database record), it stores the complete history of events that produced that state. Here's the execution flow:

Client starts a workflow: client.start_workflow(SupportAgentWorkflow, ticket_id="12345")
Temporal Service creates workflow in the database and places a task on the workflow's task queue
Worker polls the task queue, picks up the workflow task
Worker executes workflow code: Calls activities, sets timers, waits for signals
Every decision is logged as an event: "ActivityScheduled", "ActivityCompleted", "TimerStarted", "TimerFired", "WorkflowCompleted"
Event history is persisted to Cassandra or PostgreSQL

When a worker crashes mid-workflow, here's what happens:

Temporal Service detects the failure (worker stops polling or workflow task times out)
Service assigns the workflow to a new worker
New worker reads the event history from the database
Replays the workflow code, skipping already-executed activities (their results are in the event history)
Continues from the last successful step

This is deterministic replay—the cornerstone of Temporal's reliability guarantee.

2. Deterministic Replay: The Core Guarantee

Replay is what makes Temporal magical for long-running agents. When a workflow resumes after a crash or deployment, the new worker doesn't know what the previous worker was doing. But it can reconstruct the exact in-memory state by replaying the event history.

How Replay Works

The worker re-executes the workflow code from the beginning, but instead of actually running activities (which would duplicate side effects), it reads activity results from the event history. The workflow function "fast-forwards" through already-executed code paths until it reaches the point where it was interrupted, then continues with new logic.

Example: A support agent workflow that fetches customer data, searches a knowledge base, calls an LLM, and sends an email.

@workflow.defn
class SupportAgentWorkflow:
    @workflow.run
    async def run(self, ticket_id: str) -> str:
        # Step 1: Retrieve customer data
        customer = await workflow.execute_activity(
            get_customer_data,
            ticket_id,
            start_to_close_timeout=timedelta(seconds=10)
        )
 
        # Step 2: Search knowledge base
        docs = await workflow.execute_activity(
            search_knowledge_base,
            customer.query,
            start_to_close_timeout=timedelta(seconds=5)
        )
 
        # Step 3: Generate LLM response
        response = await workflow.execute_activity(
            generate_llm_response,
            customer.query,
            docs,
            start_to_close_timeout=timedelta(seconds=30)
        )
 
        # Step 4: Send email
        await workflow.execute_activity(
            send_email,
            customer.email,
            response,
            start_to_close_timeout=timedelta(seconds=10)
        )
 
        return "resolved"

Scenario: Worker crashes after Step 3 completes but before Step 4 starts.

On replay:

New worker reads event history: [WorkflowStarted, ActivityScheduled(get_customer_data), ActivityCompleted(customer=...), ActivityScheduled(search_knowledge_base), ActivityCompleted(docs=...), ActivityScheduled(generate_llm_response), ActivityCompleted(response=...)]
Re-executes workflow code:
- Step 1: Sees ActivityCompleted(customer=...) in history → returns cached result instantly
- Step 2: Sees ActivityCompleted(docs=...) in history → returns cached result instantly
- Step 3: Sees ActivityCompleted(response=...) in history → returns cached result instantly
- Step 4: Not in history → schedules the activity and generates ActivityScheduled(send_email) event
Worker sends email (Step 4 executes for the first time)
Workflow completes

The LLM wasn't called twice. The email wasn't sent twice. The workflow resumed exactly where it left off.

Determinism Constraints

For replay to work, workflow code must be deterministic: given the same event history, it must make the same decisions every time. This imposes constraints:

❌ No random numbers: Math.random() produces different values on replay ❌ No system time: Date.now() changes between executions ❌ No external state: Reading from databases, file systems, or APIs during workflow logic (not activities) ❌ No non-deterministic libraries: UUID generation, shuffling arrays, anything with side effects

✅ Use Temporal APIs:

workflow.uuid() generates deterministic UUIDs (seeded from event history)
workflow.now() returns deterministic timestamp (from event history)
Activities encapsulate all side effects

Example of non-deterministic workflow (BAD):

@workflow.run
async def bad_workflow():
    user_id = random.randint(1, 1000)  # ❌ Changes on replay
    result = await workflow.execute_activity(call_api, user_id)
    return result

On replay, random.randint() generates a different user_id, causing the workflow to schedule a different activity than the first time. Temporal detects the mismatch and throws a non-determinism error.

Example of deterministic workflow (GOOD):

@workflow.run
async def good_workflow(user_id: int):
    # user_id passed as argument (deterministic input)
    result = await workflow.execute_activity(call_api, user_id)
    return result

Exactly-Once Semantics for Side Effects

Activities execute exactly once even across retries and replays. Here's how:

First execution: Worker schedules activity, Temporal logs ActivityScheduled, worker executes the activity code, Temporal logs ActivityCompleted with the result
Replay: Worker sees ActivityCompleted in history, skips execution, returns cached result
Retry on failure: If activity fails, Temporal logs ActivityFailed, applies retry policy, schedules a new attempt (new ActivityScheduled event). The activity code runs again, but Temporal tracks each attempt separately in the event history.

This is critical for agents that charge customers ($0.99 per resolution), send emails (no spam), or update external systems (no duplicate records).

3. Why Production Agents Need Temporal

Agents execute multi-step, cross-system processes with complex failure modes, human involvement, and long latencies. Temporal was designed for exactly this.

Long-Running Tasks

Agents don't finish in milliseconds. A support ticket might take 30 seconds (retrieve data, LLM call, send email). A legal research task might take hours (search case law, summarize, generate memo, wait for lawyer review). A sales follow-up might wait 3 days before sending the next email.

Stateless functions (Lambda) timeout after 15 minutes. Message queues require manual state persistence. Temporal workflows run indefinitely—days, months, years—with state persisted in event history.

Example: Sales follow-up agent that waits 3 days between emails.

@workflow.run
async def sales_followup_workflow(lead_id: str):
    # Send initial email
    await workflow.execute_activity(send_email, lead_id, "Initial outreach")
 
    # Wait 3 days (Temporal timer, not a cron job)
    await asyncio.sleep(timedelta(days=3))
 
    # Send follow-up
    await workflow.execute_activity(send_email, lead_id, "Follow-up")

The workflow sleeps for 3 days without occupying any worker resources. Temporal sets a timer in the event history, and the worker isn't involved until the timer fires. This enables patterns like "send a sequence of 5 emails over 2 weeks" or "escalate to manager if no response in 24 hours" without brittle cron jobs or external schedulers.

Failure Recovery

LLM APIs rate-limit (OpenAI 429 errors). Databases timeout. External APIs go down. Workers crash. Temporal handles all of this automatically.

Activity retry policies let you configure retries with exponential backoff:

@activity.defn
async def call_openai_api(prompt: str) -> str:
    # May fail with 429 rate limit
    response = await openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content
 
# Configure retry policy
await workflow.execute_activity(
    call_openai_api,
    prompt,
    retry_policy=RetryPolicy(
        initial_interval=timedelta(seconds=1),
        maximum_interval=timedelta(seconds=60),
        maximum_attempts=10,
        backoff_coefficient=2.0  # 1s, 2s, 4s, 8s, 16s, 32s, 60s, 60s...
    )
)

If the OpenAI API returns a 429, Temporal automatically retries with exponential backoff. The workflow doesn't need to know about retries—it just waits for the activity to eventually succeed or exhaust attempts. The event history records every retry attempt, making failures debuggable.

If a worker crashes mid-workflow, deterministic replay ensures the workflow resumes from the last successful step. You don't lose progress. You don't restart from scratch. The agent picks up exactly where it left off.

Human-in-the-Loop (HITL)

Many agents need human approval before taking action. A legal AI drafts a memo, but a lawyer must review it before sending to the client. A support agent escalates a complex issue to a human, waits for resolution, then continues the workflow.

Temporal's signals and wait conditions are designed for HITL:

@workflow.defn
class LegalReviewWorkflow:
    def __init__(self):
        self.approved = False
 
    @workflow.signal
    async def approve(self):
        self.approved = True
 
    @workflow.run
    async def run(self, case_id: str) -> str:
        # Step 1: Generate legal memo
        memo = await workflow.execute_activity(generate_memo, case_id)
 
        # Step 2: Wait for lawyer approval (could take hours)
        await workflow.wait_condition(lambda: self.approved, timeout=timedelta(hours=24))
 
        if self.approved:
            # Step 3: Send approved memo to client
            await workflow.execute_activity(send_memo, memo)
            return "sent"
        else:
            # Timeout: Escalate to senior partner
            await workflow.execute_activity(escalate_to_partner, case_id)
            return "escalated"

External client sends approval signal:

# Lawyer reviews memo in UI, clicks "Approve"
handle = client.get_workflow_handle(workflow_id="legal-12345")
await handle.signal(LegalReviewWorkflow.approve)

The workflow waits for up to 24 hours. If the lawyer approves within that window, the signal sets self.approved = True, the wait condition unblocks, and the workflow sends the memo. If 24 hours pass with no approval, the workflow escalates to a senior partner.

During the wait, the workflow isn't running. It's persisted in the database as an event history. The worker isn't blocked. The workflow can wait for days if needed.

Multi-Agent Orchestration

Agents often call other agents. A sales agent calls a legal agent to review a contract. A support agent calls a billing agent to process a refund. Temporal's child workflows model this naturally:

@workflow.run
async def sales_agent_workflow(deal_id: str):
    # Step 1: Call legal agent (child workflow)
    contract_review = await workflow.execute_child_workflow(
        LegalAgentWorkflow.run,
        deal_id,
        id=f"legal-{deal_id}",
        task_queue="legal-workers"  # Route to GPU workers if needed
    )
 
    # Step 2: Call billing agent (child workflow)
    invoice = await workflow.execute_child_workflow(
        BillingAgentWorkflow.run,
        deal_id,
        contract_review.amount,
        id=f"billing-{deal_id}",
        task_queue="billing-workers"
    )
 
    # Step 3: Finalize deal
    return await workflow.execute_activity(close_deal, contract_review, invoice)

Child workflows:

Run in parallel if you use asyncio.gather()
Have their own event histories (isolated failure domains)
Can be cancelled by the parent (handle.cancel())
Fail up to the parent if they fail (or you can catch exceptions and compensate)

This enables complex multi-agent systems where the parent workflow coordinates specialists, each with their own retries, timeouts, and failure handling.

Exactly-Once Execution for Billing

If you charge customers $0.99 per resolved ticket (like Intercom Fin), you cannot charge them twice due to a retry. Temporal's exactly-once activity execution guarantees that even if the billing activity retries (due to a transient failure), the external billing API is only called once.

Pattern: Use activity IDs as idempotency keys.

@activity.defn
async def charge_customer(ticket_id: str, amount: float) -> str:
    # Use ticket_id as idempotency key in Stripe
    charge = await stripe.Charge.create(
        amount=int(amount * 100),  # cents
        currency="usd",
        idempotency_key=f"ticket-{ticket_id}"
    )
    return charge.id

If the activity retries, Stripe sees the same idempotency key and returns the original charge without creating a duplicate. Combined with Temporal's activity deduplication (based on event history), this ensures exactly-once billing.

4. Scaling Temporal for Production

Netflix runs hundreds of thousands of workflows per day on Temporal. Datadog runs millions of workflows per month. Here's how to scale Temporal to that level.

Worker Scaling

Workers are stateless and scale horizontally. To handle more workflows:

Deploy more worker processes (10 → 100 → 1,000 instances)
Tune worker concurrency: Each worker can execute multiple workflows and activities concurrently. Configure maxConcurrentWorkflowTaskExecutionSize (default: 100) and maxConcurrentActivityExecutionSize (default: 100) based on CPU and memory.
Autoscale based on task queue depth: Use KEDA (Kubernetes Event-Driven Autoscaler) or Horizontal Pod Autoscaler to scale workers when the task queue backlog grows.

Example autoscaling metric:

# KEDA ScaledObject for Temporal workers
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: temporal-worker-scaler
spec:
  scaleTargetRef:
    name: temporal-worker
  minReplicaCount: 10
  maxReplicaCount: 100
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: temporal_task_queue_depth
        threshold: '100'  # Scale up if &gt;100 tasks pending
        query: sum(temporal_task_queue_depth{queue="default"})

Task Queue Routing

Not all workflows have the same resource requirements. LLM-heavy workflows need GPU workers. High-priority customer tickets should go to dedicated workers. Low-priority batch jobs can share general-purpose workers.

Temporal's task queues enable routing strategies:

# Start workflow on specific task queue
await client.start_workflow(
    SupportAgentWorkflow.run,
    ticket_id="12345",
    task_queue="high-priority"  # Route to dedicated workers
)
 
# LLM-heavy workflow
await client.start_workflow(
    LegalResearchWorkflow.run,
    case_id="67890",
    task_queue="llm-workers"  # Route to GPU-equipped workers
)

Worker deployment:

default-workers: General-purpose CPU workers (100 instances)
llm-workers: GPU workers for LLM inference (10 instances with A100 GPUs)
high-priority-workers: Dedicated workers for urgent tickets (20 instances, low latency)

Each worker pool polls only its designated task queue, ensuring workload isolation.

Event History Limits

Temporal enforces a hard cap of 51,200 events or 50 MB per workflow execution, with warnings starting at 10,240 events or 10 MB. If a workflow generates too many events (e.g., a loop that schedules 100,000 activities), it will fail.

Solution: Continue-As-New

@workflow.run
async def long_running_workflow():
    for i in range(100000):
        await workflow.execute_activity(process_item, i)
 
        # Reset workflow every 1,000 iterations
        if i % 1000 == 0 and i > 0:
            workflow.continue_as_new(i + 1)  # Start new workflow, pass state

continue_as_new() closes the current workflow execution and starts a fresh one with a new event history, passing forward any state needed to continue. This keeps event histories small and replay fast.

When to use Continue-As-New:

Long-running workflows that generate >10,000 events
Infinite loops (e.g., "process messages from queue forever")
Workflows that fan out to many child workflows (>1,000 children)

Storage: Cassandra vs PostgreSQL

Cassandra:

Pros: Multi-region, multi-DC, write-optimized (perfect for event sourcing), handles >10,000 workflows/second
Cons: Complex to operate, requires expertise in Cassandra tuning, replication, and compaction
Use case: High-throughput, mission-critical workloads (Netflix, Uber)

PostgreSQL:

Pros: Simpler to operate, strong consistency, easier to debug (standard SQL)
Cons: Single-node bottleneck, <10,000 workflows/second
Use case: Lower-throughput deployments, teams without Cassandra expertise

When to use Cassandra:

>100,000 workflows/second
Multi-region failover required
Mission-critical workloads (billing, compliance)

When to use PostgreSQL:

<10,000 workflows/second
Simpler operations preferred
Single-region deployment

Temporal Cloud vs Self-Hosted

Temporal Cloud:

Pros: Fully managed (no ops), automatic upgrades, multi-region failover, usage-based pricing
Cons: Cost (pay per "Action"—workflow events, activity executions, heartbeats, signals)
Pricing model: Actions per second × total Actions + storage

Self-Hosted:

Pros: Full control, potential cost savings at scale, custom tuning
Cons: Operational complexity (Cassandra clusters, upgrades, backups, monitoring)

Cost crossover: Temporal Cloud is typically cheaper for <1M workflows/month. Beyond that, self-hosted can be more cost-effective, but factor in engineering time for operations.

Example: Datadog self-hosts Temporal and has publicly discussed the operational challenges: scaling Cassandra clusters, managing event history growth, ensuring reliable upgrades. For smaller teams, Temporal Cloud offloads this burden.

5. Versioning and Deployment

Workflows can run for months or years. During that time, you'll deploy new code. How do you ensure new code is compatible with workflows started on old code?

Workflow Versioning

Use the get_version() API to branch workflow logic based on a version marker stored in the event history:

@workflow.run
async def my_workflow():
    version = workflow.get_version("my-change", workflow.DEFAULT_VERSION, 2)
 
    if version == 1:
        # Old behavior (for workflows started before deploy)
        result = await workflow.execute_activity(old_activity)
    else:
        # New behavior (for workflows started after deploy)
        result = await workflow.execute_activity(new_activity)
 
    return result

When a workflow first executes, get_version() records version=2 in the event history. On replay, get_version() reads the version from history and uses it to branch. Workflows started before the deploy use version=1, workflows started after use version=2.

Deployment Strategies

Blue-Green Deployment:

Deploy v2 workers alongside v1 workers (both poll the same task queue)
Route new workflow starts to v2 workers (via task queue or workflow versioning)
Wait for v1 workflows to drain (monitor active workflow count)
Decommission v1 workers

Canary Deployment:

Deploy v2 workers at 10% capacity
Route 10% of new workflows to v2
Monitor error rates, latency, quality metrics
Gradually increase to 50%, 100%

Replay Testing: Before deploying new workflow code, run replay tests in CI: fetch production event histories, replay them against new code, assert no non-determinism errors. This catches breaking changes before they hit production.

6. Observability and Debugging

Temporal's Web UI provides time-ordered audit trails of every workflow execution. You can inspect:

Event history: Full sequence of workflow decisions, activity executions, timers, signals
Pending activities: What is the workflow waiting for?
Stack traces: If an activity failed, see the error message and stack trace
Replay: Re-execute the workflow locally with a debugger

Time-Travel Debugging

When a workflow fails in production:

Open the workflow execution in Temporal Web UI
Copy the event history (JSON export)
Replay the workflow locally with the production event history
Step through the code with a debugger to understand the failure

This is impossible with stateless functions (Lambda loses state on failure). With Temporal, you have a complete audit trail of what the agent did, in what order, with what data.

Metrics and Alerting

Temporal exports metrics via OpenTelemetry:

Workflow start/completion rates (workflows/sec)
Activity failure rates (% of activities failing)
Task queue depth (backlog of pending workflows)
Worker utilization (% of workers busy)

Integrate with Datadog, Prometheus, or Grafana to alert on:

High activity failure rate (>5% failures → investigate)
Task queue backlog (>1,000 pending → scale workers)
Slow workflows (p95 latency >10s → optimize activities)

7. Temporal vs Alternatives

Temporal vs AWS Step Functions

Step Functions:

State machine defined in JSON (Amazon States Language)
Tightly integrated with AWS services (Lambda, S3, DynamoDB)
Declarative, no code
Limitation: Complex logic is hard to express in JSON (no loops, limited conditionals)

Temporal:

Workflows written in code (Python, Go, TypeScript, Java)
Full programming language (loops, functions, composition)
Multi-cloud, portable

When Step Functions wins: Simple state machines (<10 states), all-AWS infrastructure

When Temporal wins: Complex workflows (100+ steps), multi-cloud, need to version workflow logic over years

Temporal vs Apache Airflow

Airflow:

DAG-based batch workflows
Scheduler-driven (cron-like)
Designed for data pipelines, ETL
Limitation: Minute-level latency (not real-time), no durable execution (tasks restart from scratch on failure)

Temporal:

Event-driven (workflows start immediately on demand)
Sub-second latency (real-time)
Durable execution (workflows resume from last step on failure)

When Airflow wins: Daily batch ETL jobs, data engineering workflows

When Temporal wins: Real-time agents (customer support, sales), long-running workflows with complex failure handling

Temporal vs Cadence

Cadence is Temporal's predecessor, developed at Uber. Temporal is a fork of Cadence (2019) by the same core team.

Why Temporal forked:

Independent governance (Temporal is a separate company)
Faster iteration (Temporal ships features faster)
Better docs, SDKs, community

Today: Temporal has broader adoption, better Cloud offering, larger ecosystem. Cadence is still used at Uber and some enterprises, but most new deployments choose Temporal.

8. Real-World Scale

Netflix

Runs hundreds of thousands of workflows per day on Temporal, with projections toward a million workflows per day as usage expands. Temporal powers critical workflows like media encoding, content delivery, and experimentation pipelines.

Datadog

Runs millions of workflows per month across 100+ internal teams. Self-hosts Temporal on Cassandra. Datadog's engineering team has publicly discussed challenges of operating Temporal at scale: managing Cassandra clusters, tuning event history retention, ensuring reliable upgrades.

Other Adopters

Stripe: Financial workflows, payment orchestration
HashiCorp: Terraform Cloud orchestration
Coinbase: Crypto transaction workflows
Snap: Snapchat infrastructure workflows
Box: Content workflow automation

These companies moved from homegrown orchestrators to Temporal for increased reliability, developer productivity, and observability.

9. When NOT to Use Temporal

Temporal is powerful, but not always the right tool:

Stateless, Short-Lived Tasks

If your task is a single API call with no retries needed (e.g., "fetch user profile from API"), Lambda or a simple HTTP handler is simpler than Temporal. Temporal's overhead (task queue polling, event persistence) adds ~100ms of latency.

Very Low Latency (<10ms)

Temporal's task queue polling and event persistence add latency. For <10ms operations, use gRPC, HTTP, or direct function calls.

Simple CRUD Operations

If you're just reading/writing a database with no orchestration, Temporal is overkill. Use a REST API or GraphQL.

Temporal shines when:

Workflows span multiple systems (CRM, LLM, email, ticketing)
Long-running (>1 minute, often hours/days)
Complex failure handling (retries, compensations, human escalations)
Need exactly-once semantics (billing, compliance)
Must survive deployments without losing state

10. The Bottom Line

Temporal is infrastructure for production agents. It provides the same durable execution guarantees that databases provide for data—but for workflows.

If you're building agents that:

Orchestrate multi-step processes across systems
Run for more than a few seconds
Need to survive crashes, deployments, and external API failures
Involve humans in the loop
Coordinate multiple specialized agents
Charge customers or call external APIs (where exactly-once matters)

...then Temporal is the orchestration engine you need.

Trade-offs:

Learning curve: Deterministic workflows, event history limits, versioning
Operational complexity: Self-hosting requires Cassandra expertise (or pay for Temporal Cloud)
Latency overhead: ~100ms minimum (not suitable for <10ms operations)

Benefits:

No lost state: Workflows survive crashes, restarts, deployments
Observability: Complete audit trail of every decision, activity, signal
Exactly-once execution: No duplicate charges, emails, or database writes
Evolvability: Version workflows over months/years without breaking in-flight executions

Netflix, Datadog, and Stripe run millions of workflows per month on Temporal. If it's good enough for them, it's good enough for your agents.

Getting started:

Start with the single-agent workflow pattern (1 workflow = 1 task)
Use Temporal Cloud (avoid operational complexity early on)
Scale workers horizontally as load grows
Add multi-agent orchestration when you have specialized agents (legal, billing, etc.)
Monitor event history size, use Continue-As-New if workflows grow large

Temporal is the durable execution engine that lets you build agents that just work—even when everything else fails.