What is Temporal?
Temporal is a durable execution platform that guarantees workflows survive crashes, restarts, and deployments. Netflix runs 100K+ workflows/day on Temporal, with Datadog processing millions monthly. For AI agents, Temporal solves the "Restart Tax"—a 15-minute agent that crashes at 99% loses $4.50 in compute. Temporal persists every step as an event history, enabling deterministic replay from failure points. It provides exactly-once semantics (no duplicate charges), infinite wait support for human-in-the-loop approvals, and time-travel debugging via complete audit trails.
Agents are workflows, not functions.
A customer support agent doesn't just answer one question. It retrieves customer data from a CRM, searches a knowledge base, generates a response with an LLM, sends an email via SendGrid, and updates ticket status in Zendesk. If any step fails, the entire task must retry from the last successful step, not from scratch. This is durable execution.
Traditional approaches fail at this:
- Stateless functions (AWS Lambda): Lose state on failure, must restart from beginning, timeout after 15 minutes
- Message queues (SQS, RabbitMQ): Manual state management, brittle retry logic, no visibility into workflow progress
- Cron jobs: Polling is inefficient, no fault tolerance, can't handle event-driven workflows
Temporal solves this with three guarantees that matter for production agents:
- Durable execution: Workflows survive crashes, restarts, and deployments without losing state
- Deterministic replay: Re-execute workflows from event history to reconstruct state and continue
- Exactly-once semantics: No duplicate side effects—critical for billing customers, sending emails, or calling external APIs
This article is a technical blueprint for architects and TPMs building agent orchestration on Temporal. We'll cover architecture fundamentals, scaling to Netflix-level throughput (hundreds of thousands of workflows per day), and operational patterns learned from companies like Datadog (millions of workflows per month), Stripe, HashiCorp, and Coinbase.
1. Temporal Architecture Fundamentals
Temporal models your application as Workflow Executions and Activity Executions coordinated by a central Temporal Service and executed by stateless Workers polling named Task Queues. Each workflow's complete evolution is stored as an Event History—an immutable log of every decision, activity execution, timer, signal, and completion.
Core Components
Workflows are durable, language-level functions (Go, TypeScript, Java, Python, .NET) that orchestrate tasks. Unlike stateless functions, workflows persist their state as an event log, not in-memory variables. You write workflows using normal language features—loops, conditionals, composition—while delegating all side effects to activities.
Activities are units of work that perform side effects: HTTP calls, database operations, LLM requests. Temporal records their scheduling, completion, or failure as events and applies retry policies, timeouts, and heartbeats according to the configuration you specify in code. Activities can fail and retry without affecting workflow state—the workflow simply waits for the activity to eventually succeed or exhaust retries.
Workers are your processes that host workflow and activity code. They long-poll task queues over gRPC for workflow and activity tasks assigned by the Temporal Service. Workers are stateless—if one crashes, another picks up the work. You scale workers horizontally (10, 100, 1,000 instances) to handle more concurrent workflows.
Task Queues are named queues that route work to workers. Workflows and activities subscribe to specific queues (e.g., default, llm-workers, high-priority), enabling routing strategies like "send GPU-intensive LLM work to GPU-equipped workers" or "prioritize urgent customer tickets."
The Temporal Service is a horizontally scalable control plane that stores event histories and manages workflow execution. It's backed by a persistence layer—Cassandra for multi-region, high-throughput deployments (10,000s of workflows/second), or PostgreSQL/MySQL for simpler, lower-throughput use cases (<10,000 workflows/second). The service exposes APIs for starting workflows, sending signals, querying state, and listing execution histories.
Event Sourcing: How It All Works
Temporal uses event sourcing to achieve durable execution. Instead of storing workflow state as snapshots (like a database record), it stores the complete history of events that produced that state. Here's the execution flow:
- Client starts a workflow:
client.start_workflow(SupportAgentWorkflow, ticket_id="12345") - Temporal Service creates workflow in the database and places a task on the workflow's task queue
- Worker polls the task queue, picks up the workflow task
- Worker executes workflow code: Calls activities, sets timers, waits for signals
- Every decision is logged as an event: "ActivityScheduled", "ActivityCompleted", "TimerStarted", "TimerFired", "WorkflowCompleted"
- Event history is persisted to Cassandra or PostgreSQL
When a worker crashes mid-workflow, here's what happens:
- Temporal Service detects the failure (worker stops polling or workflow task times out)
- Service assigns the workflow to a new worker
- New worker reads the event history from the database
- Replays the workflow code, skipping already-executed activities (their results are in the event history)
- Continues from the last successful step
This is deterministic replay—the cornerstone of Temporal's reliability guarantee.
2. Deterministic Replay: The Core Guarantee
Replay is what makes Temporal magical for long-running agents. When a workflow resumes after a crash or deployment, the new worker doesn't know what the previous worker was doing. But it can reconstruct the exact in-memory state by replaying the event history.
How Replay Works
The worker re-executes the workflow code from the beginning, but instead of actually running activities (which would duplicate side effects), it reads activity results from the event history. The workflow function "fast-forwards" through already-executed code paths until it reaches the point where it was interrupted, then continues with new logic.
Example: A support agent workflow that fetches customer data, searches a knowledge base, calls an LLM, and sends an email.
@workflow.defn
class SupportAgentWorkflow:
@workflow.run
async def run(self, ticket_id: str) -> str:
# Step 1: Retrieve customer data
customer = await workflow.execute_activity(
get_customer_data,
ticket_id,
start_to_close_timeout=timedelta(seconds=10)
)
# Step 2: Search knowledge base
docs = await workflow.execute_activity(
search_knowledge_base,
customer.query,
start_to_close_timeout=timedelta(seconds=5)
)
# Step 3: Generate LLM response
response = await workflow.execute_activity(
generate_llm_response,
customer.query,
docs,
start_to_close_timeout=timedelta(seconds=30)
)
# Step 4: Send email
await workflow.execute_activity(
send_email,
customer.email,
response,
start_to_close_timeout=timedelta(seconds=10)
)
return "resolved"Scenario: Worker crashes after Step 3 completes but before Step 4 starts.
On replay:
- New worker reads event history:
[WorkflowStarted, ActivityScheduled(get_customer_data), ActivityCompleted(customer=...), ActivityScheduled(search_knowledge_base), ActivityCompleted(docs=...), ActivityScheduled(generate_llm_response), ActivityCompleted(response=...)] - Re-executes workflow code:
- Step 1: Sees
ActivityCompleted(customer=...)in history → returns cached result instantly - Step 2: Sees
ActivityCompleted(docs=...)in history → returns cached result instantly - Step 3: Sees
ActivityCompleted(response=...)in history → returns cached result instantly - Step 4: Not in history → schedules the activity and generates
ActivityScheduled(send_email)event
- Step 1: Sees
- Worker sends email (Step 4 executes for the first time)
- Workflow completes
The LLM wasn't called twice. The email wasn't sent twice. The workflow resumed exactly where it left off.
Determinism Constraints
For replay to work, workflow code must be deterministic: given the same event history, it must make the same decisions every time. This imposes constraints:
❌ No random numbers: Math.random() produces different values on replay
❌ No system time: Date.now() changes between executions
❌ No external state: Reading from databases, file systems, or APIs during workflow logic (not activities)
❌ No non-deterministic libraries: UUID generation, shuffling arrays, anything with side effects
✅ Use Temporal APIs:
workflow.uuid()generates deterministic UUIDs (seeded from event history)workflow.now()returns deterministic timestamp (from event history)- Activities encapsulate all side effects
Example of non-deterministic workflow (BAD):
@workflow.run
async def bad_workflow():
user_id = random.randint(1, 1000) # ❌ Changes on replay
result = await workflow.execute_activity(call_api, user_id)
return resultOn replay, random.randint() generates a different user_id, causing the workflow to schedule a different activity than the first time. Temporal detects the mismatch and throws a non-determinism error.
Example of deterministic workflow (GOOD):
@workflow.run
async def good_workflow(user_id: int):
# user_id passed as argument (deterministic input)
result = await workflow.execute_activity(call_api, user_id)
return resultExactly-Once Semantics for Side Effects
Activities execute exactly once even across retries and replays. Here's how:
- First execution: Worker schedules activity, Temporal logs
ActivityScheduled, worker executes the activity code, Temporal logsActivityCompletedwith the result - Replay: Worker sees
ActivityCompletedin history, skips execution, returns cached result - Retry on failure: If activity fails, Temporal logs
ActivityFailed, applies retry policy, schedules a new attempt (newActivityScheduledevent). The activity code runs again, but Temporal tracks each attempt separately in the event history.
This is critical for agents that charge customers ($0.99 per resolution), send emails (no spam), or update external systems (no duplicate records).
3. Why Production Agents Need Temporal
Agents execute multi-step, cross-system processes with complex failure modes, human involvement, and long latencies. Temporal was designed for exactly this.
Long-Running Tasks
Agents don't finish in milliseconds. A support ticket might take 30 seconds (retrieve data, LLM call, send email). A legal research task might take hours (search case law, summarize, generate memo, wait for lawyer review). A sales follow-up might wait 3 days before sending the next email.
Stateless functions (Lambda) timeout after 15 minutes. Message queues require manual state persistence. Temporal workflows run indefinitely—days, months, years—with state persisted in event history.
Example: Sales follow-up agent that waits 3 days between emails.
@workflow.run
async def sales_followup_workflow(lead_id: str):
# Send initial email
await workflow.execute_activity(send_email, lead_id, "Initial outreach")
# Wait 3 days (Temporal timer, not a cron job)
await asyncio.sleep(timedelta(days=3))
# Send follow-up
await workflow.execute_activity(send_email, lead_id, "Follow-up")The workflow sleeps for 3 days without occupying any worker resources. Temporal sets a timer in the event history, and the worker isn't involved until the timer fires. This enables patterns like "send a sequence of 5 emails over 2 weeks" or "escalate to manager if no response in 24 hours" without brittle cron jobs or external schedulers.
Failure Recovery
LLM APIs rate-limit (OpenAI 429 errors). Databases timeout. External APIs go down. Workers crash. Temporal handles all of this automatically.
Activity retry policies let you configure retries with exponential backoff:
@activity.defn
async def call_openai_api(prompt: str) -> str:
# May fail with 429 rate limit
response = await openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
# Configure retry policy
await workflow.execute_activity(
call_openai_api,
prompt,
retry_policy=RetryPolicy(
initial_interval=timedelta(seconds=1),
maximum_interval=timedelta(seconds=60),
maximum_attempts=10,
backoff_coefficient=2.0 # 1s, 2s, 4s, 8s, 16s, 32s, 60s, 60s...
)
)If the OpenAI API returns a 429, Temporal automatically retries with exponential backoff. The workflow doesn't need to know about retries—it just waits for the activity to eventually succeed or exhaust attempts. The event history records every retry attempt, making failures debuggable.
If a worker crashes mid-workflow, deterministic replay ensures the workflow resumes from the last successful step. You don't lose progress. You don't restart from scratch. The agent picks up exactly where it left off.
Human-in-the-Loop (HITL)
Many agents need human approval before taking action. A legal AI drafts a memo, but a lawyer must review it before sending to the client. A support agent escalates a complex issue to a human, waits for resolution, then continues the workflow.
Temporal's signals and wait conditions are designed for HITL:
@workflow.defn
class LegalReviewWorkflow:
def __init__(self):
self.approved = False
@workflow.signal
async def approve(self):
self.approved = True
@workflow.run
async def run(self, case_id: str) -> str:
# Step 1: Generate legal memo
memo = await workflow.execute_activity(generate_memo, case_id)
# Step 2: Wait for lawyer approval (could take hours)
await workflow.wait_condition(lambda: self.approved, timeout=timedelta(hours=24))
if self.approved:
# Step 3: Send approved memo to client
await workflow.execute_activity(send_memo, memo)
return "sent"
else:
# Timeout: Escalate to senior partner
await workflow.execute_activity(escalate_to_partner, case_id)
return "escalated"External client sends approval signal:
# Lawyer reviews memo in UI, clicks "Approve"
handle = client.get_workflow_handle(workflow_id="legal-12345")
await handle.signal(LegalReviewWorkflow.approve)The workflow waits for up to 24 hours. If the lawyer approves within that window, the signal sets self.approved = True, the wait condition unblocks, and the workflow sends the memo. If 24 hours pass with no approval, the workflow escalates to a senior partner.
During the wait, the workflow isn't running. It's persisted in the database as an event history. The worker isn't blocked. The workflow can wait for days if needed.
Multi-Agent Orchestration
Agents often call other agents. A sales agent calls a legal agent to review a contract. A support agent calls a billing agent to process a refund. Temporal's child workflows model this naturally:
@workflow.run
async def sales_agent_workflow(deal_id: str):
# Step 1: Call legal agent (child workflow)
contract_review = await workflow.execute_child_workflow(
LegalAgentWorkflow.run,
deal_id,
id=f"legal-{deal_id}",
task_queue="legal-workers" # Route to GPU workers if needed
)
# Step 2: Call billing agent (child workflow)
invoice = await workflow.execute_child_workflow(
BillingAgentWorkflow.run,
deal_id,
contract_review.amount,
id=f"billing-{deal_id}",
task_queue="billing-workers"
)
# Step 3: Finalize deal
return await workflow.execute_activity(close_deal, contract_review, invoice)Child workflows:
- Run in parallel if you use
asyncio.gather() - Have their own event histories (isolated failure domains)
- Can be cancelled by the parent (
handle.cancel()) - Fail up to the parent if they fail (or you can catch exceptions and compensate)
This enables complex multi-agent systems where the parent workflow coordinates specialists, each with their own retries, timeouts, and failure handling.
Exactly-Once Execution for Billing
If you charge customers $0.99 per resolved ticket (like Intercom Fin), you cannot charge them twice due to a retry. Temporal's exactly-once activity execution guarantees that even if the billing activity retries (due to a transient failure), the external billing API is only called once.
Pattern: Use activity IDs as idempotency keys.
@activity.defn
async def charge_customer(ticket_id: str, amount: float) -> str:
# Use ticket_id as idempotency key in Stripe
charge = await stripe.Charge.create(
amount=int(amount * 100), # cents
currency="usd",
idempotency_key=f"ticket-{ticket_id}"
)
return charge.idIf the activity retries, Stripe sees the same idempotency key and returns the original charge without creating a duplicate. Combined with Temporal's activity deduplication (based on event history), this ensures exactly-once billing.
4. Scaling Temporal for Production
Netflix runs hundreds of thousands of workflows per day on Temporal. Datadog runs millions of workflows per month. Here's how to scale Temporal to that level.
Worker Scaling
Workers are stateless and scale horizontally. To handle more workflows:
- Deploy more worker processes (10 → 100 → 1,000 instances)
- Tune worker concurrency: Each worker can execute multiple workflows and activities concurrently. Configure
maxConcurrentWorkflowTaskExecutionSize(default: 100) andmaxConcurrentActivityExecutionSize(default: 100) based on CPU and memory. - Autoscale based on task queue depth: Use KEDA (Kubernetes Event-Driven Autoscaler) or Horizontal Pod Autoscaler to scale workers when the task queue backlog grows.
Example autoscaling metric:
# KEDA ScaledObject for Temporal workers
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: temporal-worker-scaler
spec:
scaleTargetRef:
name: temporal-worker
minReplicaCount: 10
maxReplicaCount: 100
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: temporal_task_queue_depth
threshold: '100' # Scale up if >100 tasks pending
query: sum(temporal_task_queue_depth{queue="default"})Task Queue Routing
Not all workflows have the same resource requirements. LLM-heavy workflows need GPU workers. High-priority customer tickets should go to dedicated workers. Low-priority batch jobs can share general-purpose workers.
Temporal's task queues enable routing strategies:
# Start workflow on specific task queue
await client.start_workflow(
SupportAgentWorkflow.run,
ticket_id="12345",
task_queue="high-priority" # Route to dedicated workers
)
# LLM-heavy workflow
await client.start_workflow(
LegalResearchWorkflow.run,
case_id="67890",
task_queue="llm-workers" # Route to GPU-equipped workers
)Worker deployment:
- default-workers: General-purpose CPU workers (100 instances)
- llm-workers: GPU workers for LLM inference (10 instances with A100 GPUs)
- high-priority-workers: Dedicated workers for urgent tickets (20 instances, low latency)
Each worker pool polls only its designated task queue, ensuring workload isolation.
Event History Limits
Temporal enforces a hard cap of 51,200 events or 50 MB per workflow execution, with warnings starting at 10,240 events or 10 MB. If a workflow generates too many events (e.g., a loop that schedules 100,000 activities), it will fail.
Solution: Continue-As-New
@workflow.run
async def long_running_workflow():
for i in range(100000):
await workflow.execute_activity(process_item, i)
# Reset workflow every 1,000 iterations
if i % 1000 == 0 and i > 0:
workflow.continue_as_new(i + 1) # Start new workflow, pass statecontinue_as_new() closes the current workflow execution and starts a fresh one with a new event history, passing forward any state needed to continue. This keeps event histories small and replay fast.
When to use Continue-As-New:
- Long-running workflows that generate >10,000 events
- Infinite loops (e.g., "process messages from queue forever")
- Workflows that fan out to many child workflows (>1,000 children)
Storage: Cassandra vs PostgreSQL
Cassandra:
- Pros: Multi-region, multi-DC, write-optimized (perfect for event sourcing), handles >10,000 workflows/second
- Cons: Complex to operate, requires expertise in Cassandra tuning, replication, and compaction
- Use case: High-throughput, mission-critical workloads (Netflix, Uber)
PostgreSQL:
- Pros: Simpler to operate, strong consistency, easier to debug (standard SQL)
- Cons: Single-node bottleneck, <10,000 workflows/second
- Use case: Lower-throughput deployments, teams without Cassandra expertise
When to use Cassandra:
- >100,000 workflows/second
- Multi-region failover required
- Mission-critical workloads (billing, compliance)
When to use PostgreSQL:
- <10,000 workflows/second
- Simpler operations preferred
- Single-region deployment
Temporal Cloud vs Self-Hosted
Temporal Cloud:
- Pros: Fully managed (no ops), automatic upgrades, multi-region failover, usage-based pricing
- Cons: Cost (pay per "Action"—workflow events, activity executions, heartbeats, signals)
- Pricing model: Actions per second × total Actions + storage
Self-Hosted:
- Pros: Full control, potential cost savings at scale, custom tuning
- Cons: Operational complexity (Cassandra clusters, upgrades, backups, monitoring)
Cost crossover: Temporal Cloud is typically cheaper for <1M workflows/month. Beyond that, self-hosted can be more cost-effective, but factor in engineering time for operations.
Example: Datadog self-hosts Temporal and has publicly discussed the operational challenges: scaling Cassandra clusters, managing event history growth, ensuring reliable upgrades. For smaller teams, Temporal Cloud offloads this burden.
5. Versioning and Deployment
Workflows can run for months or years. During that time, you'll deploy new code. How do you ensure new code is compatible with workflows started on old code?
Workflow Versioning
Use the get_version() API to branch workflow logic based on a version marker stored in the event history:
@workflow.run
async def my_workflow():
version = workflow.get_version("my-change", workflow.DEFAULT_VERSION, 2)
if version == 1:
# Old behavior (for workflows started before deploy)
result = await workflow.execute_activity(old_activity)
else:
# New behavior (for workflows started after deploy)
result = await workflow.execute_activity(new_activity)
return resultWhen a workflow first executes, get_version() records version=2 in the event history. On replay, get_version() reads the version from history and uses it to branch. Workflows started before the deploy use version=1, workflows started after use version=2.
Deployment Strategies
Blue-Green Deployment:
- Deploy v2 workers alongside v1 workers (both poll the same task queue)
- Route new workflow starts to v2 workers (via task queue or workflow versioning)
- Wait for v1 workflows to drain (monitor active workflow count)
- Decommission v1 workers
Canary Deployment:
- Deploy v2 workers at 10% capacity
- Route 10% of new workflows to v2
- Monitor error rates, latency, quality metrics
- Gradually increase to 50%, 100%
Replay Testing: Before deploying new workflow code, run replay tests in CI: fetch production event histories, replay them against new code, assert no non-determinism errors. This catches breaking changes before they hit production.
6. Observability and Debugging
Temporal's Web UI provides time-ordered audit trails of every workflow execution. You can inspect:
- Event history: Full sequence of workflow decisions, activity executions, timers, signals
- Pending activities: What is the workflow waiting for?
- Stack traces: If an activity failed, see the error message and stack trace
- Replay: Re-execute the workflow locally with a debugger
Time-Travel Debugging
When a workflow fails in production:
- Open the workflow execution in Temporal Web UI
- Copy the event history (JSON export)
- Replay the workflow locally with the production event history
- Step through the code with a debugger to understand the failure
This is impossible with stateless functions (Lambda loses state on failure). With Temporal, you have a complete audit trail of what the agent did, in what order, with what data.
Metrics and Alerting
Temporal exports metrics via OpenTelemetry:
- Workflow start/completion rates (workflows/sec)
- Activity failure rates (% of activities failing)
- Task queue depth (backlog of pending workflows)
- Worker utilization (% of workers busy)
Integrate with Datadog, Prometheus, or Grafana to alert on:
- High activity failure rate (>5% failures → investigate)
- Task queue backlog (>1,000 pending → scale workers)
- Slow workflows (p95 latency >10s → optimize activities)
7. Temporal vs Alternatives
Temporal vs AWS Step Functions
Step Functions:
- State machine defined in JSON (Amazon States Language)
- Tightly integrated with AWS services (Lambda, S3, DynamoDB)
- Declarative, no code
- Limitation: Complex logic is hard to express in JSON (no loops, limited conditionals)
Temporal:
- Workflows written in code (Python, Go, TypeScript, Java)
- Full programming language (loops, functions, composition)
- Multi-cloud, portable
When Step Functions wins: Simple state machines (<10 states), all-AWS infrastructure
When Temporal wins: Complex workflows (100+ steps), multi-cloud, need to version workflow logic over years
Temporal vs Apache Airflow
Airflow:
- DAG-based batch workflows
- Scheduler-driven (cron-like)
- Designed for data pipelines, ETL
- Limitation: Minute-level latency (not real-time), no durable execution (tasks restart from scratch on failure)
Temporal:
- Event-driven (workflows start immediately on demand)
- Sub-second latency (real-time)
- Durable execution (workflows resume from last step on failure)
When Airflow wins: Daily batch ETL jobs, data engineering workflows
When Temporal wins: Real-time agents (customer support, sales), long-running workflows with complex failure handling
Temporal vs Cadence
Cadence is Temporal's predecessor, developed at Uber. Temporal is a fork of Cadence (2019) by the same core team.
Why Temporal forked:
- Independent governance (Temporal is a separate company)
- Faster iteration (Temporal ships features faster)
- Better docs, SDKs, community
Today: Temporal has broader adoption, better Cloud offering, larger ecosystem. Cadence is still used at Uber and some enterprises, but most new deployments choose Temporal.
8. Real-World Scale
Netflix
Runs hundreds of thousands of workflows per day on Temporal, with projections toward a million workflows per day as usage expands. Temporal powers critical workflows like media encoding, content delivery, and experimentation pipelines.
Datadog
Runs millions of workflows per month across 100+ internal teams. Self-hosts Temporal on Cassandra. Datadog's engineering team has publicly discussed challenges of operating Temporal at scale: managing Cassandra clusters, tuning event history retention, ensuring reliable upgrades.
Other Adopters
- Stripe: Financial workflows, payment orchestration
- HashiCorp: Terraform Cloud orchestration
- Coinbase: Crypto transaction workflows
- Snap: Snapchat infrastructure workflows
- Box: Content workflow automation
These companies moved from homegrown orchestrators to Temporal for increased reliability, developer productivity, and observability.
9. When NOT to Use Temporal
Temporal is powerful, but not always the right tool:
Stateless, Short-Lived Tasks
If your task is a single API call with no retries needed (e.g., "fetch user profile from API"), Lambda or a simple HTTP handler is simpler than Temporal. Temporal's overhead (task queue polling, event persistence) adds ~100ms of latency.
Very Low Latency (<10ms)
Temporal's task queue polling and event persistence add latency. For <10ms operations, use gRPC, HTTP, or direct function calls.
Simple CRUD Operations
If you're just reading/writing a database with no orchestration, Temporal is overkill. Use a REST API or GraphQL.
Temporal shines when:
- Workflows span multiple systems (CRM, LLM, email, ticketing)
- Long-running (>1 minute, often hours/days)
- Complex failure handling (retries, compensations, human escalations)
- Need exactly-once semantics (billing, compliance)
- Must survive deployments without losing state
10. The Bottom Line
Temporal is infrastructure for production agents. It provides the same durable execution guarantees that databases provide for data—but for workflows.
If you're building agents that:
- Orchestrate multi-step processes across systems
- Run for more than a few seconds
- Need to survive crashes, deployments, and external API failures
- Involve humans in the loop
- Coordinate multiple specialized agents
- Charge customers or call external APIs (where exactly-once matters)
...then Temporal is the orchestration engine you need.
Trade-offs:
- Learning curve: Deterministic workflows, event history limits, versioning
- Operational complexity: Self-hosting requires Cassandra expertise (or pay for Temporal Cloud)
- Latency overhead: ~100ms minimum (not suitable for <10ms operations)
Benefits:
- No lost state: Workflows survive crashes, restarts, deployments
- Observability: Complete audit trail of every decision, activity, signal
- Exactly-once execution: No duplicate charges, emails, or database writes
- Evolvability: Version workflows over months/years without breaking in-flight executions
Netflix, Datadog, and Stripe run millions of workflows per month on Temporal. If it's good enough for them, it's good enough for your agents.
Getting started:
- Start with the single-agent workflow pattern (1 workflow = 1 task)
- Use Temporal Cloud (avoid operational complexity early on)
- Scale workers horizontally as load grows
- Add multi-agent orchestration when you have specialized agents (legal, billing, etc.)
- Monitor event history size, use Continue-As-New if workflows grow large
Temporal is the durable execution engine that lets you build agents that just work—even when everything else fails.