Why Temporal

State	Drift	Cost
Lives in scheduler DB, app DB, S3, Kafka, operator memory.	Every handoff is a chance for the systems to disagree.	Drift is what the 2 AM page is.

Concern	Cron	Airflow	Step Fns	Kafka alone	Temporal
Durable state	✗	partial	✓	✗	✓
Retries built-in	✗	✓	✓	✗	✓
Long human waits	✗	✗	partial	✗	✓
Code, not config	shell	Python	JSON	✗	✓
Multi-language SDK	✗	Python	n/a	✓	✓
Vendor neutral	✓	✓	✗	✓	✓
Years-long runs	✗	✗	✗	n/a	✓

Aspect	Runbook	Playbook
Purpose	Execute a specific operational task	Handle a broader operational scenario
Scope	Narrow and task-focused	Broad and scenario-focused
Structure	Step-by-step procedure	Workflows, decisions, branches, and runbooks
Decision making	Minimal; follow prescribed steps	Explicit decision points and branching actions
Example	"Restart a failed database service"	"Respond to a database outage incident"
Temporal signal	Repeated manual recovery steps	Repeatable cross-system coordination

Open with energy. Read the subtitle out loud - it's the hook. This is a 25-minute talk: aim to land 3 ideas, not 30. The 3 are: (1) state drifts across systems, (2) durable execution is application code that survives process death, (3) you'd use it where you write runbooks today. Ask for a show of hands: "who's been paged for a half-finished workflow?"

30 seconds max. Establish credibility, then move on. If the room is mixed (engineers + managers), mention you've shipped Temporal to production at companies with both Java and Go stacks - it lands trust faster.

Use these as quick chat prompts. Ask for "yes" / "no" or a one-word answer; do not discuss every response. - Have you ever been paged for a half-finished business process? - Have you had to manually reconcile whether an external API call succeeded? - Have you written retry logic that later needed a retry limit, timeout, or backoff policy? - Have you debugged a stuck cron, queue consumer, or scheduled job after it was already too late? - Have you used a workflow/orchestration tool before: Airflow, Step Functions, Camunda, Conductor, Argo, or something homegrown? - If you already use Temporal, where: local experiments, one service, or production-critical workflows?

Don't dwell. ~20 seconds. The agenda is signposting so the audience can locate themselves later. Keep it short - orientation, not detail.

Section divider. Pause for 2-3 seconds. The next 3 slides set up the pain the rest of the talk resolves. Tone shift: slow down, get serious about real outages the audience has lived through.

Read the bullets in voices. The first sounds like an e-commerce backend, the second like a data team, the third like an ops/devex flow. The audience will recognize at least one - that's the point. Land the final line with weight: "look easy until one step fails."

Pace yourself - each bullet is a real incident shape. After the third one ("crashed between DB write and publish") pause; that's the moment people nod because they remember a specific outage. Closing line is the slogan to repeat: "runbook, not a button."

About 5 minutes for the four stacks combined. Don't bash any tool - each solves a real problem. The framing is "what each leaves to you."

Quick. Cron isn't a strawman - it's where many teams start. The reason it breaks is incidental complexity that nobody owns: logs scattered, lock files invented, monitoring bolted on.

For Airflow users in the room, validate that Airflow IS great for what it was built for. The pitch is: don't replace your DAGs - move the cross-system flows out of Airflow into Temporal, keep the data DAGs in Airflow.

The JSON-state-machine point lands hardest with engineers. Ask: "would you review a 4000-line YAML for an order workflow as readily as Java?" The Lambda 15-min cap is the sneaky one - many teams hit it and don't realize it for months.

Don't position as Kafka vs Temporal. Position as Kafka + Temporal: Kafka is the bus between teams, Temporal is the brain inside a team. The quote at the bottom is the line they'll quote back at you - say it slowly.

The reframe. The problem isn't any single tool - it's that workflow state is scattered across N+1 places. Make eye contact, lean into "drift is what the 2 AM page is." It's the bridge to the Temporal section.

Tone shift again - from problem to solution. Energy back up. The talk inflects here; if you're 11 minutes in, you're on schedule.

THIS IS THE CENTRAL CONCEPT - this slide defines the term, the next one shows it. Spend ~30 seconds here on the definition. The persistence is automatic; you don't write checkpoint/restore code. Then advance to the code.

Spend ~60 seconds here. Walk the code: this is normal Java. There's no special framework. The methods are just method calls. The MAGIC is the last line. Then say: "the Workflow doesn't care which JVM is running it. The state lives in the cluster, not on a host."

Go is often the clearest SDK for engineers coming from backend services. Point out the shape: ExecuteActivity records a command in history, and Get waits for the durable result. If a Worker dies after reserveInventory, replay rebuilds paymentID and reservationID from history before scheduling ship.

Use this to defuse "is this Java-only?" concerns. Python is async-first, but the mental model is the same: durable Workflow decisions, side effects in Activities, result replay from history.

The 4 steps are the entire model. If they only remember this one slide, the rest follows. Use a whiteboard metaphor: "imagine someone took notes of every decision your program made; you can replay those notes to recreate the program's state."

Quick fly-through - 90 seconds. Don't go deep on any one. The point is breadth: ALL of these are built-in. Each bullet represents code your team currently writes and maintains. The "Workflow.sleep(Duration.ofDays(30))" line gets a chuckle from people who have written cron-replacement logic.

30 seconds. This is the mental model summary. Most important: Task Queue is JUST a string - it's not Kafka. It's not a database. It's a routing key. This often confuses people coming from message-queue thinking.

Section divider. The next 3 slides are the "show, don't tell" moment. Each slide is a complete pattern in ~15 lines of Java.

Walk top to bottom. Stop on `saga.addCompensation(...)` - explain that this is registered IMMEDIATELY after the forward step succeeds. If the Workflow dies between the forward step and the compensation registration, the compensation is lost. So you write them paired. The catch handler runs in LIFO order. Compensations also retry.

The point is at the bottom. Don't read the whole builder - point at `setIntervals(Duration.ofHours(1))` and `setJitter(...)` and say: "this is what your cron line wished it was." For Airflow users: this replaces the scheduler, not your DAG logic.

This is the killer slide for product teams. The "wait for human approval" pattern is often a custom-built monstrosity. Here it's three lines. The Workflow doesn't poll. The Worker doesn't keep a thread. The state lives in the cluster; when the Signal arrives, a Worker (maybe a different one) picks up the Workflow and the await unblocks.

Pivot to credibility/social proof. The next two slides answer the implicit question: "who else is using this and what for?"

Pick the one that matches the audience. If they're in fintech, dwell on payments. If they're a platform team, dwell on infrastructure provisioning. The AI agents bullet is newest and lands hardest in 2024+ rooms.

Skim. The point is breadth - this isn't a niche tool. Stripe and Snap are the strongest names for fintech. Netflix for data platforms. Datadog for SRE-leaning teams. If the audience asks for case studies later, point them at https://temporal.io/case-studies.

Don't read the table - point at one row and discuss. The most useful row is "Long human waits" because it surprises people. Cron and Airflow are not built for "wait for a human for 3 days." The "Vendor neutral" row matters for Step Functions skeptics.

Critical slide for credibility. If you don't show the limits, the audience suspects you're selling. The runbook line at the bottom is the test: "if the next person on call would need a runbook to recover, it's a Workflow."

This is a vocabulary reset before the exercise. Many teams use "runbook" and "playbook" interchangeably. For this talk, make the distinction operational: runbook is reactive recovery; playbook is repeatable coordination.

Do not make either one sound bad. A good SRE team needs both. The key point is that repeated runbook execution is evidence that the system has pushed application state recovery onto humans.

Use this as the memory hook. The important distinction is containment: a playbook can contain runbooks, but it also carries scenario judgment, branching, and coordination.

Use examples: - Runbook: "payment charged but order not shipped" recovery. - Playbook: "new enterprise customer onboarding" or "security exception approval" where the steps are known but involve humans and systems. This sets up the discussion exercise: participants should classify their own workflow as runbook-shaped, playbook-shaped, or not Temporal-shaped.

Online ILT flow: 1. Set a 5-minute timer and ask everyone to type their answer privately first. 2. At 2 minutes, ask them to paste a short version in chat: "<workflow> / <failure mode> / <Temporal-shaped or not>". 3. If the platform supports breakouts and the group is >8 people, use 3-minute pairs before chat share-out. Otherwise keep it all in chat. 4. Pick two examples: one strong Temporal fit and one non-fit. Ask each person to unmute for 30 seconds only if they are comfortable. 5. Close by tying answers back to the runbook test: if the next on-call would need step-by-step recovery instructions, it may be a Workflow.

We're 20 minutes in. The last 5 minutes are about giving them an action. The mistake here is recommending a big migration. Don't. Recommend ONE workflow.

Step 5 is the lesson. The biggest mistake teams make is REDESIGNING during the migration. You don't. You move first, then improve. Mechanical migration is faster, easier to verify, and lets you compare apples to apples.

End with a concrete action. If the room is on laptops, ask them to run it. The dev server is genuinely 5 minutes; the brag is fair. For remote audiences, this is the screen they screenshot.

The three things they should remember. Read them slowly. Each one maps to the agenda's central claim. If they only remember the "runbook → Workflow" habit, the talk worked.

Close strong. Land on the quote. Don't immediately segue to Q&A - let it sit. Then: "Questions?"

Leave on screen during Q&A. The slides URL gets photographed; make sure the QR code works if you've added one for in-person events.

Why

Temporal

Gaurav Agarwal

Gaurav Agarwal

As an instructor

What I need from you

Class progression

Some tips

Some tips (continued)

Show of hands

Agenda

Setup

The orchestration problem

Every backend has these

What goes wrong

Today

Where today's tools break

Cron + scripts

Airflow

AWS Step Functions

Kafka by itself

What they share

Durable Execution

What Temporal does differently

Durable execution

A Workflow, in code

Same idea in Go

Same idea in Python

The trick: event history

What Durable Execution gives you - for free

Mental Model

Four primitives

Concretely

What it looks like

A Saga, end to end

A schedule that survives deploys

A human step, no plumbing

Production

Real-world use cases

Where it earns its keep

Who's running it

Pick your poison

Where Temporal is not the answer

Operations

Runbook vs playbook

Runbook vs playbook

Simple analogy

What Temporal changes

Discuss

Adoption

Where to start

A two-week plan

Local dev in 5 minutes

Takeaways

Resources