Previous slide Next slide Toggle fullscreen Toggle overview view Open presenter view
Fundamentals
Temporal Fundamentals
Gaurav Agarwal
Course Agenda
Day
Topic
Lab focus
1
Foundations - durable execution mental model
Hello Temporal, Event History
2
Reliability + interactions
Signals, Updates, Schedules
3
Kafka integration
End-to-end Kafka pipeline, DLQ
4
Production engineering
Replay tests, dashboards
5
Saga + Spring Boot + capstone
Capstone Workflow
6
AWS migration + containers
Glue, K8s, KEDA
Build-it-yourself labs (starter code, no solutions) live in challenges/ — each lab slide links its own challenge.
Day 1
Foundations
Rethinking orchestration as durable application code.
Lab · Day 1
Local dev setup
Challenge → day-01-foundations/lab-1-local-dev-setup
make check
make temporal
open http://127.0.0.1:8233
temporal operator namespace list
Goal: every laptop shows the default namespace in the Web UI.
Day 1
Why Temporal exists
The failure modes of cron- and DAG-based orchestration.
Day 1 · on-ramp
Why Temporal exists
Where this fits — Your first hour: before any API, the problem every orchestrator you've used leaves on your plate.
Why it matters — If the failure modes don't land, the rest of the week looks like needless ceremony.
By the end — You'll name the work cron, Airflow, Step Functions and Kafka each leave to you .
Every backend has these
"Charge the card, ship the order, send the receipt."
"Pull from S3, transform with Spark, write to Snowflake."
"Wait for the human approval, then provision the tenant."
"Retry the flaky API for an hour, then page the on-call."
These are workflows . They look easy until one step fails.
What goes wrong
The third call timed out. Did it succeed?
The Lambda was killed at minute 14 of 15.
The Kafka consumer crashed between the DB write and the publish.
The cron didn't fire. Nobody noticed for two days.
The retry loop never had a budget.
Recovery is a runbook , not a button.
Stack
What it solves
What it leaves to you
Cron + scripts
Triggering on a schedule
All state, retries, recovery
Airflow
Scheduling DAGs
Cross-system state, retries on top
Step Functions
State machines in JSON
Code review, Lambda 15-min cap
Kafka alone
Transport between systems
Per-key state, idempotency
Drift across all of them = the 2 AM page.
DAGs vs durable execution
Apache Airflow
Temporal
Paradigm
A DAG of tasks you wire up
Idiomatic code that runs top-to-bottom
Trigger
Schedule-driven — hourly, midnight, quarter-end
Event-driven — API call, webhook, message
Recovery
Retry a task by its position in the graph
Replay history; resume mid-function
Sweet spot
Move + transform data on a schedule
Coordinate microservices & business logic
Airflow moves data A→B. Temporal runs a durable function that survives crashes.
Day 1
Core concepts
Workflows, Activities, Workers, Task Queues.
Day 1 · on-ramp
Core concepts
Where this fits — You've seen what breaks; these are the four pieces Temporal gives you to fix it.
Why it matters — Every Workflow, saga and pipeline later this week is built from just these four.
By the end — You'll read Workflow, Activity, Worker and Task Queue in one file and run them.
The four primitives
Workflow
Activity
Worker
Task Queue
Durable function. State is the event history. Deterministic.
Arbitrary code with side effects. Retried independently.
Long-lived process polling one or more Task Queues.
A string name. Routes work to a Worker pool.
Workflow
@WorkflowInterface
public interface OrdersWorkflow {
@WorkflowMethod
void run (String batchDate) ;
}
Annotated Java interface + impl.
The impl is your durable function .
State lives in event history; the impl is replayable.
Activity
@ActivityInterface
public interface OrdersActivities {
String extract (String batchDate) ;
String transform (String rawUri) ;
void load (String cleanUri) ;
}
Unrestricted code: HTTP, DB, files, anything.
Retried independently of the Workflow.
95% of your real production code lives here.
Worker + Task Queue
WorkerFactory factory = WorkerFactory.newInstance(client);
Worker worker = factory.newWorker("orders" );
worker.registerWorkflowImplementationTypes(OrdersWorkflowImpl.class);
worker.registerActivitiesImplementations(new OrdersActivitiesImpl ());
factory.start();
Worker = long-lived JVM polling orders Task Queue.
The Task Queue string routes work to a Worker pool.
Javadoc: WorkerFactory · WorkerFactoryOptions · WorkerOptions
One Worker hosts many types
A single Worker registers n Workflow types + m Activity types.
That one process can run any of the n × m combinations.
Registration is just a lookup table: type name → your code.
The two calls build it: registerWorkflowImplementationTypes → the n , registerActivitiesImplementations → the m .
One Worker, one Task Queue, the whole catalogue of work it knows how to do.
Many Workers split the work
Scale out to l Workers across the fleet.
Each Worker registers some(n) Workflows ; some(m) Activities — same two register… calls, a subset each.
Partition by Task Queue: route each subset to the pool that registers it.
Same model, more processes. You choose how to slice the catalogue.
Why split: Compute & I/O
Workflows — orchestration. Cheap, deterministic, mostly waiting.
Activities — the real work: compute -heavy or I/O -bound.
Isolate them: CPU-bound and I/O-bound Activities starve each other on one pool.
Separate Workers → tune slots, scale, and hardware per workload.
Splitting the fleet isn't bureaucracy — it's matching processes to the shape of the work.
Routing Activities by Task Queue
An Activity runs on the Task Queue its stub names (ActivityOptions.setTaskQueue) — default is the Workflow's own queue. The unit of "who registers what" is the Task Queue , not the Worker.
Variant
Topology
Each Worker registers
One Worker, everything
1 queue, 1 pool
all n Workflows + m Activities
Split by Task Queue
k queues, k pools
the subset routed to its queue
Same queue, l replicas
1 queue, l identical pods
the same full set — pods interchangeable
Mismatch (bug)
route to a queue no pool serves
nothing runs it → task retries until timeout
Subset registration is legal across Task Queues; within one queue every Worker must be homogeneous.
One Workflow, three pools
Task Queue Worker pool (registers)
OrderWorkflow ─→ "orders" ─────→ Orchestrator [OrderWorkflow]
│
├ charge ─→ "payments" ─────→ Payments pool [PaymentActivities]
└ render ─→ "media" ─────→ Media pool [MediaActivities] (GPU)
No single pool registers everything. Each stub's setTaskQueue picks the pool — scale and hardware follow the queue.
Routing in code
PaymentActivities pay = Workflow.newActivityStub(PaymentActivities.class,
ActivityOptions.newBuilder()
.setTaskQueue("payments" )
.setStartToCloseTimeout(Duration.ofSeconds(30 )).build());
MediaActivities media = Workflow.newActivityStub(MediaActivities.class,
ActivityOptions.newBuilder()
.setTaskQueue("media" )
.setStartToCloseTimeout(Duration.ofMinutes(5 )).build());
String charged = pay.charge(orderId);
String receipt = media.render(orderId);
The payments Worker registers only PaymentActivities; the media Worker only MediaActivities. Neither knows the other exists.
Airflow → Temporal map
Airflow
Temporal
DAG
Workflow
Operator / Task
Activity
Worker
Worker (long-lived JVM)
default_queue
Task Queue
XCom
A normal Java return value
ExternalTaskSensor
Signal / signalWithStart
BranchPythonOperator
if / switch in Java
Sensor poll loop
Workflow.await(predicate)
You stop describing shape. You start writing behavior.
At a glance
Feature
Apache Airflow
Temporal
Primary domain
Data engineering & batch
App development & microservices
State management
Central metadata DB of task status
Event-sourced history, replayed
Latency
High — polling, seconds to start
Low — gRPC, sub-second
Waiting / sleep
Costs a worker slot or a sensor
Native & cheap — sleep for a year
Scaling limit
Scheduler + metadata DB
Workflow history size
Not better-or-worse — different problems. Match the tool to the shape of the work.
Day 1
The Temporal SDK
The library that turns ordinary code into durable executions.
Day 1 · on-ramp
The Temporal SDK
Where this fits — You've met the primitives; now the library that turns ordinary Java into a durable execution.
Why it matters — Knowing which package does what heads off the "why won't this compile or replay" confusion later.
By the end — You'll know where Workflow, WorkflowClient and the annotations live, and what each is for.
What the SDK actually does
One dependency, two jobs. It's how you talk to the cluster, and it's how you run your code under the cluster's rules:
Client side — connect to Frontend over gRPC; start, signal, query, and update Workflows.
Worker side — long-poll Task Queues, dispatch Workflow & Activity tasks, replay history.
Determinism runtime — the Workflow.* toolkit (time, random, sleep, threads) that survives replay.
Plumbing — payload (de)serialization, retries, interceptors, metrics, TLS.
You write annotated classes and plain methods; the SDK makes them durable.
Java SDK — the packages
Package
You use it for
Key types
io.temporal.client
Start / signal / query from outside
WorkflowClient, WorkflowStub, WorkflowOptions
io.temporal.worker
Host & poll
Worker, WorkerFactory, WorkerOptions
io.temporal.workflow
Write Workflow code
Workflow, Async, Promise, Saga, @WorkflowInterface
io.temporal.activity
Write Activities
Activity, ActivityOptions, @ActivityInterface
io.temporal.common
Shared config
RetryOptions, converters, interceptors
io.temporal.serviceclient
The gRPC connection
WorkflowServiceStubs, WorkflowServiceStubsOptions (TLS / API-key)
Javadoc: WorkflowClient · Worker · Workflow · Activity · WorkflowServiceStubs · full index
Inside Workflow code, reach for the Workflow statics, never the JDK equivalents:
Workflow.sleep(Duration.ofDays(30 ));
long now = Workflow.currentTimeMillis();
UUID id = Workflow.randomUUID();
Workflow.await(() -> approved);
var act = Workflow.newActivityStub(OrdersActivities.class, opts);
var child = Workflow.newChildWorkflowStub(ChildWorkflow.class);
Workflow.getLogger(getClass()).info("replay-aware logging" );
Every value the Worker can't reproduce on replay comes from Workflow.*.
WorkflowClient — the way in from outside
WorkflowClient client = WorkflowClient.newInstance(
WorkflowServiceStubs.newLocalServiceStubs());
GreetingWorkflow wf = client.newWorkflowStub(GreetingWorkflow.class,
WorkflowOptions.newBuilder()
.setWorkflowId("hello-temporal-demo" )
.setTaskQueue("hello-temporal" ).build());
String result = wf.greet("Ada" );
The same client signals, queries, updates, and describes running Workflows.
Javadoc: WorkflowClient · WorkflowServiceStubs · WorkflowOptions
Add it to your build
<dependency >
<groupId > io.temporal</groupId >
<artifactId > temporal-sdk</artifactId >
<version > 1.32.1</version >
</dependency >
implementation 'io.temporal:temporal-sdk:1.32.1'
implementation 'io.temporal:temporal-testing:1.32.1'
temporal-sdk is the only runtime dependency; add temporal-testing for Day 4.
Serialization — your data becomes Payloads
Every value crossing the boundary is converted to a Payload (bytes + metadata) by the DataConverter , then converted back on the other side. That includes Workflow args & result, Activity args & result, Signals, Queries, and Updates.
greet("Ada") ──DataConverter──→ Payload {
metadata: { "encoding": "json/plain" }
data: "Ada" ← UTF-8 bytes
} ──→ stored in Event History
Default chain tries converters in order : null → byte[] → Protobuf → JSON (Jackson) . Your POJOs/records land on JSON.
The encoding metadata tells the receiver how to decode — both sides must run a compatible converter.
You pass objects; the SDK ships bytes. Keep them JSON-friendly and both ends agree.
Javadoc: DataConverter · DefaultDataConverter · PayloadConverter
Serialization — the limits
The "so what" of shipping bytes through history:
Must serialize cleanly — default JSON needs fields/getters and a no-arg constructor. No live handles: sockets, threads, streams, Connections.
Size — one payload hard-caps at 2 MB (the gRPC message limit). The SDK warns at ~256 KB . The whole history caps at 50 MB / 51,200 events .
Replay cost — every arg & result is stored in history, then re-read on every replay. Big payloads mean slow replay and a bloating history.
Schema evolution — a payload outlives the code that wrote it. Add fields and tolerate unknown ones. Don't repurpose or remove a field: an in-flight Workflow will fail to deserialize old history.
History is a control plane, not a data bus . Keep payloads small, stable, and JSON-clean. Big blobs go to S3 — pass a URI , not the bytes (Day 6) .
Options at a glance — wiring the client & Worker
Every layer of the bootstrap path has its own *Options builder. From the outside in:
Builder
Configures
Applied when
WorkflowServiceStubsOptions
Connection: target host, TLS, API key, metrics scope
building the service stubs
WorkflowClientOptions
Namespace, data converter, client interceptors
building the WorkflowClient
WorkerFactoryOptions
Cross-Worker: sticky cache, virtual workflow threads, Worker interceptors
WorkerFactory.newInstance
WorkerOptions
Per-queue slots, tuner, virtual threads
factory.newWorker
WorkflowImplementationOptions
Per-type: fail-on exception types, per-Activity defaults
registering a Workflow impl
Each builder is newBuilder() … .build() — same shape everywhere, so they read the same.
Options at a glance — starting & retrying work
These travel with each execution rather than the Worker:
Builder
Configures
Applied when
WorkflowOptions
ID, Task Queue, run/execution timeouts, retry, ID-reuse policy
starting a Workflow
ChildWorkflowOptions
Same set + parent-close policy
starting a Child Workflow
ScheduleOptions
Memo & search attributes for the Schedule
createSchedule
ActivityOptions
Timeouts, heartbeat, Task Queue, retry
building an Activity stub
LocalActivityOptions
Timeouts + retry for short, local Activities
building a local Activity stub
RetryOptions
Backoff, max attempts, non-retryable types
nested inside the four above
RetryOptions is never set alone — it's the retry block inside a Workflow/Activity options builder.
Day 1 · optional
Other SDKs — same model
Python and Go ship the identical primitives in idiomatic form.
Day 1 · on-ramp
Other SDKs — same model
Where this fits — A short detour: the same primitives you just learned, in Python and Go.
Why it matters — Your platform is polyglot — a Go Worker and a Java Worker can share one Workflow contract.
By the end — You'll recognize the identical model across all three SDKs and read each in idiomatic form.
Optional
Python SDK — temporalio
from temporalio.client import Client
from temporalio.worker import Worker
client = await Client.connect("localhost:7233" , namespace="default" )
handle = await client.start_workflow(
GreetingWorkflow.run, "Ada" ,
id ="hello-temporal-demo" , task_queue="hello-temporal" )
print (await handle.result())
worker = Worker(client, task_queue="hello-temporal" ,
workflows=[GreetingWorkflow], activities=[greet])
await worker.run()
pip install temporalio — async-native: @workflow.defn / @activity.defn, workflow.* toolkit.
Optional
Go SDK — go.temporal.io/sdk
c, _ := client.Dial(client.Options{HostPort: "localhost:7233" })
defer c.Close()
we, _ := c.ExecuteWorkflow(ctx,
client.StartWorkflowOptions{ID: "hello-temporal-demo" , TaskQueue: "hello-temporal" },
GreetingWorkflow, "Ada" )
var result string
we.Get(ctx, &result)
A Workflow is a plain func(ctx workflow.Context, ...); the workflow package holds the determinism toolkit.
Optional
Cross-SDK at a glance
Concept
Java
Python
Go
Package
io.temporal:temporal-sdk
temporalio (pip)
go.temporal.io/sdk
Connect
WorkflowClient
Client.connect
client.Dial
Define Workflow
@WorkflowInterface
@workflow.defn
func(ctx, …)
Run a Worker
WorkerFactory / Worker
Worker(...).run()
worker.New(...).Run()
Determinism API
Workflow.*
workflow module
workflow package
Call an Activity
typed Activity stub
workflow.execute_activity
workflow.ExecuteActivity
Different syntax, one runtime contract — pick the language, keep the model.
Day 1
Event sourcing & deterministic replay
The single concept that breaks the most Airflow brains.
Day 1 · on-ramp
Event sourcing & deterministic replay
Where this fits — The one idea under everything: how a Workflow survives a crash and resumes mid-flight.
Why it matters — This is the concept that most often breaks Airflow intuition — get it now, debug less later.
By the end — You'll explain why Workflow code re-executes while Activity results are replayed from history.
The replay rule
When a Worker resumes a Workflow:
It re-runs the Workflow code from the start.
Replays recorded events to reconstruct local state.
Reaches the next undecided point.
Continues from there.
Different decision than the recorded history = non-determinism error.
Five families of non-determinism
long now = System.currentTimeMillis();
long now = Workflow.currentTimeMillis();
int n = new Random ().nextInt(10 );
int n = Workflow.newRandom().nextInt(10 );
Files.writeString(path, "x" );
Thread.sleep(60_000 );
Workflow.sleep(Duration.ofMinutes(1 ));
for (var e : hashMap.entrySet()) { ... }
Durable sleep
Workflow.sleep(Duration.ofDays(30 ));
The Worker forgets this Workflow.
The server fires a timer in 30 days.
Some Worker - maybe a different one - picks it up and continues.
No JVM stays alive. Survives every deploy in between.
The Event History
A Workflow's state is an append-only log of events — what the server
persists and the Worker replays. A one-Activity run records:
WorkflowExecutionStarted ← input, Task Queue
WorkflowTaskScheduled / Started / Completed
ActivityTaskScheduled / Started / Completed
WorkflowTaskScheduled / Started / Completed
WorkflowExecutionCompleted ← result
If it isn't an event, it didn't happen — this log is the source of truth.
Event History retention
While a Workflow is open , its Event History is the live source of truth.
After a Workflow closes , history is kept for the Namespace retention period .
Cleanup is automatic: Temporal schedules a retention timer when the execution closes.
The default CLI-created Namespace retention is 3 days ; the minimum is 1 day .
Changing retention affects newly closed executions only, not already-closed ones.
Retention prunes closed histories. It does not shrink a running Workflow's history.
Pruning a growing history
You do not delete individual events from an open Workflow.
Use Continue-As-New to checkpoint state and start a fresh Run.
Same Workflow ID, new Run ID, new Event History.
Use it for long-running Workflows, entity Workflows, and high-event loops.
Temporal warns around 10,240 events / 10 MB ; hard limit is 51,200 events / 50 MB .
Continue-As-New prunes live history by starting the next run with only the state you pass forward.
Limits you design around
Per Run (one Workflow execution) — the caps that decide when to Continue-As-New:
What
SDK warns
Hard cap
Escape hatch
Events in the history
10,240
51,200
Continue-As-New
History size
10 MB
50 MB
Continue-As-New
A single payload
~256 KB
2 MB
S3 reference (Day 6)
Pending Activities / Child Workflows
—
~2,000 each
bounded batches, not all at once
Open Runs — exactly one per Workflow ID per Namespace; the ID frees up once that run closes.
These are server defaults (dynamic config), but the event and size caps are real stops. Blow past 51,200 events / 50 MB and the server terminates the Workflow.
The history caps are the true ceiling. Continue-As-New is how every unbounded Workflow stays under them.
Lab · Day 1
Hello Temporal
Challenge → day-01-foundations/lab-2-hello-temporal
make run-hello
make run-hello-starter
Worker and starter are separate processes , just like in production. They share only the Task Queue name. Order doesn't matter — start the Workflow first and the server holds it on the queue until a Worker polls.
Lab · Day 1
Hello Temporal — read the history
Workflow appears in the Web UI under default namespace.
Click into it; open the Event History tab.
Identify WorkflowExecutionStarted and the ActivityTask* events.
Restart the Worker mid-run; the Workflow resumes. That's the lesson.
Wait — two Task Queues in one history?
Each WorkflowTaskScheduled event records the Task Queue it was dispatched on — and they differ:
"taskQueue" : { "name" : "hello-temporal" , "kind" : "TASK_QUEUE_KIND_NORMAL" }
"taskQueue" : {
"name" : "Dhruvtara.local:0506aff1-295c-4b05-bf85-1c362f3fffec" ,
"kind" : "TASK_QUEUE_KIND_STICKY" ,
"normalName" : "hello-temporal"
}
name — for a sticky queue, an auto-generated, per-Worker host:uuid.
normalName — the real queue you named; what sticky falls back to.
Normal vs Sticky Task Queues
NORMAL — the durable queue you name. Every Worker on that name competes for its tasks. All Activities and the first Workflow Task of each run land here.
STICKY — an ephemeral, per-Worker queue the SDK creates automatically. After a Worker runs a Workflow Task, it caches the run in memory . The server then routes that run's next Workflow Tasks back to the same Worker, which continues without replaying the whole history.
Sticky is pure Worker optimisation (the "Workflow cache"). You never name one, and Activities are never sticky.
Sticky execution — and its fallback
Later Workflow Tasks for a run go to its sticky queue first .
If that Worker is gone or busy past StickyScheduleToStartTimeout (~5s default), the task falls back to the normal queue .
Another Worker then picks it up and replays from event 1 to rebuild state — correctness is never at risk, only the replay shortcut is lost.
This is exactly why the Workflow survives when you kill the Worker mid-run (the Hello lab's lesson). The sticky cache is an optimisation; the normal queue plus history is the guarantee.
Split brain — two Workers, one run
Worker A ── runs Workflow Task ──╳ (network drops; cache still warm)
│ server waits…
WorkflowTaskTimeout ▼ (~10s) → task re-scheduled
Worker B ── polls, replays from event 1, completes ──► history advances ✓
Worker A ── reconnects, submits its result ─────────► REJECTED (stale) ✗
The server accepts a completion only if it matches the run's current history version — a compare-and-swap . First writer wins.
Worker B's result is committed; A's is a stale write against an old version, so it's discarded. No coordination between the Workers required.
Exactly-once on Workflow state ; at-least-once on Activity execution → Activities must be idempotent.
There is never a "merge two histories" step. Persistence serialises the writes; the loser is simply rejected.
Stateful for speed, stateless for truth
Sticky cache — a Worker keeps the run's in-memory state to skip replay. Pure speed .
Replay — any other Worker rebuilds that exact state from the event history. Pure correctness .
Determinism — the contract that makes replay reproduce the same state, on a different Worker, byte-for-byte.
Three views of one mechanism: the cache is disposable, the history is the truth, determinism is the bridge. Kill any Worker at any instant and the run is fine — its state was only ever a cached replay of events.
Workflow ID vs Run ID
Workflow ID
Run ID
Who sets it
you
the server
Example
hello-temporal-demo
019ecbcd-6981-7b00-…
Means
the Workflow's business identity
one execution attempt
Reuse
reusable over time
unique, immutable
New one on
—
retry · continue-as-new · reset
Signal / query / cancel / describe target the Workflow ID (latest run by default).
One Workflow ID → a chain of Runs ; only one open at a time per namespace.
If starting a Workflow fails
StartWorkflowExecution is a client RPC. Failure means different things:
Case
What happened
Server rejects request
No Workflow started: bad Namespace, auth, invalid options, ID conflict
Connectivity / timeout
Ambiguous: the request may never have reached Temporal, or the response was lost after a successful start
Start accepted
WorkflowExecutionStarted is persisted; Workflow is durable from that point
First Workflow Task fails
Start still succeeded; the running Workflow now follows retry/failure rules
Use a stable Workflow ID for business idempotency.
On retry after an ambiguous client error, handle AlreadyStarted / Use Existing deliberately.
A Worker does not need to be online for start to succeed; the first Workflow Task waits on the Task Queue.
The commit point is the WorkflowExecutionStarted event, not the client's HTTP/gRPC response.
Start retry pattern
String workflowId = "order-" + orderId;
OrderWorkflow wf = client.newWorkflowStub(OrderWorkflow.class,
WorkflowOptions.newBuilder().setWorkflowId(workflowId).setTaskQueue("orders" ).build());
try {
WorkflowClient.start(wf::run, orderId);
} catch (WorkflowExecutionAlreadyStarted e) {
}
Retry the start with the same Workflow ID. A duplicate start becomes "already exists," not duplicate business work.
Three names you'll conflate
Name
What it is
Here
Workflow Type
the Workflow function / class
GreetingWorkflow
Workflow ID
the instance you started
hello-temporal-demo
Task Queue
the routing name Workers poll
hello-temporal
Type is recorded from your code as WorkflowExecutionStarted.workflowType — the function name (Go) or @WorkflowMethod interface (Java).
ID you pick per run; Task Queue wires Workers to work.
Three independent strings — changing one doesn't touch the others.
"…Completed" — which one?
Three events end in Completed ; don't conflate them:
Event
Means
WorkflowTaskCompleted
a Worker finished deciding — ran your code to the next await
ActivityTaskCompleted
one Activity returned its result
WorkflowExecutionCompleted
the whole Workflow finished — the terminal event
Many WorkflowTaskCompleted (one per decision) + one per Activity.
Exactly one WorkflowExecutionCompleted, always last.
WorkflowTask = "the Worker thought"; ActivityTask = "real work ran."
Day 1
Architecture
What's inside the box.
Day 1 · on-ramp
Architecture
Where this fits — You've used Temporal from the outside; now a look inside the box you've been talking to.
Why it matters — Knowing the services and task flow demystifies timeouts, sticky queues and "stuck" Workflows.
By the end — You'll trace one Workflow start through Frontend, History, Matching and your Worker.
The cluster
┌──────────────┐
SDK / CLI ────→ │ Frontend │ gRPC API
└──────┬───────┘
┌──────▼───────┐
│ History │ workflow state machine
└──────┬───────┘
┌──────▼───────┐
│ Matching │ Task Queue dispatch
└──────┬───────┘
┌──────▼───────┐
│ Persistence │ PostgreSQL / Cassandra
└──────────────┘
Your Workers connect outbound to Frontend on :7233.
The four services
Frontend
History
Matching
Worker
Stateless gRPC gateway. Auth, rate-limiting, routing, request validation. Every SDK/CLI call lands here.
Owns Workflow Execution state. Writes the event history, runs the state machine, enqueues tasks. Sharded.
Hosts Task Queues. Matches tasks from History to Workers polling by queue name.
Internal background service: replication, archival, schedules, batch ops, cleanup. Not your Worker.
Your application Worker is a client of this cluster, not part of it.
History Service & shards
Workflow state is partitioned into shards (e.g. 512 / 4096); each shard owns a slice of executions by hashed Workflow ID.
A shard is owned by exactly one History host at a time. That means a single writer and no per-workflow contention.
Ownership is a rangeID lease in persistence. If a partition makes two hosts both claim a shard, the first write bumps the lease; the other gets shard ownership lost and drops it — split brain resolved by the DB, not by merging.
Each shard drives its executions and processes internal task queues :
Internal queue
Drives
Transfer tasks
Push Workflow/Activity tasks to Matching; start child workflows
Timer tasks
Fire durable timers, Workflow.sleep, timeouts, retries
Visibility tasks
Update the searchable/visibility store
Replication tasks
Ship events to other clusters (multi-cluster)
Scaling History = more shards spread across more History hosts.
Three task types
Task
Worker does
Result
Workflow Task
Resume Workflow code until it blocks or completes
Commands back to History (schedule activity, start timer, complete)
Activity Task
Execute your Activity code (side effects allowed)
Success/failure reported to History
Query Task
Run a read-only query over current state
Value returned; history not advanced
History produces tasks; Matching dispatches them; your Worker pulls and runs them.
Lifecycle: one Workflow start
1 Client StartWorkflowExecution → Frontend → History (owning shard)
2 History persist [WorkflowExecutionStarted, WorkflowTaskScheduled]
+ transfer task → Matching enqueues on the Task Queue
3 Worker PollWorkflowTask (Frontend→Matching); Matching pings History
→ persist WorkflowTaskStarted, hand the events to the Worker
4 Worker run code until it blocks → RespondWorkflowTaskCompleted
command: ScheduleActivityTask
→ persist [WorkflowTaskCompleted, ActivityTaskScheduled] → Matching
5 Worker PollActivityTask → persist ActivityTaskStarted → run Activity
6 Worker RespondActivityTaskCompleted
→ persist [ActivityTaskCompleted, WorkflowTaskScheduled] → Matching
7 Worker final Workflow Task → command: CompleteWorkflowExecution
→ persist [WorkflowTaskCompleted, WorkflowExecutionCompleted] ✓
Nothing the Worker does is durable until History persists the resulting Event . Every arrow above is a write that happens before any Worker sees it.
…and the Event History it leaves
1 WorkflowExecutionStarted ← the client's StartWorkflowExecution
2 WorkflowTaskScheduled
3 WorkflowTaskStarted ┐ Workflow Task #1
4 WorkflowTaskCompleted ┘ → command: ScheduleActivityTask
5 ActivityTaskScheduled
6 ActivityTaskStarted
7 ActivityTaskCompleted
8 WorkflowTaskScheduled
9 WorkflowTaskStarted ┐ Workflow Task #2
10 WorkflowTaskCompleted ┘ → command: CompleteWorkflowExecution
11 WorkflowExecutionCompleted
The Worker issues Commands ; the Service records each as an Event .
This append-only history is the Workflow's state — not the Worker's memory.
Replay feeds these Events back; the code's regenerated Commands must match, in order.
Lifecycle: who acts, and when
Client Frontend History Matching Worker
│───────→│───────→│ │ │ StartWorkflowExecution
│ │ │───────→│ │ persist WFExecStarted,WFTScheduled → Matching
│ │ │ │←───────│ PollWorkflowTask
│ │ │←───────│ │ RecordWorkflowTaskStarted
│ │ │────────────────→│ persist WFTStarted; deliver task + events
│ │ │ │ │ Worker runs code → blocks
│ │ │←────────────────│ RespondWFTCompleted [ScheduleActivity]
│ │ │───────→│ │ persist WFTCompleted,ActivityTaskScheduled → Matching
│ │ │ │ │ … Activity task, then final WFT → Completed
The Worker addresses neither History nor Matching directly — it polls through Frontend, Matching dispatches, History owns the record.
Lifecycle: every Command becomes an Event
Worker issues a Command
Service records Event(s)
ScheduleActivityTask
ActivityTaskScheduled
StartTimer
TimerStarted
SignalExternalWorkflowExecution
SignalExternalWorkflowExecutionInitiated
StartChildWorkflowExecution
StartChildWorkflowExecutionInitiated
CompleteWorkflowExecution
WorkflowExecutionCompleted
ContinueAsNewWorkflowExecution
WorkflowExecutionContinuedAsNew
The Worker only ever proposes Commands. History decides, records the matching Event, and dispatches the resulting task. On replay, the code re-derives the same Commands from these Events. That's the determinism contract.
Lifecycle: the Worker's inner loop
┌─→ ① poll Task Queue → Workflow Task + history
│ ② replay history → rebuild SDK state
│ ③ run code from the cursor until it blocks
│ ④ collect the Commands the code produced
└── ⑤ RespondWorkflowTaskCompleted → History appends events
(the next Workflow Task is scheduled; loop repeats)
Every Workflow Task replays the whole history so far , then runs only the new tail.
That loop is why a crash is a non-event: a fresh Worker re-enters at ① and rebuilds state at ②.
Lifecycle: the states a run moves through
┌─→ Completed
├─→ Failed
start ─→ Running ──┼─→ Timed Out
├─→ Canceled
├─→ Terminated
└─→ Continued-As-New → new run, fresh history
The whole event trace lives inside Running; every other box is terminal.
Failed / Timed Out may be retried per the Workflow's Retry Policy — a new run.
Continue-As-New ends this run and starts a fresh one: the trick for long-running Workflows that would otherwise grow history without bound.
The detail page's More Actions menu has three ways to intervene — not interchangeable:
Cancel
Terminate
Reset
Workflow gets a say?
yes — a request it can catch
no — killed at once
replays instead
Cleanup / compensation?
yes, if you coded it
no
n/a
Effect
graceful stop
hard stop
rewind to an earlier event , re-run from there
Use when
"stop, but tidy up"
"it's wedged, stop now"
"bad deploy/bug — replay with fixed code"
Cancel is cooperative. Terminate is kill -9. Reset is a time machine — a new Run from a past point.
What's on your laptop
temporal server start-dev bundles:
Frontend + History + Matching + internal Worker
PostgreSQL (or in-memory)
Web UI on :8233
Prometheus metrics on :7234 with --metrics-port
One binary today. Same gRPC contract as production.
Lab · Day 1 · optional
Hello Temporal on Docker
Challenge → day-01-foundations/lab-2-hello-temporal-docker
Same Workflow, real cluster. Only the environment changes:
make stack-temporal
make run-connect
make run-connect-starter
Single binary → four services + PostgreSQL , each its own container.
No env, no creds → the plaintext branch defaults to 127.0.0.1:7233.
State now survives restarts — durable Postgres , not in-memory.
Lab · Day 1 · optional
Hello Temporal on the Cloud
Challenge → day-01-foundations/lab-2-hello-temporal-cloud
Same worker — Cloud creds make Connections.fromEnv() take the TLS branch:
export TEMPORAL_ADDRESS=...:7233 TEMPORAL_NAMESPACE=my-ns.acct
export TEMPORAL_API_KEY=...
make run-connect
make run-connect-starter
Set the namespace explicitly (my-ns.acct), not default.
Execution lands in the Cloud UI, not your laptop.
Lab · Day 1
Read the Event History
Challenge → day-01-foundations/lab-3-reading-event-history
temporal workflow show \
--workflow-id hello-temporal-demo --output json \
| jq '.events[].eventType'
Identify in order:
WorkflowTaskScheduled / Started / Completed - the decision loop
ActivityTaskScheduled / Started / Completed - the work loop
WorkflowExecutionCompleted - final outcome
Day 1 · the whole picture
How it all fits together
You start a Workflow by a business ID — your durable handle to it.
A Worker polling a Task Queue picks it up and runs your Workflow code.
Workflow code calls Activities for anything with side effects — retried for you.
Every step is appended to the Event History — the one source of truth.
Crash anywhere, and the Worker replays that history : Workflow code re-executes, Activity results are restored.
Durable execution = your code, plus a history that lets it resume exactly where it left off.
Day 1 takeaways
One model: Workflow code re-executes on replay; Activity results are recorded.
One discipline: keep Workflow code deterministic; do all I/O in Activities.
One habit: pick Workflow IDs from business identity. They're durable handles.
Day 2
Building reliable Workflows
Async, retries, heartbeats - and the ways you interact with running executions.
Day 2
Async and parallel Activity execution
Promises, not threads.
Day 2 · on-ramp
Async and parallel Activity execution
Where this fits — Day 1 ran one Activity at a time; today you fan work out and wait on many.
Why it matters — This is how a 10-minute serial pipeline becomes a 1-minute parallel one — safely.
By the end — You'll start Activities concurrently with Promises and join them without threads.
Blocking: the model you already have
Normal code runs one line at a time . A call blocks — the next line waits until it returns.
String a = priceBook();
String b = priceLamp();
Two calls that don't depend on each other still run back-to-back . If each takes 1s, you wait 2s — for no reason.
Blocking is the default everywhere you've coded. Async is just about not waiting when you don't have to.
A Promise is a claim ticket
Order at a coffee shop: you hand over the order, get a number on a receipt, and step aside. The receipt isn't the coffee — it's a promise of coffee, handed to you instantly .
Placing the order = Async.function(...) — starts the work, returns a Promise right away.
Walking up when your number is called = .get() — this is the only step that waits.
Getting the ticket is instant; only .get() blocks. That gap is the whole idea.
Same calls, different timing
Order three coffees, then collect them — all three baristas work at once. Order-wait, order-wait, order-wait and you've tripled the time for the same drinks.
Sequential: order→wait order→wait order→wait ~3 drinks of time
Async: order order order → wait for all ~1 drink of time
Sequential vs parallel isn't a different API — it's when you collect. Start everything first, collect last.
Exercise · The coffee run
You're getting coffee for 3 teammates. Each order takes the barista 4 minutes.
On paper, order these 6 steps so all three are ready in ~4 minutes, not 12 :
Collect Dana's coffee · 2. Place Amir's order · 3. Collect Amir's coffee
Place Sam's order · 5. Place Dana's order · 6. Collect Sam's coffee
Which steps are the Async.function(...) calls? Which are the .get()s?
Why async — the mental model
An Activity call is a durable async call . It returns a Promise (a future result), not a blocked thread:
Async.function(act::method, arg) schedules the Activity and hands back a Promise immediately .
The Workflow parks on .get() — the Worker thread is freed. One JVM holds tens of thousands of parked Workflows as heap state , not threads.
The one rule: start every Activity you want in parallel before you .get() any of them.
Sequential vs parallel is just when you call .get() — same API, different timing.
Sequential vs Async
String rawUri = activities.extract(batchDate);
String cleanUri = activities.transform(rawUri);
Promise<String> rawUri = Async.function(activities::extract, batchDate);
Promise<String> auditUri = Async.function(activities::extract, batchDate + "-audit" );
String cleanUri = activities.transform(rawUri.get());
String cleanAuditUri = activities.transform(auditUri.get());
Promise.get() blocks the Workflow loop , not an OS thread.
Exercise · Predict the clock
Each extract takes 2s ; each transform takes 1s .
Promise<String> raw = Async.function(activities::extract, a);
Promise<String> audit = Async.function(activities::extract, b);
String x = activities.transform(raw.get());
String y = activities.transform(audit.get());
How long does a fully sequential version (4 blocking calls) take?
How long does this version take — and why isn't it faster everywhere ?
Hint — the two extracts overlap; the two transforms still run one at a time.
Fan-out / fan-in, in plain terms
A mailroom with 100 letters: hand all 100 out at once (fan-out ), wait until the last clerk finishes (join ), then stack every reply into one pile (fan-in ).
Fan-out — start N independent jobs without waiting between them.
Fan-in — combine the N results once they're all back.
Split → do in parallel → combine. The code on the next slides is just this sentence in Java.
Java note: list.stream().map(f).toList() = "run f on each item, collect the results" — a for-loop in one line.
Fan-out / fan-in — the shape
The everyday parallel pattern: do N independent things, then combine. Three steps:
Fan-out — map each item to an Async.function(...) call, collecting the Promises into a list (don't .get() yet).
Join — Promise.allOf(list).get() parks until every branch finishes.
Fan-in — reduce the resolved values (sum, merge, collect).
The whole trick is step 1: build the list of Promises before you await any — that's what makes it parallel instead of a sequential loop.
Fan-out / fan-in
import io.temporal.workflow.Async;
import io.temporal.workflow.Promise;
List<Promise<Integer>> counts =
partitions.stream()
.map(p -> Async.function(activities::processPartition, p))
.toList();
Promise.allOf(counts).get();
int total = counts.stream().mapToInt(Promise::get).sum();
All partitions run in parallel; the Workflow suspends across all of them.
Javadoc: Async · Promise
Exercise · Spot the serial bug
This looks parallel but runs one Activity at a time. Find the line that kills the concurrency:
int total = 0 ;
for (String p : partitions) {
Promise<Integer> c = Async.function(activities::processPartition, p);
total += c.get();
}
Pair up, 2 minutes. Then rewrite it so all partitions run at once.
More async shapes — procedure & race
Two variants on the same parking model:
Async.procedure — for void Activities (no return); Async.function is for ones that return a value.
Promise.allOf waits for every branch; Promise.anyOf wakes on the first to finish — a race.
Pick allOf to gather all results, anyOf to act on the fastest and move on.
Async.procedure & first-to-finish
List<Promise<Void>> sends =
userIds.stream().map(id -> Async.procedure(notify::send, id)).toList();
Promise.allOf(sends).get();
Promise<String> primary = Async.function(notify::askPrimary, q);
Promise<String> fallback = Async.function(notify::askFallback, q);
Promise.anyOf(primary, fallback).get();
allOf waits for every branch; anyOf wakes on the first.
When one branch fails
By default a failed branch's .get() throws — and an unhandled throw aborts the whole fan-out.
To survive partial failure , wrap each branch's .get() in a try/catch and record a per-branch outcome.
The other branches keep their results; you decide what a partial success means.
All-or-nothing is the default; per-branch handling is a deliberate choice.
Partial failure in a fan-out
Map<Integer, Promise<String>> futures = new LinkedHashMap <>();
for (int p : partitions)
futures.put(p, Async.function(activities::processWithStatus, p));
Map<Integer, String> result = new LinkedHashMap <>();
for (var e : futures.entrySet()) {
try { result.put(e.getKey(), e.getValue().get()); }
catch (ActivityFailure failure) {
result.put(e.getKey(), "FAILED: " + failure.getMessage());
}
}
One branch failing doesn't sink the others - collect per-branch outcomes.
Bounding a fan-out
Unbounded Async.function over 10k items schedules all 10k at once — fine for the Workflow loop, but it can flatten a fragile downstream that only tolerates ~20 in-flight calls.
Gate the loop with an in-Workflow counter and Workflow.await(() -> inFlight < max) — the deterministic equivalent of a bounded thread pool.
Don't reach for java.util.concurrent.Semaphore: blocking a real thread isn't replay-safe. Workflow.await is the durable wait.
Keep the pipeline full without overwhelming the dependency — cap concurrency, not total work.
Bounded fan-out
int [] inFlight = {0 };
List<Promise<Void>> sends = new ArrayList <>();
for (String id : userIds) {
Workflow.await(() -> inFlight[0 ] < maxInFlight);
inFlight[0 ]++;
sends.add(Async.procedure(notify::send, id)
.thenApply(ignored -> { inFlight[0 ]--; return null ; }));
}
Promise.allOf(sends).get();
thenApply decrements as each branch finishes; await releases the next only when there's room.
Lab · Day 2
Async + parallel activities
Challenge → day-02-reliability/lab-1-async-parallel-activities
make run-async
make run-async-starter
In the Web UI:
Note that all ActivityTaskScheduled events appear with the same timestamp.
Compare to a sequential variant: events stagger.
Pair: predict what happens if one of three parallel Activities fails.
Day 2
Retries, timeouts, heartbeats
Know what each setting controls or you'll misuse all of them.
Day 2 · on-ramp
Retries, timeouts, heartbeats
Where this fits — Async got work running; now you make each step survive a flaky downstream.
Why it matters — Pick the wrong timeout and you either hang forever or give up too soon — both page someone.
By the end — You'll know what each of the three timeouts controls and when a heartbeat earns its keep.
Three timeouts
Setting
Controls
startToCloseTimeout
One attempt's wall-clock budget
scheduleToCloseTimeout
Total budget across all retry attempts
scheduleToStartTimeout
How long an Activity sits in the queue before pickup
heartbeatTimeout
Max gap between heartbeats; detects Worker death
If you can't say why a timeout is 5 minutes, it's wrong.
Setting them deliberately
import io.temporal.activity.ActivityOptions;
import io.temporal.common.RetryOptions;
ActivityOptions.newBuilder()
.setStartToCloseTimeout(Duration.ofMinutes(5 ))
.setScheduleToCloseTimeout(Duration.ofMinutes(30 ))
.setHeartbeatTimeout(Duration.ofSeconds(30 ))
.setRetryOptions(
RetryOptions.newBuilder()
.setInitialInterval(Duration.ofSeconds(5 ))
.setBackoffCoefficient(2.0 )
.setMaximumInterval(Duration.ofMinutes(1 ))
.setMaximumAttempts(6 )
.build())
.build();
6 × 5min attempts + 6 backoff waits ≈ 33min — set scheduleToClose to bound it.
Javadoc: ActivityOptions · RetryOptions
Activity retry vs Workflow retry
Two different retry scopes — people set the wrong one:
Activity RetryPolicy
Workflow RetryPolicy
Set on
ActivityOptions
WorkflowOptions
Retries
one Activity attempt
the whole Workflow run
Default
on — unlimited attempts
off — no retry
For
transient I/O failures (the 99% case)
crash-only restart of an entire run
Almost all retry logic belongs on the Activity — that's where side effects and flakiness live.
A Workflow retry restarts from event 0 as a new Run — rarely what you want.
The knob you reach for daily is the Activity RetryPolicy; Workflow retry is the rare exception.
Heartbeats — why
A long Activity (export 1000 pages, transcode a video) has a problem: if the Worker dies at page 900, a plain retry restarts at page 0 .
A heartbeat is a periodic "still alive — at page N" ping from the Activity to the server.
It buys two things. Liveness : miss the heartbeatTimeout and the server reschedules the Activity on another Worker. Resumability : the retry reads the last heartbeat detail and continues from page N .
Only long-running Activities need it; short ones finish before any timeout.
A heartbeat checkpoints an Activity , the way continue-as-new checkpoints a Workflow .
Heartbeats
public String exportLargeTable (String tableName) {
for (int page = 0 ; page < 1000 ; page++) {
try {
exportPage(tableName, page);
Activity.getExecutionContext().heartbeat(page);
} catch (ActivityCanceledException | ActivityPausedException stop) {
cleanupPartialExport(tableName, page); throw stop;
}
}
return "s3://exports/" + tableName;
}
On retry, read the last heartbeat detail and resume from page N .
Resume from the last heartbeat
public String backfill (String dataset) {
ActivityExecutionContext ctx = Activity.getExecutionContext();
int startPage = ctx.getHeartbeatDetails(Integer.class).orElse(0 );
for (int page = startPage; page < 100_000 ; page++) {
copyPage(dataset, page);
ctx.heartbeat(page);
}
return "backfilled " + dataset;
}
getHeartbeatDetails(...).orElse(0) is the resume API — empty on attempt 1, the last checkpoint on every retry.
Racing an Activity against a deadline
Sometimes a long Activity must be abandoned if it overruns — but you still want it to clean up first.
A CancellationScope wraps the Activity; cancelling the scope sends a cancellation to it.
That cancellation arrives on the Activity's next heartbeat , which throws. The Activity catches it and deletes partial work before exiting.
Drive it with Workflow.await(deadline, …) to enforce the time budget.
Cancellation is cooperative — it rides the heartbeat, so only heartbeating Activities can be stopped mid-flight.
Javadoc: CancellationScope
CancellationScope - race against a deadline
CompletablePromise<String> result = Workflow.newPromise();
CancellationScope scope = Workflow.newCancellationScope(
() -> result.completeFrom(Async.function(exports ::exportLargeTable, table)));
scope.run();
if (!Workflow.await(deadline, result::isCompleted)) {
scope.cancel("export deadline exceeded" );
throw ApplicationFailure.newFailure("export timed out" , "ExportTimeout" );
}
return result.get();
Cancellation flows to the Activity via heartbeat; it cleans up partial work.
Day 2
Determinism, reinforced
The rules that keep replay honest.
Day 2 · on-ramp
Determinism, reinforced
Where this fits — A checkpoint: the handful of rules that keep replay honest, now that you're writing real logic.
Why it matters — These are the traps that pass code review and then fail on the first replay in production.
By the end — You'll spot the common non-determinism bugs before the SDK does.
Common traps
Map.Entry.getKey() iteration over HashMap - JVM-version-dependent.
Instant.now(), LocalDateTime.now().
UUID.randomUUID() → use Workflow.randomUUID().
CompletableFuture, ExecutorService → use Async.function, Workflow.newPromise.
Throwing checked exceptions across the Workflow boundary - prefer ApplicationFailure.
The replay tests on Day 4 catch all of these.
Day 2
Signals and Queries
Push data in. Pull data out.
Day 2 · on-ramp
Signals and Queries
Where this fits — So far Workflows ran start-to-finish alone; now you talk to one while it's running.
Why it matters — This is how a long-running Workflow takes input and exposes state — without a database.
By the end — You'll push a Signal in and pull state out with a Query, and read both in the history.
Talking to a running Workflow — which one?
A Workflow is a live object , not a job you fire and forget. Three ways to interact, picked by intent :
Signal
Query
Update
Direction
push in
read out
call in → out
Synchronous?
no — fire-and-forget
yes
yes
Change state?
yes
no (read-only)
yes
In the history?
yes
no
yes
Can be rejected?
no
n/a
yes (validator)
Signal = "here's an event." Query = "what's your state?" Update = "do this and tell me the result."
Javadoc: @SignalMethod · @QueryMethod · @UpdateMethod
Signals - push data in
@WorkflowInterface
interface ApprovalWorkflow {
@WorkflowMethod String run (String requestId) ;
@SignalMethod void approve (String approver) ;
@QueryMethod String currentState () ;
}
@Override
public String run (String requestId) {
Workflow.await(() -> state.startsWith("APPROVED" ));
return state;
}
@Override
public void approve (String approver) { state = "APPROVED by " + approver; }
Queries - pull data out
@Override
public String currentState () { return state; }
Read-only function over current in-memory state .
No history events. No Activities. No side effects.
Synchronous and cheap. Routed to whichever Worker has the workflow cached.
Lab · Day 2
Signals + Queries
Challenge → day-02-reliability/lab-2-signals-and-queries
make run-approval
make run-approval-starter
temporal workflow signal --workflow-id approval-demo \
--name approve --input '"alice"'
temporal workflow query --workflow-id approval-demo \
--type currentState
Send the Signal before the Workflow starts; see what happens.
Day 2
Updates
Synchronous, validated, write-capable RPC into a running Workflow.
Day 2 · on-ramp
Updates
Where this fits — Signals are fire-and-forget; an Update is the request/response version that can reject bad input.
Why it matters — It replaces the "Signal, then poll a Query" dance with one validated, blocking call.
By the end — You'll call into a running Workflow and get a result back, with a validator guarding the door.
Updates — a validated, blocking call
An Update is the request/response way into a Workflow: the caller waits for a result, and the Workflow can reject bad input before recording anything.
Mark the handler @UpdateMethod; add an optional @UpdateValidatorMethod that runs first .
Validator throws → the Update is rejected and never hits history (cheap, no audit noise).
Accepted → it mutates state and returns a value to the caller.
Use an Update (not a Signal) when the caller needs the outcome, or you must validate before admitting.
@UpdateMethod + @UpdateValidatorMethod
@WorkflowInterface
interface CartWorkflow {
@WorkflowMethod String checkout (String cartId) ;
@UpdateMethod int addItem (String sku, int quantity) ;
@UpdateValidatorMethod(updateName = "addItem")
void validateAddItem (String sku, int quantity) ;
}
Caller blocks on .getResult().
Validator runs before the update is admitted to history.
Reject cheaply; don't pollute the audit trail.
Don't want to block? startUpdate
A typed addItem(...) call blocks until the Update finishes. When you'd rather fire it and keep working:
startUpdate(...) returns a handle as soon as the Update reaches the stage you pick.
The WorkflowUpdateStage is required — it says how far to wait before handing back the handle. Pick ACCEPTED or COMPLETED.
Call handle.getResult() later, only when you actually need the value.
Same Update; the only choice is when you block — now (typed call) or later (handle).
startUpdate - start now, get result later
WorkflowStub stub = client.newUntypedWorkflowStub(workflowId);
WorkflowUpdateHandle<Integer> handle =
stub.startUpdate("addItem" , WorkflowUpdateStage.COMPLETED,
Integer.class, "book" , 2 );
int itemCount = handle.getResult();
WorkflowUpdateStage is required : ACCEPTED or COMPLETED.
A typed-stub addItem(...) call blocks outright; startUpdate hands back a handle.
Start-or-signal — one idempotent entry
A Kafka consumer — or any event source — can't know whether the Workflow for a key already exists. So bare start() throws on the second event for that key.
signalWithStart means: start the Workflow if absent, then deliver this Signal. It's atomic, every time.
First event for order-123 starts it; every later event just signals the running execution.
The idempotent entry point for event-driven Workflows — never branch on "does it exist yet?"
signalWithStart
BatchRequest batch = client.newSignalWithStartRequest();
batch.add(workflow::run, orderId);
batch.add(workflow::orderEvent, event);
client.signalWithStart(batch);
First event for a key: workflow starts.
Later events: signal the existing execution.
Bare start() throws WorkflowExecutionAlreadyStarted on event #2.
Create + update in one round trip
Sometimes you want to start a Workflow and immediately ask it something , in a single call:
startUpdateWithStart creates the Workflow if absent, applies an Update, and returns its result — one round trip.
The Update half is the same validated @UpdateMethod from before.
The synchronous cousin of signalWithStart: start-or-reuse, but it hands back a result.
startUpdateWithStart
WithStartWorkflowOperation<String> start =
WithStartWorkflowOperation.newBuilder(workflow::process)
.setArguments(request.orderId()).build();
WorkflowUpdateHandle<String> update =
client.startUpdateWithStart(
start, "submit" , WorkflowUpdateStage.COMPLETED, String.class, request);
return update.getResult();
One round trip. Creates the Workflow if absent, applies the Update, returns.
Lab · Day 2
Updates
Challenge → day-02-reliability/lab-3-updates
make run-approval
make run-approval-starter
temporal workflow update execute --workflow-id approval-demo \
--name changeNote --input '"expedite before close of business"'
Verify the response is the new workflow state , not a generic 202.
Day 2
Schedules
Replacing Airflow's scheduler.
Day 2 · on-ramp
Schedules
Where this fits — This is the piece that directly replaces Airflow's scheduler.
Why it matters — Cron specs, catchup and overlap policy live in Temporal now — one less system to babysit.
By the end — You'll define a recurring Workflow and control what happens when runs pile up.
Defining a Schedule
A Schedule is a durable server object that starts a Workflow on a spec — Temporal's replacement for cron / Airflow's scheduler.
A Schedule has three parts: an action (which Workflow, which Task Queue), a spec (when to fire), and a policy (overlap, jitter, catchup).
The server owns it; you manage it with ScheduleClient — create, pause, trigger, delete.
You define what and when ; the server fires it. No always-on scheduler process of your own.
Javadoc: Schedule · ScheduleClient
Hourly schedule
import io.temporal.client.schedules.*;
Schedule schedule = Schedule.newBuilder()
.setAction(ScheduleActionStartWorkflow.newBuilder()
.setWorkflowType(OrdersWorkflow.class)
.setOptions(WorkflowOptions.newBuilder().setTaskQueue("orders" ).build())
.build())
.setSpec(ScheduleSpec.newBuilder()
.setIntervals(List.of(new ScheduleIntervalSpec (Duration.ofHours(1 ))))
.setJitter(Duration.ofMinutes(5 ))
.build())
.setPolicy(SchedulePolicy.newBuilder()
.setOverlap(ScheduleOverlapPolicy.SCHEDULE_OVERLAP_POLICY_BUFFER_ONE)
.build())
.build();
scheduleClient.createSchedule("hourly-orders" , schedule, ScheduleOptions.newBuilder().build());
Durable Temporal object. Survives redeploy. Overlap is explicit .
Javadoc: Schedule · io.temporal.client.schedules
Calendar specs, catchup & overlap
Two questions every scheduler must answer — Temporal makes both explicit knobs:
When — a cron expression or calendar/interval spec, with optional jitter to avoid thundering herds.
If it's late or still running — catchupWindow bounds how much it backfills after downtime. The overlap policy decides what happens when the previous run hasn't finished.
Airflow's schedule_interval + catchup + max_active_runs, but as explicit, bounded settings.
Cron, catchup & overlap
ScheduleSpec.newBuilder()
.setCronExpressions(List.of("0 9 * * *" ))
.setJitter(Duration.ofMinutes(5 )).build();
SchedulePolicy.newBuilder()
.setCatchupWindow(Duration.ofHours(1 ))
.setOverlap(ScheduleOverlapPolicy.SCHEDULE_OVERLAP_POLICY_SKIP).build();
Overlap policy
When a run is still going
SKIP
drop the new run
BUFFER_ONE / BUFFER_ALL
queue one / queue all
ALLOW_ALL
run concurrently
CANCEL_OTHER / TERMINATE_OTHER
stop the running one first
Lab · Day 2
Schedules
Challenge → day-02-reliability/lab-4-schedules
make run-schedules
temporal schedule list
temporal schedule describe --schedule-id daily-sales-report-schedule
Discuss:
What does SCHEDULE_OVERLAP_POLICY_SKIP mean for a 90-minute job that fires hourly?
Pause + resume from the CLI; observe what the schedule does.
Day 2
Child Workflows and timeouts
When to compose. How to bound.
Day 2 · on-ramp
Child Workflows and timeouts
Where this fits — You've built single Workflows. Now you compose them, and you cap how long anything runs.
Why it matters — Picking child vs. Activity, and the right timeout, keeps big workflows from sprawling.
By the end — You'll decide when to spawn a child and how to cap a Workflow's lifetime.
Child Workflow vs Activity — when?
Both let a parent delegate work. Reach for a Child Workflow only when the sub-task is itself a workflow :
Use an Activity when…
Use a Child Workflow when…
it's one unit of work (a call, a job)
it has its own multi-step orchestration
no independent retries/timeouts needed
it needs its own history, timeouts, Task Queue
the result fits the parent's history
its events would bloat the parent's history
—
you want it on a different Worker pool
Default to an Activity. Compose a Child Workflow when the sub-task deserves its own identity and history .
Child Workflows
FraudWorkflow fraud = Workflow.newChildWorkflowStub(FraudWorkflow.class,
ChildWorkflowOptions.newBuilder().setTaskQueue("fraud" ).build());
ShippingWorkflow shipping = Workflow.newChildWorkflowStub(ShippingWorkflow.class,
ChildWorkflowOptions.newBuilder().setTaskQueue("shipping" ).build());
Promise<String> fraudDecision = Async.function(fraud::check, orderId);
Promise<String> shippingPlan = Async.function(shipping::plan, orderId);
Promise.allOf(fraudDecision, shippingPlan).get();
Children get independent identity, history, Task Queue, timeouts .
Fanning out children — ParentClosePolicy
A nightly billing run spawns one child Workflow per tenant — each independently queryable and retryable by its own ID.
By default a child terminates when its parent closes. ParentClosePolicy.PARENT_CLOSE_POLICY_ABANDON lets long tenant jobs outlive the coordinator .
Start every child before joining (collect the Promises first) so all tenants bill in parallel.
ABANDON decouples child lifetime from the parent — the coordinator's job is to launch, not to babysit.
Per-tenant child fan-out
for (String tenantId : tenantIds) {
TenantBillingWorkflow child = Workflow.newChildWorkflowStub(
TenantBillingWorkflow.class,
ChildWorkflowOptions.newBuilder()
.setWorkflowId("billing-" + tenantId)
.setParentClosePolicy(ParentClosePolicy.PARENT_CLOSE_POLICY_ABANDON)
.setWorkflowExecutionTimeout(Duration.ofHours(1 ))
.build());
futures.put(tenantId, Async.function(child::run, tenantId));
}
futures.forEach((id, p) -> results.put(id, p.get()));
Each tenant gets its own Workflow ID and history; ABANDON lets it finish even if the parent doesn't.
Workflow timeouts
WorkflowOptions.newBuilder()
.setWorkflowExecutionTimeout(Duration.ofDays(7 ))
.setWorkflowRunTimeout(Duration.ofHours(12 ))
.setTaskQueue("orders" )
.build();
WorkflowExecutionTimeout - hard cap, all continuations.
WorkflowRunTimeout - cap for this run; forces continuation.
Day 2 takeaways
One async pattern: Async.function + Promise.allOf. Yields the Workflow loop, not threads.
One Kafka/REST rule: signalWithStart , never bare start.
One sync RPC: startUpdateWithStart for "POST and wait for result."
Day 3
Kafka integration
Kafka is the bus between teams. Temporal is the brain inside one team.
Day 3
Temporal + Kafka architecture
Different jobs. Used together.
Day 3 · on-ramp
Temporal + Kafka architecture
Where this fits — You live in Kafka already; today is where Kafka and Temporal divide the work.
Why it matters — Used wrong, they overlap and fight. Used right, each does the job it's best at.
By the end — You'll say which problems belong to Kafka and which to Temporal, and why.
Who owns what
Concern
Owner
Append-only event log, replayable by offset
Kafka
Fan-out to many independent consumers
Kafka
State of a single business transaction
Temporal
Retry / timeout / compensation logic
Temporal
Long-running human / external waits
Temporal
Kafka tells you what happened . Temporal tells you where we are .
Kafka consumer as Activity
@Override
public List<String> pollBatch (String topic) {
consumer.subscribe(List.of(topic));
List<String> values = new ArrayList <>();
while (values.size() < 100 ) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofSeconds(1 ));
for (ConsumerRecord<String, String> record : records) {
values.add(record.value());
Activity.getExecutionContext()
.heartbeat(record.topic() + ":" + record.partition() + ":" + record.offset());
}
}
consumer.commitSync();
return values;
}
Emitting events back to Kafka
When a Workflow needs to publish a result, wrap the producer in an Activity:
The send is a side effect → it lives in an Activity, never in Workflow code.
Make the producer idempotent (enable.idempotence, acks=all) and give it a stable key . Then a retried Activity won't duplicate.
Temporal is at-least-once. Add an idempotent keyed producer and you get effectively-once per key.
The Worker may retry the send; idempotent + keyed is what makes that safe.
Producer Activity
properties.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, "true" );
properties.put(ProducerConfig.ACKS_CONFIG, "all" );
@Override
public void publishOutcome (String orderId, String outcome) {
producer.send(new ProducerRecord <>("order-outcomes" , orderId, outcome)).join();
}
Idempotent producer + stable key = at-least-once becomes effectively-once by key.
Downstream still dedupes.
Atomic DB write + event — the outbox
You can't atomically write your DB and publish to Kafka . They're two systems with no shared transaction — and 2PC is the thing to avoid.
Write the business row and an outbox row in one DB transaction . They commit or fail together.
A separate Activity (or Debezium) reads the outbox and publishes to Kafka, marking rows sent.
Turn "two side effects across systems" into "one local transaction + a relay."
Outbox pattern
transactionTemplate.execute(status -> {
orderRepository.save(order);
outboxRepository.save(new OutboxMessage (
"order-events" , order.id(),
json.serialize(new OrderAccepted (order.id()))));
return null ;
});
One DB transaction = atomic business row + outbox row.
Publish to Kafka in a separate Activity / Debezium.
Two side effects across systems can't be atomic without 2PC. Outbox sidesteps that.
Day 3
Signal-driven Workflows
Replacing Kafka-triggered Airflow DAGs.
Day 3 · on-ramp
Signal-driven Workflows
Where this fits — The concrete migration: a Kafka-triggered Airflow DAG becomes a Signal-driven Workflow.
Why it matters — It removes the brittle "consumer kicks off a DAG" glue you maintain today.
By the end — You'll bridge a Kafka topic to a running Workflow with Signals.
The pattern
One Workflow per business entity (e.g. per orderId).
Workflow ID = "order-" + orderId.
Kafka consumer is a thin bridge: signalWithStart for every event.
Commit offsets after signalWithStart returns.
Bare start() throws WorkflowExecutionAlreadyStarted on event #2.
The bridge
BatchRequest batch = client.newSignalWithStartRequest();
batch.add(workflow::run, orderId);
batch.add(workflow::orderEvent, record.value());
client.signalWithStart(batch);
consumer.commitSync();
First event for orderId starts the workflow.
Subsequent events signal the existing execution.
Offset commit happens after the signal lands.
Day 3
End-to-end pipeline
Kafka → Temporal → Kafka.
Day 3 · on-ramp
End-to-end pipeline
Where this fits — The pieces assembled: Kafka in, Temporal in the middle, Kafka out.
Why it matters — This is the shape of most event-driven systems you'll actually ship.
By the end — You'll run a full consume → orchestrate → produce pipeline and read it in the history.
Lab · Day 3
Run the pipeline
Challenge → day-03-kafka/lab-1-kafka-pipeline
make stack-kafka
make run-kafka
kcat -b localhost:9092 -t orders -P -k "order-1" <<< 'NEW:line-item-A'
kcat -b localhost:9092 -t order-outcomes -C -o end -f 'key=%k value=%s\n'
Send a second event for the same key. Watch it Signal the existing workflow.
Partition fan-out
Two strategies:
Outside the Workflow - one Workflow per partition. Many small histories.
Inside the Workflow - one Workflow processes a range of partitions in parallel Activities.
Pick based on whether the partitions share business state.
Inside-Workflow fan-out
List<Promise<Integer>> counts =
ranges.stream()
.map(range -> Async.function(activities::processRange, range))
.toList();
Promise.allOf(counts).get();
int total = counts.stream().mapToInt(Promise::get).sum();
Bound the fan-out: don't open 1,000 partitions inside one history.
Lab · Day 3
Fan-out by partition
Challenge → day-03-kafka/lab-2-partition-fanout
make kafka-topic TOPIC=orders PARTITIONS=4
for i in 1 2 3 4; do
echo "evt-$i " | kcat -b localhost:9092 -t orders -P -k "order-$i "
done
In the Web UI, observe 4 separate Workflow executions, one per key.
DLQ vs Temporal retry exhaustion
Failure type
Belongs in
Transient (network)
Temporal retry (free)
Poison message (malformed)
DLQ topic for triage
Business rule rejection
DLQ or audit topic
Catch-all
DLQ after Temporal exhausts
Temporal retries solve transient. DLQ catches what retry can't fix.
DLQ Activity
try {
orders.validate(orderId);
} catch (ActivityFailure exhausted) {
dlq.publish(orderId, exhausted.getMessage());
}
Workflow catches ActivityFailure (retry exhausted).
Publishes to DLQ; Workflow completes successfully.
The order failed; the Workflow did its job.
Day 3 takeaways
signalWithStart is the only correct Kafka bridge primitive.
Commit Kafka offsets only after the unit of work is durably accepted.
DLQ catches what Temporal retries cannot fix. Different problems.
Day 4
Production engineering
Versioning, sizing, observability, replay tests, namespaces, Airflow migration.
Day 4
Workflow versioning
Shipping new code without breaking in-flight Workflows.
Day 4 · on-ramp
Workflow versioning
Where this fits — Your code is running in production; now you change it without breaking in-flight Workflows.
Why it matters — A careless deploy can non-deterministically corrupt every executing Workflow at once.
By the end — You'll ship a behavior change safely with the versioning API.
Why versioning exists
Day 1: deploy v1. Workflow runs against v1 history.
Day 30: deploy v2 that reorders two Activities.
In-flight Workflow resumes against v2 code with v1 history → non-determinism error.
You need v2 code to behave like v1 until past the change-point .
Workflow.getVersion
int v = Workflow.getVersion("charge-before-reserve" , Workflow.DEFAULT_VERSION, 1 );
if (v == Workflow.DEFAULT_VERSION) {
payments.reserve(orderId);
payments.charge(orderId);
} else {
payments.charge(orderId);
payments.reserve(orderId);
}
The change-point name is identity . Treat like a migration filename. Never recycle.
Two ways a Worker adopts new code
When you deploy new code, what happens to in-flight runs? Choose per Workflow type:
PINNED — in-flight runs stay on the old code until they finish. New code only takes new runs. Right for short-lived Workflows: drain, then deploy.
AUTO_UPGRADE — long-runners pick up newer compatible code automatically. Right for Workflows that live for months.
Short-lived → PINNED. Long-lived → AUTO_UPGRADE. (getVersion, from the previous slide, patches within one definition.)
Versioning behavior
@WorkflowVersioningBehavior(VersioningBehavior.PINNED)
class ShortLivedCheckoutWorkflow implements CheckoutWorkflow { ... }
@WorkflowVersioningBehavior(VersioningBehavior.AUTO_UPGRADE)
class SubscriptionLifecycleWorkflow implements SubscriptionWorkflow { ... }
PINNED - drain in-flight on old Workers, then deploy.
AUTO_UPGRADE - long-runners pick up newer compatible code automatically.
Day 4
Worker sizing & Task Queue design
Sized for resource profile, not business domain.
Day 4 · on-ramp
Worker sizing & Task Queue design
Where this fits — Workflows work; now you make Workers carry production load.
Why it matters — Task Queues are sized by resource profile, not domain — get this wrong and you over- or under-provision.
By the end — You'll know the levers — slots, pollers, tuners — and how to set them deliberately.
The levers
Setting
Controls
Default
maxConcurrentWorkflowTaskExecutionSize
In-flight workflow decisions on this Worker
200
maxConcurrentActivityExecutionSize
In-flight Activity attempts
200
maxConcurrentLocalActivityExecutionSize
In-flight local Activities
200
ResourceBasedTuner
Auto-scale Worker slots vs CPU / memory targets
off (fixed slots)
CompositeTuner
Mix strategies: fixed workflow slots + resource-based activity slots
off
Sticky execution
Worker caches workflows; skips full replay each task
on · cache 600
setUsingVirtualWorkflowThreads(true)
Cheaper SDK Workflow threads inside the Worker Factory
false
setUsingVirtualThreads(true) (JDK 21+)
Cheaper Activity execution threads inside one Worker
false
Number of Task Queues
One pool per resource profile
1
Pollers default to 5 workflow-task + 5 activity-task per Worker; rate limits (maxActivitiesPerSecond) default to 0 = unlimited .
Workers across machines
orders Task Queue
├── worker JVM on pod-a / vm-a
├── worker JVM on pod-b / vm-b
└── worker JVM on pod-c / vm-c
Every Worker polling the same Task Queue is in the same pool .
Matching hands each task to one available poller. Workers compete for tasks; they don't coordinate directly.
Total capacity is roughly: machines × Worker processes × execution slots.
Sticky Workflow cache is per Worker process ; if that process disappears, another Worker replays from history.
Machines scale the pool. Slots limit concurrency inside each Worker.
Threads vs virtual threads
Layer
Platform threads
Virtual threads
Workflow execution
SDK workflow threads; deterministic, park at Temporal waits
setUsingVirtualWorkflowThreads(true) makes those cheaper
Activity execution
One blocking Activity can occupy one OS-backed thread
setUsingVirtualThreads(true) makes blocking I/O Activities cheaper
Temporal semantics
Same Task Queues, histories, retries, timeouts
Same semantics — only JVM scheduling cost changes
Virtual threads help blocking I/O-heavy Activity pools.
They do not make CPU-bound Activities faster; CPU still caps throughput.
Long Workflow waits still do not park a JVM thread; Temporal records timers and resumes later.
Virtual threads increase how much one JVM can hold; extra machines increase the Worker pool.
Manual sizing
Worker worker = factory.newWorker(
"io-heavy" ,
WorkerOptions.newBuilder()
.setMaxConcurrentActivityExecutionSize(200 )
.setMaxConcurrentWorkflowTaskExecutionSize(20 )
.build());
I/O-heavy workload: many concurrent Activities, few workflow tasks. Both default to 200 ; here we leave Activities at the default and cap workflow tasks down to 20.
Cheaper concurrency — virtual threads
Activity slots cost threads. On JDK 21+ , virtual threads make those threads almost free:
setUsingVirtualWorkflowThreads(true) runs execution on virtual threads . One Worker can then hold far more concurrent Activities per host — especially I/O-bound ones.
Frees you from sizing slot counts around OS-thread limits.
The easy lever for high-concurrency, I/O-heavy Activity workloads (JDK 21+ only).
Virtual-thread Worker
WorkerFactoryOptions factoryOptions =
WorkerFactoryOptions.newBuilder()
.setUsingVirtualWorkflowThreads(true )
.build();
Worker worker =
factory.newWorker(
"high-concurrency-activities" ,
WorkerOptions.newBuilder()
.setUsingVirtualThreads(true )
.setMaxConcurrentActivityExecutionSize(1_000 )
.build());
Same Task Queue. Same retries. More blocking I/O Activities can fit in one JVM.
Let the host decide — resource-based slots
Fixed slot counts are a guess. A resource-based tuner sizes the Worker to the machine instead:
You set target CPU / memory (e.g. 75% / 80%); the SDK grows or shrinks slots to hold that.
Best for mixed or unpredictable workloads, where any fixed number is wrong half the time.
Stop hand-tuning slot counts per host — target a utilisation and let it adapt.
Resource-based tuner
ResourceBasedTuner tuner =
ResourceBasedTuner.newBuilder()
.setControllerOptions(
ResourceBasedControllerOptions.newBuilder()
.setTargetMemoryUsage(0.75 )
.setTargetCpuUsage(0.80 )
.build())
.build();
Worker worker = factory.newWorker(
"payments" , WorkerOptions.newBuilder().setWorkerTuner(tuner).build());
Auto-fit Worker slot counts to host capacity. Best fit for mixed workloads.
Mix strategies per slot type
A WorkerTuner decides how many slots each task type gets. CompositeTuner takes one SlotSupplier per slot type , so each can use the strategy that fits it:
CompositeTuner(workflowTaskSupplier, activitySupplier, localActivitySupplier) — three independent suppliers, one per slot type.
FixedSizeSlotSupplier<>(N) — hands out exactly N slots, never more (same effect as setMaxConcurrent…Size(N)). Use where load is predictable — e.g. workflow-task decisions.
ResourceBasedSlotSupplier — grows / shrinks slots against CPU & memory targets. Use where load isn't — e.g. I/O-heavy Activities.
Pin the cheap, predictable slots with FixedSizeSlotSupplier; auto-size the expensive ones with ResourceBasedSlotSupplier. CompositeTuner is just the holder that lets them coexist.
CompositeTuner - mix strategies
ResourceBasedController controller =
ResourceBasedController.newSystemInfoController(
ResourceBasedControllerOptions.newBuilder()
.setTargetMemoryUsage(0.75 ).setTargetCpuUsage(0.80 ).build());
WorkerTuner tuner = new CompositeTuner (
new FixedSizeSlotSupplier <>(20 ),
ResourceBasedSlotSupplier.createForActivity(
controller, ResourceBasedSlotOptions.getDefaultInstance()),
new FixedSizeSlotSupplier <>(20 ));
Fixed where load is predictable; resource-based where it isn't.
Protecting a fragile downstream — rate limits
Slot tuners size for your throughput; sometimes the constraint is theirs — a legacy API or vendor with a hard QPS cap.
setMaxActivitiesPerSecond caps this Worker's Activity starts/sec.
setMaxTaskQueueActivitiesPerSecond caps the whole Task Queue — server-enforced across every replica, so the dependency never sees more than N QPS no matter how far you scale.
Tuners throttle for capacity; these two knobs throttle to respect a downstream limit.
Rate-limiting an Activity pool
Worker worker = factory.newWorker(
"legacy-api-calls" ,
WorkerOptions.newBuilder()
.setMaxActivitiesPerSecond(50 )
.setMaxTaskQueueActivitiesPerSecond(100 )
.build());
Scale to 20 replicas and the vendor still sees ≤ 100 QPS — the task-queue cap is enforced server-side.
Day 4
Observability
Metrics on day one.
Day 4 · on-ramp
Observability
Where this fits — Before you trust this in production, you need to see inside it.
Why it matters — Temporal emits the metrics; wiring them on day one saves the 2 AM blind debugging session.
By the end — You'll expose SDK metrics, add your own, and trace one request across Workflow and Activities.
Key SDK metrics
Metric
Tells you
temporal_workflow_task_schedule_to_start_latency
Worker capacity vs demand
temporal_workflow_completed_total
Throughput
temporal_workflow_failed_total
Real failures
temporal_activity_execution_failed_total
Bad downstream / retry config
temporal_sticky_cache_size
Replay overhead / memory health
temporal_activity_schedule_to_start_latency
Activity backlog
Wire via Micrometer → Prometheus → your existing Grafana.
Micrometer wiring
PrometheusMeterRegistry registry = new PrometheusMeterRegistry (PrometheusConfig.DEFAULT);
Scope scope = new RootScopeBuilder ()
.reporter(new MicrometerClientStatsReporter (registry))
.reportEvery(com.uber.m3.util.Duration.ofSeconds(10 ));
WorkflowServiceStubs service = WorkflowServiceStubs.newServiceStubs(
WorkflowServiceStubsOptions.newBuilder().setMetricsScope(scope).build());
Beyond the built-ins — your own metrics
The SDK emits task latencies and counts for free. For business signals, you add your own:
Grab the SDK's metrics scope inside an Activity and emit counters / gauges / timers.
They ride the same Micrometer → Prometheus pipeline as the built-in metrics.
"How many orders compensated today?" is a metric you emit — Temporal doesn't know your domain.
Custom Activity metric
class InvoiceActivitiesImpl implements InvoiceActivities {
private final Counter invoices;
InvoiceActivitiesImpl(MeterRegistry registry) {
this .invoices = Counter.builder("training_invoices_generated_total" ).register(registry);
}
@Override public String generateInvoice (String orderId) {
invoices.increment();
return "s3://invoices/" + orderId + ".pdf" ;
}
}
Same Micrometer registry as the SDK metrics; your KPIs sit beside Temporal's.
One trace across Workflow + Activities
A request that spans a Workflow and several Activities should be one distributed trace , not disconnected spans:
An OpenTelemetry interceptor carries the trace context across every boundary: Workflow → Activity → child-Workflow.
Spans land in your existing backend (Jaeger, Tempo, Honeycomb…).
Wire the interceptor once on the client + Worker; causality across the whole execution comes for free.
Tracing with OpenTelemetry
client = WorkflowClient.newInstance(service,
WorkflowClientOptions.newBuilder()
.setInterceptors(new OpenTracingClientInterceptor (otOptions))
.build());
factory = WorkerFactory.newInstance(client,
WorkerFactoryOptions.newBuilder()
.setWorkerInterceptors(new OpenTracingWorkerInterceptor ())
.build());
One trace spans client → Workflow → Activity. Needs temporal-opentracing.
Lab · Day 4
Local dashboard
Challenge → day-04-production/lab-1-observability
make stack-obs
make temporal
open http://localhost:3000
make load-transform N=50
In Grafana, open the Temporal Training - Overview dashboard and watch:
temporal_workflow_completed_total climb.
temporal_workflow_failed_total increment when you force a failure.
Day 4
Namespace strategy
Isolation boundary, not a routing primitive.
Day 4 · on-ramp
Namespace strategy
Where this fits — One more production decision: how you carve tenants and environments apart.
Why it matters — A namespace is an isolation boundary, not a routing key — confusing the two bites later.
By the end — You'll decide when to split namespaces and how retention applies to each.
When to split namespaces
Scenario
Namespace shape
Dev / staging / prod
One namespace per environment
Regulated tenant isolation
One namespace per tenant
Shared SaaS tenants
One namespace per env; tenant ID in Search Attributes
Different retention SLAs
Separate namespace per retention class
Namespace ≠ Task Queue. Task Queue routes work; Namespace bounds it.
Access control stops at the namespace
Self-hosted: mTLS authenticates the connection; two Go hooks authorize the call.
Hook
Job
ClaimMapper
token → who you are + your namespace Role
Authorizer
claims + API call → allow / deny
The default Authorizer only checks you hold some role on the namespace — it ignores the Role level and the Task Queue.
Finer than a namespace (per-Task-Queue, per-Workflow) = a custom Authorizer — the one hook that sees the request.
Cloud: RBAC is fixed (account + namespace roles); Workers use Service Accounts + scoped API keys ; finer grain = more namespaces.
No config knob between "one namespace" and "write your own Authorizer."
Namespace retention operations
Retention is Namespace-level policy for closed Workflow histories.
temporal operator namespace describe --namespace default
temporal operator namespace update --namespace default --retention 7d
temporal workflow delete --namespace default \
--workflow-id <workflow-id> --run-id <run-id>
Reduce future storage: lower Namespace retention.
Remove one closed execution now: temporal workflow delete.
Preserve old histories externally: enable History Archival.
Finding live Workflows — Search Attributes
Metrics give you aggregates; Search Attributes let ops filter and group individual executions — "all orders in REFUNDING for tenant acme."
Register custom keys once per namespace : temporal operator search-attribute create --name OrderStage --type Keyword.
From inside the Workflow, Workflow.upsertTypedSearchAttributes(...) tags the execution; update it as state changes so temporal workflow list and the UI stay current.
Indexed, queryable metadata on a running Workflow — the bridge from a Workflow to your ops dashboards.
Typed Search Attributes from a Workflow
private static final SearchAttributeKey<String> ORDER_STAGE =
SearchAttributeKey.forKeyword("OrderStage" );
Workflow.upsertTypedSearchAttributes(ORDER_STAGE.valueSet("RECEIVED" ));
activities.charge(orderId);
Workflow.upsertTypedSearchAttributes(ORDER_STAGE.valueSet("CHARGED" ));
Now temporal workflow list --query "OrderStage='CHARGED'" finds every matching live execution.
Day 4
Testing
In-process Workflow tests with time skipping.
Day 4 · on-ramp
Testing
Where this fits — You've written Workflows; now you test them without a server or a real clock.
Why it matters — Fast in-process tests with time-skipping make day-long timers testable in milliseconds.
By the end — You'll unit-test a Workflow with mocked Activities and skipped time.
Testing Workflows — no server, no clock
Workflow logic is deterministic, so you can test it in-process — no Docker, no real Temporal, no waiting:
TestWorkflowEnvironment runs a Worker + client inside your test JVM.
Time-skipping is the headline: a Workflow that sleeps 30 days completes in milliseconds . The test clock jumps straight to the next timer.
Unit-test orchestration like ordinary code — fast and hermetic.
Javadoc: TestWorkflowEnvironment
TestWorkflowEnvironment
TestWorkflowEnvironment env = TestWorkflowEnvironment.newInstance();
Worker worker = env.newWorker("reminder" );
worker.registerWorkflowImplementationTypes(ReminderWorkflowImpl.class);
worker.registerActivitiesImplementations(new ReminderActivitiesImpl ());
env.start();
ReminderWorkflow stub = env.getWorkflowClient().newWorkflowStub(
ReminderWorkflow.class,
WorkflowOptions.newBuilder().setTaskQueue("reminder" ).build());
String result = stub.run("hello" );
No Docker. No network. Time skipping - a 30-day reminder completes in milliseconds.
JUnit 5 + mocked Activities
For real test suites, skip the manual env wiring and isolate the logic:
TestWorkflowExtension injects the env / worker / typed stub into each @Test.
Mock the Activities (Mockito) so the test exercises Workflow logic only — zero network, zero DB.
Assert the orchestration: given these Activity results, the Workflow reaches this outcome.
JUnit 5 extension + mocked Activities
@RegisterExtension
static final TestWorkflowExtension ext = TestWorkflowExtension.newBuilder()
.setWorkflowTypes(ReminderWorkflowImpl.class).setDoNotStart(true ).build();
@Test
void completes (TestWorkflowEnvironment env, Worker worker, ReminderWorkflow wf) {
ReminderActivities activities = mock(ReminderActivities.class);
when (activities.lookupEmail("u1" )).thenReturn("u1@example.com" );
worker.registerActivitiesImplementations(activities);
env.start();
assertEquals("sent to u1@example.com" , wf.remind("u1" ));
}
Extension injects env/worker/stub; Mockito mocks Activities - zero I/O.
Lab · Day 4
In-process tests
Challenge → day-04-production/lab-2-testing-workflows
make run-testing
The test uses Workflow.sleep(Duration.ofDays(1)).
It still completes in <1 second.
Try changing the sleep to 30 days; same test time.
Day 4
Workflow replay testing
Catching determinism regressions before they reach production.
Day 4 · on-ramp
Workflow replay testing
Where this fits — The test that specifically guards the determinism rules from Day 2.
Why it matters — It catches the version-breaking change before the deploy, not after.
By the end — You'll replay a recorded history against new code and catch the regression.
Capture & replay
temporal workflow show --workflow-id order-1001 \
--output json > histories/order-1001.json
@Test
void replaysProductionHistory () throws Exception {
WorkflowReplayer.replayWorkflowExecutionFromResource(
"histories/order-1001.json" , OrderSagaWorkflowImpl.class);
}
Refactor breaks an in-flight workflow → CI fails before you ship.
What a non-determinism failure looks like
Reorder two Activities, then replay the old history — the replayer rejects it:
[TMPRL1100] nondeterministic workflow:
history event is ActivityTaskScheduled (ActivityId: 5, Name: Extract …)
replay command is ScheduleActivityTask (ActivityId: 5, Name: Load …)
History scheduled Extract as command #5; the new code emits Load there.
The SDK won't guess — it fails the replay rather than corrupt state.
Real output from the lab's TestReorderedCodeBreaksReplay. In CI this fires before the deploy, not on a live run.
Lab · Day 4
Build a replay corpus
Challenge → day-04-production/lab-3-replay-testing
For a Workflow you wrote on Day 1-2:
Run 3 executions covering: happy path, retry, cancellation.
Capture each with temporal workflow show ... --output json.
Drop them into src/test/resources/histories/.
Add a WorkflowReplayer test per file.
Modify the Workflow to reorder Activities; watch the test fail.
Javadoc: WorkflowReplayer
Day 4
Migrating Airflow DAGs
A decision framework.
Day 4 · on-ramp
Migrating Airflow DAGs
Where this fits — Stepping back: a framework for deciding what to move, and in what order.
Why it matters — Not every DAG should move; knowing which saves wasted migrations.
By the end — You'll have a decision rule for migrate-or-keep and a safe migration order.
Migrate or not?
DAG shape
Verdict
Simple ETL on a fixed schedule
Stay on Airflow, or move scheduler to Temporal Schedules
Cross-system orchestration with retries and human steps
Migrate (sweet spot)
Kafka-triggered, one execution per key
Migrate (Day 3 pattern)
Pure data transformation
Don't migrate. Spark / dbt territory
Long-running waits (hours, days, humans)
Migrate. Airflow handles this poorly
Tight Airflow operator coupling
Wrap in Activities; the operator is the unit
Migration order that works
Pick one DAG that hurts in production.
Map operators → Activities mechanically. Don't redesign.
Run side by side for a release cycle.
Cut over after the Temporal version is clean for two weeks.
Redesign only after stable. Now use Signals, Updates, Schedules.
Don't migrate everything. Migrate where Temporal earns its keep.
Day 4 takeaways
Versioning is about preserving old histories, not just deploying new code.
Size Workers for resource profile , not business domain.
Replay tests are the single safety net for Workflow code changes.
Not everything is a Workflow. Migrate where Temporal earns its keep.
Day 5
Saga, Spring Boot & capstone
Real-world Workflow walkthrough. Then build one.
Day 5
Order-processing saga
The canonical demo: payment → inventory → ship; compensate on failure.
Day 5 · on-ramp
Order-processing saga
Where this fits — Everything so far, assembled into the canonical real workflow: pay → reserve → ship.
Why it matters — Distributed transactions have no rollback; the saga is how you undo across services.
By the end — You'll read a saga's forward steps and compensations and watch one undo itself in history.
The Saga pattern — forward steps + undo
A saga is a sequence of steps. If a later step fails, you undo the earlier ones yourself — there's no distributed transaction to roll back for you:
After each forward Activity, register its compensation (the inverse Activity).
On failure, run the registered compensations — by default LIFO (reverse order).
The SDK's Saga helper just tracks the compensation stack; you write the undo Activities.
"Authorize → reserve → ship", with a matching "cancel → restore → notify" if anything throws.
Javadoc: Saga
The saga
public String process (String orderId) {
Saga saga = new Saga (new Saga .Options.Builder().setParallelCompensation(false ).build());
try {
String paymentId = activities.authorizePayment(orderId);
saga.addCompensation(activities::cancelPayment, paymentId);
String reservationId = activities.reserveInventory(orderId);
saga.addCompensation(activities::restoreInventory, reservationId);
activities.ship(orderId);
return "COMPLETED" ;
} catch (RuntimeException failure) {
saga.compensate();
activities.sendFailureNotification(orderId, failure.getMessage());
return "COMPENSATED" ;
}
}
Orchestration vs choreography
Orchestration - one central Workflow coordinates all steps & compensations. Single audit trail. Temporal's natural shape.
Choreography - each service reacts to events and emits its own. No central state.
Temporal supports both. A Workflow can be the orchestrator. Or it can be one service's durable participant inside a larger event choreography.
For cross-team flows from Airflow + Kafka, orchestration wins.
Choreography with Temporal
Event-choreographed system
Temporal participant
Kafka topic carries facts: OrderPlaced, PaymentCaptured, ShipmentFailed
Workflow owns one business key: order-123
Services publish events after local commits
Bridge delivers events with signalWithStart
No one process owns the whole company flow
This team still gets durable state, retries, timers, and audit
Choreography outside; orchestration inside the Workflow boundary.
Choreography example
Order service owns the order lifecycle, but reacts to events from other teams.
Runnable demo: examples/runnable/13-choreography.
@KafkaListener(topics = "order-domain-events")
void onEvent (DomainEvent event) {
OrderProcessWorkflow workflow = client.newWorkflowStub(
OrderProcessWorkflow.class,
WorkflowOptions.newBuilder()
.setWorkflowId("order-" + event.orderId())
.setTaskQueue("orders" )
.build());
BatchRequest batch = client.newSignalWithStartRequest();
batch.add(workflow::run, event.orderId());
batch.add(workflow::onEvent, event);
client.signalWithStart(batch);
}
The bridge is thin. The Workflow decides what the event means for this order.
When to choose which
Choose orchestration when...
Choose choreography when...
One team owns the end-to-end business outcome
Several teams must evolve independently
You need one place for compensations and timeouts
Events are the stable contract between domains
Operators need one execution history to debug
Consumers should be added without changing a central flow
The flow is user-facing or SLA-bound
The flow is naturally eventually consistent
Most real systems mix them: events between domains, Workflows inside domains.
Compensation rules
Register compensation immediately after the forward step succeeds.
Compensations are business logic , not generic undo.
Compensations get their own retry policy. Test the failing case.
Idempotency on forward AND compensation steps.
Lab · Day 5
Run the saga
Challenge → day-05-saga-spring/lab-1-order-saga-walkthrough
make run-saga
Start the happy path:
temporal workflow start --task-queue orders \
--type OrderSagaWorkflow --workflow-id order-OK \
--input '"order-1001"'
Run the saga: force a failure
temporal workflow start --task-queue orders \
--type OrderSagaWorkflow --workflow-id order-fail \
--input '"fail-at-ship"'
In the Web UI, watch the compensations fire in reverse order.
Day 5
Saga in Spring Boot
Wiring + interaction patterns.
Day 5 · on-ramp
Saga in Spring Boot
Where this fits — The same saga, wired the way your services actually run — in Spring Boot.
Why it matters — This is the production plumbing: client beans, Worker registration, sync and async entry points.
By the end — You'll drive a saga over HTTP and over Kafka from a Spring app.
Temporal in Spring Boot — the shape
Temporal isn't a framework you hand control to. It's a client + Worker you wire into Spring's lifecycle:
Expose WorkflowServiceStubs → WorkflowClient → WorkerFactory as @Beans. Then bind the factory's start / shutdown to the app context.
Activities are Spring beans — inject DataSources, HTTP clients, repositories as usual.
Drive Workflows from a @RestController (start / signal / query); the Worker just polls in the background.
In production use the temporal-spring-boot-starter; the next slide is the manual wiring it automates.
Manual Spring config
@Configuration
class TemporalConfig {
@Bean WorkflowServiceStubs workflowServiceStubs () {
return WorkflowServiceStubs.newLocalServiceStubs(); }
@Bean WorkflowClient workflowClient (WorkflowServiceStubs s) {
return WorkflowClient.newInstance(s); }
@Bean(initMethod = "start", destroyMethod = "shutdown")
WorkerFactory workerFactory (WorkflowClient c, OrderActivities a) {
WorkerFactory f = WorkerFactory.newInstance(c);
Worker w = f.newWorker("orders" );
w.registerWorkflowImplementationTypes(OrderSagaWorkflowImpl.class);
w.registerActivitiesImplementations(a);
return f;
}
}
Javadoc: WorkflowServiceStubs · WorkflowClient · WorkerFactory
Driving the saga synchronously (HTTP)
A @RestController POST that needs the result uses an Update-with-start:
startUpdateWithStart creates the saga Workflow (if absent) and submits the order in one call.
The handler blocks on getResult() and returns the outcome — one round trip; the caller gets the answer.
Synchronous request/response over a durable Workflow — the client never knows it isn't a plain service call.
Sync interaction (Update)
WithStartWorkflowOperation<String> start =
WithStartWorkflowOperation.newBuilder(workflow::process)
.setArguments(request.orderId()).build();
WorkflowUpdateHandle<String> update = client.startUpdateWithStart(
start, "submit" , WorkflowUpdateStage.COMPLETED, String.class, request);
return update.getResult();
POST endpoint blocks until the workflow returns. One round trip.
Driving the saga from Kafka (async)
A @KafkaListener that just needs to kick off work uses signal-with-start:
signalWithStart starts the saga for order-<id> if absent, then delivers the event — idempotent per key.
No blocking, no result — fire the event, commit the offset.
Same Workflow, two front doors: a synchronous HTTP Update, and an async Kafka Signal.
Async interaction (Signal)
@KafkaListener(topics = "orders")
void onOrder (OrderRequest request) {
OrderSagaWorkflow workflow = client.newWorkflowStub(
OrderSagaWorkflow.class,
WorkflowOptions.newBuilder()
.setWorkflowId("order-" + request.orderId())
.setTaskQueue("orders" ).build());
BatchRequest batch = client.newSignalWithStartRequest();
batch.add(workflow::process, request.orderId());
batch.add(workflow::onUpdate, request);
client.signalWithStart(batch);
}
Continue-as-new — why
History grows with every event . A Workflow that loops forever — a subscription, a counter, an actor — would grow its history without bound. That means slower replay and, eventually, hard limits.
Continue-as-new atomically ends the current run and starts a fresh one — same Workflow ID, clean history.
You hand forward only the state the next run needs: a checkpoint , not a memory dump.
To callers it's still one Workflow ID; underneath it's a chain of Runs .
Reach for it on unbounded loops and periodic/cron-style Workflows.
Continue-as-new
@Override
public void run (String subscriptionId, int eventCount) {
while (true ) {
Workflow.await(this ::hasNextEvent);
handleNextEvent();
eventCount++;
if (eventCount >= 1000 ) {
Workflow.continueAsNew(subscriptionId, 0 );
}
}
}
Continue-as-new is a checkpoint , not a memory dump. Carry only what's needed.
Lab · Day 5
Saga in Spring Boot
Challenge → day-05-saga-spring/lab-2-saga-spring-boot
Wire the saga into a Spring Boot app:
Register the Worker as a @Component with WorkerFactory lifecycle bound to the app context.
Drive the Workflow from a @RestController — start, signal, query.
Inject Activity dependencies (DB, HTTP clients) as Spring beans.
Goal: the saga runs inside Spring Boot, started from an HTTP endpoint.
Day 5
Capstone
Redesign a Kafka-triggered Airflow DAG as a Temporal Saga.
Day 5 · on-ramp
Capstone
Where this fits — You build: take a Kafka-triggered Airflow DAG and redesign it as a Temporal saga.
Why it matters — It's the rehearsal for the migration you'll do back at work.
By the end — You'll have a working saga you designed, against a clear acceptance bar.
The task
A customer signup flow. Kafka event customer-signup arrives with {userId, email, plan}. The DAG runs four tasks: create user, charge first month, provision tenant, send welcome email. Failure handling today is ad hoc.
Redesign it as a Saga. Demonstrate one compensation path.
Deliverable plan (75 min)
Time
Deliverable
0-10
Sketch Workflow signature + Activity interface + compensation order on paper
10-50
Implement enough Java to run the happy path + one failure path
50-65
Wire the Kafka trigger with signalWithStart
65-75
Run end-to-end against the local stack; demo one compensation
Acceptance criteria
At least three forward steps.
Compensation registered immediately after each step.
@KafkaListener triggering via signalWithStart.
One demonstrated failure → compensation visible in the Web UI history.
Lab · Day 5
Capstone
Challenge → day-05-saga-spring/lab-3-capstone
make stack-kafka
make temporal
Go. 75 minutes. Walk the room every 15. Unstick people on Spring config -
the lesson is in the saga shape, not the wiring.
Capstone review (25 min)
Two or three volunteer pairs share screen. The room critiques. Cover:
How did they decide what was a Workflow vs an Activity?
Where did they put idempotency keys?
Orchestration or choreography? Why?
What would they change for a 30-day saga?
Q&A + open migration planning (20 min)
Anchor questions if the room is quiet:
Pick one Airflow DAG. What's the first thing that would break in Temporal?
What's your team's hardest distributed-transaction failure? Would a Saga have caught it?
Where does "I think it ran but I'm not sure" happen in your stack? That's a Workflow.
Day 5 takeaways
Compensation is business logic, not generic undo. Design it on purpose.
Sync Updates via startUpdateWithStart for sync APIs.
Async Signals via signalWithStart for event-driven triggers.
Continue-as-new is a checkpoint, not a memory dump.
Day 6
AWS migration & container workloads
Replacing Glue + Lambda + Step Functions. Running Workers in Kubernetes.
Day 6
The AWS orchestration problem
Hidden complexity in Lambda + Glue + Step Functions.
Day 6 · on-ramp
The AWS orchestration problem
Where this fits — The morning's premise: where Lambda + Glue + Step Functions hide their complexity.
Why it matters — Naming the pain points tells you exactly what Temporal is replacing, and why.
By the end — You'll map each AWS piece to the state and recovery burden it leaves you.
State scatters
EventBridge rule fires.
Lambda validates, transforms, sometimes orchestrates.
Step Functions declares state transitions in JSON.
Glue runs Spark / Python shell, writes results to S3.
S3 is the handoff medium.
Every handoff = a chance for state to disagree. Recovery is a runbook.
AWS → Temporal map
AWS shape
Temporal shape
EventBridge → Lambda → Step Functions
Consumer (or thin Lambda) signalWithStarts a Workflow
Step Functions JSON states
Workflow branches through Java code
Glue Python writes S3 checkpoints
Activity returns result; history records the run
Glue Spark heavy transform
Activity starts Glue, heartbeats runId while polling
CloudWatch Lambda retry
RetryOptions with typed ApplicationFailure
S3 handoff between Lambdas
Activity returns S3 URI
DynamoDB checkpoint table
Workflow event history
When to keep AWS compute
Service
Keep when
Replace when
Lambda
<100ms, IAM-bound, one-shot
Multi-step coordination, retries, long waits
Glue Spark
Large distributed transforms (>10 GB)
Pure data movement; small batches
Glue Python
Tiny scripts (<1 min) with Glue catalog
Anything you'd write as a Java Activity
Step Functions
Already wired, low-change
Anything needing human steps or code review
Temporal supervises; AWS executes the heavy lift.
Day 6
Glue Spark as an Activity
The canonical supervise-AWS-compute pattern.
Day 6 · on-ramp
Glue Spark as an Activity
Where this fits — The core AWS pattern: supervise long external compute from a Workflow.
Why it matters — It's the template for wrapping any slow, out-of-process job — Glue, EMR, a batch API.
By the end — You'll launch and supervise a Glue job as an Activity, heartbeating its progress.
Supervising external compute — the pattern
You can't make a 30-minute Glue / EMR / Batch job itself durable. So wrap it in an Activity that supervises the job:
Start the external job — get back a job-run ID.
Poll its status in a loop, heartbeating the run ID each time.
Succeed / fail the Activity when the job finishes.
The heartbeat means a Worker restart re-attaches to the same run and resumes polling — it doesn't relaunch the job.
The Workflow stays oblivious — it just sees "a long Activity."
Temporal owns the orchestration & durability ; AWS still owns the compute .
Glue activity
@Override
public String runGlueJob (String jobName, String inputS3Uri) {
String runId = glue.startJobRun(
StartJobRunRequest.builder()
.jobName(jobName)
.arguments(Map.of("--input" , inputS3Uri))
.build()).jobRunId();
while (true ) {
Activity.getExecutionContext().heartbeat(runId);
JobRun jobRun = glue.getJobRun(
GetJobRunRequest.builder().jobName(jobName).runId(runId).build()).jobRun();
if (jobRun.jobRunState() == JobRunState.SUCCEEDED) return runId;
if (Set.of(FAILED, TIMEOUT, STOPPED).contains(jobRun.jobRunState()))
throw ApplicationFailure.newFailure(jobRun.errorMessage(), "GlueJobFailed" );
Thread.sleep(Duration.ofSeconds(15 ).toMillis());
}
}
Lab · Day 6
Glue Activity (LocalStack)
Challenge → day-06-aws-containers/lab-1-glue-activity
make stack-aws
make aws-init
make run-aws
awslocal s3 cp /tmp/test.csv s3://imports-incoming/test.csv
scripts/start-workflow.sh transform 1 ImportWorkflow "s3://imports-incoming/test.csv"
Watch the heartbeats in the Web UI as the polling loop runs.
Big data? Pass a reference, not the bytes
Every payload — Workflow args, results, Signals — is stored in history . History has a hard 2 MB cap, and the SDK warns around 256 KB.
Keep the bytes in S3 (or any blob store); pass only a URI + metadata through the Workflow.
The Activity reads/writes the object; history stays tiny and replay stays fast.
History is a control plane, not a data bus — move references, not megabytes.
S3 reference payloads
record TransformRequest (String inputS3Uri, String outputPrefix) {}
record TransformResult (String outputS3Uri, long rowCount) {}
Workflow history holds URIs + counts .
Activity owns the bytes.
Hard cap 2 MB per payload (SDK warns ~256 KB); large data via S3.
Workflow history is small. URIs travel cheap.
Encrypting payloads — the codec
By default, Workflow inputs, results, and Activity args are stored in history as plaintext . For sensitive data, encrypt at the SDK boundary:
A PayloadCodec encrypts on the way out and decrypts on the way in — the server only ever stores ciphertext.
A codec server lets the Web UI and CLI decrypt for authorized viewers. So history stays human-readable, but it's never exposed at rest.
End-to-end encryption — the Temporal server (and its operators) never see your plaintext.
Codec server - encrypt payloads
class EncryptionCodec implements PayloadCodec {
public List<Payload> encode (List<Payload> p) { }
public List<Payload> decode (List<Payload> p) { }
}
DataConverter converter = new CodecDataConverter (
DefaultDataConverter.newDefaultInstance(), List.of(new EncryptionCodec ()));
WorkflowClient.newInstance(service,
WorkflowClientOptions.newBuilder().setDataConverter(converter).build());
Server stores ciphertext only. A standalone codec server lets the Web UI decode on demand.
Lab · Day 6
Replace S3 checkpoints
Challenge → day-06-aws-containers/lab-2-s3-checkpointing
Take a hypothetical existing pipeline that writes a checkpoint S3 key after every step.
Identify which checkpoints are resume points . Those become Workflow state.
Identify which checkpoints are handoffs . Those become Activity return URIs.
Sketch the Workflow signature. What's input? What's output?
No new code; redesign on paper. 15 minutes.
From state machine to code
Migrating an AWS Step Functions state machine? The mapping is direct. The payoff: you leave ASL/JSON behind.
States → plain code — choices become if, parallel states become Async, retriers become a RetryPolicy.
No 25,000-event / 1-year ceilings , no JSON DSL — it's just a Workflow function.
Activities wrap the same Lambdas / jobs you already invoke.
A state machine becomes ordinary control flow you can read, test, and debug.
Step Functions → Temporal
@Override
public void run (String inputS3Uri) {
ValidatedFile file = activities.validate(inputS3Uri);
String transformedUri = activities.transform(file.cleanInputUri());
LoadResult loadResult = activities.load(transformedUri);
activities.publishNotification(loadResult);
}
Compare to the equivalent ASL: ~30 lines of JSON state machine with Resource arns.
Lab · Day 6
Migrate a Step Functions pipeline
Challenge → day-06-aws-containers/lab-3-stepfunctions-migration
Take a four-state ASL state machine (validate → transform → load → notify):
Map each Task state to an Activity; the state machine becomes Workflow code.
Replace Retry / Catch blocks with Temporal RetryOptions + try/catch.
Run it end-to-end against LocalStack.
Goal: the ASL JSON is gone; control flow lives in Workflow code.
Day 6
Events in, events out, config
SQS triggers a Workflow · SNS notifies the world · SSM holds the config.
Day 6 · on-ramp
Events in, events out, config
Where this fits — You've migrated the orchestration ; now wire the edges : an inbound trigger, an outbound notification, and externalized config.
Why it matters — These are the seams where AWS at-least-once delivery meets Temporal at-least-once execution — idempotency is the whole game.
By the end — You'll trigger a Workflow from SQS, fan out a result via SNS, and boot a Worker from SSM Parameter Store.
SQS as the trigger
An EventBridge → Lambda → StartExecution chain collapses into a long-poll SQS consumer that signalWithStarts a Workflow. The consumer is plain glue. It lives outside any Workflow, so the determinism rules don't apply.
"Standard queues ensure at-least-once message delivery, but due to the highly distributed architecture, more than one copy of a message might be delivered, and messages may occasionally arrive out of order."
— Amazon SQS Developer Guide · docs.aws.amazon.com
The duplicate is fine : signalWithStart is idempotent on the Workflow Id.
Why signalWithStart absorbs duplicates
"When you send this operation, if there is a running Workflow Execution with the given Workflow Id, it will be Signaled. Otherwise, a new Workflow Execution starts and is immediately sent the Signal."
— Temporal Documentation · docs.temporal.io
Derive the Workflow Id from the event (e.g. import-<bucket>-<key>).
First delivery starts ; any redelivery signals the same run. No duplicate Workflow.
Delete the SQS message only after the signal is durable in Temporal — at-least-once, never at-most-once.
SQS → signalWithStart bridge
for (Message m : resp.messages()) {
FileEvent event = parse(m.body());
ImportWorkflow stub = client.newWorkflowStub(ImportWorkflow.class,
WorkflowOptions.newBuilder()
.setWorkflowId("import-" + event.bucket() + "-" + event.key())
.setTaskQueue("transform" ).build());
BatchRequest batch = client.newSignalWithStartRequest();
batch.add(stub::run, event.s3Uri());
client.signalWithStart(batch);
sqs.deleteMessage(DeleteMessageRequest.builder()
.queueUrl(queueUrl).receiptHandle(m.receiptHandle()).build());
}
Long poll (waitTimeSeconds(20)), not a hot spin — fewer empty receives, lower cost.
Lab · Day 6
SQS event trigger
Challenge → day-06-aws-containers/lab-6-sqs-event-trigger
make stack-aws && make aws-init
awslocal sqs send-message --queue-url \
http://localhost:4566/000000000000/imports-events \
--message-body '{"bucket":"imports-incoming","key":"orders.csv","s3Uri":"s3://imports-incoming/incoming/orders.csv"}'
One Workflow starts; the redelivery just re-signals it. Idempotency, observed.
SNS to notify outward
The final notify step moves into Temporal as an Activity that publishes to SNS. Subscribers stay decoupled.
"The Fanout scenario is when a message published to an SNS topic is replicated and pushed to multiple endpoints, such as Firehose delivery streams, Amazon SQS queues, HTTP(S) endpoints, and Lambda functions."
— Amazon SNS Developer Guide · docs.aws.amazon.com
Publishing is I/O → it's an Activity , never Workflow code.
Activities are at-least-once, so a retry may double-publish . To handle it, put workflowId+runId in the message and let subscribers dedup.
Idempotent by design — both sides
"Activities follow an at-least-once execution model." … "Because Activities may be retried due to failures, it's strongly recommended to make them idempotent."
— Temporal Documentation · docs.temporal.io
Edge
AWS guarantee
Make it safe
SQS in
at-least-once delivery
Workflow Id from the event → signalWithStart dedupes
SNS out
at-least-once publish
carry workflowId+runId; subscribers dedup (or FIFO topic)
The pattern is the same on every edge: an idempotency key the receiver can dedup on.
Lab · Day 6
SNS fan-out notify
Challenge → day-06-aws-containers/lab-7-sns-fanout-notify
make stack-aws && make aws-init
awslocal sqs receive-message --queue-url \
http://localhost:4566/000000000000/imports-complete-sub
The subscribed SQS queue receives the completion event — the publisher just moved into an Activity.
SSM Parameter Store for config & secrets
Stop baking config into the image or passing secrets as plaintext env. Read them from SSM Parameter Store at Worker startup.
"SecureString parameters encrypt values using AWS Key Management Service, making them a practical choice for lightweight encrypted configuration values…"
— AWS Systems Manager User Guide · docs.aws.amazon.com
Config loads in process-startup code , not Workflow code (reading SSM in a Workflow is non-deterministic).
If a step needs a value at runtime, read it inside an Activity — for example, a fetchApiKey Activity that decrypts a SecureString.
Boot the Worker from SSM
WorkerConfig cfg = loadConfig(ssm, "/temporal-training/worker/" );
WorkflowServiceStubs service = WorkflowServiceStubs.newServiceStubs(
WorkflowServiceStubsOptions.newBuilder().setTarget(cfg.temporalAddress()).build());
WorkflowClient client = WorkflowClient.newInstance(
service, WorkflowClientOptions.newBuilder().setNamespace(cfg.namespace()).build());
setTarget / setNamespace build the real stubs from config — not hard-coded, not newLocalServiceStubs().
Lab · Day 6
Config & secrets from SSM
Challenge → day-06-aws-containers/lab-8-ssm-parameter-store
make stack-aws && make aws-init
awslocal ssm get-parameters-by-path \
--path /temporal-training/worker/ --with-decryption
Boot the Worker from Parameter Store; the SecureString is decrypted via KMS, never logged.
Day 6
Workers as containers
No HTTP server. Process-level probes. Graceful shutdown.
Day 6 · on-ramp
Workers as containers
Where this fits — The afternoon: running Workers the way the rest of your services run — as containers.
Why it matters — A Worker has no HTTP port, so health checks and shutdown work differently than your web apps.
By the end — You'll containerize a Worker, set the right probes, and autoscale it on queue backlog.
Mental model
A Worker is a long-lived process polling Task Queues outbound .
No inbound traffic. No Service, no Ingress.
Health = "is the process polling?" Use a pgrep exec probe, or an HTTP /health endpoint (Actuator or HttpServer) that returns 200 only after WorkerFactory.start().
Graceful shutdown = drain in-flight Activities; SIGTERM, then heartbeat-cancel.
Dockerfile
FROM maven:3.9 -eclipse-temurin-17 AS build
WORKDIR /src
COPY pom.xml .
COPY src ./src
RUN mvn -q -DskipTests package
FROM eclipse-temurin:17 -jre
WORKDIR /app
COPY --from=build /src/target/worker.jar /app/worker.jar
ENV JAVA_TOOL_OPTIONS="-XX:+UseContainerSupport -XX:MaxRAMPercentage=75"
ENTRYPOINT ["java" , "-jar" , "/app/worker.jar" ]
Lab · Day 6
Docker build
Challenge → day-06-aws-containers/lab-4-worker-container
cd examples/runnable/08-aws-containers
docker build -t temporal-transform-worker:dev .
docker run --rm \
-e TEMPORAL_ADDRESS=host.docker.internal:7233 \
temporal-transform-worker:dev
Confirm the Worker connects to the host's Temporal and starts polling.
Workers are just Deployments
A Temporal Worker is a stateless process that polls — so it's an ordinary Kubernetes Deployment:
No ingress, no Service, no ports — Workers dial out to Temporal; nothing connects in.
Scale with replicas — each pod is an identical Worker on the same Task Queue.
Liveness = "is the process polling?" (next: KEDA scales the replicas on backlog).
Nothing special to deploy: containerise the Worker and run N replicas.
Kubernetes Deployment
spec:
replicas: 2
strategy:
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
template:
spec:
terminationGracePeriodSeconds: 120
containers:
- name: worker
image: <ecr>/temporal-transform-worker:latest
readinessProbe:
exec: { command: ["sh" , "-c" , "pgrep -f worker.jar > /dev/null" ] }
livenessProbe:
exec: { command: ["sh" , "-c" , "pgrep -f worker.jar > /dev/null" ] }
terminationGracePeriodSeconds ≥ longest startToCloseTimeout.
Lab · Day 6
K8s deploy on kind
Challenge → day-06-aws-containers/lab-5-kubernetes-keda
make kind-up
make kind-load
kubectl apply -f examples/runnable/08-aws-containers/k8s-worker-deployment.yaml
kubectl rollout status deployment/temporal-transform-worker
kubectl logs -l app=temporal-transform-worker --tail =20
Confirm the Worker polls the cluster's Temporal address.
Autoscaling on Task Queue backlog
CPU is the wrong scaling signal for a Worker. The right signal is how much work is waiting :
KEDA scales the Worker Deployment on a Temporal metric : the Task Queue's backlog / schedule-to-start lag.
Backlog grows → add Worker pods; the queue drains → scale back down (even to zero).
Scale to queue depth, not CPU — the metric that actually means "we're falling behind."
KEDA temporal scaler
kind: ScaledObject
spec:
scaleTargetRef: { name: temporal-transform-worker }
minReplicaCount: 1
maxReplicaCount: 10
triggers:
- type: temporal
metadata:
endpoint: temporal-frontend.temporal.svc.cluster.local:7233
namespace: production
taskQueue: transform
queueType: ActivityTaskQueue
targetQueueSize: "20"
Native scaler polls DescribeTaskQueue. No Prometheus exporter needed.
Lab · Day 6
KEDA autoscale
Challenge → day-06-aws-containers/lab-5-kubernetes-keda
kubectl apply -f examples/runnable/08-aws-containers/keda-scaledobject.yaml
make load-transform N=200
kubectl get scaledobject,pods -l app=temporal-transform-worker -w
Watch replica count climb from 1 → ~5 as backlog grows.
KEDA autoscaling, live
300 workflow tasks backlogged on transform, nothing draining → KEDA scales the Deployment to max:
$ kubectl get scaledobject,hpa,deploy
ScaledObject temporal-transform-worker TRIGGER=temporal READY=True ACTIVE=True
HPA keda-hpa-…transform-worker TARGETS=30/20 (avg) REPLICAS=10
Deployment transform-worker READY=10/10 AVAILABLE=10
Real run on kind: an in-cluster Temporal + KEDA; replicas went 1 → 10 within one poll.
What you're seeing
KEDA scaled transform-worker from 1 → 10 off the Temporal task-queue backlog:
KEDA's Temporal scaler polls DescribeTaskQueue — no Prometheus exporter needed for this one.
Backlog ÷ targetQueueSize (20) drives the HPA: 300 / 20 = 15 → capped at maxReplicaCount 10 .
Drain the backlog and it scales back to minReplicaCount — which can be 0 (scale-to-zero).
CPU/memory would never catch this — Workers are I/O-bound; queue depth is the real demand signal.
Glue Activities in containers
Worker pod runs Glue-orchestration Activities.
IRSA , not access keys: eks.amazonaws.com/role-arn on the ServiceAccount.
Outbound to Temporal frontend (Cloud or self-hosted ELB).
Outbound to AWS APIs via VPC endpoints.
No bundled access keys. IRSA + VPC endpoints is the production shape.
Temporal Cloud vs EKS self-hosted
Concern
Temporal Cloud
EKS self-hosted
Setup time
Hours
Weeks
Persistence
Managed (Cassandra)
You run PostgreSQL / Cassandra
Upgrades
Automatic
You schedule
Multi-region
Built-in (premium)
You design replication
Cost shape
Per-action
Fixed infra
Audit / compliance
SOC2, HIPAA tiers
You provide evidence
Use Cloud unless you have a specific reason not to.
Lab · Day 6
End-to-end S3 → Temporal → S3
Challenge → day-06-aws-containers/lab-2-s3-checkpointing
make stack-aws
make temporal
make run-aws
awslocal s3 cp test-input.csv s3://imports-incoming/
scripts/start-workflow.sh transform end2end ImportWorkflow "s3://imports-incoming/test-input.csv"
awslocal s3 ls s3://imports-output/
Verify:
Three Activity completions in the Web UI.
Workflow history holds URIs + rowCount, not file bytes.
Day 6 · optional
Running it on real AWS
ECS · EKS · Aurora · Route 53 — the production shapes (real account required).
Day 6 · on-ramp
Running it on real AWS
Where this fits — The same Worker you containerized, now on the AWS primitives a real deployment uses.
Why it matters — The Worker barely changes; what changes is identity (IRSA/task roles), scaling (CloudWatch vs KEDA), and state (Aurora, Route 53).
By the end — You'll read the manifests for ECS autoscaling, EKS+IRSA, an idempotent Aurora sink, and Route 53 failover — and know the traps.
ECS Fargate — autoscale without KEDA
ECS has no native Temporal scaler . So you assemble what KEDA bundles: publish Task Queue backlog as a custom CloudWatch metric , then target-track it.
"Amazon ECS leverages the Application Auto Scaling service to provide this functionality." … target tracking: "Increase or decrease the number of tasks that your service runs based on a target value for a specific metric."
— Amazon ECS Developer Guide · docs.aws.amazon.com
A tiny backlog publisher calls DescribeTaskQueue → PutMetricData (Temporal/Worker / TaskQueueBacklog).
Application Auto Scaling target-tracks it — the CloudWatch analog of KEDA's targetQueueSize.
The Service has no load balancer — Workers are outbound-only (same as the Deployment).
Lab · Day 6 · optional
ECS Fargate autoscaling
Challenge → day-06-aws-containers/lab-9-ecs-fargate-autoscaling
Optional · real AWS account required. No make targets — walk the manifests under examples/07-aws-containers/aws/, apply only with an account, and tear down after (Fargate + CloudWatch cost real money).
ecs_task_definition.json — secrets from SSM, stopTimeout: 120 (graceful drain).
ecs_autoscaling.json — target-track the backlog metric, min 1 / max 10.
EKS — the kind lab, for real
The Deployment and the same KEDA ScaledObject run unchanged on EKS. The only real additions: a real cluster, and IRSA instead of LocalStack's dummy keys.
"Instead of creating and distributing your AWS credentials to the containers or using the Amazon EC2 instance's role, you associate an IAM role with a Kubernetes service account and configure your Pods to use the service account."
— Amazon EKS User Guide · docs.aws.amazon.com
Annotate the Worker's ServiceAccount with eks.amazonaws.com/role-arn → Activities call S3/Glue/SNS/SSM with no static keys .
Backlog grows → KEDA adds pods → Cluster Autoscaler / Karpenter adds nodes .
Lab · Day 6 · optional
EKS + IRSA
Challenge → day-06-aws-containers/lab-10-eks-irsa-deployment
Optional · real AWS account required. The kind cluster from Lab 6.5 is the local stand-in; this is the production graduation. eksctl delete cluster when done.
# proof there are no static keys — run from inside the pod:
$ aws sts get-caller-identity
Arn: arn:aws:sts::<acct>:assumed-role/temporal-worker-irsa/...
Same manifests as kind — only serviceAccountName + the ECR image change.
Aurora — effectively-once over at-least-once
The load step persists to Aurora (Postgres-compatible). Activities are at-least-once, so a retried load must not double-insert.
"Amazon Aurora (Aurora) is a fully managed relational database engine that's compatible with MySQL and PostgreSQL."
— Amazon Aurora User Guide · docs.aws.amazon.com
Make the write idempotent : INSERT … ON CONFLICT (idempotency_key) DO NOTHING, one transaction per batch.
The key must be stable across attempts — derive it from workflowId, never wall-clock/random.
Bounded connection pool (HikariCP) sized to Activity concurrency; creds via SSM / IAM DB auth — the relational sibling of the DynamoDB conditional-write.
Lab · Day 6 · optional
Aurora transactional sink
Challenge → day-06-aws-containers/lab-11-aurora-transactional-sink
Optional · real AWS account required for Aurora — but the idempotency logic is drillable against any local Postgres , which is the real lesson.
INSERT INTO loaded_rows (idempotency_key, ...)
VALUES (:wf_id || ':' || :batch, ...)
ON CONFLICT (idempotency_key) DO NOTHING;
Kill the Worker mid-load; the row count is identical after retry.
Route 53 — reachability ≠ replication
For a self-hosted frontend, Route 53 gives Workers a stable DNS name and health-checked failover. (On Temporal Cloud you get a managed global endpoint — skip this.)
"Failover routing lets you route traffic to a resource when the resource is healthy or to a different resource when the first resource is unhealthy."
— Amazon Route 53 Developer Guide · docs.aws.amazon.com
The trap: DNS failover makes a frontend reachable — it does not replicate Workflow state. Cross-region durability needs Temporal multi-cluster replication , not DNS.
Long-lived gRPC connections don't re-resolve per call — failover helps reconnects , so set sane TTLs.
Lab · Day 6 · optional
Route 53 failover endpoint
Challenge → day-06-aws-containers/lab-12-route53-failover
Optional · real AWS account required (hosted zone + health checks). For self-hosters only.
PRIMARY/SECONDARY alias records → regional frontend NLBs, gated by health checks.
TEMPORAL_ADDRESS points at the stable name, never a raw LB hostname.
Remember: this is frontend reachability , not data durability .
Day 6 takeaways
Temporal replaces orchestration state , not all compute. Keep Glue Spark; replace Step Functions JSON.
Every AWS edge is at-least-once — SQS in, SNS out. An idempotency key the receiver dedups on reconciles it with Temporal's at-least-once Activities (signalWithStart, ON CONFLICT).
Externalize config/secrets to SSM ; read bootstrap at startup, per-run secrets in an Activity.
Workers have no inbound traffic. Use exec probes or add Actuator deliberately.
Scale on queue backlog : KEDA's native scaler on EKS; a published CloudWatch metric on ECS. Identity via IRSA / task roles , never static keys.
Route 53 failover is frontend reachability, not data replication — cross-region DR needs Temporal multi-cluster replication.
Cloud is the default for new deployments. Self-host only with a clear reason.
Course close
What you have now
A complete Temporal mental model and the patterns to ship with.
Where to go next
Take the capstone from Day 5 back to your team. Ship it side-by-side with the existing implementation.
Stand up the observability stack in your real env. Get the metrics flowing first.
Start the replay corpus . One captured history per non-trivial Workflow.
Pick one Airflow DAG to migrate using the framework.
The four habits
When you'd write a runbook, write a Workflow instead.
Workflow code is deterministic; all I/O lives in Activities.
signalWithStart / startUpdateWithStart are the bridge primitives.
Capture histories; replay them in CI.
Your hardest distributed-transaction bug today is a feature Temporal already solved.
The cost is learning a new model. The reward is fewer runbooks.
6 days × 4 hours.
Each day mirrors a day in lecture_notes/Day-XX.md.
Lab slides are marked - laptops out, fingers on keyboards.
Pace check: end of Day 1 should leave the room with one Workflow running.
Quick orientation slide.
Don't dwell - each Day cover slide opens the detailed agenda for that block.
4 hours. Morning interleaves concepts with the first hands-on labs;
afternoon goes deeper on the event history.
Get everyone's environment green before teaching anything else.
Wait until every laptop is green.
Pair the stragglers.
Don't proceed without this; the rest of the day depends on it.
Open in VSCode: examples/01-foundations/airflow_dag_vs_temporal_workflow.java + .py - DAG shape vs durable code, side by side.
Read in different voices.
Each shape will resonate with someone in the room.
The slogan to repeat across the day: runbook, not a button.
Don't bash any tool.
Each solves a real problem.
The point is the seam each leaves open, not that any is bad.
This is the paradigm shift, stated once, early. Everything on Day 1 builds on
"durable function," not "graph of tasks."
Airflow is schedule-first and polls; Temporal reacts to events with sub-second
latency - that's why it fits user-facing flows Airflow can't serve.
Open in VSCode: examples/01-foundations/core_primitives.java - all four roles in one file.
Run: make run-hello
Four cards, four primitives.
Task Queue is JUST A STRING.
Not Kafka.
Not a DB.
It's a routing key.
This is just a Java interface.
The SDK uses the @WorkflowInterface annotation to identify it via reflection.
factory.start() kicks off the long-poll loop.
Workers connect outbound.
The mental trap is "one Worker = one Workflow". It isn't.
A Worker is a process that holds a registry. Whatever shows up on its Task Queue,
if the type is registered, it runs it.
Start here: the simplest topology is a single Worker that knows everything.
Why slice it instead of running l identical Workers?
Different work has different needs: a GPU box registers only the inference
Activity; a lightweight pool registers the orchestration Workflows.
A Worker only pulls work it has registered. Mismatched Task Queue = task sits
unhandled. This is a config bug people hit on day one.
This is the "so what" behind the l-Workers slide.
A workflow task is microseconds of decision-making. An Activity might pin a core
for a minute or block on a slow API.
Put them on the same Worker and the slow ones eat all the slots. Separate Task
Queues let you size each pool independently — more activity slots here, a bigger
box there. Tie back to WorkerOptions on the previous slide.
The rule in one line: route by Task Queue, register the subset that queue needs.
The failure mode is the day-one bug - an Activity routed to a queue no Worker
serves just sits there retrying. Homogeneous within a queue, heterogeneous across.
Open in VSCode: examples/01-foundations/activity_task_routing.java · Run: make run-routing
Demo: make run-routing (one Worker process, 3 pools) + make run-routing-starter.
The two pools log independently - each Activity lands on the queue its stub
named, and no single Worker registered the full set. Runnable end-to-end in
examples/runnable/15-task-queue-routing (Java/Python/Go).
For Airflow rooms, this slide is the moment of recognition.
XCom-becomes-a-return-value gets the biggest reaction.
The scaling-limit row is the honest one: Temporal isn't free of limits, it just
moves them. History size is the constraint you design around (continue-as-new on
Day 5).
Pair this with the migration framework on Day 4 - "migrate where Temporal earns
its keep."
The cluster is language-agnostic; the SDK is how YOUR language speaks to it. Every
primitive from the last section (Workflow, Activity, Worker, Task Queue) is a type
in this library. Java is our working language; Python and Go appear at the end so
the room knows the same model ships everywhere.
Reference throughout: javadoc.io/doc/io.temporal/temporal-sdk/latest
The headline: you don't call a REST API to "do durable execution." You link a
library, and the library both reaches the cluster AND hosts your code under the
replay contract. The four bullets map 1:1 to the packages on the next slide.
Don't read every cell - point at the split: client+serviceclient are the OUTSIDE
(your HTTP handler / starter), worker is the HOST, workflow+activity are the CODE
you write, common is the shared knobs. This is the whole API surface in one frame;
the Javadoc index has this exact package list.
This is the single most-used class in the SDK and the bridge back to the
non-determinism slides coming next. The rule: if the JDK version would give a
different answer on replay, there's a Workflow.* that records it instead.
This is the starter side from the Hello lab. Three steps: connect, make a stub,
call it. A direct method call blocks for the result; WorkflowClient.start(wf::greet,
"Ada") returns immediately with a WorkflowExecution handle - the async path.
One artifact gets you everything on the package slide. temporal-testing (the
TestWorkflowEnvironment + replay harness) lands on Day 4 - flag it now so the
dependency is already familiar.
The bullet from "What the SDK actually does" (payload (de)serialization) cashed
out. The mental model: nothing custom crosses the wire - it's all Payloads.
DefaultDataConverter is a chain of PayloadConverters; JSON is the catch-all that
handles records/POJOs, which is why "make it JSON-serializable" is the rule.
Override the converter at WorkflowClientOptions (Day 6 codec slide stacks a
PayloadCodec on top to encrypt). Limits are the next slide - the "so what".
This is the slide they asked "limits?" about. Four kinds of limit, not just size:
serializability, size, replay cost, and schema evolution - the last is the
sneaky one. Tie size back to the 2 MB / 256 KB figures and the 50 MB / 51,200
event history cap from the event-sourcing section. The fix for big data (S3
reference payload) and for sensitive data (PayloadCodec encryption) both land on
Day 6 - this slide plants the rule, Day 6 shows the pattern. Jackson default:
unknown properties are ignored, so additive changes are safe; renaming a field
is a breaking change for any open Workflow whose history holds the old name.
This is the reference slide people screenshot. Walk it outside-in: stubs are the
socket, client is the namespace-scoped entry point, factory owns the JVM-wide
cache + threads, worker is per-Task-Queue, impl options are per-Workflow-type.
Contrast with the previous slide: those configure the Worker once at boot; these
are per-execution and can change call to call. RetryOptions is the common nested
piece — point back to the "Setting them deliberately" slide.
Optional / awareness only - the room is Java-first. The point of these three
slides: the mental model is the product, not the Java API. A polyglot shop runs
Workflows in one language and Workers in another against the same cluster.
Same three moves - connect, start, run a worker - in async Python. The Workflow.*
toolkit becomes the workflow module (workflow.sleep, workflow.now, workflow.uuid4).
No annotations - Go uses function signatures and the workflow.Context. Same shape:
Dial, ExecuteWorkflow, Get. worker.New(c, "queue", opts).Run() hosts it.
The takeaway slide for the optional block: the columns differ, the rows don't.
This is why a team can adopt Temporal without a language rewrite.
Open in VSCode: examples/01-foundations/deterministic_replay_bad.java vs deterministic_replay_good.java - diff them side by side.
Whiteboard moment.
Walk through with arrows.
"Replay" doesn't re-execute side effects - Activity results are READ from history.
Reference card. The throughline: anything whose value the Worker can't reproduce
on replay must be sourced from history, not recomputed. Walk each family:
1. TIME - System.currentTimeMillis() returns a new value every replay. The first
run records 10:00:00; a replay tomorrow recomputes 10:00:01 and the code
branches differently. Workflow.currentTimeMillis() returns the value recorded
in history, so every replay sees the same instant.
2. RANDOM - same trap. new Random() reseeds from the system clock; replay gets a
different number. Workflow.newRandom() seeds deterministically from the run and
records the seed, so the sequence is reproducible. (Use this for jitter, IDs,
A/B bucketing - not java.util.Random.)
3. I/O - reading a file, calling an HTTP endpoint, or hitting a DB gives a
different answer each replay AND fires the side effect twice. There is no
Workflow.* substitute - the fix is to MOVE it into an Activity. Activities run
once and their result is recorded; replay reads the result from history.
4. CONCURRENCY - the biggest aha. Thread.sleep blocks a Worker thread for the full
duration; Workflow.sleep records a timer and the Worker FORGETS the Workflow
entirely (lead-in to the next slide). Same rule for threads/locks: use
Workflow.newThread / Async / Workflow primitives, never raw java.lang.Thread,
so the SDK controls scheduling deterministically.
5. ITERATION ORDER - HashMap/HashSet have no guaranteed order, and it can differ
across JVM versions or runs. If you iterate one to make a decision (pick first,
sum in order, branch on order) replay can diverge. Fix: use a TreeMap /
LinkedHashMap or sort the keys before iterating. "risky," not "always wrong" -
it only breaks if order affects a recorded decision.
Tie it back to the replay rule slide: every one of these makes the re-run reach a
DIFFERENT decision than history = non-determinism error, Workflow stuck/failed.
Ask: "how would you wait 30 days for an email opt-in today?" Compare to one line.
This is the artifact event sourcing produces. Don't dissect every event yet
(that's Lab 1.3) - the goal is that when they open the UI in the next lab, the
list isn't a wall of noise: they can spot the lifecycle bookends and the
Activity trio. Each Workflow Task = one "the Worker woke up, decided, recorded."
`localhost:8233` opens here. The left rail (Workflows, Schedules, Workers,
Batch...) is the whole product; Workflows is where you live on Day 1. The
columns are named on the right, beside the screen.
Click any row on the previous screen to land here. Point at Input "Ada" and the
Result up top, then the numbered event list beside it. The parts are named on
the right.
The consolidated "limits?" reference — pairs with the retention/pruning slides
above and the serialization-limits slide in the SDK section. Two kinds of limit:
SIZE/COUNT (the table — hit the hard cap and the server terminates the run) and
STRUCTURAL (one open run per ID; ~2,000 pending activities/children before the
Workflow Task fails). All numbers are dynamic-config defaults: historyCount
warn/error 10,240/51,200, historySize 10/50 MB, blobSize ~256 KB/2 MB,
NumPendingActivities/ChildExecutions 2,000. Escape hatches: Continue-As-New
(Day 5) for history growth, S3 reference payloads (Day 6) for big blobs, bounded
fan-out for pending-work caps. Say out loud: you never tune these up — you design
under them.
This is the real production topology, not a toy wiring: the Worker is a
long-lived deployment; the starter is whatever fires work - an HTTP handler, a
cron, a CLI. Both only ever talk to the server.
Have one person KILL the Worker mid-run on purpose.
The Workflow completes when the Worker restarts.
This is the most important moment of Day 1.
This is the #1 "huh?" when people first read a history. They set the queue to
hello-temporal but see a hostname:uuid on later Workflow Tasks. The next slide
explains why. JSON is from the Event History "JSON" toggle / `workflow show -o json`.
This is the question students ask: "what if a disconnected Worker comes back?"
The answer is the CAS on history version - not locking, not leader election at
the Worker level. Workers are interchangeable; the DB is the arbiter. Tie the
activity-idempotency point back to the retries lab.
The synthesis slide. Students meet sticky cache, replay, and determinism as
three separate topics; this is where they click into one. Cache = derived view;
history = source of truth; determinism = why the view is reproducible elsewhere.
The classic Day-1 confusion. Workflow ID is yours and reusable; Run ID is one
physical attempt. continue-as-new keeps the ID, mints a new Run - the Day-5 lab.
On the Workflows list: Type column = GreetingWorkflow, Workflow ID column =
hello-temporal-demo. People assume the Type is the ID, or that the Task Queue
must match one of them. It doesn't - they're orthogonal.
Run: make run-hello, then read the Web UI event history. Dump it from the CLI with examples/01-foundations/history_cli.sh.
Simplified mental model first - the next slide shows the real topology.
Trace one Workflow start: SDK → Frontend → History (write
WorkflowExecutionStarted) → Matching → Worker polls.
Frontend is a gateway in front of three peer services; History and Matching both
own persistence. Your Workers live OUTSIDE this box and connect outbound to
Frontend on :7233. Keep the source credit on the slide.
The naming trap: "Worker Service" inside the cluster is internal background work.
The Worker YOU write and deploy is a separate process polling Matching via Frontend.
Walk this slowly on the whiteboard. Two services the Worker never talks to
directly: it polls through Frontend, Matching dispatches, History owns the
record. Matching → History RecordWorkflowTaskStarted is the easily-missed hop:
the Worker pulling a task is itself an event (WorkflowTaskStarted) before code
runs.
The key insight: nothing the Worker does is trusted until History has written
the resulting event. Crash anywhere and replay rebuilds from the persisted
history. Next slide turns this trace sideways into the ledger it produces.
This is the "not just an interaction diagram" payoff: same run, turned 90° into
the durable record. Eleven events, eleven state transitions.
Point out the rhythm: every Workflow Task is a Scheduled→Started→Completed
triple, and a Command (ScheduleActivityTask, CompleteWorkflowExecution) only
ever lands as an Event. The Worker proposes; History decides and records.
Tie forward to determinism: on replay the SDK re-runs the code and checks the
Commands it regenerates against events 4 and 10. Reorder your Activities and
event 4 won't match — that's the non-determinism error they'll meet on Day 3.
Same seven steps as the trace, turned into lanes so the asymmetry is visible:
all the persistence and decisioning happens in the middle (History), the Worker
only ever talks to the edge (Frontend). The two left-pointing arrows into
History — RecordWorkflowTaskStarted and RespondWFTCompleted — are the moments a
poll/response turns into a persisted event.
The single rule behind every lifecycle diagram: a Command is intent; an Event is
fact. The names even rhyme — Schedule→Scheduled, Start→Started/Initiated. This is
why "do I/O in Activities, not Workflow code": only Commands round-trip through
History, so only Command-shaped effects are durable and replayable.
The mechanism under the trace. Step ② is the one that surprises people — the
Worker doesn't resume a paused thread, it re-runs the code from the top against
the recorded history every single Workflow Task. The Worker is stateless between
tasks; the history is the state. Sticky cache (mention WorkerOptions) is just an
optimization that skips ② when the same Worker holds the run in memory.
Zoom all the way out. The previous slides were one trip through Running; this is
the shape of the whole execution. Worth naming the distinction now: Canceled is
cooperative (the code gets a chance to clean up), Terminated is forceful (the
server kills it, no cleanup). Continue-As-New seeds Day 2's long-running
patterns — same Workflow ID, brand-new Run ID and empty history.
The most dangerous menu for newcomers. Terminate forfeits compensation - prefer
Cancel. Reset recovers from a bad code deploy: pick an event, reset, the Worker
replays forward with current code. All three are in the per-Workflow More Actions.
Connections.fromEnv() is the shared base; Cloud (next) feeds it credentials and
takes the TLS branch. Worker and starter are separate processes, both reading
the same env.
Mirror of the Cloud slide and built on the SAME worker - here the plaintext
branch, Cloud the TLS branch. Stop 'make temporal' first; the stack binds :7233.
Good moment to restart the temporal container and show the Workflow survived.
Punchline to say out loud: laptop -> Docker -> Cloud is all env, not a rewrite.
Optional - demo-only if attendees have no Cloud creds. Same shared module as the
Docker slide (examples/runnable/01b-hello-temporal-anywhere); with no creds it
falls back to plaintext, so it still compiles and runs against make temporal.
This grep-able view is the production debugging starting point.
Show it now; it'll come back on Day 4 for replay tests.
The first slogan to repeat.
If only one thing sticks for Day 1, it's the re-execution-vs-recorded-results
distinction.
4 hours: morning is reliability mechanics; afternoon is signals/queries/
updates/schedules/children.
Lots of code.
Open in VSCode: examples/02-reliability/async_activity.java, parallel_fanout_allof.java
Run: make run-async
Start from what they know. Everyone has written blocking code; name it so the
contrast on the next slides has something to push against. No Temporal yet -
this is pure "how code waits". For the non-async crowd, this slide is the floor.
The claim-ticket analogy is the load-bearing metaphor for the entire section.
A Promise is a receipt for a result that isn't ready yet. Async hands it back
immediately; .get() is the only thing that waits. Land this hard before any code.
The punchline that makes "parallel" cost nothing. Same orders, same counter -
the only variable is whether you wait between orders or after all of them.
No threads, no executor pools. Hold here until heads nod.
Answer: place all three orders first (2, 4, 5 in any order), THEN collect all
three (1, 3, 6). Placing = Async.function; collecting = .get(). The trap answer
interleaves place/collect - that's the sequential 12-minute version.
2-minute solo, then reveal. Tie each step back to the API name out loud.
The lead-in before any Async code. The whole section is one idea: a Promise lets
the Workflow wait without occupying a thread, so "parallel" costs nothing. Get
this and the fan-out code is obvious.
Sequential: 2+2+1+1 = 6s. This version: the two extracts overlap (2s), then the
transforms run sequentially (1+1) = 4s total. Lesson: only the work you START
before waiting overlaps; anything you call sequentially after a .get() is still
sequential. Follow-up: "how would you parallelize the transforms too?" - start
both, then get both.
Name the pattern with a physical picture before any stream()/Promise code. Many
in the room haven't met "fan-out/fan-in" as vocabulary - it's just split-work-
then-combine. The stream() gloss matters: half the room may not read Java streams
fluently, and the next slide leans on .stream().map(...).toList().
This is the lead-in before the stream().map(Async.function).toList() one-liner.
If they get "collect all Promises, THEN join," the code reads itself. The classic
bug is calling .get() inside the map - that serializes it.
One JVM hosts tens of thousands of suspended Workflows.
Each one is just heap state, not a parked thread.
The bug: .get() is INSIDE the loop, so each pass waits for its own Activity
before the next one starts - fully sequential despite Async.function. Fix:
collect every Promise first (stream().map(Async.function).toList()), THEN
allOf().get() and sum. This is THE classic mistake; seeing it once inoculates
them. Connect back to the coffee run - this is order-wait, order-wait.
Lead-in before the procedure/anyOf code. Same idea as fan-out, two more tools:
void Activities, and racing N calls when you only need the first answer.
Example: examples/02-reliability/async_procedure_and_race.java
Lead-in before the try/catch-per-branch code. The question to pose: "3 of 4
priced fine, 1 failed - do you fail the order or price 3?" That's a design call.
Example: examples/02-reliability/partial_failure.java
Lead-in before the bounded-fanout code. The naive fan-out is "all at once"; the
real-world constraint is a downstream QPS/connection cap. The counter + await is
the replay-safe way to throttle inside Workflow code - never a JDK Semaphore.
Open in VSCode: examples/02-reliability/bounded_fanout.java
Have students sketch on paper before running.
Then run and verify their prediction was right (or wrong - even better).
This is the proof for lab step 1. Read ascending so events 1-7 are on screen.
The three ActivityTaskScheduled in a row, under ONE WorkflowTaskCompleted, is the
whole point - image left, the parts named on the right.
Open in VSCode: examples/02-reliability/retry_and_timeouts.java, heartbeat_long_activity.java
The arithmetic is the lesson.
Bring a calculator if you don't trust the audience to do it on paper.
This is the live view while an Activity is between attempts - the retry counter,
the next-retry countdown, and the last error. Captured mid-backoff from the
retries lab (ChargeCard fails twice, then succeeds). Fields named on the right.
The mix-up: people put MaximumAttempts on WorkflowOptions expecting their Activity
to retry. It doesn't - Activity retries are configured on ActivityOptions and are
ON by default (unlimited). Workflow retry is opt-in and restarts the whole run.
The lead-in before the loop. Without heartbeats a dead Worker is invisible until
startToClose fires, and the retry redoes everything. With them: fast failure
detection + resume-where-you-left-off.
Open in VSCode: examples/02-reliability/heartbeat_long_activity.java
Open in VSCode: examples/02-reliability/heartbeat_resume_from_checkpoint.java
The previous slide promised "resume from page N"; this is the call that delivers
it. heartbeat(page) writes the checkpoint; getHeartbeatDetails reads it back on
the next attempt. Without this read a retry silently restarts at page 0.
Lead-in before the CancellationScope code. Ties back to heartbeats: cancellation
reaches a running Activity the same way liveness does - via the heartbeat.
Example: examples/02-reliability/cancellation_scope.java
Open in VSCode: examples/02-reliability/workflow_time.java - durable sleep records TimerStarted; no thread parks.
Reinforcement, not new content.
The students saw the families yesterday; this is the "what bites in production"
list.
Open in VSCode: examples/03-interactions/signals_queries.java
Run: make run-approval
This is the lead-in BEFORE any annotation soup. Decide by intent first, then the
@SignalMethod / @QueryMethod / @UpdateMethod code is obvious. Queries never touch
history (read-only, served from the cached state); signals and updates do.
Async, recorded in history, wakes any await predicate.
Demo: pick currentState, Run Query, read the result. Note Event History is 0 here
- a Query never appears in history. Empty tab = no Worker polling / run not cached.
The "before workflow starts" twist: signalWithStart later in the day will make
this explicit.
Open in VSCode: examples/03-interactions/update_completed.java, update_with_start.java
Run: make run-approval
Lead-in before the @UpdateMethod interface. The validator is the headline:
reject bad input synchronously without polluting history.
Open in VSCode: examples/03-interactions/update_completed.java
Lead-in before the startUpdate code. Parallels startUpdate to Async.function:
both hand back a handle so you can do other work before collecting the result.
Example: examples/03-interactions/update_completed.java
Lead-in before the signalWithStart code. THE foot-gun fix - the Day-3 Kafka
bridge depends entirely on this. Bare start() crashes on event #2 for a key.
This is THE foot-gun.
Every team copies bare WorkflowClient.start() from a tutorial and crashes on the
second Kafka message for the same key.
Lead-in before the startUpdateWithStart code. Use when the very first interaction
both creates the Workflow and needs an answer (e.g. submit-and-confirm).
Open in VSCode: examples/03-interactions/update_with_start.java
The blocking-call shape is what makes Updates the modern primitive.
Signal+Query is older and works against older clusters; Update is the right tool
when the caller wants the result.
Capstone for the section: the SAME run, after a Query (not shown - leaves no
trace), an Update (changeNote), and a Signal (approve). Point at the Signaled
event and the Update Accepted/Completed pair, named on the right.
Open in VSCode: examples/03-interactions/schedule_interval.java, schedule_cron_overlap.java
Run: make run-schedules
Lead-in before the Schedule.newBuilder code. Three parts: action + spec + policy.
Contrast with a cron line on a box that dies when the box dies.
Open in VSCode: examples/03-interactions/schedule_interval.java
Lead-in before the cron/overlap code + table. The overlap policy is the real
lesson - "your hourly job takes 90 min, now what?" has five named answers.
Example: examples/03-interactions/schedule_cron_overlap.java
The Schedules tab is its own left-rail section, separate from Workflows. Created
from examples/runnable/04-schedules. The parts are named on the right.
Compare to "your DAG runs hourly but the 3 AM run takes 90 minutes" - in Airflow
you set max_active_runs.
Here you set ScheduleOverlapPolicy.
Open in VSCode: examples/03-interactions/child_workflow.java, workflow_and_run_timeouts.java
The lead-in before the stub code. 90% of delegation is Activities; children are
for when the sub-task is a real orchestration in its own right.
Open in VSCode: examples/03-interactions/child_workflow.java
Click into the parent → Relationships. The tree + table make "each child is its
own Workflow" concrete. Click any child row to jump to its own page/history.
Lead-in before the tenant-fanout code. Two ideas: per-tenant child identity, and
ParentClosePolicy.ABANDON so a finished (or continued-as-new) parent doesn't kill
in-flight children. Contrast the default: children terminate with the parent.
Open in VSCode: examples/03-interactions/tenant_fanout.java
Workflow timeouts ≠ Activity timeouts.
These are top-level execution caps, not per-attempt budgets.
Three slogans for Day 2.
Each one is a foot-gun saved.
Day 3 is half conceptual (where does Kafka end and Temporal start?) and half
hands-on (full Kafka → Temporal → Kafka loop).
Open in VSCode: examples/04-kafka/kafka_consumer_activity.java, producer_activity_idempotent.java, outbox_activity.java
Disable auto-commit.
Always.
Commit after the unit of work succeeds.
Heartbeat the topic:partition:offset so retries can resume.
Lead-in before the producer config. The point: publishing is a side effect
(Activity), and the retry story forces idempotence + a stable key.
Open in VSCode: examples/04-kafka/producer_activity_idempotent.java
Lead-in before the outbox transaction code. The problem statement is the lesson:
DB-write + Kafka-publish can't be atomic, so make the publish derive from a row.
Open in VSCode: examples/04-kafka/outbox_activity.java
Open in VSCode: examples/04-kafka/signal_bridge.java - signalWithStart, not bare start.
Run: make run-kafka
Open in VSCode: examples/04-kafka/signal_bridge.java
Run: make run-kafka (Worker + bridge). Produce with make kafka-produce TOPIC=orders, tail with make kafka-consume TOPIC=order-outcomes.
Have students fire two events for one key.
The second event should NOT start a new workflow.
If it does, they're using bare start - debug it.
Captured live: kcat produced 3 events (created/paid/shipped) for key 100 to the
'orders' topic; the bridge did SignalWithStartWorkflow each time. One Started,
the rest Signaled. The activity (PublishOutcome → order-outcomes topic) is the
producer half. Docs: https://docs.temporal.io/develop/java
Open in VSCode: examples/04-kafka/partition_fanout.java
Open in VSCode: examples/04-kafka/dlq_after_retry_exhaustion.java
Heaviest day on production rigour.
Lots of ops content.
Two big labs: metrics dashboard and replay tests.
Open in VSCode: examples/05-production/get_version_patch.java, versioning_behavior.java
Open in VSCode: examples/05-production/get_version_patch.java
Lead-in before the @WorkflowVersioningBehavior code. The decision is run lifetime:
you can drain a 5-minute checkout; you can't drain a 6-month subscription.
Open in VSCode: examples/05-production/versioning_behavior.java
Open in VSCode: examples/05-production/worker_options_manual.java, worker_tuner.java, composite_tuner.java, virtual_threads.java
The diagnostic view. "My Workflow is stuck" → open Workers/Pollers; if it's empty,
no Worker is polling that Task Queue. Capacity lives here too: too few pollers for
the backlog shows up as schedule-to-start latency - the Observability section next.
This is the missing bridge between the Pollers tab and the tuning knobs.
Kubernetes replicas, VM count, and multiple JVMs all just add pollers to the same
queue. Temporal does not shard a Workflow across Workers; each Workflow Task or
Activity Task is leased to one Worker at a time.
Important framing: virtual threads are a Java runtime implementation detail, not
a Temporal distribution feature. They reduce per-blocking-call thread cost inside
a Worker. Task Queues decide which machine gets the task; WorkerOptions decide
how many tasks this process runs at once.
Open in VSCode: examples/05-production/worker_options_manual.java
Lead-in before the virtual-threads code. Slots map to threads; virtual threads
lift the ceiling for blocking I/O Activities without a thread-per-slot cost.
Open in VSCode: examples/05-production/virtual_threads.java
Lead-in before the ResourceBasedTuner code. Contrast with the previous "Manual
sizing" slide: same goal (right concurrency), but driven by live resource use.
Open in VSCode: examples/05-production/worker_tuner.java
Lead-in before the CompositeTuner code. The realistic answer is "both": pin the
cheap deterministic workflow slots with a FixedSizeSlotSupplier, auto-size the
expensive activity slots with a ResourceBasedSlotSupplier. CompositeTuner doesn't
size anything itself — it just wires one supplier per slot type.
Example: examples/05-production/composite_tuner.java
Lead-in before the rate-limit code. Distinguish the two knobs: per-Worker (local,
multiplies with replicas) vs task-queue-wide (global, server-enforced). The second
is the one that actually protects a shared dependency under horizontal scaling.
Open in VSCode: examples/05-production/rate_limited_activity_pool.java
Open in VSCode: examples/05-production/micrometer_metrics.java, custom_activity_metric.java, otel_tracing.java
Stack: make stack-obs, then make grafana
Open in VSCode: examples/05-production/micrometer_metrics.java
Lead-in before the custom-metric code. Built-ins tell you the system is healthy;
custom metrics tell you the business is. Same export path, so it's cheap to add.
Example: examples/05-production/custom_activity_metric.java
Lead-in before the OTel interceptor code. The win is correlation: without it,
the Activity spans are orphans; with it, you see the whole request as one tree.
Example: examples/05-production/otel_tracing.java
Captured live: a Worker exporting tally→Prometheus metrics on :9464, Prometheus
scraping it, Grafana rendering the provisioned overview. Read beside the image.
Open in VSCode: examples/05-production/namespace_strategy.md
Maps to instructor Demo 3 (custom-authorizer harness): a working Authorizer that
pins each caller to a Role AND a Task Queue. The teaching point: the namespace is
the boundary the platform enforces; everything finer is code you own.
Lead-in before the Search Attributes code. Contrast with metrics (aggregate) and
memo (attached but NOT indexed). Keys are registered per namespace first, then
upserted from Workflow code. This is how the "tenant ID in Search Attributes" from
the namespace slide actually becomes filterable.
Open in VSCode: examples/05-production/search_attributes_ops.java
Open in VSCode: examples/05-production/junit5_extension_mockito_test.java; runnable test in examples/runnable/06-testing/ReminderWorkflowTest.java
Run: make run-testing (no server needed)
Lead-in before the TestWorkflowEnvironment code. Time-skipping is the "wow":
durable timers normally make long Workflows untestable; here they run instantly.
Open in VSCode: examples/runnable/06-testing/src/test/java/training/temporal/testing/ReminderWorkflowTest.java · Run: make run-testing
Lead-in before the JUnit5 + Mockito code. The pattern: real Workflow, fake
Activities. You're testing the decisions, not the side effects.
Example: examples/05-production/junit5_extension_mockito_test.java
Open in VSCode: examples/05-production/replay_test.java
Open in VSCode: examples/05-production/replay_test.java
Captured from examples/runnable/11-determinism-replay (go test -run
TestReorderedCodeBreaksReplay -v). TMPRL1100 is the non-determinism error code -
recorded history vs replay command diverge at event 5.
This is the safety net for the rest of the year.
Encourage students to take this pattern back to their team and seed a corpus.
4 hours.
Morning is saga + Spring.
Afternoon is capstone (75 min of build time + 25 min review + 20 min Q&A).
Open in VSCode: examples/06-saga-spring/saga_compensation.java (full project: examples/runnable/07-saga/)
Run: make run-saga
Lead-in before the Saga code. The mental model is a stack: push a compensation
after each success; on failure, pop them in reverse. Temporal runs your undo.
Open in VSCode: examples/06-saga-spring/saga_compensation.java (full project: examples/runnable/07-saga/)
War story: when team #3 silently drops an event in a choreographed flow, nobody
notices for 36 hours.
Temporal's log shows it immediately.
Open in VSCode: examples/06-saga-spring/java/choreography_bridge.java + python/choreography_bridge.py + go/choreography_bridge.go
This is the answer to "does Temporal support choreography?"
Yes, but do not turn every service into one giant shared Workflow. Keep bounded
contexts independent. Use events at team boundaries, then use Temporal inside a
team boundary when the service has stateful, retrying, long-running logic.
The listener does not contain business process logic. It makes the event durable
inside Temporal, then commits the Kafka offset after Temporal accepts it.
Avoid presenting this as religion. The useful rule is ownership. If one team
owns the outcome, orchestrate. If no team should own all downstream behavior,
choreograph between teams and use Temporal locally.
Captured from examples/runnable/07-saga with input fail-at-ship. The Result is
"COMPENSATED" - a handled business outcome, not a crash. The arc is named on the
right; the takeaways continue on the next slide.
Open in VSCode: examples/06-saga-spring/spring_temporal_config.java, kafka_listener_trigger.java, sync_saga_update.java, async_saga_signal.java, continue_as_new.java
The lead-in before the @Configuration soup. Three beans + lifecycle is the whole
trick; Activities being Spring beans is what makes DI/testing feel native.
This is the underlying wiring.
In production, prefer the temporal-spring-boot-starter and let it do this.
Lead-in before the sync HTTP code. This is the "two front doors" pair with the
next slide: HTTP Update here, Kafka Signal next - same Workflow underneath.
Open in VSCode: examples/06-saga-spring/sync_saga_update.java
Lead-in before the KafkaListener code. Reinforces signalWithStart from Day 3:
the consumer never checks "does this order's Workflow exist yet?"
Open in VSCode: examples/06-saga-spring/async_saga_signal.java
The lead-in before the loop code. The trigger is history size, not elapsed time:
if a Workflow keeps appending events forever, continue-as-new resets the slate.
Open in VSCode: examples/06-saga-spring/continue_as_new.java
Captured from examples/runnable/12-continue-as-new (processed 9 events across 3
Runs). The First/Previous Execution links are the chain - walk them backwards.
Scaffold from examples/runnable/07-saga/ (Run: make run-saga). Challenge: day-05-saga-spring/lab-3-capstone.
Hold the line on time.
At 50 minutes, stop everyone and check in.
If most are stuck, slow down; if most are done, pull review forward.
Resist correcting code style.
Focus on the four questions above.
They are what the cohort will face on real systems.
4 hours: morning is AWS migration; afternoon is containers + KEDA.
Two big labs: containerized Worker on kind, and KEDA autoscale.
Open in VSCode: examples/07-aws-containers/aws_mapping.md, step_functions_before.json vs step_functions_after_temporal.java
Open in VSCode: examples/07-aws-containers/glue_activity.java, s3_reference_payload.java
Run: make run-aws
The lead-in before the polling loop. This is THE pattern for any external async
job (Glue, EMR, Batch, SageMaker): start → poll+heartbeat → settle.
Open in VSCode: examples/07-aws-containers/glue_activity.java · Run: make run-aws
Lead-in before the record types. The rule: if a payload could be big, it goes to
blob storage and the Workflow carries the pointer. Keeps history small + fast.
Lead-in before the PayloadCodec code. The threat model: history is durable
storage; treat it like any DB column you'd encrypt. Codec server = controlled
decryption for the UI.
Example: examples/07-aws-containers/codec_server.java
Lead-in before the before/after code. The pitch: you stop maintaining ASL and
its limits; the orchestration is code your team already knows.
Open in VSCode: examples/07-aws-containers/step_functions_before.json vs step_functions_after_temporal.java
Open in VSCode: examples/07-aws-containers/{java,python,go}/sqs_signal_bridge.*,
sns_publish_activity.*, ssm_parameter_config.*
The morning's missing third: what *drives* a Workflow, how it talks back out,
and where its config/secrets live. All three run on free LocalStack.
Source: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/standard-queues.html
The point: SQS gives you at-least-once. Don't fight it - make the consumer
idempotent by keying the Workflow Id off the event, so a redelivery re-signals
the same run instead of starting a duplicate.
Source: https://docs.temporal.io/sending-messages
Deleting before the signal is durable = a lost event on crash. Order matters:
signal first, delete second.
Open in VSCode: examples/07-aws-containers/java/sqs_signal_bridge.java
Source: https://docs.aws.amazon.com/sns/latest/dg/welcome.html
The publisher moved into Temporal; the fan-out topology (SNS -> SQS/Lambda/HTTP)
is unchanged. Only "who calls Publish" changed.
Source: https://docs.temporal.io/develop/python/best-practices/error-handling
This slide is the thesis of the whole "edges" sub-section: at-least-once meets
at-least-once; idempotency keys reconcile them.
Source: https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-parameter-store.html
The determinism trap: SSM values can change between replays. Bootstrap config at
startup; per-run secrets via an Activity. This is also the on-ramp to IRSA/task
roles (next sub-section) - no static keys.
Open in VSCode: examples/07-aws-containers/ssm_parameter_config.java
The conceptual slide said "config in startup code, secrets via an Activity"; this
is the code. getParametersByPath loads the tree at boot; setTarget/setNamespace
build the non-local stubs from it. fetchApiKey keeps the secret read behind the
Activity boundary.
Open in VSCode: examples/07-aws-containers/Dockerfile, worker_deployment.yaml, keda_scaledobject.yaml
Open in VSCode: examples/07-aws-containers/Dockerfile (runnable: examples/runnable/08-aws-containers/Dockerfile)
Lead-in before the Deployment YAML. The "no Service/ingress" point surprises
people from request/response services - Workers are outbound-only.
Open in VSCode: examples/07-aws-containers/worker_deployment.yaml
Lead-in before the ScaledObject YAML. Ties back to worker sizing + the
schedule-to-start metric: that latency IS the autoscaling trigger.
Open in VSCode: examples/07-aws-containers/keda_scaledobject.yaml
Captured live: kind cluster + KEDA, an in-cluster dev server, a 300-task backlog
on 'transform'. The repo manifests assume an in-cluster temporal-frontend +
production namespace; this used a streamlined equivalent.
These four are NOT on free LocalStack (Pro-only), so the labs are conceptual +
reference manifests under examples/07-aws-containers/aws/ - no make targets.
Walk them as lecture; apply only with an account. Each costs real money - tear
down after. Labs 9-12.
Source: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-auto-scaling.html
Contrast with lab 5: KEDA polls DescribeTaskQueue FOR you. On ECS you wire the
poller -> metric -> scaler yourself. Same idea (scale on queue depth), more glue.
Manifests: examples/07-aws-containers/aws/ecs_*.json + backlog_publisher.md
Source: https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html
The portability story: kind -> EKS is a no-op for the Worker. IRSA is the
production answer to the dummy creds used all morning. Requires the cluster's
OIDC provider enabled, or IRSA silently fails.
Manifests: examples/07-aws-containers/aws/eksctl-cluster.yaml, irsa-serviceaccount.yaml, eks-worker-deployment.yaml
Source: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/CHAP_AuroraOverview.html
This is the at-least-once thesis again, in SQL. The DynamoDB example
(dynamodb_idempotency.*) is the NoSQL sibling. Aurora can also be Temporal's own
persistence store when self-hosting - but that's an aside; the lab's focus is
Aurora-as-application-sink.
Files: examples/07-aws-containers/aws/aurora_schema.sql, aurora_load_activity.java, aurora_terraform.tf
Source: https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/routing-policy-failover.html
The big honesty point: students must not mistake DNS failover for DR. It's
frontend HA only. Pair with Temporal multi-cluster replication for true
cross-region. Files: examples/07-aws-containers/aws/route53_failover.tf, route53_records.md