Temporal Fundamentals

Day	Topic	Lab focus
1	Foundations - durable execution mental model	Hello Temporal, Event History
2	Reliability + interactions	Signals, Updates, Schedules
3	Kafka integration	End-to-end Kafka pipeline, DLQ
4	Production engineering	Replay tests, dashboards
5	Saga + Spring Boot	Saga in Spring Boot
6	AWS migration + containers	Glue, K8s, KEDA

Stack	What it solves	What it leaves to you
Cron + scripts	Triggering on a schedule	All state, retries, recovery
Airflow	Scheduling DAGs	Cross-system state, retries on top
Step Functions	State machines in JSON	Code review, Lambda 15-min cap
Kafka alone	Transport between systems	Per-key state, idempotency

	Apache Airflow	Temporal
Paradigm	A DAG of tasks you wire up	Idiomatic code that runs top-to-bottom
Trigger	Schedule-driven: hourly, midnight, quarter-end	Event-driven: API call, webhook, message
Recovery	Retry a task by its position in the graph	Replay history; resume mid-function
Sweet spot	Move + transform data on a schedule	Coordinate microservices & business logic

Workflow	Activity	Worker	Task Queue
Durable function. State is the event history. Deterministic.	Arbitrary code with side effects. Retried independently.	Long-lived process polling one or more Task Queues.	A string name. Routes work to a Worker pool.

Variant	Topology	Each Worker registers
One Worker, everything	1 queue, 1 pool	all n Workflows + m Activities
Split by Task Queue	k queues, k pools	the subset routed to its queue
Same queue, l replicas	1 queue, l identical pods	the same full set, pods interchangeable
Mismatch (bug)	route to a queue no pool serves	nothing runs it → task retries until timeout

Airflow	Temporal
DAG	Workflow
Operator / Task	Activity
Worker	Worker (long-lived JVM)
`default_queue`	Task Queue
XCom	A normal Java return value
`ExternalTaskSensor`	Signal / `signalWithStart`
`BranchPythonOperator`	`if` / `switch` in Java
Sensor poll loop	`Workflow.await(predicate)`

Feature	Apache Airflow	Temporal
Primary domain	Data engineering & batch	App development & microservices
State management	Central metadata DB of task status	Event-sourced history, replayed
Latency	High, polling, seconds to start	Low, gRPC, sub-second
Waiting / sleep	Costs a worker slot or a sensor	Native & cheap, sleep for a year
Scaling limit	Scheduler + metadata DB	Workflow history size

Package	You use it for	Key types
`io.temporal.client`	Start / signal / query from outside	`WorkflowClient`, `WorkflowStub`, `WorkflowOptions`
`io.temporal.worker`	Host & poll	`Worker`, `WorkerFactory`, `WorkerOptions`
`io.temporal.workflow`	Write Workflow code	`Workflow`, `Async`, `Promise`, `Saga`, `@WorkflowInterface`
`io.temporal.activity`	Write Activities	`Activity`, `ActivityOptions`, `@ActivityInterface`
`io.temporal.common`	Shared config	`RetryOptions`, converters, interceptors
`io.temporal.serviceclient`	The gRPC connection	`WorkflowServiceStubs`, `WorkflowServiceStubsOptions` (TLS / API-key)

Builder	Configures	Applied when
`WorkflowServiceStubsOptions`	Connection: target host, TLS, API key, metrics scope	building the service stubs
`WorkflowClientOptions`	Namespace, data converter, client interceptors	building the `WorkflowClient`
`WorkerFactoryOptions`	Cross-Worker: sticky cache, virtual workflow threads, Worker interceptors	`WorkerFactory.newInstance`
`WorkerOptions`	Per-queue slots, tuner, virtual threads	`factory.newWorker`
`WorkflowImplementationOptions`	Per-type: fail-on exception types, per-Activity defaults	registering a Workflow impl

Builder	Configures	Applied when
`WorkflowOptions`	ID, Task Queue, run/execution timeouts, retry, ID-reuse policy	starting a Workflow
`ChildWorkflowOptions`	Same set + parent-close policy	starting a Child Workflow
`ScheduleOptions`	Memo & search attributes for the Schedule	`createSchedule`
`ActivityOptions`	Timeouts, heartbeat, Task Queue, retry	building an Activity stub
`LocalActivityOptions`	Timeouts + retry for short, local Activities	building a local Activity stub
`RetryOptions`	Backoff, max attempts, non-retryable types	nested inside the four above

Concept	Java	Python	Go
Package	`io.temporal:temporal-sdk`	`temporalio` (pip)	`go.temporal.io/sdk`
Connect	`WorkflowClient`	`Client.connect`	`client.Dial`
Define Workflow	`@WorkflowInterface`	`@workflow.defn`	`func(ctx, …)`
Run a Worker	`WorkerFactory` / `Worker`	`Worker(...).run()`	`worker.New(...).Run()`
Determinism API	`Workflow.*`	`workflow` module	`workflow` package
Call an Activity	typed Activity stub	`workflow.execute_activity`	`workflow.ExecuteActivity`

What	SDK warns	Hard cap	Escape hatch
Events in the history	10,240	51,200	Continue-As-New
History size	10 MB	50 MB	Continue-As-New
A single payload	~256 KB	2 MB	S3 reference (Day 6)
Pending Activities / Child Workflows		~2,000 each	bounded batches, not all at once

	Workflow ID	Run ID
Who sets it	you	the server
Example	`hello-temporal-demo`	`019ecbcd-6981-7b00-…`
Means	the Workflow's business identity	one execution attempt
Reuse	reusable over time	unique, immutable
New one on		retry · continue-as-new · reset

Case	What happened
Server rejects request	No Workflow started: bad Namespace, auth, invalid options, ID conflict
Connectivity / timeout	Ambiguous: the request may never have reached Temporal, or the response was lost after a successful start
Start accepted	`WorkflowExecutionStarted` is persisted; Workflow is durable from that point
First Workflow Task fails	Start still succeeded; the running Workflow now follows retry/failure rules

Name	What it is	Here
Workflow Type	the Workflow function / class	`GreetingWorkflow`
Workflow ID	the instance you started	`hello-temporal-demo`
Task Queue	the routing name Workers poll	`hello-temporal`

Event	Means
`WorkflowTaskCompleted`	a Worker finished deciding: ran your code to the next await
`ActivityTaskCompleted`	one Activity returned its result
`WorkflowExecutionCompleted`	the whole Workflow finished, the terminal event

Frontend	History	Matching	Worker
Stateless gRPC gateway. Auth, rate-limiting, routing, request validation. Every SDK/CLI call lands here.	Owns Workflow Execution state. Writes the event history, runs the state machine, enqueues tasks. Sharded.	Hosts Task Queues. Matches tasks from History to Workers polling by queue name.	Internal background service: replication, archival, schedules, batch ops, cleanup. Not your Worker.

Internal queue	Drives
Transfer tasks	Push Workflow/Activity tasks to Matching; start child workflows
Timer tasks	Fire durable timers, `Workflow.sleep`, timeouts, retries
Visibility tasks	Update the searchable/visibility store
Replication tasks	Ship events to other clusters (multi-cluster)

Task	Worker does	Result
Workflow Task	Resume Workflow code until it blocks or completes	Commands back to History (schedule activity, start timer, complete)
Activity Task	Execute your Activity code (side effects allowed)	Success/failure reported to History
Query Task	Run a read-only query over current state	Value returned; history not advanced

Worker issues a Command	Service records Event(s)
`ScheduleActivityTask`	`ActivityTaskScheduled`
`StartTimer`	`TimerStarted`
`SignalExternalWorkflowExecution`	`SignalExternalWorkflowExecutionInitiated`
`StartChildWorkflowExecution`	`StartChildWorkflowExecutionInitiated`
`CompleteWorkflowExecution`	`WorkflowExecutionCompleted`
`ContinueAsNewWorkflowExecution`	`WorkflowExecutionContinuedAsNew`

	Cancel	Terminate	Reset
Workflow gets a say?	yes: a request it can catch	no: killed at once	replays instead
Cleanup / compensation?	yes, if you coded it	no	n/a
Effect	graceful stop	hard stop	rewind to an earlier event, re-run from there
Use when	"stop, but tidy up"	"it's wedged, stop now"	"bad deploy/bug, replay with fixed code"

Setting	Controls
`startToCloseTimeout`	One attempt's wall-clock budget
`scheduleToCloseTimeout`	Total budget across all retry attempts
`scheduleToStartTimeout`	How long an Activity sits in the queue before pickup
`heartbeatTimeout`	Max gap between heartbeats; detects Worker death

	Activity `RetryPolicy`	Workflow `RetryPolicy`
Set on	`ActivityOptions`	`WorkflowOptions`
Retries	one Activity attempt	the whole Workflow run
Default	on: unlimited attempts	off: no retry
For	transient I/O failures (the 99% case)	crash-only restart of an entire run

	Signal	Query	Update
Direction	push in	read out	call in → out
Synchronous?	no, fire-and-forget	yes	yes
Change state?	yes	no (read-only)	yes
In the history?	yes	no	yes
Can be rejected?	no	n/a	yes (validator)

Overlap policy	When a run is still going
`SKIP`	drop the new run
`BUFFER_ONE` / `BUFFER_ALL`	queue one / queue all
`ALLOW_ALL`	run concurrently
`CANCEL_OTHER` / `TERMINATE_OTHER`	stop the running one first

Use an Activity when…	Use a Child Workflow when…
it's one unit of work (a call, a job)	it has its own multi-step orchestration
no independent retries/timeouts needed	it needs its own history, timeouts, Task Queue
the result fits the parent's history	its events would bloat the parent's history
	you want it on a different Worker pool

Concern	Owner
Append-only event log, replayable by offset	Kafka
Fan-out to many independent consumers	Kafka
State of a single business transaction	Temporal
Retry / timeout / compensation logic	Temporal
Long-running human / external waits	Temporal

Failure type	Belongs in
Transient (network)	Temporal retry (free)
Poison message (malformed)	DLQ topic for triage
Business rule rejection	DLQ or audit topic
Catch-all	DLQ after Temporal exhausts

Setting	Controls	Default
`maxConcurrentWorkflowTaskExecutionSize`	In-flight workflow decisions on this Worker	`200`
`maxConcurrentActivityExecutionSize`	In-flight Activity attempts	`200`
`maxConcurrentLocalActivityExecutionSize`	In-flight local Activities	`200`
`ResourceBasedTuner`	Auto-scale Worker slots vs CPU / memory targets	off (fixed slots)
`CompositeTuner`	Mix strategies: fixed workflow slots + resource-based activity slots	off
Sticky execution	Worker caches workflows; skips full replay each task	on · cache `600`
`setUsingVirtualWorkflowThreads(true)`	Cheaper SDK Workflow threads inside the Worker Factory	`false`
`setUsingVirtualThreads(true)` (JDK 21+)	Cheaper Activity execution threads inside one Worker	`false`
Number of Task Queues	One pool per resource profile	`1`

Layer	Platform threads	Virtual threads
Workflow execution	SDK workflow threads; deterministic, park at Temporal waits	`setUsingVirtualWorkflowThreads(true)` makes those cheaper
Activity execution	One blocking Activity can occupy one OS-backed thread	`setUsingVirtualThreads(true)` makes blocking I/O Activities cheaper
Temporal semantics	Same Task Queues, histories, retries, timeouts	Same semantics, only JVM scheduling cost changes

Metric	Tells you
`temporal_workflow_task_schedule_to_start_latency`	Worker capacity vs demand
`temporal_workflow_completed_total`	Throughput
`temporal_workflow_failed_total`	Real failures
`temporal_activity_execution_failed_total`	Bad downstream / retry config
`temporal_sticky_cache_size`	Replay overhead / memory health
`temporal_activity_schedule_to_start_latency`	Activity backlog

SDK	Native tracing module	OTel path
Go	`contrib/opentelemetry` (+ legacy `…/opentracing`)	native interceptor
Python	`contrib.opentelemetry` only	native interceptor
Java	`temporal-opentracing` only	OpenTracing interceptors + OTel shim

Scenario	Namespace shape
Dev / staging / prod	One namespace per environment
Regulated tenant isolation	One namespace per tenant
Shared SaaS tenants	One namespace per env; tenant ID in Search Attributes
Different retention SLAs	Separate namespace per retention class

Hook	Job
`ClaimMapper`	token → who you are + your namespace Role
`Authorizer`	claims + API call → allow / deny

DAG shape	Verdict
Simple ETL on a fixed schedule	Stay on Airflow, or move scheduler to Temporal Schedules
Cross-system orchestration with retries and human steps	Migrate (sweet spot)
Kafka-triggered, one execution per key	Migrate (Day 3 pattern)
Pure data transformation	Don't migrate. Spark / dbt territory
Long-running waits (hours, days, humans)	Migrate. Airflow handles this poorly
Tight Airflow operator coupling	Wrap in Activities; the operator is the unit

Event-choreographed system	Temporal participant
Kafka topic carries facts: `OrderPlaced`, `PaymentCaptured`, `ShipmentFailed`	Workflow owns one business key: `order-123`
Services publish events after local commits	Bridge delivers events with `signalWithStart`
No one process owns the whole company flow	This team still gets durable state, retries, timers, and audit

Choose orchestration when...	Choose choreography when...
One team owns the end-to-end business outcome	Several teams must evolve independently
You need one place for compensations and timeouts	Events are the stable contract between domains
Operators need one execution history to debug	Consumers should be added without changing a central flow
The flow is user-facing or SLA-bound	The flow is naturally eventually consistent

AWS shape	Temporal shape
EventBridge → Lambda → Step Functions	Consumer (or thin Lambda) `signalWithStart`s a Workflow
Step Functions JSON states	Workflow branches through Java code
Glue Python writes S3 checkpoints	Activity returns result; history records the run
Glue Spark heavy transform	Activity starts Glue, heartbeats runId while polling
CloudWatch Lambda retry	`RetryOptions` with typed `ApplicationFailure`
S3 handoff between Lambdas	Activity returns S3 URI
DynamoDB checkpoint table	Workflow event history

Service	Keep when	Replace when
Lambda	<100ms, IAM-bound, one-shot	Multi-step coordination, retries, long waits
Glue Spark	Large distributed transforms (>10 GB)	Pure data movement; small batches
Glue Python	Tiny scripts (<1 min) with Glue catalog	Anything you'd write as a Java Activity
Step Functions	Already wired, low-change	Anything needing human steps or code review

Resume what?	Mechanism	Owns the checkpoint
A failed step in a multi-job pipeline	History replays completed Activities, doesn't re-run them	Temporal
Polling one run after a Worker crash	Heartbeat `runId` + `getHeartbeatDetails` re-attach	Temporal
A point inside one Spark job	Glue Job Bookmarks, or split the job into steps	Glue

Edge	AWS guarantee	Make it safe
SQS in	at-least-once delivery	Workflow Id from the event → `signalWithStart` dedupes
SNS out	at-least-once publish	carry `workflowId`+`runId`; subscribers dedup (or FIFO topic)

Concern	Temporal Cloud	EKS self-hosted
Setup time	Hours	Weeks
Persistence	Managed (Cassandra)	You run PostgreSQL / Cassandra
Upgrades	Automatic	You schedule
Multi-region	Built-in (premium)	You design replication
Cost shape	Per-action	Fixed infra
Audit / compliance	SOC2, HIPAA tiers	You provide evidence

Time	Deliverable
0-10	Sketch Workflow signature + Activity interface + compensation order on paper
10-50	Implement enough Java to run the happy path + one failure path
50-65	Wire the Kafka trigger with `signalWithStart`
65-75	Run end-to-end against the local stack; demo one compensation

6 days × 4 hours. Each day mirrors a day in lecture_notes/Day-XX.md. Lab slides are marked - laptops out, fingers on keyboards. Pace check: end of Day 1 should leave the room with one Workflow running.

Quick orientation slide. Don't dwell - each Day cover slide opens the detailed agenda for that block.

4 hours. Morning interleaves concepts with the first hands-on labs; afternoon goes deeper on the event history. Get everyone's environment green before teaching anything else.

Wait until every laptop is green. Pair the stragglers. Don't proceed without this; the rest of the day depends on it.

Open in VSCode: examples/01-foundations/java/airflow_dag_vs_temporal_workflow.java + .py - DAG shape vs durable code, side by side.

Read in different voices. Each shape will resonate with someone in the room.

The slogan to repeat across the day: runbook, not a button.

Don't bash any tool. Each solves a real problem. The point is the seam each leaves open, not that any is bad.

This is the paradigm shift, stated once, early. Everything on Day 1 builds on "durable function," not "graph of tasks." Airflow is schedule-first and polls; Temporal reacts to events with sub-second latency - that's why it fits user-facing flows Airflow can't serve.

Open in VSCode: examples/01-foundations/java/core_primitives.java - all four roles in one file. Run: make run-hello

Four cards, four primitives. Task Queue is JUST A STRING. Not Kafka. Not a DB. It's a routing key.

This is just a Java interface. The SDK uses the @WorkflowInterface annotation to identify it via reflection.

factory.start() kicks off the long-poll loop. Workers connect outbound.

The mental trap is "one Worker = one Workflow". It isn't. A Worker is a process that holds a registry. Whatever shows up on its Task Queue, if the type is registered, it runs it. Start here: the simplest topology is a single Worker that knows everything.

Why slice it instead of running l identical Workers? Different work has different needs: a GPU box registers only the inference Activity; a lightweight pool registers the orchestration Workflows. A Worker only pulls work it has registered. Mismatched Task Queue = task sits unhandled. This is a config bug people hit on day one.

This is the "so what" behind the l-Workers slide. A workflow task is microseconds of decision-making. An Activity might pin a core for a minute or block on a slow API. Put them on the same Worker and the slow ones eat all the slots. Separate Task Queues let you size each pool independently, more activity slots here, a bigger box there. Tie back to WorkerOptions on the previous slide.

The rule in one line: route by Task Queue, register the subset that queue needs. The failure mode is the day-one bug - an Activity routed to a queue no Worker serves just sits there retrying. Homogeneous within a queue, heterogeneous across.

Open in VSCode: examples/01-foundations/java/activity_task_routing.java · Run: make run-routing

Demo: make run-routing (one Worker process, 3 pools) + make run-routing-starter. The two pools log independently - each Activity lands on the queue its stub named, and no single Worker registered the full set. Runnable end-to-end in examples/runnable/15-task-queue-routing (Java/Python/Go).

For Airflow rooms, this slide is the moment of recognition. XCom-becomes-a-return-value gets the biggest reaction.

The scaling-limit row is the honest one: Temporal isn't free of limits, it just moves them. History size is the constraint you design around (continue-as-new on Day 5). Pair this with the migration framework on Day 4 - "migrate where Temporal earns its keep."

The cluster is language-agnostic; the SDK is how YOUR language speaks to it. Every primitive from the last section (Workflow, Activity, Worker, Task Queue) is a type in this library. Java is our working language; Python and Go appear at the end so the room knows the same model ships everywhere. Reference throughout: javadoc.io/doc/io.temporal/temporal-sdk/latest

The headline: you don't call a REST API to "do durable execution." You link a library, and the library both reaches the cluster AND hosts your code under the replay contract. The four bullets map 1:1 to the packages on the next slide.

Don't read every cell - point at the split: client+serviceclient are the OUTSIDE (your HTTP handler / starter), worker is the HOST, workflow+activity are the CODE you write, common is the shared knobs. This is the whole API surface in one frame; the Javadoc index has this exact package list.

This is the single most-used class in the SDK and the bridge back to the non-determinism slides coming next. The rule: if the JDK version would give a different answer on replay, there's a Workflow.* that records it instead.

This is the starter side from the Hello lab. Three steps: connect, make a stub, call it. A direct method call blocks for the result; WorkflowClient.start(wf::greet, "Ada") returns immediately with a WorkflowExecution handle - the async path.

One artifact gets you everything on the package slide. temporal-testing (the TestWorkflowEnvironment + replay harness) lands on Day 4 - flag it now so the dependency is already familiar.

The bullet from "What the SDK actually does" (payload (de)serialization) cashed out. The mental model: nothing custom crosses the wire - it's all Payloads. DefaultDataConverter is a chain of PayloadConverters; JSON is the catch-all that handles records/POJOs, which is why "make it JSON-serializable" is the rule. Override the converter at WorkflowClientOptions (Day 6 codec slide stacks a PayloadCodec on top to encrypt). Limits are the next slide - the "so what".

This is the slide they asked "limits?" about. Four kinds of limit, not just size: serializability, size, replay cost, and schema evolution - the last is the sneaky one. Tie size back to the 2 MB / 256 KB figures and the 50 MB / 51,200 event history cap from the event-sourcing section. The fix for big data (S3 reference payload) and for sensitive data (PayloadCodec encryption) both land on Day 6 - this slide plants the rule, Day 6 shows the pattern. Jackson default: unknown properties are ignored, so additive changes are safe; renaming a field is a breaking change for any open Workflow whose history holds the old name.

This is the reference slide people screenshot. Walk it outside-in: stubs are the socket, client is the namespace-scoped entry point, factory owns the JVM-wide cache + threads, worker is per-Task-Queue, impl options are per-Workflow-type.

Contrast with the previous slide: those configure the Worker once at boot; these are per-execution and can change call to call. RetryOptions is the common nested piece, point back to the "Setting them deliberately" slide.

Optional / awareness only - the room is Java-first. The point of these three slides: the mental model is the product, not the Java API. A polyglot shop runs Workflows in one language and Workers in another against the same cluster.

Same three moves - connect, start, run a worker - in async Python. The Workflow.* toolkit becomes the workflow module (workflow.sleep, workflow.now, workflow.uuid4).

No annotations - Go uses function signatures and the workflow.Context. Same shape: Dial, ExecuteWorkflow, Get. worker.New(c, "queue", opts).Run() hosts it.

The takeaway slide for the optional block: the columns differ, the rows don't. This is why a team can adopt Temporal without a language rewrite.

Open in VSCode: examples/01-foundations/java/deterministic_replay_bad.java vs deterministic_replay_good.java - diff them side by side.

Whiteboard moment. Walk through with arrows. "Replay" doesn't re-execute side effects - Activity results are READ from history.

Reference card. The throughline: anything whose value the Worker can't reproduce on replay must be sourced from history, not recomputed. Walk each family: 1. TIME - System.currentTimeMillis() returns a new value every replay. The first run records 10:00:00; a replay tomorrow recomputes 10:00:01 and the code branches differently. Workflow.currentTimeMillis() returns the value recorded in history, so every replay sees the same instant. 2. RANDOM - same trap. new Random() reseeds from the system clock; replay gets a different number. Workflow.newRandom() seeds deterministically from the run and records the seed, so the sequence is reproducible. (Use this for jitter, IDs, A/B bucketing - not java.util.Random.) 3. I/O - reading a file, calling an HTTP endpoint, or hitting a DB gives a different answer each replay AND fires the side effect twice. There is no Workflow.* substitute - the fix is to MOVE it into an Activity. Activities run once and their result is recorded; replay reads the result from history. 4. CONCURRENCY - the biggest aha. Thread.sleep blocks a Worker thread for the full duration; Workflow.sleep records a timer and the Worker FORGETS the Workflow entirely (lead-in to the next slide). Same rule for threads/locks: use Workflow.newThread / Async / Workflow primitives, never raw java.lang.Thread, so the SDK controls scheduling deterministically. 5. ITERATION ORDER - HashMap/HashSet have no guaranteed order, and it can differ across JVM versions or runs. If you iterate one to make a decision (pick first, sum in order, branch on order) replay can diverge. Fix: use a TreeMap / LinkedHashMap or sort the keys before iterating. "risky," not "always wrong" - it only breaks if order affects a recorded decision. Tie it back to the replay rule slide: every one of these makes the re-run reach a DIFFERENT decision than history = non-determinism error, Workflow stuck/failed.

Ask: "how would you wait 30 days for an email opt-in today?" Compare to one line.

This is the artifact event sourcing produces. Don't dissect every event yet (that's Lab 1.3) - the goal is that when they open the UI in the next lab, the list isn't a wall of noise: they can spot the lifecycle bookends and the Activity trio. Each Workflow Task = one "the Worker woke up, decided, recorded."

`localhost:8233` opens here. The left rail (Workflows, Schedules, Workers, Batch...) is the whole product; Workflows is where you live on Day 1. The columns are named on the right, beside the screen.

Click any row on the previous screen to land here. Point at Input "Ada" and the Result up top, then the numbered event list beside it. The parts are named on the right.

The consolidated "limits?" reference, pairs with the retention/pruning slides above and the serialization-limits slide in the SDK section. Two kinds of limit: SIZE/COUNT (the table, hit the hard cap and the server terminates the run) and STRUCTURAL (one open run per ID; ~2,000 pending activities/children before the Workflow Task fails). All numbers are dynamic-config defaults: historyCount warn/error 10,240/51,200, historySize 10/50 MB, blobSize ~256 KB/2 MB, NumPendingActivities/ChildExecutions 2,000. Escape hatches: Continue-As-New (Day 5) for history growth, S3 reference payloads (Day 6) for big blobs, bounded fan-out for pending-work caps. Say out loud: you never tune these up, you design under them.

This is the real production topology, not a toy wiring: the Worker is a long-lived deployment; the starter is whatever fires work - an HTTP handler, a cron, a CLI. Both only ever talk to the server.

Have one person KILL the Worker mid-run on purpose. The Workflow completes when the Worker restarts. This is the most important moment of Day 1.

This is the #1 "huh?" when people first read a history. They set the queue to hello-temporal but see a hostname:uuid on later Workflow Tasks. The next slide explains why. JSON is from the Event History "JSON" toggle / `workflow show -o json`.

This is the question students ask: "what if a disconnected Worker comes back?" The answer is the CAS on history version - not locking, not leader election at the Worker level. Workers are interchangeable; the DB is the arbiter. Tie the activity-idempotency point back to the retries lab.

The synthesis slide. Students meet sticky cache, replay, and determinism as three separate topics; this is where they click into one. Cache = derived view; history = source of truth; determinism = why the view is reproducible elsewhere.

The classic Day-1 confusion. Workflow ID is yours and reusable; Run ID is one physical attempt. continue-as-new keeps the ID, mints a new Run - the Day-5 lab.

On the Workflows list: Type column = GreetingWorkflow, Workflow ID column = hello-temporal-demo. People assume the Type is the ID, or that the Task Queue must match one of them. It doesn't - they're orthogonal.

Run: make run-hello, then read the Web UI event history. Dump it from the CLI with examples/01-foundations/history_cli.sh.

Simplified mental model first - the next slide shows the real topology. Trace one Workflow start: SDK → Frontend → History (write WorkflowExecutionStarted) → Matching → Worker polls.

Frontend is a gateway in front of three peer services; History and Matching both own persistence. Your Workers live OUTSIDE this box and connect outbound to Frontend on :7233. Keep the source credit on the slide.

The naming trap: "Worker Service" inside the cluster is internal background work. The Worker YOU write and deploy is a separate process polling Matching via Frontend.

Walk this slowly on the whiteboard. Two services the Worker never talks to directly: it polls through Frontend, Matching dispatches, History owns the record. Matching → History RecordWorkflowTaskStarted is the easily-missed hop: the Worker pulling a task is itself an event (WorkflowTaskStarted) before code runs. The key insight: nothing the Worker does is trusted until History has written the resulting event. Crash anywhere and replay rebuilds from the persisted history. Next slide turns this trace sideways into the ledger it produces.

This is the "not just an interaction diagram" payoff: same run, turned 90° into the durable record. Eleven events, eleven state transitions. Point out the rhythm: every Workflow Task is a Scheduled→Started→Completed triple, and a Command (ScheduleActivityTask, CompleteWorkflowExecution) only ever lands as an Event. The Worker proposes; History decides and records. Tie forward to determinism: on replay the SDK re-runs the code and checks the Commands it regenerates against events 4 and 10. Reorder your Activities and event 4 won't match, that's the non-determinism error they'll meet on Day 3.

Same seven steps as the trace, turned into lanes so the asymmetry is visible: all the persistence and decisioning happens in the middle (History), the Worker only ever talks to the edge (Frontend). The two left-pointing arrows into History, RecordWorkflowTaskStarted and RespondWFTCompleted, are the moments a poll/response turns into a persisted event.

The single rule behind every lifecycle diagram: a Command is intent; an Event is fact. The names even rhyme, Schedule→Scheduled, Start→Started/Initiated. This is why "do I/O in Activities, not Workflow code": only Commands round-trip through History, so only Command-shaped effects are durable and replayable.

The mechanism under the trace. Step ② is the one that surprises people, the Worker doesn't resume a paused thread, it re-runs the code from the top against the recorded history every single Workflow Task. The Worker is stateless between tasks; the history is the state. Sticky cache (mention WorkerOptions) is just an optimization that skips ② when the same Worker holds the run in memory.

Zoom all the way out. The previous slides were one trip through Running; this is the shape of the whole execution. Worth naming the distinction now: Canceled is cooperative (the code gets a chance to clean up), Terminated is forceful (the server kills it, no cleanup). Continue-As-New seeds Day 2's long-running patterns, same Workflow ID, brand-new Run ID and empty history.

The most dangerous menu for newcomers. Terminate forfeits compensation - prefer Cancel. Reset recovers from a bad code deploy: pick an event, reset, the Worker replays forward with current code. All three are in the per-Workflow More Actions.

Connections.fromEnv() is the shared base; Cloud (next) feeds it credentials and takes the TLS branch. Worker and starter are separate processes, both reading the same env. Mirror of the Cloud slide and built on the SAME worker - here the plaintext branch, Cloud the TLS branch. Stop 'make temporal' first; the stack binds :7233. Good moment to restart the temporal container and show the Workflow survived.

Punchline to say out loud: laptop -> Docker -> Cloud is all env, not a rewrite. Optional - demo-only if attendees have no Cloud creds. Same shared module as the Docker slide (examples/runnable/01b-hello-temporal-anywhere); with no creds it falls back to plaintext, so it still compiles and runs against make temporal.

This grep-able view is the production debugging starting point. Show it now; it'll come back on Day 4 for replay tests.

The first slogan to repeat. If only one thing sticks for Day 1, it's the re-execution-vs-recorded-results distinction.

4 hours: morning is reliability mechanics; afternoon is signals/queries/ updates/schedules/children. Lots of code.

Open in VSCode: examples/02-reliability/java/async_activity.java, parallel_fanout_allof.java Run: make run-async

Start from what they know. Everyone has written blocking code; name it so the contrast on the next slides has something to push against. No Temporal yet - this is pure "how code waits". For the non-async crowd, this slide is the floor.

The claim-ticket analogy is the load-bearing metaphor for the entire section. A Promise is a receipt for a result that isn't ready yet. Async hands it back immediately; .get() is the only thing that waits. Land this hard before any code.

The punchline that makes "parallel" cost nothing. Same orders, same counter - the only variable is whether you wait between orders or after all of them. No threads, no executor pools. Hold here until heads nod.

Answer: place all three orders first (2, 4, 5 in any order), THEN collect all three (1, 3, 6). Placing = Async.function; collecting = .get(). The trap answer interleaves place/collect - that's the sequential 12-minute version. 2-minute solo, then reveal. Tie each step back to the API name out loud.

The lead-in before any Async code. The whole section is one idea: a Promise lets the Workflow wait without occupying a thread, so "parallel" costs nothing. Get this and the fan-out code is obvious.

Sequential: 2+2+1+1 = 6s. This version: the two extracts overlap (2s), then the transforms run sequentially (1+1) = 4s total. Lesson: only the work you START before waiting overlaps; anything you call sequentially after a .get() is still sequential. Follow-up: "how would you parallelize the transforms too?" - start both, then get both.

Name the pattern with a physical picture before any stream()/Promise code. Many in the room haven't met "fan-out/fan-in" as vocabulary - it's just split-work- then-combine. The stream() gloss matters: half the room may not read Java streams fluently, and the next slide leans on .stream().map(...).toList().

This is the lead-in before the stream().map(Async.function).toList() one-liner. If they get "collect all Promises, THEN join," the code reads itself. The classic bug is calling .get() inside the map - that serializes it.

One JVM hosts tens of thousands of suspended Workflows. Each one is just heap state, not a parked thread.

The bug: .get() is INSIDE the loop, so each pass waits for its own Activity before the next one starts - fully sequential despite Async.function. Fix: collect every Promise first (stream().map(Async.function).toList()), THEN allOf().get() and sum. This is THE classic mistake; seeing it once inoculates them. Connect back to the coffee run - this is order-wait, order-wait.

Lead-in before the procedure/anyOf code. Same idea as fan-out, two more tools: void Activities, and racing N calls when you only need the first answer.

Example: examples/02-reliability/java/async_procedure_and_race.java

Lead-in before the try/catch-per-branch code. The question to pose: "3 of 4 priced fine, 1 failed - do you fail the order or price 3?" That's a design call.

Example: examples/02-reliability/java/partial_failure.java

Lead-in before the bounded-fanout code. The naive fan-out is "all at once"; the real-world constraint is a downstream QPS/connection cap. The counter + await is the replay-safe way to throttle inside Workflow code - never a JDK Semaphore.

Open in VSCode: examples/02-reliability/java/bounded_fanout.java

Have students sketch on paper before running. Then run and verify their prediction was right (or wrong - even better).

This is the proof for lab step 1. Read ascending so events 1-7 are on screen. The three ActivityTaskScheduled in a row, under ONE WorkflowTaskCompleted, is the whole point - image left, the parts named on the right.

Open in VSCode: examples/02-reliability/java/retry_and_timeouts.java, heartbeat_long_activity.java

The arithmetic is the lesson. Bring a calculator if you don't trust the audience to do it on paper.

This is the live view while an Activity is between attempts - the retry counter, the next-retry countdown, and the last error. Captured mid-backoff from the retries lab (ChargeCard fails twice, then succeeds). Fields named on the right.

The mix-up: people put MaximumAttempts on WorkflowOptions expecting their Activity to retry. It doesn't - Activity retries are configured on ActivityOptions and are ON by default (unlimited). Workflow retry is opt-in and restarts the whole run.

The lead-in before the loop. Without heartbeats a dead Worker is invisible until startToClose fires, and the retry redoes everything. With them: fast failure detection + resume-where-you-left-off.

Open in VSCode: examples/02-reliability/java/heartbeat_long_activity.java

Open in VSCode: examples/02-reliability/java/heartbeat_resume_from_checkpoint.java

The previous slide promised "resume from page N"; this is the call that delivers it. heartbeat(page) writes the checkpoint; getHeartbeatDetails reads it back on the next attempt. Without this read a retry silently restarts at page 0.

Lead-in before the CancellationScope code. Ties back to heartbeats: cancellation reaches a running Activity the same way liveness does - via the heartbeat.

Example: examples/02-reliability/java/cancellation_scope.java

Open in VSCode: examples/02-reliability/java/workflow_time.java - durable sleep records TimerStarted; no thread parks.

Reinforcement, not new content. The students saw the families yesterday; this is the "what bites in production" list.

Open in VSCode: examples/03-interactions/java/signals_queries.java Run: make run-approval

This is the lead-in BEFORE any annotation soup. Decide by intent first, then the @SignalMethod / @QueryMethod / @UpdateMethod code is obvious. Queries never touch history (read-only, served from the cached state); signals and updates do.

Async, recorded in history, wakes any await predicate.

Demo: pick currentState, Run Query, read the result. Note Event History is 0 here - a Query never appears in history. Empty tab = no Worker polling / run not cached.

The "before workflow starts" twist: signalWithStart later in the day will make this explicit.

Open in VSCode: examples/03-interactions/java/update_completed.java, update_with_start.java Run: make run-approval

Lead-in before the @UpdateMethod interface. The validator is the headline: reject bad input synchronously without polluting history.

Open in VSCode: examples/03-interactions/java/update_completed.java

Lead-in before the startUpdate code. Parallels startUpdate to Async.function: both hand back a handle so you can do other work before collecting the result.

Example: examples/03-interactions/java/update_completed.java

Lead-in before the signalWithStart code. THE foot-gun fix - the Day-3 Kafka bridge depends entirely on this. Bare start() crashes on event #2 for a key.

This is THE foot-gun. Every team copies bare WorkflowClient.start() from a tutorial and crashes on the second Kafka message for the same key.

Lead-in before the startUpdateWithStart code. Use when the very first interaction both creates the Workflow and needs an answer (e.g. submit-and-confirm).

Open in VSCode: examples/03-interactions/java/update_with_start.java

The blocking-call shape is what makes Updates the modern primitive. Signal+Query is older and works against older clusters; Update is the right tool when the caller wants the result.

Capstone for the section: the SAME run, after a Query (not shown - leaves no trace), an Update (changeNote), and a Signal (approve). Point at the Signaled event and the Update Accepted/Completed pair, named on the right.

Open in VSCode: examples/03-interactions/java/schedule_interval.java, schedule_cron_overlap.java Run: make run-schedules

Lead-in before the Schedule.newBuilder code. Three parts: action + spec + policy. Contrast with a cron line on a box that dies when the box dies.

Open in VSCode: examples/03-interactions/java/schedule_interval.java

Lead-in before the cron/overlap code + table. The overlap policy is the real lesson - "your hourly job takes 90 min, now what?" has five named answers.

Example: examples/03-interactions/java/schedule_cron_overlap.java

The Schedules tab is its own left-rail section, separate from Workflows. Created from examples/runnable/04-schedules. The parts are named on the right.

Compare to "your DAG runs hourly but the 3 AM run takes 90 minutes" - in Airflow you set max_active_runs. Here you set ScheduleOverlapPolicy.

Open in VSCode: examples/03-interactions/java/child_workflow.java, workflow_and_run_timeouts.java

The lead-in before the stub code. 90% of delegation is Activities; children are for when the sub-task is a real orchestration in its own right.

Open in VSCode: examples/03-interactions/java/child_workflow.java

Click into the parent → Relationships. The tree + table make "each child is its own Workflow" concrete. Click any child row to jump to its own page/history.

Lead-in before the tenant-fanout code. Two ideas: per-tenant child identity, and ParentClosePolicy.ABANDON so a finished (or continued-as-new) parent doesn't kill in-flight children. Contrast the default: children terminate with the parent.

Open in VSCode: examples/03-interactions/java/tenant_fanout.java

Workflow timeouts ≠ Activity timeouts. These are top-level execution caps, not per-attempt budgets.

Three slogans for Day 2. Each one is a foot-gun saved.

Day 3 is half conceptual (where does Kafka end and Temporal start?) and half hands-on (full Kafka → Temporal → Kafka loop).

Open in VSCode: examples/04-kafka/java/kafka_consumer_activity.java, producer_activity_idempotent.java, outbox_activity.java

Disable auto-commit. Always. Commit after the unit of work succeeds. Heartbeat the topic:partition:offset so retries can resume.

Lead-in before the producer config. The point: publishing is a side effect (Activity), and the retry story forces idempotence + a stable key.

Open in VSCode: examples/04-kafka/java/producer_activity_idempotent.java

Lead-in before the outbox transaction code. The problem statement is the lesson: DB-write + Kafka-publish can't be atomic, so make the publish derive from a row.

Open in VSCode: examples/04-kafka/java/outbox_activity.java

Open in VSCode: examples/04-kafka/java/signal_bridge.java - signalWithStart, not bare start. Run: make run-kafka

Open in VSCode: examples/04-kafka/java/signal_bridge.java

Run: make run-kafka (Worker + bridge). Produce with make kafka-produce TOPIC=orders, tail with make kafka-consume TOPIC=order-outcomes.

Have students fire two events for one key. The second event should NOT start a new workflow. If it does, they're using bare start - debug it.

Captured live: kcat produced 3 events (created/paid/shipped) for key 100 to the 'orders' topic; the bridge did SignalWithStartWorkflow each time. One Started, the rest Signaled. The activity (PublishOutcome → order-outcomes topic) is the producer half. Docs: https://docs.temporal.io/develop/java

Open in VSCode: examples/04-kafka/java/partition_fanout.java

Open in VSCode: examples/04-kafka/java/dlq_after_retry_exhaustion.java

Heaviest day on production rigour. Lots of ops content. Two big labs: metrics dashboard and replay tests.

Open in VSCode: examples/05-production/java/get_version_patch.java, versioning_behavior.java

Open in VSCode: examples/05-production/java/get_version_patch.java

Lead-in before the @WorkflowVersioningBehavior code. The decision is run lifetime: you can drain a 5-minute checkout; you can't drain a 6-month subscription.

Open in VSCode: examples/05-production/java/versioning_behavior.java

Open in VSCode: examples/05-production/java/worker_options_manual.java, worker_tuner.java, composite_tuner.java, virtual_threads.java

Corrected shorthand of the levers table: one Worker runs n Workflows + m Activities off one queue. a/b are integer slot counts (both default 200); c/d are JDK-21 virtual-thread booleans (both default false). The trap the original snippet hid: c is set on the Factory, not the Worker, and c vs d is Workflow- vs Activity-thread scope, not "threads vs virtual threads".

Same a/b/c/d shorthand extended to the rest of the table, with defaults. e is the third slot count. f/g are the poller pairs (5 + 5). h/i are the two rate caps (per-Worker vs queue-wide, server-enforced). j replaces the fixed slot counts with a tuner. k is the sticky cache. l is the topology lever. Lettering continues e–l so it reads as one continuous list with the previous slide.

The diagnostic view. "My Workflow is stuck" → open Workers/Pollers; if it's empty, no Worker is polling that Task Queue. Capacity lives here too: too few pollers for the backlog shows up as schedule-to-start latency - the Observability section next.

This is the missing bridge between the Pollers tab and the tuning knobs. Kubernetes replicas, VM count, and multiple JVMs all just add pollers to the same queue. Temporal does not shard a Workflow across Workers; each Workflow Task or Activity Task is leased to one Worker at a time.

Important framing: virtual threads are a Java runtime implementation detail, not a Temporal distribution feature. They reduce per-blocking-call thread cost inside a Worker. Task Queues decide which machine gets the task; WorkerOptions decide how many tasks this process runs at once.

Open in VSCode: examples/05-production/java/worker_options_manual.java

Lead-in before the virtual-threads code. Slots map to threads; virtual threads lift the ceiling for blocking I/O Activities without a thread-per-slot cost.

Open in VSCode: examples/05-production/java/virtual_threads.java

Lead-in before the ResourceBasedTuner code. Contrast with the previous "Manual sizing" slide: same goal (right concurrency), but driven by live resource use.

Open in VSCode: examples/05-production/java/worker_tuner.java

Lead-in before the CompositeTuner code. The realistic answer is "both": pin the cheap deterministic workflow slots with a FixedSizeSlotSupplier, auto-size the expensive activity slots with a ResourceBasedSlotSupplier. CompositeTuner doesn't size anything itself, it just wires one supplier per slot type.

Example: examples/05-production/java/composite_tuner.java

Lead-in before the rate-limit code. Distinguish the two knobs: per-Worker (local, multiplies with replicas) vs task-queue-wide (global, server-enforced). The second is the one that actually protects a shared dependency under horizontal scaling.

Open in VSCode: examples/05-production/java/rate_limited_activity_pool.java

Open in VSCode: examples/05-production/java/micrometer_metrics.java, custom_activity_metric.java, otel_tracing.java Stack: make stack-obs, then make grafana

Open in VSCode: examples/05-production/java/micrometer_metrics.java

Lead-in before the custom-metric code. Built-ins tell you the system is healthy; custom metrics tell you the business is. Same export path, so it's cheap to add.

Example: examples/05-production/java/custom_activity_metric.java

Lead-in before the OTel interceptor code. The win is correlation: without it, the Activity spans are orphans; with it, you see the whole request as one tree.

Runnable lab: examples/runnable/19-distributed-tracing (Java/Go/Python). Snippet: examples/05-production/java/otel_tracing.java. Java reaches OTel via the OpenTracing shim; Go and Python have native OTel interceptors. The propagator gotcha is real - without W3C TraceContext every span became its own root.

This is the payoff: 8 spans, one trace, across two services. Temporal uses FOLLOWS_FROM for the worker-side continuation (RunWorkflow follows StartWorkflow) and CHILD_OF for scheduling (StartActivity under RunWorkflow). Verified live.

Why this matters in three SDKs: Java is the interesting one - no native OTel module, so you keep the OpenTracing interceptors and bridge with the shim. The migration story is "swap the tracer, keep the wiring."

Captured live: a Worker exporting tally→Prometheus metrics on :9464, Prometheus scraping it, Grafana rendering the provisioned overview. Read beside the image.

Lead-in before the Search Attributes code. Contrast with metrics (aggregate) and memo (attached but NOT indexed). Keys are registered per namespace first, then upserted from Workflow code. This is how the "tenant ID in Search Attributes" from the namespace slide actually becomes filterable.

Open in VSCode: examples/05-production/java/search_attributes_ops.java

Open in VSCode: examples/05-production/namespace_strategy.md

Maps to instructor Demo 3 (custom-authorizer harness): a working Authorizer that pins each caller to a Role AND a Task Queue. The teaching point: the namespace is the boundary the platform enforces; everything finer is code you own.

Open in VSCode: examples/05-production/java/junit5_extension_mockito_test.java; runnable test in examples/runnable/06-testing/ReminderWorkflowTest.java Run: make run-testing (no server needed)

Lead-in before the TestWorkflowEnvironment code. Time-skipping is the "wow": durable timers normally make long Workflows untestable; here they run instantly.

Open in VSCode: examples/runnable/06-testing/src/test/java/training/temporal/testing/ReminderWorkflowTest.java · Run: make run-testing

Lead-in before the JUnit5 + Mockito code. The pattern: real Workflow, fake Activities. You're testing the decisions, not the side effects.

Example: examples/05-production/java/junit5_extension_mockito_test.java

Open in VSCode: examples/05-production/java/replay_test.java

Captured from examples/runnable/11-determinism-replay (go test -run TestReorderedCodeBreaksReplay -v). TMPRL1100 is the non-determinism error code - recorded history vs replay command diverge at event 5.

This is the safety net for the rest of the year. Encourage students to take this pattern back to their team and seed a corpus.

4 hours. Saga + Spring Boot: sync Updates, async Signals, continue-as-new. The capstone build (75 min build + 25 review + 20 Q&A) moved to the course close - it now caps all six days rather than just Day 5.

Open in VSCode: examples/06-saga-spring/java/saga_compensation.java (full project: examples/runnable/07-saga/) Run: make run-saga

Lead-in before the Saga code. The mental model is a stack: push a compensation after each success; on failure, pop them in reverse. Temporal runs your undo.

Open in VSCode: examples/06-saga-spring/java/saga_compensation.java (full project: examples/runnable/07-saga/)

War story: when team #3 silently drops an event in a choreographed flow, nobody notices for 36 hours. Temporal's log shows it immediately.

Open in VSCode: examples/06-saga-spring/java/choreography_bridge.java + python/choreography_bridge.py + go/choreography_bridge.go

This is the answer to "does Temporal support choreography?" Yes, but do not turn every service into one giant shared Workflow. Keep bounded contexts independent. Use events at team boundaries, then use Temporal inside a team boundary when the service has stateful, retrying, long-running logic.

The listener does not contain business process logic. It makes the event durable inside Temporal, then commits the Kafka offset after Temporal accepts it.

Avoid presenting this as religion. The useful rule is ownership. If one team owns the outcome, orchestrate. If no team should own all downstream behavior, choreograph between teams and use Temporal locally.

Captured from examples/runnable/07-saga with input fail-at-ship. The Result is "COMPENSATED" - a handled business outcome, not a crash. The arc is named on the right; the takeaways continue on the next slide.

The generic integration, before any saga specifics. Open in VSCode: examples/06-saga-spring/java/spring_temporal_config.java (manual wiring) and the runnable starter app examples/runnable/16-spring-boot. Saga-specific triggers (sync Update, Kafka Signal) come in the next section.

The lead-in before the @Configuration soup. Three beans + lifecycle is the whole trick; Activities being Spring beans is what makes DI/testing feel native.

This is the underlying wiring. In production, prefer the temporal-spring-boot-starter and let it do this.

The payoff after the manual @Configuration slide. Same three responsibilities - client, Worker registration, lifecycle - now declared, not coded. Activities stay Spring beans so DI/testing are unchanged. Runnable: examples/runnable/16-spring-boot (Java only), `make run-spring`, then POST /greetings. Note the auto-discovery log lines naming the task queue - that IS the Worker the starter stood up.

OPTIONAL hands-on (or run it live as a demo). One Spring process is both the client and the Worker. Show the startup log auto-discovering GreetingWorkflowImpl onto the 'greetings' queue, POST to start+block for the result, GET to Query, then open greeting-Ada in the Web UI. Skip for time, the saga-in-Spring lab is the required one. Lands "client + Worker you wire in" with zero Temporal config.

Now the saga-specific wiring on top of the Spring Boot + Temporal basics. Open in VSCode: examples/06-saga-spring/java/sync_saga_update.java, async_saga_signal.java, kafka_listener_trigger.java, continue_as_new.java

Lead-in before the sync HTTP code. This is the "two front doors" pair with the next slide: HTTP Update here, Kafka Signal next - same Workflow underneath.

Open in VSCode: examples/06-saga-spring/java/sync_saga_update.java

Lead-in before the KafkaListener code. Reinforces signalWithStart from Day 3: the consumer never checks "does this order's Workflow exist yet?"

Open in VSCode: examples/06-saga-spring/java/kafka_listener_trigger.java

The lead-in before the loop code. The trigger is history size, not elapsed time: if a Workflow keeps appending events forever, continue-as-new resets the slate.

Open in VSCode: examples/06-saga-spring/java/continue_as_new.java

Captured from examples/runnable/12-continue-as-new (processed 9 events across 3 Runs). The First/Previous Execution links are the chain - walk them backwards.

4 hours: morning is AWS migration; afternoon is containers + KEDA. Two big labs: containerized Worker on kind, and KEDA autoscale.

Open in VSCode: examples/07-aws-containers/aws_mapping.md, step_functions_before.json vs step_functions_after_temporal.java

Open in VSCode: examples/07-aws-containers/java/glue_activity.java, s3_reference_payload.java Run: make run-aws

The lead-in before the polling loop. This is THE pattern for any external async job (Glue, EMR, Batch, SageMaker): start → poll+heartbeat → settle.

Open in VSCode: examples/07-aws-containers/java/glue_activity.java · Run: make run-aws

The question students always ask: "can Temporal resume my Glue job from step 3?" The honest answer is a question back: what's a step? Temporal can't reach inside one Spark script, that's opaque. But if your pipeline is extract → transform → load as three jobs, model each as an Activity and Temporal resumes at the failed one for free. Finer than that (mid-Spark) is Glue Job Bookmarks' job, not Temporal's. The heartbeat row is the re-attach case from the previous slides.

Open in VSCode: examples/runnable/17-spring-glue-pipeline

The payoff slide. Three Glue jobs, three Activities, one Workflow. When `load` fails, you fix it and the retry doesn't re-pay for extract + transform, their results come straight out of event history. This is the "resume from the failed step" guarantee the previous slide promised, made concrete. Contrast with Step Functions, where resuming mid-state-machine is a console-diving runbook.

Lead-in before the record types. The rule: if a payload could be big, it goes to blob storage and the Workflow carries the pointer. Keeps history small + fast.

Lead-in before the PayloadCodec code. The threat model: history is durable storage; treat it like any DB column you'd encrypt. Codec server = controlled decryption for the UI.

Example: examples/07-aws-containers/java/codec_server.java

Lead-in before the before/after code. The pitch: you stop maintaining ASL and its limits; the orchestration is code your team already knows.

Open in VSCode: examples/07-aws-containers/step_functions_before.json vs step_functions_after_temporal.java

Open in VSCode: examples/07-aws-containers/{java,python,go}/sqs_signal_bridge.*, sns_publish_activity.*, ssm_parameter_config.* The morning's missing third: what *drives* a Workflow, how it talks back out, and where its config/secrets live. All three run on free LocalStack.

Source: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/standard-queues.html The point: SQS gives you at-least-once. Don't fight it - make the consumer idempotent by keying the Workflow Id off the event, so a redelivery re-signals the same run instead of starting a duplicate.

Source: https://docs.temporal.io/sending-messages Deleting before the signal is durable = a lost event on crash. Order matters: signal first, delete second.

Open in VSCode: examples/07-aws-containers/java/sqs_signal_bridge.java

Source: https://docs.aws.amazon.com/sns/latest/dg/welcome.html The publisher moved into Temporal; the fan-out topology (SNS -> SQS/Lambda/HTTP) is unchanged. Only "who calls Publish" changed.

Source: https://docs.temporal.io/develop/python/best-practices/error-handling This slide is the thesis of the whole "edges" sub-section: at-least-once meets at-least-once; idempotency keys reconcile them.

The capstone of the AWS morning: nothing new, just composition. validate (S3 list) → runGlueJob (the supervise-compute pattern from earlier) → publishValidation (SNS). The starter from Day 5 is what hosts it. Runnable: examples/runnable/17.

Open in VSCode: examples/runnable/17-spring-glue-pipeline/java/.../GlueStitchWorkflowImpl.java

Open in VSCode: examples/runnable/17-spring-glue-pipeline/java/.../SqsTriggerBridge.java

Source: https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-parameter-store.html The determinism trap: SSM values can change between replays. Bootstrap config at startup; per-run secrets via an Activity. This is also the on-ramp to IRSA/task roles (next sub-section) - no static keys.

Open in VSCode: examples/07-aws-containers/java/ssm_parameter_config.java

The conceptual slide said "config in startup code, secrets via an Activity"; this is the code. getParametersByPath loads the tree at boot; setTarget/setNamespace build the non-local stubs from it. fetchApiKey keeps the secret read behind the Activity boundary.

Open in VSCode: examples/07-aws-containers/Dockerfile, worker_deployment.yaml, keda_scaledobject.yaml

Open in VSCode: examples/07-aws-containers/Dockerfile (runnable: examples/runnable/08-aws-containers/java/Dockerfile)

Lead-in before the Deployment YAML. The "no Service/ingress" point surprises people from request/response services - Workers are outbound-only.

Open in VSCode: examples/07-aws-containers/worker_deployment.yaml

Lead-in before the ScaledObject YAML. Ties back to worker sizing + the schedule-to-start metric: that latency IS the autoscaling trigger.

Open in VSCode: examples/07-aws-containers/keda_scaledobject.yaml

Captured live: kind cluster + KEDA, an in-cluster dev server, a 300-task backlog on 'transform'. The repo manifests assume an in-cluster temporal-frontend + production namespace; this used a streamlined equivalent.

These four are NOT on free LocalStack (paid-tier emulators), so the labs are conceptual + reference manifests under examples/07-aws-containers/aws/ - no make targets. Walk them as lecture; apply only with an account. Each costs real money - tear down after. Labs 9-12.

Source: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-auto-scaling.html The headline: ECS has no native Temporal scaler the way EKS has KEDA. So you rebuild what KEDA bundles - publish backlog as a CloudWatch metric, target-track it. The wiring is on the next slide.

Contrast with lab 5: KEDA polls DescribeTaskQueue FOR you. On ECS you wire the poller -> metric -> scaler yourself. Same idea (scale on queue depth), more glue. Manifests: examples/07-aws-containers/aws/ecs_*.json + backlog_publisher.md

Source: https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html The portability story: kind -> EKS is a no-op for the Worker. IRSA is the production answer to the dummy creds used all morning. Requires the cluster's OIDC provider enabled, or IRSA silently fails. Manifests: examples/07-aws-containers/aws/eksctl-cluster.yaml, irsa-serviceaccount.yaml, eks-worker-deployment.yaml

Source: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/CHAP_AuroraOverview.html This is the at-least-once thesis again, in SQL: a retried load Activity must not double-insert. The mechanics that make it safe are on the next slide.

The DynamoDB example (dynamodb_idempotency.*) is the NoSQL sibling. Aurora can also be Temporal's own persistence store when self-hosting - but that's an aside; the lab's focus is Aurora-as-application-sink. Files: examples/07-aws-containers/aws/aurora_schema.sql, aurora_load_activity.java, aurora_terraform.tf

Source: https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/routing-policy-failover.html The big honesty point: students must not mistake DNS failover for DR. It's frontend HA only. Pair with Temporal multi-cluster replication for true cross-region. Files: examples/07-aws-containers/aws/route53_failover.tf, route53_records.md

Scaffold from examples/runnable/07-saga/ (Run: make run-saga). Challenge: day-05-saga-spring/lab-3-capstone.

Hold the line on time. At 50 minutes, stop everyone and check in. If most are stuck, slow down; if most are done, pull review forward.

Resist correcting code style. Focus on the four questions above. They are what the cohort will face on real systems.

Fundamentals

Temporal Fundamentals

Gaurav Agarwal

Course Agenda

Day 1

Foundations

Lab · Day 1

Local dev setup

Day 1

Why Temporal exists

Day 1 · on-ramp

Why Temporal exists

Every backend has these

What goes wrong

Tools you've shipped with

DAGs vs durable execution

Day 1

Core concepts

Day 1 · on-ramp

Core concepts

The four primitives

Workflow

Activity

Worker + Task Queue

One Worker hosts many types

Many Workers split the work

Why split: Compute & I/O

Routing Activities by Task Queue

One Workflow, three pools

Routing in code

Airflow → Temporal map

At a glance

Day 1

The Temporal SDK

Day 1 · on-ramp

The Temporal SDK

What the SDK actually does

Java SDK: the packages

Workflow: the deterministic toolkit

WorkflowClient: the way in from outside

Add it to your build

Serialization: your data becomes Payloads

Serialization: the limits

Options at a glance: wiring the client & Worker

Options at a glance: starting & retrying work

Day 1 · optional

Other SDKs: same model

Day 1 · on-ramp

Other SDKs: same model

Optional

Python SDK: temporalio

Optional

Go SDK: go.temporal.io/sdk

Optional

Cross-SDK at a glance

Day 1

Event sourcing & deterministic replay

Day 1 · on-ramp

Event sourcing & deterministic replay

The replay rule

Five families of non-determinism

Durable sleep

The Event History

Web UI: Workflows list

Web UI: Event History

Event History retention

Pruning a growing history

Limits you design around

Lab · Day 1

Hello Temporal

Lab · Day 1

Hello Temporal: read the history

Wait: two Task Queues in one history?

Normal vs Sticky Task Queues

Sticky execution: and its fallback

Split brain: two Workers, one run

Stateful for speed, stateless for truth

Workflow ID vs Run ID

If starting a Workflow fails

Start retry pattern

`Workflow`: the deterministic toolkit

`WorkflowClient`: the way in from outside

Python SDK: `temporalio`

Go SDK: `go.temporal.io/sdk`