For me the main issue with these systems is that its still seen as a special cas...

abelanger · 2025-12-19T15:52:15 1766159535

The "constraining functions to only be durable" idea is really interesting to me and would solve the main gotcha of the article.

It'd be an interesting experiment to take memory snapshots after each step in a workflow, which an API like Firecracker might support, but likely adds even more overhead than current engines in terms of expense and storage. I think some durable execution engines have experimented with this type of system before, but I can't find a source now - perhaps someone has a link to one of these.

There's also been some work, for example in the Temporal Python SDK, to overwrite the asyncio event loop to make regular calls like `sleep` work as durable calls instead, to reduce the risk to developers. I'm not sure how well this generalizes.

tomwheeler · 2025-12-19T23:58:23 1766188703

> It'd be an interesting experiment to take memory snapshots after each step in a workflow, which an API like Firecracker might support...I think some durable execution engines have experimented with this type of system before, but I can't find a source now - perhaps someone has a link

Trigger.dev currently uses CRIU, but I recall reading on HN (https://news.ycombinator.com/item?id=45251132) that they're moving to MicroVMs. Their website (https://feedback.trigger.dev/p/make-runs-startresume-faster-...) suggests that they're using Firecracker specifically, but I haven't seen anything beyond that. It would definitely be interesting to hear how it works out, because I'm not aware of another Durable Execution platform that has done this yet.

vouwfietsman · 2025-12-19T17:40:32 1766166032

Ok, I'm not an expert here, you most likely are, but just my 2 cents on your response: I would very much argue to not make this magic. e.g:

> take memory snapshots after each step in a workflow

Don't do this. Just give people explicit boundaries of where their snapshots occur, and what is snapshotted, so they have control both over durability and performance. Make it clear to people that everything should be in the chain of command of the snapshotting framework: e.g no file-local or global variables. This is already how people program web services but somehow nobody leans into it.

The thing is, if you want people to understand durability but you also hide it from them, it will actually be much more complicated to understand and work with a framework.

The real golden ticket I think is to make readable intuitive abstractions around durability, not hide it behind normal-looking code.

Please steal my startup.

abelanger · 2025-12-19T20:56:05 1766177765

FWIW, I think the memory snapshotting idea isn't going to work for most stacks for a few different reasons, but to speak more broadly on API design for durable execution systems, I agree completely. One of the issues with Temporal and Hatchet in its current state is that it currently abstracts concepts that are essential for the developer to understand, like what it means for a workflow to be durable, while the developer is building the system. So you end up discovering a bunch of weird behaviors like "non-determinism error" when starting to test these systems without a good grasp of the fundamentals.

We're investing heavily in separating out some of these primitives that are separately useful and come together in a DE system: tasks, idempotency keys and workflow state (i.e. event history). I'm not sure exactly what this API will look like in its end state, but idempotency keys, durable tasks and event-based histories are independently useful. This is only true of the durable execution side of the Hatchet platform, though; I think our other primitives (task queues, streaming, concurrency, rate limiting, retries) are more widely used than our `durableTasks` feature because of this very problem you're describing.

vouwfietsman · 2025-12-20T08:35:30 1766219730

> I think our other primitives (task queues, streaming, concurrency, rate limiting, retries) are more widely used than our `durableTasks` feature because of this very problem you're describing

Indeed, I'm happy to hear you say this.

I think it should be the other way around: if durable tasks are properly understood its actually the queues/streaming/concurrency/ratelimits/retries that can be abstracted away and ignored.

Funny, I never realised this before.

kodablah · 2025-12-19T21:42:50 1766180570

> The thing is, if you want people to understand durability but you also hide it from them, it will actually be much more complicated to understand and work with a framework.

> The real golden ticket I think is to make readable intuitive abstractions around durability, not hide it behind normal-looking code.

It's a tradeoff. People tend to want to use languages they are familiar with, even at the cost of being constrained within them. A naive DSL would not be expressive enough for the turing completeness one needs, so effectively you'd need a new language/runtime. It's far easier to constrain an existing language than write a new one of course.

Some languages/runtimes are easier to apply durable/deterministic constraints too (e.g. WASM which is deterministic by design and JS which has a tiny stdlib that just needs a few things like time and rand replaced), but they still don't take the ideal step you mention - put the durable primitives and their benefits/constraints in front of the dev clearly.

vouwfietsman · 2025-12-20T18:12:50 1766254370

This still assumes an all encompassing transparent durability layer, what I'm arguing for is the opposite: something that can just be a library in any language, and any runtime, because it does not try to be clever about injecting durability in otherwise idiomatic code.

vouwfietsman · 2025-12-19T17:42:08 1766166128

Just to continue the idea: you wouldn't be constraining or tagging functions, you would relinquish control to a system that closely guards how you produce side effects. e.g doing a raw HTTP request from a task is prohibited, not intercepted.

pests · 2025-12-19T18:28:14 1766168894

Doesn't Google have a similar type system for stuff like this? I recall an old engineering blog / etc that detailed how they handled this at scale.

Kinrany · 2025-12-19T18:24:54 1766168694

This would look like a handler taking an IO token that provides a memoizing get_or_execute function, plus utilities for calling these handlers, correct?