In my one encounter with one of these systems it induced new code and tooling co...

throwaway894345 · 2025-11-21T21:58:04 1763762284

> All for... an occasional convenience far outweighed by the overall drag of using it

If you have any long-running operation that could be interrupted mid-run by any network fluke (or the termination of the VM running your program, or your program being OOMed, or some issue with some third party service that your app talks to, etc), and you don’t want to restart the whole thing from scratch, you could benefit from these systems. The alternative is having engineers manually try to repair the state and restart execution in just the right place and that scales very badly.

I have an application that needs to stand up a bunch of cloud infrastructure (a “workspace” in which users can do research) on the press of a button, and I want to make sure that the right infrastructure exists even if some deployment attempt is interrupted or if the upstream definition of a workspace changes. Every month there are dozens of network flukes or 5XX errors from remote endpoints that would otherwise leave these workspaces in a broken state and in need of manual repair. Instead, the system heals itself whenever the fault clears and I basically never have to look at the system (I periodically check the error logs, however, to confirm that the system is actually recovering from faults—I worry that the system has caught fire and there’s actually some bug in the alerting system that is keeping things quiet).

hedgehog · 2025-11-22T01:20:30 1763774430

The system I used didn't have any notion of repair, just retry-forever. What did you use for that? I've written service tree management tools that do that sort of thing on a single host but not any kind of distributed system.

throwaway894345 · 2025-11-22T07:03:13 1763794993

Repair is just continuous retrying some reconciliation operation, where “reconciliation” means taking the desired state and the current state and diffing the two to figure out what actions need to be performed. In my case I needed to look up what the definition of a “workspace” was (from a database or similar) in terms of what infrastructure should exist and then query the cloud provider APIs to figure out what infrastructure did exist and then create any missing infrastructure, delete any infrastructure that ought not exist, and update any infrastructure whose state is not how it ought to be.

> I've written service tree management tools that do that sort of thing on a single host but not any kind of distributed system.

That’s essentially what Kubernetes is—a distributed process manager (assuming process management is what you are describing by “service tree”).

jedberg · 2025-11-21T22:17:32 1763763452

I'm not sure which one you used, but ideally it's so lightweight that the benefits outweigh the slight cost of developing with them. Besides the recovery benefit, there is observability and debugging benefits too.

hedgehog · 2025-11-22T01:12:17 1763773937

I don't want to start a debate about a specific vendor but the cost was very high. Leaky serialization of call arguments and results, then hairpinning messages across the internet and back to get to workers. 200ms overhead for a no-op call. There was some observability benefit but it didn't allow for debugger access and had its own special way of packaging code so net add of complexity there too. That's not getting into the induced complexity caused by adding a bunch of RPC boundaries to fit their execution model. All that and using the thing effectively still requires understanding their runtime model. I understand the motivation, but not the technical approach.

jedberg · 2025-11-22T02:51:41 1763779901

Regardless of the vendor, it sounds like you were using the old style model where there is a central coordinator and a shim library that talks to a black box binary.

The style presented in this blog post doesn't suffer from those downsides. It's all done with local databases and pure language libraries, and is completely transparent to the user.

hedgehog · 2025-11-22T18:29:17 1763836157

Yeah, the system in the blog post retargeted at Postgres would be a step up from what I've used. I'm still skeptical of the underlying model of message replay for rehydration because it makes reasoning about the changes to the logic ("flows" in the post's terminology) really hard. You have to understand what the runtime is doing as well as how all the previous versions of the code worked, the implications for all the possible states of the cached step results, and how those logs will behave when replayed through the current flow code. I think in all worlds where transactions are necessary a central coordinator is necessary, whether it's an RDMS under a traditional app or something fancier under one of these durable execution things.

In the end I'm left wondering what the net benefit is over say an actor framework that more directly maps to the notion of long-lived state with occasional activity and is easier to test.

All that said some of the vendors have raised hundreds of millions of dollars so someone must believe in the idea.

bn-l · 2025-11-22T02:47:44 1763779664

Temporal