> Suppose, instead, we had a mechanism that allowed registering arbitrary panic hooks, and unregistering them when no longer needed, in any order. Then, we could do RAII-style resource handling: you could have a `CursesTerminal` type, which is responsible for cleaning up the terminal, and it cleans up the terminal on `Drop` and on panic. To do the latter, it would register a panic hook, and deregister that hook on `Drop`.
This doesn't get rid of unwinding at all- it's an inefficient reimplementation of it. There's a reason language implementations have switched away from having the main execution path register and unregister destructors and finally blocks, to storing them in a side table and recovering them at the time of the throw.
Giving special treatment to code that "explicitly wants" to handle unwinding means two things:
* You have to know when an API can unwind, and you have to make it an error to unwind when the caller isn't expecting it. If this is done statically, you are getting into effect annotation territory. If this is done dynamically, are essentially just injecting drop bombs into code that doesn't expect unwinding. Either way, you are multiplying complexity for generic code. (Not to mention you have to invent a whole new set of idioms for panic-free code.)
* You still have to be able to clean up the resources held by a caller that does expect unwinding. So all your vocabulary/glue/library code (the stuff that can't just assume panic=abort) still needs these "scoped panic hooks" in all the same places it has any level of panic awareness in Drop today.
So for anyone to actually benefit from this, they have to be writing panic-free code with whatever new static or dynamic tools come with this, and they have to be narrowly scoped and purpose-specific enough that they could essentially already today afford panic=abort. Who is this even for?
To be very explicit about something: these are all vague design handwaves, and until they become not only concrete but sufficiently clear to handle use cases people have, they're not going to go anywhere. They're vague ideas we're thinking about. Right now, panic unwind isn't going anywhere.
Hypothetically Rust could make `Mutex<InnerBlah>` work with just two bits in the same way it makes `Option<&T>` the same size as `&T`. Annotate `InnerBlah` with the information about which bits are available and let `Mutex` use them.
There was talk of Rust allowing stride != alignment. [1] I think this would mean if say `InnerBlah` has size 15 and alignment 8, `parking_lot::Mutex<InnerBlah>` can be size 16 rather than the current 24. Same would be true for an `OuterBlah` the mutex is one field of. But I don't think it'll happen.
In principle, you Rust could create something like std::num::NonZero and its corresponding sealed trait ZeroablePrimitive to mark that two bits are unused. But that doesn't exist yet as far as I know.
There are also currently the unstable rustc_layout_scalar_valid_range_start and rustc_layout_scalar_valid_range_end attributes (which are used in the definition of NonNull, etc.) which could be used for some bit patterns.
The requirement is that the futures are not separate heap allocations, not that they are inert.
It's not at all obvious that Rust's is the only possible design that would work here. I strongly suspect it is not.
In fact, early Rust did some experimentation with exactly the sort of stack layout tricks you would need to approach this differently. For example, see Graydon's post here about the original implementation of iterators, as lightweight coroutines: https://old.reddit.com/r/ProgrammingLanguages/comments/141qm...
If it’s not inert, how do you use async in the kernel or microcontrollers? A non-inert implementation presumes a single runtime implementation within std+compiler and not usable in environments where you need to implement your own meaning of dispatch.
I think the kernel and microcontroller use-case has been overstated.
A few bare metal projects use stackless coroutines (technically resumable functions) for concurrency, but it has turned out to be a much smaller use-case than anticipated. In practice C and C++ coroutines are really not worth the pain that they are to use, and Rust async has mostly taken off with heavy-duty executors like Tokio that very much don't target tiny #[no-std] 16-bit microcontrollers.
The Kernel actually doesn't use resumable functions for background work, it uses kernel threads. In the wider embedded world threads are also vastly more common than people might think, and the really low-end uniprocessor systems are usually happy to block. Since these tiny systems are not juggling dozens of requests per second that are blocking on I/O, they don't gain that much from coroutines anyways.
We mostly see bigger Rust projects use async when they have to handle concurrent requests that block on IO (network, FS, etc), and we mostly observe that the ecosystem is converging on tokio.
Threads are not free, but most embedded projects today that process requests in parallel — including the kernel — are already using them. Eager futures are more expensive than lazy futures, and less expensive than threads. They strike an interesting middle ground.
Lazy futures are extremely cheap at runtime. But we're paying a huge complexity cost in exchange that benefits a very small user-base than hasn't really fully materialized as we hoped it would.
> it has turned out to be a much smaller use-case than anticipated
Well, no, at the time of the design of Rust's async MVP, everyone was pretty well aware that the vast majority of the users would be writing webservers, and that the embedded use case would be a decided minority, if it ever existed at all. That Embassy exists and its ecosystem as vibrant as it is is, if anything, an unexpected triumph.
But regardless of how many people were actually expected to use it in practice, the underlying philosophy remained thus: there exist no features of Rust-the-language that are incompatible with no_std environments (e.g. Rust goes well out of its way, and introduces a lot of complexity, to make things like closures work given such constraints), and it would be exceptional and unprecedented for Rust to violate this principle when it comes to async.
Point taken, I might have formed the wrong impression at the time.
With my C++ background, I'm very much at home with that philosophy, but I think there is room for nuance in how strictly orthodox we are.
C++ does have optional language features that introduce some often unwelcone runtime overhead, like RTTI and unwinding.
Rust does not come configured for freestanding environments out of the box either. Like C++, you are opting out of language features like unwinding as well as the standard library when going freestanding.
I want to affirm that I'm convinced Rust is great for embedded. It's more that I mostly love async when I get to use it for background I/O with a full fledged work stealing thread-per-core marvel of engineering like tokio!
In freestanding Rust the I/O code is platform specific, suddenly I'd have to write the low-level async code myself, and it's not clear this makes the typical embedded project that much higher performance, or all that easy to maintain.
So, I don't want to say anything too radical. But I think the philosophy doesn't have to be as clear cut as no language feature ever incompatible with no-std. Offering a std only language feature is not necessarily closing a door to embedded. We sort of already make opt-out concessions to have a friendlier experience for most people.
"Not inert" does not at all imply "a single runtime within std+compiler." You've jumped way too far in the opposite direction there.
The problem is that the particular interface Rust chose for controlling dispatch is not granular enough. When you are doing your own dispatch, you only get access to separate tasks, but for individual futures you are at the mercy of combinators like `select!` or `FuturesUnordered` that only have a narrow view of the system.
A better design would continue to avoid heap allocations and allow you to do your own dispatch, but operate in terms of individual suspended leaf futures. Combinators like `join!`/`select!`/etc. would be implemented more like they are in thread-based systems, waiting for sub-tasks to complete, rather than being responsible for driving them.
If you’ve got eager dispatch I’m eager (pun intended) to learn how you have an executor that’s not baked into the std library and limited to a single runtime per process because at the time of construction you need the language to schedule dispatch of the created future. This is one of the main challenges behind the pluggable executor effort - the set of executors that could be written is so different (work stealing vs thread per core) that it’s impossible to unify without an effect system and even then you’ve got challenges of how to encode that in the language structure because the executor is a global thing determined at runtime but then it’s also local in the sense that you don’t know which executor a given piece of code will end up actually being dispatched into since you could have the same async function invoked on different executors.
For better or worse eager dispatch I think generally implies also not being able to cancel futures since ownership is transferred to the executor rather than being retained by your code.
You don't need any of that, and you can keep cancellation too.
The core of an eager cooperative multitasking system does not even need the concept of an executor. You can spawn a new task by giving it some stack space and running its body to its first suspension point, right there on the current thread. When it suspends, the leaf API (e.g. `lock`) grabs the current top of the stack and stashes it somewhere, and when it's time to resume it again just runs the next part of the task right there on the current thread.
You can build different kinds of schedulers on top of this first-class ability to resume a particular leaf call in a task. For example, a `lock` integrated with a particular scheduler might queue up the resume somewhere instead of invoking it immediately. Or, a generic `lock` might be wrapped with an adapter that re-suspends and queues that up. None of this requires that the language know anything about the scheduler at all.
This is all typical of how higher level languages implement both stackful and stackless coroutines. The difference is that we want control over the "give it some stack space" part- we want the compiler to compute a maximum size and have us specify where to store it, whether that's on the heap (e.g. tokio::spawn) or nested in some other task's stack (e.g. join, select) or some statically-allocated storage (e.g. on a microcontroller).
(Of course the question then becomes, how do you ensure `lock` can't resume the task after it's been freed, either due to normal resumption or cancellation? Rust answers this with `Waker`, but this conflates the unit of stack ownership with the unit of scheduling, and in the process enables intermediate futures to route a given wakeup incorrectly. These must be decoupled so that `lock` can hold onto both the overall stack and the exact leaf suspension point it will eventually resume.)
Cancellation doesn't change much here. Given a task held from the "caller end" (as opposed to the leaf callee resume handles above), the language needs to provide a way to destruct the stack and let the decoupled `Waker` mechanism respond. This still propagates naturally to nested tasks like join/select arms, though there is now an additional wrinkle that a nested task may be actively running (and may even be the thing that indirectly provoked the cancellation).
On the other hand, early Rust also for instance had a tracing garbage collector; it's far from obvious to me how relevant its discarded design decisions are supposed to be to the language it is today.
This one is relevant because it avoids heap allocation while running the iterator and for loop body concurrently. Which is exactly the kind of thing that `async` does.
It avoids heap allocation in some situations. But in principle the exact same optimization could be done for stackful coroutines. Heck, right now in C I could stack-allocate an array and pass it to pthread_create as the stack for a new thread. To avoid an overlarge allocation I would need to know exactly how much stack is needed, but this is exactly the knowledge the Rust compiler already requires for async/await.
What people care about are semantics. async/await leaks implementation details. One of the reasons Rust does it the way it currently does is because the implementation avoids requiring support from, e.g., LLVM, which might require some feature work to support a deeper level of integration of async without losing what benefits the current implementation provides. Rust has a few warts like this where semantics are stilted in order to confine the implementation work to the high-level Rust compiler.
> in principle the exact same optimization could be done for stackful coroutines.
Yes, I totally agree, and this is sort of what I imagine a better design would look like.
> One of the reasons Rust does it the way it currently does is because the implementation avoids requiring support from, e.g., LLVM
This I would argue is simply a failure of imagination. All you need from the LLVM layer is tail calls, and then you can manage the stack layout yourself in essentially the same way Rust manages Future layout.
You don't even need arbitrary tail calls. The compiler can limit itself to the sorts of things LLVM asks for- specific calling convention, matching function signatures, etc. when transferring control between tasks, because it can store most of the state in the stack that it laid out itself.
In order to know for sure how much stack is needed (or to replace the stack with a static allocation, which used to be common on older machines and still today in deep embedded code, and even on GPU!), you must ensure that any functions you call within your thread are non-reentrant, or else that they resort to an auxiliary stack-like allocation if reentrancy is required. This is a fundamental constraint (not something limited to current LLVM) which in practice leads you right back into the "what color are your functions?" world.
Additionally they ignore field experience, I can tell that on VC++ the lifetime checker only has worked in small examples, as I was really into trying it out.
Microsoft even has blog posts admitting that only with SAL like annotations it can be improved, while keeping the usual C++ semantics.
A method call like `.trunc()` is still going to be abysmally less ergonomic than `as`. It relies on inference or turbofish to pick a type, and it has all the syntactic noise of a function call on top of that.
Not to mention this sort of proliferation of micro-calls for what should be <= 1 instruction has a cost to debug performance and/or compile times (though this is something that should be fixed regardless).
> A method call like `.trunc()` is still going to be abysmally less ergonomic than `as`. It relies on inference or turbofish to pick a type, and it has all the syntactic noise of a function call on top of that.
If `as` gets repurposed for safe conversions (e.g. u32 to u64), there's some merit to the more hazardous conversions being slightly noisier. I'm all for them being no noisier than necessary, but even in my most conversion-heavy code (which has to convert regularly between usize and u64), I'd be fine writing `.into()` or `.trunc()` everywhere, as long as I don't have to write `.try_into()?` or similar.
> Not to mention this sort of proliferation of micro-calls for what should be <= 1 instruction has a cost to debug performance and/or compile times (though this is something that should be fixed regardless).
I fully expect that such methods will be inlined, likely even in debug mode (e.g. `#[inline(always)]`), and compile down to the same minimal instructions.
Yes, this is specifically what I'm disagreeing with.
> I fully expect that such methods will be inlined, likely even in debug mode (e.g. `#[inline(always)]`), and compile down to the same minimal instructions.
Many things in the language theoretically go through a trait as well, except that we have special cases in the compiler to handle those traits more efficiently. If this were a performance issue, there's no reason we couldn't do the same for `.trunc()` or `.into()`.
The compiler doesn't have to implement a call as a call; having "magic functions" calls to which are special-cased by the code generator is an old and time-honored tradition.
I've been using https://messages.google.com to get something like the desktop iMessage experience with Android- does that work for your use case? (I don't use iMessage so I could just be missing some killer feature it has, or something.)
The tokenizer is not really a good demonstration of the differences between these styles. A more representative comparison would be the later stages that build, traverse, and manipulate tree and graph data structures.
I think a reasonable comparison would have to be DoD Rust parser vs current Rust parser. Comparing across languages isn't very useful, because Zig has very different syntax rules, and doesn't provide diagnostics near the same level as Rust does. The Rust compiler (and also its parser) spends an incredible amount of effort on diagnostics, to the point of actually trying to parse syntax from other languages (e.g. Python), just to warn people not to use Python syntax in Rust. Not to mention that it needs to deal with decl and proc macros, intertwine that with name resolution, etc. etc. This all of course hurts parsing performance quite a lot, and IMO would make it both much harder to write the whole thing in DoD, and also the DoD performance benefits would be not so big, because of all the heterogeneous functionality the Rust frontend does. Those are of course deliberate decisions of Rust that favor other things than compilation performance.
Your points here don't really make sense. There are many ways you can apply DoD to a codebase, but by far the main one (both easiest and most important) is to optimize the in-memory layout of long-lived objects. I won't claim to be familiar with the Rust compiler pipeline, but for most compilers, that means you'd have a nice compact representation for a `Token` and `AstNode` (or whatever you call those concepts), but the code between them -- i.e. the parser -- isn't really affected. In other words, all the fancy features you describe -- macros intertwined with name resolution, parsing syntax from other languages, high-quality diagnostics -- don't care about DoD! Our approach in the Zig compiler has evolved over time, but we're slowly converging towards a style where all of the access to the memory-efficient dense representation is abstracted behind functions. So, you write your actual processing (e.g. your parser with all the features you mention) just the same; the only real difference is that when your parser wants to, for instance, get a token (as input) or emit an AST node (as output), it calls functions to do that, and those functions pull out the bytes you need into a lovely `struct` or (in Rust terms) `enum` or whatever the case may be.
Our typical style in Zig, or at least what we tend to do when writing DoD structures nowadays, is to have the function[s] for "reading" that long-lived data (e.g. getting a single token out from a memory-efficient packed representation of "all the tokens") in the implementation of the DoD type, and the functions for "writing" it in the one place that generates that thing. For instance, the parser has functions to deal with writing a "completed" AST node to the efficient representation it's building, and the AST type itself has functions (used by the next phase of the compiler pipeline, in our case a phase called AstGen) to extract data about a single AST node from that efficient representation. That way, barely any code has to actually be aware of the optimized representation being used behind the scenes. As mentioned above, what you end up with is that the actual processing phases look more-or-less identical to how they would without DoD.
FWIW, I don't think the parser is our best code here: it's one of the oldest "DoD-ified" things in the Zig codebase so has some outdated patterns and questionable naming. Personally, I'm partial to `ZonGen`[0] as a fairly good example of a "processing" phase (although I'm admittedly biased!). It inputs an AST and outputs a simple tree IR for a subset of Zig which is analagous to JSON. Then, for an example of code consuming that generated IR, take a look at `print_zoir`[1], which just dumps the tree to stdout (or whatever) for debugging purposes. The interesting logic is in `PrintZon.renderNode` in that file: note how it calls `node.get`, and then just has a nice convenient tagged union (`enum` in Rust terms) value to work with.
I also don't know all the details, but the Rust parser tokens contain horrible crimes, primarily because of macros. All I wanted to say was that applying DoD to the parser in Rust would (IMO) be much more difficult than in Zig, because language differences and different approaches to error reporting. Not saying it's impossible ofc. That being said, I don't really think so much effort would be worth here, the gain would be minimal in the grand scheme of things; we have bigger perf. problems than parsing.
"Hand-rolled assembly" was one item in a list that also included DoD. You're reading way more into that sentence than they wrote- the claim is that DoD itself also impacts the maintainability of the codebase.
This doesn't get rid of unwinding at all- it's an inefficient reimplementation of it. There's a reason language implementations have switched away from having the main execution path register and unregister destructors and finally blocks, to storing them in a side table and recovering them at the time of the throw.