I think you might be confusing Runtime, OS and bare-metal primitives. Java virtu...

gpderetta · on March 28, 2024

Yes, of course a non-cooperative switch is more expensive than a cooperative one. But the thread model does not require preemption or even time-slice scheduling.

But with async/await cooperative switch is the only option.

f_devd · on March 28, 2024

I'm unfamiliar with a bare-metal thread model that doesn't do preemption outside of a Runtime. I imagine you'd need to effectively inject code to do a cooperative switch as there aren't many ways for a cpu to exit it's current 'task' outside of an interrupt (pre-emption) or a defer call (cooroutines/async). For Runtimes it usually also means you effectively have a cooperative switch but it's hidden away in runtime code.

Do you have an example?

avodonosov · on April 2, 2024

@f_devd, I realized that my main objection to async/await does not apply to Rust.

Thank you for staying in the discussion long enough for me to realize that completely.

I dislike async/await in Javascript because async functions can not be called synchronously from normal functions. The calling function and all its callers and all their callers need to turned async.

In Rust, since we can simply do `executor::block_on(some_async_functino())`, my objection goes away - all primitives remain fully composable. Async functions can call usual functions and vice versa.

So my first comment was to some extend a "knee-jerk reaction".

As we started to discuss thread preemption cost, I will provide some responses below. In short, I believe it can be on par with async/await.

=================================================

> I think you might be confusing Runtime, OS and bare-metal primitives.

I am not confusing, but I consider all those cases down to what happens at CPU level.

> Java virtual threads are possible because there is always the runtime which code will return to, and since it's already executing in a VM the concept of Stack/Heap Store/Loads don't really matter for performance.

They remain applicable, as at the lowest level the VM / Runtime is executed by a CPU.

> Async/await (in rust) is without a runtime,

Rust Executor is a kind of runtime, IMHO.

> and without copies or register stores/loads;

The CPU register values are still saved to memory when async function returns Poll::Pending, so that the intermediate computation results are not lost and when polled again the function continues its execution correctly. (On the level of Rust source code, the register saving corresponds to assignment of local variables of the most nested async function to the fields of the generated anonymous future).

==============================================

> In embedded you might not have a stack base, just a stack pointer, this means in order to switch to a different stack you need to copy 2 stacks. (I might be wrong here; I know some processors have linear stacks, but this might be more uncommon).

If the CPU does not have a stack base (stack segment register), saving of the stack pointer is enough to switch to another stack.

In practice, I think, even CPUs with stack segment register, most often only need to save stack pointer for context switch - all stacks of the process can live in the same segment, and even for different processes the OS can arrange the segments to have the equal segment selector. I know that switching to kernel mode usually involves changing stack segment register in addition to the stack pointer (as the kernel stack segment has different protection level).

==============================================

> On bare metal this dynamic changes significantly, in order to "switch contexts" with preemption the following steps are needed (omitting the kernel switch ops): [...] While for async/await everything already in place on the stack/heap so a context switch is: [..]

The operations you listed for bare metal are very cheap, some items in the list are just single CPU instruction. (Also, I think timer interrupts are configured once for periodic interval and don't need to be recalculated and set on every context switch).

If one expands the "go to next Future in Waker queue" item you listed for async/await in the same level of detail that you did for bare metal, the resulting list may be even longer than the bare metal list.

==============================================

The majority of the context switch cost at CPU level is when we switch to different process, so that new virtual memory mapping table needs to be loaded to the CPU, (and correspondingly, the cached mappings in TLB needs to be reset and new ones need to be computed during execution in the new context); from the need to load different descriptor tables.

Nothing of that applies to in-process green thread context switches.