Async/await with one thread is simple and well-understood. That's the Javascript model. Threads let you get all those CPUs working on the problem, and Rust helps you manage the locking. Plus, you can have threads at different priorities, which may be necessary if you're compute-bound.
Multi-threaded async/await gets ugly. If you have serious compute-bound sections, the model tends to break down, because you're effectively blocking a thread that you share with others.
Compute-bound multi-threaded does not work as well in Rust as it should. Problems include:
- Futex congestion collapse. This tends to be a problem with some storage allocators. Many threads are hitting the same locks. In particular, growing a buffer can get very expensive in allocators where the recopying takes place with the entire storage allocator locked.
I've mentioned before that Wine's library allocator, in a .DLL that's emulating a Microsoft library, is badly prone to this problem. Performance drops by two orders of magnitude with all the CPU time going into spinlocks. Microsoft's own implementation does not have this problem.
- Starvation of unfair mutexes. Both the standard Mutex and crossbeam-channel channels are unfair. If you have multiple threads locking a resource, doing something, unlocking the resource, and repeating that cycle, one thread will win repeatedly and the others will get locked out.[1]
If you need fair mutexes, there's "parking-lot". But you don't get the poisoning safety on thread panic that the standard mutexes give you.
If you're not I/O bound, this gets much more complicated.
I've mostly only dealt with IO-bound computations, but the contention issues arise there as well. What's the point of having a million coroutines when the IO throughput is bounded again? How will coroutines save me when I immediately exhaust my size 10 DB connection pool? It won't, it just makes debugging and working around the issues harder and difficult to reason about.
Use of async/await model in particular ends up with random hanged micro tasks in some random place in code that are very hard to trace back to the cause because they're dispersed potentially anywhere.
Concurrency is also rather undefined, as are priorities most of the time.
This can be partly fixed by labelling, which adds more complexity, but at least is explicit. Then the programmer needs to know what to label... Which they won't do, and Rust has no training wheels to help with concurrency.
Threads, well you have well defined ingress and egress. Priorities are handled by the OS and to some degree fairness is usually ensured.
> What's the point of having a million coroutines when the IO throughput is bounded again?
Because if you did the same thing with a million threads, you’d have all those same problems and more, because (a) threads take up more RAM, (b) threads are more expensive to spawn and tear down, and (c) threads have more registers to save/restore when context switching than coroutines.
The real answer is that you shouldn’t have a million coroutines, and you shouldn’t have a million threads. Doing such a thing is only useful if you have other useful work you can do while you wait on I/O. This is true for web servers (which want to serve other requests, so… maybe one coroutine per active request) and UI’s (which should have a dedicated thread for the UI event loop), but for other applications there’s shockingly little need to actually do async programming in the first place. Parallelism is a well-understood problem and can be used in key places (ie. doing a lot of math that you can easily split into multiple cores), but concurrency is IMO a net negative for most applications that don’t look like a web server.
Just morning bathroom musings based on your posts (yep /g) and this got me thinking maybe the robust solution (once and for all for all languages) may require a rethink at the hardware level. The CPU bound issue comes down to systemic interrupt/resume I think; if this can be done fairly for n wip thread-of-execution with efficient queued context swaps (say maybe a cpu with n wip contexts) then the problem becomes a resource allocation issue. Your thoughts?
That's what "hyper-threading" is. There's enough duplicated hardware that beyond 2 hyper-threads, it seems to be more effective to add another CPU. If anybody ever built a 4-hyperthread CPU, it didn't become a major product.
It's been tried a few times in the past, back when CPUs were slow relative to memory. There was a National Semiconductor microprocessor where the state of the CPU was stored in main memory, and, by changing one register, control switched to another thread. Going way back, the CDC 6600, which was said to have 10 peripheral processors for I/O, really had only one, with ten copies of the state hardware.
Today, memory is more of a bottleneck that the CPU, so this is not a win.
The UltraSPARC T1 had 4-way SMT, and its successors bumped that to 8-way. Modern GPU compute is also highly based on hardware multi-threading as a way of compensating for memory latency, while also having wide execution units that can extract fine-grained parallelism within individual threads.
Missed that. That's part of IBM mainframe technology, where you can have "logical partitions", a cluster on a chip, and assign various resources to each. IBM POWER10 apparently allows up to 8-way hyperthreading if configured that way.
(This has been a very low priority background thread in my head this morning so cut me some slack on hand waving.)
Historically, the H/W folks addressed (pi) memory related architectural changes, such as when multicore came around and we got level caches. Imagine if we had to deal at software level with memory coherence in different cores [down to the fundamental level of invalidating Lx bytes]. There would be NUMA like libraries and various hacks to make it happen.
Arguably you could say "all that is in principle OS responsibility even memory coherence across cores" and we're done. Or you would agree that "thank God the H/W people took care of this" and ask can they do the same for processing?
The CPU model afaik hasn't changed that much in terms of granularity of execution steps whereas the H/W people could realize that d'oh an execution granularity in conjunction with hot context switching mechanism, could really help the poor unwashed coders in efficiently executing multiple competing sequences of code (which is all they know about at H/W level).
If your CPU's architecture specs n+/-e clock ticks per context iteration, then you compile for that and you design languages around that. CPU bound now becomes heavy CPU usage but is not a disaster for any other process sharing the machine with you. It becomes a matter of provisioning instead of programming ad-hoc provisioning ..
If our implementations are bad because of preemption, then I’m not sure why the natural conclusion isn’t “maybe there should be less preemption” instead of “[even] more of the operating system should be moved into the hardware”.
I don't know how the challenges with cooperative (no-preemptive) multitasking keep needing to get rediscovered. Even golang, which I consider a very responsibly designed language, went with cooperative at first until they were forced to switch to preemptive. Not saying cooperative multitasking doesn't have its place, just that it's gotta have a warning sticker or even better disallow certain types of code from executing statically.
Also great time to plug a related post, What color is your function:
I keep having to repeat it on this unfortunate website: it's an implementation detail, a multi-threaded executor of async/await can cope with starvation perfectly well as demonstrated by .NET's implementation that can shrug off some really badly written code that interleaves blocking calls with asynchronous.
You would be surprised. It ultimately regresses to "thread per request but with extra steps". I remember truly atrocious codebases that were spamming task.Result everywhere and yet there were performing tolerably even back on .NET Framework 4.6.1. The performance has (and often literally) improved ten-fold since then, with threadpool being rewritten, its hill-climbing algorithm receiving further tuning and it getting proactive blocked workers detection that can inject threads immediately without going through hill-climbing.
If you expect to color your functions async by default, it's really easy to turn a sync functuion into a near-zero-cost async, a Future that has already been resolved at construction time by calling the sync function.
This way, JS / TS becomes pretty comfortable. Except for stacktraces, of course.
Fair enough. The context is that years ago, my team was evaluating Go and Node.js as options for a server requiring concurrency, and back then Node.js didn't provide any stacktrace for async callbacks. I know it has been improved since then, but I don't use Node.js.
> Async/await with one thread is simple and well-understood. That's the Javascript model.
Bit of nuance (I'm not an authority of this and I don't know the current up-to-date answer across 'all' of the Javascript runtimes these days):
Isn't it technically "the developer is exposed the concept of just one thread without having to worry about order of execution (other than callbacks/that sort of pre-emption)" but under the hood it can actually be using as many threads as it wants (and often does)? It's just "abstracted" away from the user.
I mean, yeah, if you go deep enough your OS may decide to schedule your browser thread to a different core as well. I don’t think it has any relevance here - semantically, it is executed on a single thread, which is very different from multi-threading.
It is executed by your runtime which may or may not behind the scenes be using a single thread for your execution and/or the underlying I/O/eventing going on underneath, no?
Most JS runtimes are multi-threaded behind the scenes. If you start a node process:
node -e "setTimeout(()=>{}, 10_000)" &
Then wait a second and run:
ps -o thcount $!
Or on macOS:
ps -M $!
You'll see that there are multiple threads running, but like others have said, it's completely opaque to you as a programmer. It's basically just an implementation detail.
If you need true parallelism of course you can opt-in to Web Workers (called Worker Threads in Node), or the Node-specific child_process.fork, or cluster module.
I'm not sure what you mean by "multi-threaded async/wait"... Isn't the article considering async/await as an alternative to threads (i.e. coroutines vs threads)?
I'm a C++ programmer, and still using C++17 at work, so no coroutines, but don't futures provide a similar API? Useful for writing async code in serialized fashion that may be easier (vs threads) to think about and debug.
Of course there are still all the potential pitfalls that you enumerate, so it's no magic bullet for sure, but still a useful style of programming on occasion.
They mean async/await running over multiple OS threads compared to over one OS thread.
You can also have threads running on one OS thread (Python) or running on multiple OS threads (everything else).
Every language’s concurrency model is determined by both a concurrency interface (callbacks, promises, async await, threads, etc), and an implementation (single-threaded, multiple OS threads, multiple OS processes).
async/await tasks can be run in parallel on multiple threads, usually no more threads than there are hardware threads. This allows using the full capabilities of the machine, not just one core's worth. In a server environment with languages that support async/await but don't have the ability to execute on multiple cores like Node.js and Python, this is usually done by spawning many duplicate processes and distributing incoming connections round-robin between them.
IIRC glibc default malloc doesn't use per-thread arenas as they would waste too much memory on programs spawning tens of thousands of threads, and glibc can't really make too many workload assumptions. Instead I think it uses a fixed pool of arenas and tries to minimize contention.
These days on Linux, with restartable sequences, you can have true per-cpu arenas with zero contention. Not sure which allocator use them though.
> As pressure from thread collisions increases, additional arenas are created via mmap to relieve the pressure. The number of arenas is capped at eight times the number of CPUs in the system (unless the user specifies otherwise, see mallopt), which means a heavily threaded application will still see some contention, but the trade-off is that there will be less fragmentation.
Glibc allocator is also notoriously prone to not giving back memory that could be given back to the OS when running several threads over a long time period. Reducing pool size helps a bit, but not sufficiently to make memory stable.
I think its kotlin's model. The language don't have a default co-routine executor set, you need to provide your own or spawn one with the std. It can be thread-pooled, single thread, or custom one but there isn't a default one set.
If you use single threaded executor, then race condition won't happen. If you choose pooled, then you obviously should realize there can be a race condition. It's all about choice you made.
If you are IO bound, consider threads. This is almost the same as async / await.
What was missing above, and the problem with how most compute education is these days, if you are compute bound you need to think about processes.
If you were dealing with python concurrent.futures, you would need to consider processpooexecutor vs. threadpoolexecutor.
Threadpoolexecutor gives you the same as the above.
With multiprocessor executor, you will have multiple processes executing independently but you have to copy a memory space. Which people don't consider. In python DS work - multiprocessor workloads need to determine memory space considerations.
It's kinda f'd up how JS doesn't have engineers think about their workloads.
I think you are coming at this from a particular Python mindset, driven by the limitations imposed on Python threading by the GIL. This is a peculiarity specific to Python rather than a general purpose concept about threads vs processes.
[...] if you are compute bound you need to think about processes.
How would that help? Running several processes instead of several threads will not speed anything up [1] and might actually slow you down because of additional inter-process communication overhead.
[1] Unless we are talking about running processes across multiple machines to make use of additional processors.
I think you need to clarify what you mean by "thread". For example they are different things when we compare Python and Java Threads. Or OS threads and green threads.
I think the GP was relating to OS threads.
I was also referring to kernel threads. If we are talking about non-kernel threads, then sure, a given implementation might have limitations and there might be something to be gained by running several processes, but that would be a workaround for those limitations. But for kernel threads there will generally be no gain by spreading them across several processes.
Right; a process is just a thread (or set of threads) and some associated resources - like file descriptors and virtual memory allocations. As I understand it, the scheduler doesn’t really care if you’re running 1000 processes with 1 thread each or 1 process with 1000 threads.
But I suspect it’s faster to swap threads within a process than swap processes, because it avoids expensive TLB flushes. And of course, that way there’s no need for IPC.
All things being equal, you should get more performance out of a single process with a lot of threads than a lot of individual processes.
> a process is just a thread (or set of threads) and some associated resources - like file descriptors and virtual memory allocations
Or rather the reverse, in Linux
terminology. Only processes exist, some just happen to share the same virtual address space.
> the scheduler doesn’t really care if you’re running 1000 processes with 1 thread each or 1 process with 1000 threads
Not just the scheduler, the whole kernel really. The concept of thread vs process is mainly a userspace detail for Linux. We arbitrarily decided that the set of clone() parameters from fork() create a process, while the set of clone() parameters through pthread_create() create a thread. If you start tweaking the clone() parameters yourself, then the two become indistinguishable.
> it’s faster to swap threads within a process than swap processes, because it avoids expensive TLB flushes
Right, though this is more of a theorical concern than a practical one. If you are sensible to a marginal TLB flush, then you may as well "isolcpu" and set affinities to avoid any context switch at all.
> that way there’s no need for IPC
If you have your processes mmap a shared memory, you effectively share address space between processes just like threads share their address space.
For most intent and purposes, really, I do find multiprocessing just better than multithreading. Both are pretty much indistinguishable, but separate processes give you the flexibility of being able to arbitrarily spawn new workers just like any other process, while with multithreading you need to bake in some form of pool manager and hope to get it right.
> The concept of thread vs process is mainly a userspace detail for Linux. We arbitrarily decided that the set of clone() parameters from fork() create a process, while the set of clone() parameters through pthread_create() create a thread. If you start tweaking the clone() parameters yourself, then the two become indistinguishable.
That's from Plan 9. There, you can fork, with various calls, sharing or not sharing code, data, stack, environment variables, and file descriptors.[1] Now that's in Linux. It leads to a model where programs are divided into connected processes with some shared memory. Android does things that way, I think.
Or rather the reverse, in Linux terminology. Only processes exist, some just happen to share the same virtual address space.
Threads and processes are semantically quite different in standard computer science terminology. A thread has an execution state, i.e. its set of set of processor register values. A process on the other hand is a management and isolation unit for resources like memory and handles.
Tbf a lot of the time you're running it in a container and you allocate 1 vcpu there, only downside is maybe a little extra memory overhead. And for most lambdas I think they're suited to being single threaded (imo).
Multi-threaded async/await gets ugly. If you have serious compute-bound sections, the model tends to break down, because you're effectively blocking a thread that you share with others.
Compute-bound multi-threaded does not work as well in Rust as it should. Problems include:
- Futex congestion collapse. This tends to be a problem with some storage allocators. Many threads are hitting the same locks. In particular, growing a buffer can get very expensive in allocators where the recopying takes place with the entire storage allocator locked. I've mentioned before that Wine's library allocator, in a .DLL that's emulating a Microsoft library, is badly prone to this problem. Performance drops by two orders of magnitude with all the CPU time going into spinlocks. Microsoft's own implementation does not have this problem.
- Starvation of unfair mutexes. Both the standard Mutex and crossbeam-channel channels are unfair. If you have multiple threads locking a resource, doing something, unlocking the resource, and repeating that cycle, one thread will win repeatedly and the others will get locked out.[1] If you need fair mutexes, there's "parking-lot". But you don't get the poisoning safety on thread panic that the standard mutexes give you.
If you're not I/O bound, this gets much more complicated.
[1] https://users.rust-lang.org/t/mutex-starvation/89080