Async/await with one thread is simple and well-understood. That's the Javascript...

exfalso · on March 25, 2024

Yes, 100%.

I've mostly only dealt with IO-bound computations, but the contention issues arise there as well. What's the point of having a million coroutines when the IO throughput is bounded again? How will coroutines save me when I immediately exhaust my size 10 DB connection pool? It won't, it just makes debugging and working around the issues harder and difficult to reason about.

AstralStorm · on March 25, 2024

The debugging issue is bigger than what it seems.

Use of async/await model in particular ends up with random hanged micro tasks in some random place in code that are very hard to trace back to the cause because they're dispersed potentially anywhere.

Concurrency is also rather undefined, as are priorities most of the time.

This can be partly fixed by labelling, which adds more complexity, but at least is explicit. Then the programmer needs to know what to label... Which they won't do, and Rust has no training wheels to help with concurrency.

Threads, well you have well defined ingress and egress. Priorities are handled by the OS and to some degree fairness is usually ensured.

ninkendo · on March 26, 2024

> What's the point of having a million coroutines when the IO throughput is bounded again?

Because if you did the same thing with a million threads, you’d have all those same problems and more, because (a) threads take up more RAM, (b) threads are more expensive to spawn and tear down, and (c) threads have more registers to save/restore when context switching than coroutines.

The real answer is that you shouldn’t have a million coroutines, and you shouldn’t have a million threads. Doing such a thing is only useful if you have other useful work you can do while you wait on I/O. This is true for web servers (which want to serve other requests, so… maybe one coroutine per active request) and UI’s (which should have a dedicated thread for the UI event loop), but for other applications there’s shockingly little need to actually do async programming in the first place. Parallelism is a well-understood problem and can be used in key places (ie. doing a lot of math that you can easily split into multiple cores), but concurrency is IMO a net negative for most applications that don’t look like a web server.

temporarely · on March 25, 2024

Just morning bathroom musings based on your posts (yep /g) and this got me thinking maybe the robust solution (once and for all for all languages) may require a rethink at the hardware level. The CPU bound issue comes down to systemic interrupt/resume I think; if this can be done fairly for n wip thread-of-execution with efficient queued context swaps (say maybe a cpu with n wip contexts) then the problem becomes a resource allocation issue. Your thoughts?

Animats · on March 25, 2024

> a cpu with N wip contexts

That's what "hyper-threading" is. There's enough duplicated hardware that beyond 2 hyper-threads, it seems to be more effective to add another CPU. If anybody ever built a 4-hyperthread CPU, it didn't become a major product.

It's been tried a few times in the past, back when CPUs were slow relative to memory. There was a National Semiconductor microprocessor where the state of the CPU was stored in main memory, and, by changing one register, control switched to another thread. Going way back, the CDC 6600, which was said to have 10 peripheral processors for I/O, really had only one, with ten copies of the state hardware.

Today, memory is more of a bottleneck that the CPU, so this is not a win.

zozbot234 · on March 25, 2024

The UltraSPARC T1 had 4-way SMT, and its successors bumped that to 8-way. Modern GPU compute is also highly based on hardware multi-threading as a way of compensating for memory latency, while also having wide execution units that can extract fine-grained parallelism within individual threads.

nine_k · on March 25, 2024

Also, IBM POWER have SMT at levels above 2; at least POWER 7 had 4-way SMT ("hyperthreading").

Animats · on March 25, 2024

Missed that. That's part of IBM mainframe technology, where you can have "logical partitions", a cluster on a chip, and assign various resources to each. IBM POWER10 apparently allows up to 8-way hyperthreading if configured that way.

temporarely · on March 25, 2024

Thanks, very informative.

binary132 · on March 25, 2024

What you said sounded in my head more like you’re describing a cooperatively scheduled OS rather than a novel hardware architecture.

temporarely · on March 25, 2024

(This has been a very low priority background thread in my head this morning so cut me some slack on hand waving.)

Historically, the H/W folks addressed (pi) memory related architectural changes, such as when multicore came around and we got level caches. Imagine if we had to deal at software level with memory coherence in different cores [down to the fundamental level of invalidating Lx bytes]. There would be NUMA like libraries and various hacks to make it happen.

Arguably you could say "all that is in principle OS responsibility even memory coherence across cores" and we're done. Or you would agree that "thank God the H/W people took care of this" and ask can they do the same for processing?

The CPU model afaik hasn't changed that much in terms of granularity of execution steps whereas the H/W people could realize that d'oh an execution granularity in conjunction with hot context switching mechanism, could really help the poor unwashed coders in efficiently executing multiple competing sequences of code (which is all they know about at H/W level).

If your CPU's architecture specs n+/-e clock ticks per context iteration, then you compile for that and you design languages around that. CPU bound now becomes heavy CPU usage but is not a disaster for any other process sharing the machine with you. It becomes a matter of provisioning instead of programming ad-hoc provisioning ..

binary132 · on March 25, 2024

If our implementations are bad because of preemption, then I’m not sure why the natural conclusion isn’t “maybe there should be less preemption” instead of “[even] more of the operating system should be moved into the hardware”.

marcosdumay · on March 25, 2024

If you have fewer threads ready to run than CPU cores, you never have any good reason to interrupt one of them.

sporkland · on March 25, 2024

I don't know how the challenges with cooperative (no-preemptive) multitasking keep needing to get rediscovered. Even golang, which I consider a very responsibly designed language, went with cooperative at first until they were forced to switch to preemptive. Not saying cooperative multitasking doesn't have its place, just that it's gotta have a warning sticker or even better disallow certain types of code from executing statically.

Also great time to plug a related post, What color is your function:

https://journal.stuffwithstuff.com/2015/02/01/what-color-is-...

neonsunset · on March 25, 2024

I keep having to repeat it on this unfortunate website: it's an implementation detail, a multi-threaded executor of async/await can cope with starvation perfectly well as demonstrated by .NET's implementation that can shrug off some really badly written code that interleaves blocking calls with asynchronous.

https://news.ycombinator.com/item?id=39530435

https://news.ycombinator.com/item?id=39786142

https://news.ycombinator.com/item?id=39721626

AstralStorm · on March 25, 2024

Said badly with code will still execute with poor performance and the mix can be actively hard to spot.

maccard · on March 25, 2024

Badly written threaded code will have the same problem, unfortunately

neonsunset · on March 25, 2024

You would be surprised. It ultimately regresses to "thread per request but with extra steps". I remember truly atrocious codebases that were spamming task.Result everywhere and yet there were performing tolerably even back on .NET Framework 4.6.1. The performance has (and often literally) improved ten-fold since then, with threadpool being rewritten, its hill-climbing algorithm receiving further tuning and it getting proactive blocked workers detection that can inject threads immediately without going through hill-climbing.

miohtama · on March 25, 2024

Also there is the extra software development and maintenance cost due to coloured functions that async/await causes

https://journal.stuffwithstuff.com/2015/02/01/what-color-is-...

Unless you are doing high scalability software, async might not be worth of the trade offs.

nine_k · on March 25, 2024

If you expect to color your functions async by default, it's really easy to turn a sync functuion into a near-zero-cost async, a Future that has already been resolved at construction time by calling the sync function.

This way, JS / TS becomes pretty comfortable. Except for stacktraces, of course.

yencabulator · on March 25, 2024

Except when your sync function calls a function that makes the async runtime deadlock. No, there's no compiler error or lint for that. Good luck!

ngrilly · on March 26, 2024

Useful stacktraces is one of the reasons why I use Go instead of JS/TS on the server-side.

RReverser · on March 26, 2024

JS has proper async stacktraces as well, or do you mean something else?

ngrilly · on March 26, 2024

You're right. But it was not the case years ago when my team evaluated Go and Node.js. I should have clarified.

laggyluke · on March 26, 2024

> Useful stacktraces is one of the reasons why I use Go

I'm not sure if this is sarcasm or not.

ngrilly · on March 26, 2024

Fair enough. The context is that years ago, my team was evaluating Go and Node.js as options for a server requiring concurrency, and back then Node.js didn't provide any stacktrace for async callbacks. I know it has been improved since then, but I don't use Node.js.

What do you miss in Go's stack traces?

MuffinFlavored · on March 25, 2024

> Async/await with one thread is simple and well-understood. That's the Javascript model.

Bit of nuance (I'm not an authority of this and I don't know the current up-to-date answer across 'all' of the Javascript runtimes these days):

Isn't it technically "the developer is exposed the concept of just one thread without having to worry about order of execution (other than callbacks/that sort of pre-emption)" but under the hood it can actually be using as many threads as it wants (and often does)? It's just "abstracted" away from the user.

recursive · on March 25, 2024

I don't think so. Two threads updating the same variable would be observable in code. There's no way around it.

MuffinFlavored · on March 25, 2024

I'm referring to something like this: https://stackoverflow.com/questions/7018093/is-nodejs-really...

It's like a pedantic technical behind the scenes point I think, just trying to learn "what's true"

recursive · on March 25, 2024

Looks like the spec refers to the thing-that-has-the-thread as an "agent". https://tc39.es/ecma262/#sec-agents

I don't know the details about how any implementation of a javascript execution environment allows for the creation of new agents.

kaba0 · on March 25, 2024

I mean, yeah, if you go deep enough your OS may decide to schedule your browser thread to a different core as well. I don’t think it has any relevance here - semantically, it is executed on a single thread, which is very different from multi-threading.

MuffinFlavored · on March 25, 2024

It is executed by your runtime which may or may not behind the scenes be using a single thread for your execution and/or the underlying I/O/eventing going on underneath, no?

throwitaway1123 · on March 26, 2024

Most JS runtimes are multi-threaded behind the scenes. If you start a node process:

  node -e "setTimeout(()=>{}, 10_000)" &

Then wait a second and run:

  ps -o thcount $!

Or on macOS:

  ps -M $!

You'll see that there are multiple threads running, but like others have said, it's completely opaque to you as a programmer. It's basically just an implementation detail.

throwitaway1123 · on March 26, 2024

If you need true parallelism of course you can opt-in to Web Workers (called Worker Threads in Node), or the Node-specific child_process.fork, or cluster module.

HarHarVeryFunny · on March 25, 2024

I'm not sure what you mean by "multi-threaded async/wait"... Isn't the article considering async/await as an alternative to threads (i.e. coroutines vs threads)?

I'm a C++ programmer, and still using C++17 at work, so no coroutines, but don't futures provide a similar API? Useful for writing async code in serialized fashion that may be easier (vs threads) to think about and debug.

Of course there are still all the potential pitfalls that you enumerate, so it's no magic bullet for sure, but still a useful style of programming on occasion.

zarzavat · on March 25, 2024

They mean async/await running over multiple OS threads compared to over one OS thread.

You can also have threads running on one OS thread (Python) or running on multiple OS threads (everything else).

Every language’s concurrency model is determined by both a concurrency interface (callbacks, promises, async await, threads, etc), and an implementation (single-threaded, multiple OS threads, multiple OS processes).

lights0123 · on March 25, 2024

async/await tasks can be run in parallel on multiple threads, usually no more threads than there are hardware threads. This allows using the full capabilities of the machine, not just one core's worth. In a server environment with languages that support async/await but don't have the ability to execute on multiple cores like Node.js and Python, this is usually done by spawning many duplicate processes and distributing incoming connections round-robin between them.

_flux · on March 25, 2024

Jemalloc can use separate arenas for different threads which I imagine mostly solves the futex congestion issue. Perhaps it introduces new ones?

gpderetta · on March 25, 2024

IIRC glibc default malloc doesn't use per-thread arenas as they would waste too much memory on programs spawning tens of thousands of threads, and glibc can't really make too many workload assumptions. Instead I think it uses a fixed pool of arenas and tries to minimize contention.

These days on Linux, with restartable sequences, you can have true per-cpu arenas with zero contention. Not sure which allocator use them though.

celrod · on March 25, 2024

> As pressure from thread collisions increases, additional arenas are created via mmap to relieve the pressure. The number of arenas is capped at eight times the number of CPUs in the system (unless the user specifies otherwise, see mallopt), which means a heavily threaded application will still see some contention, but the trade-off is that there will be less fragmentation.

https://sourceware.org/glibc/wiki/MallocInternals

So glibc's malloc will use up to 8x #CPUs arenas. If you have 10_000 threads, there is likely to be contention.

JonChesterfield · on March 25, 2024

https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree...

Thank you, I didn't know about this one. An allocator that seems to use it at https://google.github.io/tcmalloc/rseq.html

jeffbee · on March 25, 2024

Not only does tcmalloc use rseq, the feature was contributed to Linux by the tcmalloc authors, for this purpose, among other purposes.

yxhuvud · on March 26, 2024

Glibc allocator is also notoriously prone to not giving back memory that could be given back to the OS when running several threads over a long time period. Reducing pool size helps a bit, but not sufficiently to make memory stable.

klysm · on March 25, 2024

I find this has worked well for me when I can easily state what thread pool work gets executed on.

mmis1000 · on March 26, 2024

I think its kotlin's model. The language don't have a default co-routine executor set, you need to provide your own or spawn one with the std. It can be thread-pooled, single thread, or custom one but there isn't a default one set.

If you use single threaded executor, then race condition won't happen. If you choose pooled, then you obviously should realize there can be a race condition. It's all about choice you made.

jmspring · on March 25, 2024

You focus on rust rather than generalizing...

If you are IO bound, consider threads. This is almost the same as async / await.

What was missing above, and the problem with how most compute education is these days, if you are compute bound you need to think about processes.

If you were dealing with python concurrent.futures, you would need to consider processpooexecutor vs. threadpoolexecutor.

Threadpoolexecutor gives you the same as the above.

With multiprocessor executor, you will have multiple processes executing independently but you have to copy a memory space. Which people don't consider. In python DS work - multiprocessor workloads need to determine memory space considerations.

It's kinda f'd up how JS doesn't have engineers think about their workloads.

initplus · on March 25, 2024

I think you are coming at this from a particular Python mindset, driven by the limitations imposed on Python threading by the GIL. This is a peculiarity specific to Python rather than a general purpose concept about threads vs processes.

danbruc · on March 25, 2024

[...] if you are compute bound you need to think about processes.

How would that help? Running several processes instead of several threads will not speed anything up [1] and might actually slow you down because of additional inter-process communication overhead.

[1] Unless we are talking about running processes across multiple machines to make use of additional processors.

zelphirkalt · on March 25, 2024

I think you need to clarify what you mean by "thread". For example they are different things when we compare Python and Java Threads. Or OS threads and green threads. I think the GP was relating to OS threads.

danbruc · on March 25, 2024

I was also referring to kernel threads. If we are talking about non-kernel threads, then sure, a given implementation might have limitations and there might be something to be gained by running several processes, but that would be a workaround for those limitations. But for kernel threads there will generally be no gain by spreading them across several processes.

josephg · on March 25, 2024

Right; a process is just a thread (or set of threads) and some associated resources - like file descriptors and virtual memory allocations. As I understand it, the scheduler doesn’t really care if you’re running 1000 processes with 1 thread each or 1 process with 1000 threads.

But I suspect it’s faster to swap threads within a process than swap processes, because it avoids expensive TLB flushes. And of course, that way there’s no need for IPC.

All things being equal, you should get more performance out of a single process with a lot of threads than a lot of individual processes.

Galanwe · on March 25, 2024

> a process is just a thread (or set of threads) and some associated resources - like file descriptors and virtual memory allocations

Or rather the reverse, in Linux terminology. Only processes exist, some just happen to share the same virtual address space.

> the scheduler doesn’t really care if you’re running 1000 processes with 1 thread each or 1 process with 1000 threads

Not just the scheduler, the whole kernel really. The concept of thread vs process is mainly a userspace detail for Linux. We arbitrarily decided that the set of clone() parameters from fork() create a process, while the set of clone() parameters through pthread_create() create a thread. If you start tweaking the clone() parameters yourself, then the two become indistinguishable.

> it’s faster to swap threads within a process than swap processes, because it avoids expensive TLB flushes

Right, though this is more of a theorical concern than a practical one. If you are sensible to a marginal TLB flush, then you may as well "isolcpu" and set affinities to avoid any context switch at all.

> that way there’s no need for IPC

If you have your processes mmap a shared memory, you effectively share address space between processes just like threads share their address space.

For most intent and purposes, really, I do find multiprocessing just better than multithreading. Both are pretty much indistinguishable, but separate processes give you the flexibility of being able to arbitrarily spawn new workers just like any other process, while with multithreading you need to bake in some form of pool manager and hope to get it right.

Animats · on March 26, 2024

> The concept of thread vs process is mainly a userspace detail for Linux. We arbitrarily decided that the set of clone() parameters from fork() create a process, while the set of clone() parameters through pthread_create() create a thread. If you start tweaking the clone() parameters yourself, then the two become indistinguishable.

That's from Plan 9. There, you can fork, with various calls, sharing or not sharing code, data, stack, environment variables, and file descriptors.[1] Now that's in Linux. It leads to a model where programs are divided into connected processes with some shared memory. Android does things that way, I think.

[1] https://news.ycombinator.com/item?id=863939

danbruc · on March 25, 2024

Or rather the reverse, in Linux terminology. Only processes exist, some just happen to share the same virtual address space.

Threads and processes are semantically quite different in standard computer science terminology. A thread has an execution state, i.e. its set of set of processor register values. A process on the other hand is a management and isolation unit for resources like memory and handles.

Hendrikto · on March 25, 2024

> If you are IO bound, consider threads. This is almost the same as async / await.

Only in Python.

> if you are compute bound you need to think about processes.

Also only in Python.

cryptonector · on March 25, 2024

If you're using threads then consider not using Python. Or, just consider not using Python.

baq · on March 25, 2024

Backend JS just spins up another container and/or lambda and if it's too slow and requires multiple CPUs in a single deployment, oh well, too bad.

zelphirkalt · on March 25, 2024

That is of course a huge overhead, compared to how other languages solve the problem.

fennecfoxy · on March 27, 2024

Backend JS does whatever the hecky you _want_ it to do.

The Cluster module has been around for a long time: https://nodejs.org/api/cluster.html

Tbf a lot of the time you're running it in a container and you allocate 1 vcpu there, only downside is maybe a little extra memory overhead. And for most lambdas I think they're suited to being single threaded (imo).

fjdhdhdhfhfhf · on March 25, 2024

^ kid who only writes python condescending to the engineers actually solving hard problems LOL