I worked with a fairly large code base which leveraged threads to avoid having t...

I worked with a fairly large code base which leveraged threads to avoid having to use callbacks in a C++ code base. This allowed the engineers to use the more familiar linear programming style to make blocking network calls which would stall one thread, while others were unblocked to proceed as responses were received. The threads would interoperate with each other using a similar blocking rpc system.

But what ended up happening was that the system would tend to block unnecessarily across all threads, particularly in cross thread communication. These were due to many external network calls being performed sequentially when they could have been performed in parallel, and likewise cross-thread communication constantly blocking due to sequential calls. The end result was a system who'd performance was just fundamentally hobbled by threads that were constantly waiting on other threads or series of external calls, and it was very difficult to understand the nature of the gordian knot of blocking behavior that was causing these issues.

The main problem with using threads is that the moment you introduce a different cpu into the mix you need to deal with synchronization primitives around your state, and this means that engineers will use fewer threads (at least in our case) than necessary to reduce the complexity of the synchronization work needed to be done which means that you lose the advantage of asynchronous parallelism. Or at least that is what happened in this particular case.

The cost of engineering synchronization for async/await is zero, because this parallelism happens on a single thread. Since the cpu "work" to be done for async/io is relatively small, this argues for using single threaded 'callback' style solutions where you maximize the amount of parallelism and decrease the amount of potential blocking as well as minimizing the complexity of thread synchronization as much as possible. In cases where you want to leverage as many cpu's as possible, it's often the case that you can better benefit from cpu parallelism by simply forking your process on multiple cores.