Was Kay thinking of local code or distributed code? For local I can see his point. But once we start talking to networks or even really hard drives you need scatter-gather semantics. Otherwise you’re running lots of threads, which undercuts his point, or you’re writing your own multitasking state machine which is just a poor implementation of concurrency where you’re responsible for all the bugs and deadlocks, not just some of them.
(I should caveat what I'm about to say that I'm primarily concerned about writing robust and highly performant programs, and while I believe it should be a focus broadly, it's a practical niche.)
That's the thing, though. It's arguably even more important for distributed code. If we abstract away the state machine too much, it becomes difficult to reason about the code precisely because of abstraction. The complexity that was present explicitly in the state machine will just cause confusing behavior in the abstracted version. Using lightweight threads or another high level abstraction that approximates blocking code will allow getting a program out faster, but lower quality at that. Two examples to illustrate my point: first, you mention scatter-gather, but the base concept is orthogonal to sync/async. However, I/O is characteristically async, and therefore the underlying mechanisms are async anyways. Second, io_uring is showing that async can be good for performance while not being a difficult interface.
Sync code makes things easier to reason about but also kinda not. I think the big issues with async are that the OS hasn't done a good job allocating responsibility of async interfaces and the fundamental difficulty. The former makes async seem less efficient than it could be, which is true in that a sync-over-async interface is better than an async-over-sync-over-async interface, but we should have async interfaces accessible. The latter probably feeds into a bias to not even touch async where mixing async and sync would be the best blend of performance and programmability.