Not here. You don't trade off latency because you cannot reduce it below a cache...

Not here. You don't trade off latency because you cannot reduce it below a cache-miss per context-switch, anyway if your working set is not tiny. The point is that if you have lots of tasks then your latency is has a lower bound (due to hardware limitations) regardless of the design.

In other words, if your server serves some amount of data that is larger than the CPU cache size and can be accessed at random, there is some latency that you have to pay, and so many micro-optimisations are simply ineffective even if you want to get the lowest latency possible. Incurring a cache miss and allocating memory (if your allocation is really fast) and even copying some data around isn't significantly slower than just incurring a cache miss and not doing those other things. They matter only when you don't incur a cache miss, and that happens when you have a very small number of tasks whose data fits in the cache (i.e. a generator use-case and not so much a server use-case).

Put in yet another way, some considerations only matter when the workload doesn't involve many cache misses, but a server workload virtually always incurs a cache-miss when serving a new request, even in servers that care mostly about latency. In general, in servers you're then working in the microsecond range, anyway, and so optimisations that operate at the nanosecond range are not useful.