CPU limitations relative to IO became a much bigger issue again with 10Gbps+ networking and NVMe, but ignoring that, there is the entirely separate issue of IPC and anything else that causes lots of traversals of the userspace/kernel boundary at a high rate relative to the amount of time spent on external IO, and it can be absolutely brutal on throughput.
The biggest speedups I've made on systems I've worked on over the last 20 years have usually been down to being aware of patterns people don't think about where they caused unnecessary amounts of context switches, because context switches are far more expensive than people realize.
Now, doing that is not going to be fixed by "just" JIT'ing the write syscall, but you have to start somewhere, and I think the more promising approach there would be to look at what parts of the kernel code you could actually JIT into user-space.
E.g. obviously you can't elide anything that checks for anything security related to userspace, but consider when the kernel knows that you're writing to a pipe that only a single other process is reading from, for example. Could you manage to turn the write()/read() pairs into creating a buffer mapped directly into the reading process, and replace the read() syscalls with working purely on the userspace buffer when there is data there, and only ever requiring a context switch when outpacing the writer/reader? (e.g. using futexes when the buffer is potentially full or empty)
Other case is things like user-space networking. If the user has sufficient privilege to write directly to the network anyway, why not make their networking code dynamically run in user space when it makes sense?
Yes, it won't be free, but the point of targeting IO intensive code for this kind of thing is that a lot of the code paths are quite simple but tends to get executed at high frequency over and over again, so you can in fact often afford to do a lot of work even for relatively modest speedups. Even more so in modern systems where distributing the work over enough cores is often a challenge - any time you have a core idling you have time to spent trying to JIT existing code paths with more expensive optimizations.
Any must reads for getting into kernel optimization? I'd like to be able to diagnose performance issues better and you seem to know the flow very well. Any advice on material to read or places to hang around would be appreciated!
Specifically kernel optimization isn't really my area. I've mostly approached it from the userspace side in that I've worked a lot on optimizing apps that often turn out to abuse syscalls.
From that side, I'd say looking at tracing tools like systrace is worthwhile. But also just place strace (system calls) and ltrace (dynamic library calls). "strace -c -p [pid]", wait a bit and then ctrl-c, or just "strace -p [pid]" for a running process on Linux (or drop the "-p" and add the command you want to trace) is hugely illuminating and will often tell you a lot of things about the code you're running.
You can usually do the same with "proper" profilers (like e.g. gprof), but strace has the benefit of focusing specifically on the syscall boundary, and being very easy to pull out. You can use it both for profiling (with -c) or to see the flow of syscalls (without -c). The latter is useful for code that gives opaque error messages or otherwise to find out e.g. which files an application tries to open, and is invaluable for trouble-shooting.
Specifically for performance issues, there is one thing that is very valuable to remember: syscalls are expensive. Very expensive. Code that e.g. does lots of small read() calls, for example, is almost certainly doing something stupid (like not doing userspace buffering). But in general, a brief run of strace -c and figuring out why the top calls occur so often and/or take so long very often leads to obvious fixes.
E.g. a far too common pattern, though it's getting better, is read()'ing a length indicator in an incoming packet of data, followed by a second read() to read the contents, instead of using a non-blocking socket and doing larger reads into a buffer. When the message flow is slow, it doesn't matter. When it's fast, the difference in throughput can be massive.
As an additional thing, even redundant security checks will be plenty fast as long data checked is mostly unchanging and colocated with code. (E.g. struct task in Linux)
Write syscall being JIT is not even close to what can be achieved when you have to copy data on average 5 times until it reaches a driver. Mailbox style IPC everywhere would help with that immensely, but devices tend to skip DMA and require sleeps or slow port writes to do anything due to hardware issues.
The biggest speedups I've made on systems I've worked on over the last 20 years have usually been down to being aware of patterns people don't think about where they caused unnecessary amounts of context switches, because context switches are far more expensive than people realize.
Now, doing that is not going to be fixed by "just" JIT'ing the write syscall, but you have to start somewhere, and I think the more promising approach there would be to look at what parts of the kernel code you could actually JIT into user-space.
E.g. obviously you can't elide anything that checks for anything security related to userspace, but consider when the kernel knows that you're writing to a pipe that only a single other process is reading from, for example. Could you manage to turn the write()/read() pairs into creating a buffer mapped directly into the reading process, and replace the read() syscalls with working purely on the userspace buffer when there is data there, and only ever requiring a context switch when outpacing the writer/reader? (e.g. using futexes when the buffer is potentially full or empty)
Other case is things like user-space networking. If the user has sufficient privilege to write directly to the network anyway, why not make their networking code dynamically run in user space when it makes sense?
Yes, it won't be free, but the point of targeting IO intensive code for this kind of thing is that a lot of the code paths are quite simple but tends to get executed at high frequency over and over again, so you can in fact often afford to do a lot of work even for relatively modest speedups. Even more so in modern systems where distributing the work over enough cores is often a challenge - any time you have a core idling you have time to spent trying to JIT existing code paths with more expensive optimizations.
It's worth at least exploring.