Honestly, seems like a big waste of time to me. Generating the code wouldn't be free, so you pay for that. IO, generally, is fairly expensive, so you are now weighing the costs of IO against the savings in instructions/reduced branches. So where are the savings? OS level mutexes?
Basically all of the modern CPUs are going to do a really good job at optimizing away most of the checks cooked into the code Modern CPUs are really good at seeing "99% of the time, you take this branch" and "these 8 instructions can be ran at the same time".
What they can't optimize away is accessing data off CPU. The speed of light is a killer here. In the best case, you are talking about ~1000 cycles if you hit main memory and way more if you talk to the disk (even an SSD). You can run down a ton of branches and run a bunch of code in the period of time you sit around waiting for data to arrive.
This wasn't true in 1992 when CPUs ran in the Mhz. But we are long past that phase.
About the only place this would seem to be applicable is in embedded systems looking to squeeze out performance while decreasing power usage. Otherwise, feels like a waste for anything with a Rasberry pi CPU.
I read the thesis, and it is incredible. Massalin developed Synthesis OS to run on Sun machines (using a 68030 if I'm not mistaken), and was able to run Sun OS executables on her OS, faster than Sun OS could run its own executables. Part of that was the specialization kept code and data close together to keep cache misses down. I just wish the code to Synthesis was available for study.
The partial specialization here is all about optimizing data off of the CPU. Your drivers and protocol stacks above them have tons of data associated to make the drivers and stacks more generic and configurable. By JITing out optimized code you get the generic source, and optimized hot paths by embeddi g some of those co stants in the instruction stream, and simply not reading others. This also helps the I$, skipping over all those if(bug that doesn't apply to my hardware) sections. But, I think the biggest win is defining the user/kernel boundary in ways that are specific to your application. We've started to see some of that with neat bpf uses.
> The speed of light is a killer here. In the best case, you are talking about ~1000 cycles if you hit main memory
Nitpick: this isn't really a speed-of-light issue. The time taken for an electrical signal to make a round-trip between CPU and RAM on the same motherboard is closer to 10 clock cycles than 1000. Any latency beyond that is a result of delays within the CPU or memory modules themselves, not the time spent traveling between them.
CPU limitations relative to IO became a much bigger issue again with 10Gbps+ networking and NVMe, but ignoring that, there is the entirely separate issue of IPC and anything else that causes lots of traversals of the userspace/kernel boundary at a high rate relative to the amount of time spent on external IO, and it can be absolutely brutal on throughput.
The biggest speedups I've made on systems I've worked on over the last 20 years have usually been down to being aware of patterns people don't think about where they caused unnecessary amounts of context switches, because context switches are far more expensive than people realize.
Now, doing that is not going to be fixed by "just" JIT'ing the write syscall, but you have to start somewhere, and I think the more promising approach there would be to look at what parts of the kernel code you could actually JIT into user-space.
E.g. obviously you can't elide anything that checks for anything security related to userspace, but consider when the kernel knows that you're writing to a pipe that only a single other process is reading from, for example. Could you manage to turn the write()/read() pairs into creating a buffer mapped directly into the reading process, and replace the read() syscalls with working purely on the userspace buffer when there is data there, and only ever requiring a context switch when outpacing the writer/reader? (e.g. using futexes when the buffer is potentially full or empty)
Other case is things like user-space networking. If the user has sufficient privilege to write directly to the network anyway, why not make their networking code dynamically run in user space when it makes sense?
Yes, it won't be free, but the point of targeting IO intensive code for this kind of thing is that a lot of the code paths are quite simple but tends to get executed at high frequency over and over again, so you can in fact often afford to do a lot of work even for relatively modest speedups. Even more so in modern systems where distributing the work over enough cores is often a challenge - any time you have a core idling you have time to spent trying to JIT existing code paths with more expensive optimizations.
Any must reads for getting into kernel optimization? I'd like to be able to diagnose performance issues better and you seem to know the flow very well. Any advice on material to read or places to hang around would be appreciated!
Specifically kernel optimization isn't really my area. I've mostly approached it from the userspace side in that I've worked a lot on optimizing apps that often turn out to abuse syscalls.
From that side, I'd say looking at tracing tools like systrace is worthwhile. But also just place strace (system calls) and ltrace (dynamic library calls). "strace -c -p [pid]", wait a bit and then ctrl-c, or just "strace -p [pid]" for a running process on Linux (or drop the "-p" and add the command you want to trace) is hugely illuminating and will often tell you a lot of things about the code you're running.
You can usually do the same with "proper" profilers (like e.g. gprof), but strace has the benefit of focusing specifically on the syscall boundary, and being very easy to pull out. You can use it both for profiling (with -c) or to see the flow of syscalls (without -c). The latter is useful for code that gives opaque error messages or otherwise to find out e.g. which files an application tries to open, and is invaluable for trouble-shooting.
Specifically for performance issues, there is one thing that is very valuable to remember: syscalls are expensive. Very expensive. Code that e.g. does lots of small read() calls, for example, is almost certainly doing something stupid (like not doing userspace buffering). But in general, a brief run of strace -c and figuring out why the top calls occur so often and/or take so long very often leads to obvious fixes.
E.g. a far too common pattern, though it's getting better, is read()'ing a length indicator in an incoming packet of data, followed by a second read() to read the contents, instead of using a non-blocking socket and doing larger reads into a buffer. When the message flow is slow, it doesn't matter. When it's fast, the difference in throughput can be massive.
As an additional thing, even redundant security checks will be plenty fast as long data checked is mostly unchanging and colocated with code. (E.g. struct task in Linux)
Write syscall being JIT is not even close to what can be achieved when you have to copy data on average 5 times until it reaches a driver. Mailbox style IPC everywhere would help with that immensely, but devices tend to skip DMA and require sleeps or slow port writes to do anything due to hardware issues.
This will be more like the kernel as a JIT compiler instead of the current almost-FSM (finite state machine) model. Improved performance at the expense of (probably) correctness. Expect more of the current Intel-style bugs that are hard to debug and devastating. But implemented well this could be revolutionary for Ops etc. Self-tuning DBs exist and stuff like the JVM has been around for decades, this simply does that on a lower level.
Basically all of the modern CPUs are going to do a really good job at optimizing away most of the checks cooked into the code Modern CPUs are really good at seeing "99% of the time, you take this branch" and "these 8 instructions can be ran at the same time".
What they can't optimize away is accessing data off CPU. The speed of light is a killer here. In the best case, you are talking about ~1000 cycles if you hit main memory and way more if you talk to the disk (even an SSD). You can run down a ton of branches and run a bunch of code in the period of time you sit around waiting for data to arrive.
This wasn't true in 1992 when CPUs ran in the Mhz. But we are long past that phase.
About the only place this would seem to be applicable is in embedded systems looking to squeeze out performance while decreasing power usage. Otherwise, feels like a waste for anything with a Rasberry pi CPU.