CPU limitations relative to IO became a much bigger issue again with 10Gbps+ net...

JoeSamoa · on July 3, 2019

Any must reads for getting into kernel optimization? I'd like to be able to diagnose performance issues better and you seem to know the flow very well. Any advice on material to read or places to hang around would be appreciated!

vidarh · on July 4, 2019

Specifically kernel optimization isn't really my area. I've mostly approached it from the userspace side in that I've worked a lot on optimizing apps that often turn out to abuse syscalls.

From that side, I'd say looking at tracing tools like systrace is worthwhile. But also just place strace (system calls) and ltrace (dynamic library calls). "strace -c -p [pid]", wait a bit and then ctrl-c, or just "strace -p [pid]" for a running process on Linux (or drop the "-p" and add the command you want to trace) is hugely illuminating and will often tell you a lot of things about the code you're running.

You can usually do the same with "proper" profilers (like e.g. gprof), but strace has the benefit of focusing specifically on the syscall boundary, and being very easy to pull out. You can use it both for profiling (with -c) or to see the flow of syscalls (without -c). The latter is useful for code that gives opaque error messages or otherwise to find out e.g. which files an application tries to open, and is invaluable for trouble-shooting.

Specifically for performance issues, there is one thing that is very valuable to remember: syscalls are expensive. Very expensive. Code that e.g. does lots of small read() calls, for example, is almost certainly doing something stupid (like not doing userspace buffering). But in general, a brief run of strace -c and figuring out why the top calls occur so often and/or take so long very often leads to obvious fixes.

E.g. a far too common pattern, though it's getting better, is read()'ing a length indicator in an incoming packet of data, followed by a second read() to read the contents, instead of using a non-blocking socket and doing larger reads into a buffer. When the message flow is slow, it doesn't matter. When it's fast, the difference in throughput can be massive.

AstralStorm · on July 3, 2019

As an additional thing, even redundant security checks will be plenty fast as long data checked is mostly unchanging and colocated with code. (E.g. struct task in Linux)

Write syscall being JIT is not even close to what can be achieved when you have to copy data on average 5 times until it reaches a driver. Mailbox style IPC everywhere would help with that immensely, but devices tend to skip DMA and require sleeps or slow port writes to do anything due to hardware issues.