Once the API is high level enough it gets unusable by major users who are high end networking, GPU and graphics libraries and low latency sound.
Nobody else truly needs to bypass the kernel. Even low latency can work with good real time task handling, making the users exactly two cases, who have special DMA handling in hardware already. If it means introducing special case kernel bypasses for high scale computing, it's already done, and the low level APIs just get wrapped.
And the Achilles's foot is security.
If it's arguing for making all hardware a kernel free fabric, it's essentially a move of everything to firmware. Worst case, we get zero memory protection and unfixable bugs.
You would likely be mistaken. Real time kernels (like Redhat's MRG product, or the -rt patchset from Thomas Gleixner) have less jitter aka more predictable timing. However, in virtually every average case, they have slightly higher latency than the stock linux kernel.
The reality of 100G+ networking or low latency networking is the kernel can't keep up with the interrupts from the hardware, so you turn interrupts per packet off (adaptive coalescing / ethtool -C for ethernet), so userspace tcp/ip stacks such as Intel / Linux Foundation's DPDK[1], Solarflare's OpenOnload[2], Mellanox's VMA[3] and Chelsio's Wire Direct[4] exist to fill this need. Heck, even the BBC wrote their own kernel bypass[5] networking layer! Note that Solarflare, Mellanox, and Chelsio are all heavily used in High Performance Computing supercomputers along with finance such as electronic trading. If there was no need, there wouldn't be so many options due to the market wanting them.
High-level means not exposing hardware limitations to the application. The primary target applications are datacenter services, which spend much of their time processing network I/O. As network latencies lower to a few microseconds, datacenter applications like Redis will need kernel-bypass because the kernel will become too expensive for them. In our experiments with a 25Gb network, the Linux kernel and POSIX interface costs Redis 60% of its latency.
Network I/O is a major bottleneck for Redis but they leave a lot on the table by being single threaded. I can speak from experience because I maintain a Multithreaded Fork: https://github.com/JohnSully/KeyDB
KeyDB can easily get 2-3x the QPS with half the latency.
This is IMHO a wrong analysis. Redis can be scaled by being single threaded by running multiple processes: then if you remove the overhead of the network stack, each process can deliver more QPS, not just better latency. By using threads (which Redis now in parts also does, but and gets 2X performance by making threaded just 0.01% of the code, that is, a single function) you continue to incur in the I/O penalty, just amortized in more threads, but it continues to be a waste. Also the latency you measure as reduced with threads is an illusion: it happens only during benchmarks because the instance is saturated more when running on a single thread. If you measure single-request latencies, they are dominated by the network stack latency.
The lower latency is not an illusion, it is indeed lower latency for servers with high load. If you don't have high load then I agree the need for threads is eliminated - but people using Redis for real work have traffic where this becomes an issue. Multiple processes require clustering or sharding each with its own set of overheads (both in CPU and human terms).
You and I disagree vehemently on this (hence the fork), but I really think your optimizing for your own simplicity not that of the user's. It should be the opposite since the developer has the most insight into the software.
I don't think you understood my comment. What I mean is that regardless of what you think of Redis and threads the fact that doing IO is so wasteful and adds latency and CPU time remains and is a constant.
> High-level means not exposing hardware limitations to the application.
This seems counter-intuitive.
Hardware limitations mean different abstraction than OS-level APIs, as them to applications.
Even POSIX does not expose hardware limitations.
Rather, high-level in the paper is more like some suitable interface to a wide range of applications. I.e., high-level as it's targeted to be used directly by applications as a portable interface.
You said you don't want to make users deal with flow control and hardware details.. does that imply a userspace bypass library which does that stuff for us? Does it look posixy?
Solarflare's OpenOnload or Mellanox's VMA both show up as LD_PRELOADS that overload any traditional socket programming unless you want to code your apps to their API directly.
It looks POSIX-like but uses high-level queues and fixes some issues with epoll. The lack of an atomic data unit and the overhead of the poor epoll interface cost too much to retain for kernel-bypass. Take a look at the paper for more details.
> we found that 30% of the cost of the Linux kernel comes from its interface. This overhead is just too much to carry around while using kernel-bypass devices.
One third of the cost is actually expensive !
Also, ScyllaDB NoSQL database(C++ clone of Cassandra) uses Seastar framework to achieve high IO throughput.
I am surprised and disappointed that the original paper and the blog post has zero reference to unikernel research, despite the fact that unikernel pretty much is the whole encompassing idea.
I am wondering whether or not this is a missing or a different understanding the concept.
Edit:
Sorry I did not really get the difference between library OS and unikernels.
It's still a lack of reference considering their connections.
The Demikernel is not a unikernel. It is a library OS compiled as a series of shared libraries. It is not compiled together with the application and doesn’t take into account what features the application uses. It is designed to work with kernel-bypass hardware, like DPDK.
It depends on the interface for the drivers to the application. However, UIO doesn't seem to support DMA, which is a non-starter.
RDMA and DPDK both use user-space drivers, which is necessary for kernel-bypass. I'm not advocating for a particular kernel-bypass solution. I'm arguing that if we use kernel-bypass for I/O, we should have a common, efficient, high-level interface for it.
Once the API is high level enough it gets unusable by major users who are high end networking, GPU and graphics libraries and low latency sound.
Nobody else truly needs to bypass the kernel. Even low latency can work with good real time task handling, making the users exactly two cases, who have special DMA handling in hardware already. If it means introducing special case kernel bypasses for high scale computing, it's already done, and the low level APIs just get wrapped.
And the Achilles's foot is security.
If it's arguing for making all hardware a kernel free fabric, it's essentially a move of everything to firmware. Worst case, we get zero memory protection and unfixable bugs.