Let’s start this 9th edition of Kernel Recipes !

We are happy to welcome you after three years of waiting. Jens Axboe is the godfather for this edition and will also be our first speaker in the morning.

This is the live stream from those who prefer video over text.

What’s new with io_uring – Jens Axboe

First of all, what’s io_uring ? It’s a way for an application to talk with kernel. It’s composed of two ring buffers, each doing communication in a single direction.

Underneath, it uses three io_uring_* syscalls. The main goal of io_uring was to replace AIO for async I/O. Its key features are being async, zero-copy and lockless. It’s intended to be used through liburing instead of the kernel interface directly.

Jens Axboe at Kernel Recipes 2022

What’s new ?

Native workers are now really native: the kernel jobs doing tasks on behalf of the application now use io-threads instead of dangerously trying to assume the apps credentials: it fixed a few security issues and corner cases at the same time.

TIF_NOTIFY_SIGNAL is also a new feature to better handle the way signals are sent to io_uring tasks. It was a lot of work to do because it’s architecture dependent.

Direct descriptors are a new type of file descriptors that only exist within a ring: they are faster than using file descriptor (less locking), and enable having multiple operations on a file. A more recent change upcoming in 5.19, is to let the kernel manage the direct descriptors instead of having the application do it.

What about bypassing fget/fput locking on opening the ring itself ? It’s now possible as well, albeit a bit dangerous to use improperly.

io_uring can also now manage a buffer pool provided by the application to prevent passing a new buffer for each read/write call: a buffer will be picked from the provided buffers. Also in 5.19, there are now ring provided buffers; they are provided by the kernel instead of the application for better performance, but cannot be used at the same time as classic provided buffers; liburing usually takes care of those details.

It’s now possible to do custom asynchronous commands in drivers thanks to the ->uring_cmd()addition. It is used for example for NVME vendor commands to replace what would usually be done in synchronous ioctls.

Cooperative completion scheduling has been added and helps a bit for a few network tasks. For 5.19, cancellations have been improved to match more conditions. Multishot accept requests were also added for accepting many connections asynchronously.

Apart from this, there were many performance optimizations over the last releases.

liburing 2.2 will be synchronized with 5.19 and have all the new features. It now has 80 manpages instead of 8, and more regression tests.

Microsoft added support for “I/O Rings”, a design very similar to io_uring. It should open the road to write cross-platform applications with ring I/O.

Of course, this does not stop here: there are still many upcoming improvements queued for 5.20 and planned in the future: true async buffered writes, level triggered poll support, faster io-wq, incrementally consumed provided buffers; and the code split in the kernel for the 13k+ lines fs/io_uring.c file.

And Jens says that it might take a long time to move to the Completion-based model of io_uring. It’s a long term, 10-year long project.

Make Linux developers fix your kernel bug – Thorsten Leemhuis

The Linux kernel is made by volunteers. It does not necessarily mean hobbyists; even if they are paid, it means that you can’t really force anyone to work on a bug.

Most people want to work on building what they care about, but sometimes Linus Torvalds might need to step in if things are badly broken.

In addition, most developers will gladly address issues in their code; but life sometimes gets in the way. You can help by writing decent bug reports: it makes fixing bugs much easier.

Thorsten Leemhuis

How to create a decent report ?

The first thing is to ensure that your kernel is vanilla. This is not the case for most kernels in the wild: they are built by distros, which makes them unsuitable for reporting issues to Linux maintainers. In this case you should report this to your distribution. Or better yet: install a vanilla kernel; there are often pre-built ones, and Thorsten maintains the one for Fedora. Or compiling a kernel yourself.

Once you have a vanilla kernel, verify that you still have the issue, and then report it directly upstream, not through your distro.

The second thing is to ensure you kernel is fresh. What does that mean ? First test the latest mainline kernel; most of the time, it means -rc releases. Do not use longterm (LTS) or stable kernel series for a report. The only exception is when you have a regression within a stable or longterm series.

Then you need to ensure your kernel integrity: verify that it’s not tainted. You verify that you have 0 in /proc/sys/kernel/tainted. You should also make sure you’re not using out-of-tree drivers, even if your kernel is not tainted. Before doing the report, disable all such drivers, and reboot without them to verify if you can still reproduce the issue.

All of this is described in the official kernel documentation on how to report the issues, which Thorsten wrote.

Don’t forget to verify you hardware integrity (ex: overclocking), and check your dmesg log for errors.

Before sending: check that you are submitting your bug report to the right place. And this is not a simple task: the official bugzilla.kernel.org is most of the time the wrong place. You should check the MAINTAINERS file to find the proper place, most often a mailing list.

Finally, make your report as clear, short, and simple as possible.

Kind of issues

The security issues are usually those that people are being obliged to address. Data loss or devastating bugs as well. Regressions are also very important to the Linux kernel development process, and should be fixed very quickly.

Thorsten is working on tracking regressions, and has a dedicated bot for this; the EU funded his work in the past, and now Meta has stepped in to continue. If you see regression, don’t forget to copy the regression mailing list when sending your report. Now, the regressions that are part of the rule are the ones visible from userspace; and of course you should have a similar kernel config.

During the reporting process, you will most likely be ask to find the culprit yourself, because you are the one that can reproduce it. This is done with git bisections. Once the commit introducing the issue is found, a fix is pretty much guaranteed; and reverting is always a possibility.

Some issues on the other hand are likely to be ignored. For example, Linux contains a few incomplete drivers. Or some drivers might face real-world issues, like the nouveau driver which might not have the necessary the knowledge to use some hardware features.

Sometimes code does not have an active maintainer. It remains in the kernel because it is useful to people, and removing it might break the no-regression rule. Other times, the maintainer might document that they are only doing Odd fixes, but a report should be sent. When code is fully Orphan, a report should be sent as well, but it’s usually expected that no outside fix will happen.

To conclude, Thorsten says you should definitely look at the step-by-step guide to report issues, which is now part of the official documentation.

io_uring: path to zerocopy – Pavel Begunkov

The goal with modern I/O it to have zerocopy for maximum performance, and even peer-to-peer DMA where possible. Pavel has been working with networking and io_uring.

An example with network send requests: zerocopy is supported when using MSG_ZEROCOPY, but it requires locking on the buffer so that the app can process the data. The io_uring model is different.

Pavel Begunkov

It exposes two different models: storage-like and two-step; each has its own pros and cons, and depending on the type of network requests (TCP or UDP), one or the other would be better. The goal would be to have chose a model for networking with io_uring.

In the v1 patches, the storage-like model was chosen, but network developers weren’t really pleased with the proposal. v2 used a notification registration model; it had much better performance. It managed to use io_uring registered buffers, bringing io_uring performance advantages. During testing with UDP, the zerocopy patches brought a nice advantage with bigger (4k) payloads.

What’s being worked on: peer-to-peer DMA with dmabuf. It’s still in discussion, and can feel hacky for now; p2pdma would need to be supported in the networking layer.

In the future, zerocopy receive might be worked on as well. Nothing new compared to the current solutions: with mmap, it would be like TCP_ZEROCOPY_RECEIVE. With provided buffers, a zctap/AF_XDP approach could be taken. But it requires hardware support.

That’s it for this morning! Continue reading with the afternoon liveblog.