Continued from this morning’s live blog.

Kernel and security, or CVEs are alive, do not panic! by Greg KH

Five years ago at Kernel Recipes, Greg was firmly against CVEs. He starts by admitting he was wrong.

And things changed, with CNAs, pushed by the python project, allowing projects to be in charge of their own CVEs. Linux is now a CNA, too.

To be a CNA, kernel.org is required to document every vulnerability.

Everything happens in public, Greg says, with an open mailing list, process docu, tool git repo, and a list to announce the CVEs.

But what is a vulnerability? There is an official definition that is very broad. The kernel project refined that to a few class of bugs that are known as vulnerability (ex: overflows). This includes for example triggering a warning with WARN_ON; those get a CVEs because billions of Android devices are configured with panic_on_warn. Data corruption or data loss is not considered a vulnerability from cve.org policy, neither are performance issues.

CVEs are assigned after-the-fact, usually with 1-2 week delay, in order to allow systems to be updated before public annoucements; only the specific fix (not its dependencies) are referenced, and they are not tested independently.

Linux does not take care of hardware CVEs: SPECTRE-like bugs.

CVEs are assigned by a team of 3 people, each analyzing every patch and bugfixes. Some external developers like distro maintainers are often helping and proposing CVEs. CVE rejections are also considered with discussions when needed.

Greg KH

On average since the beginning of the new approach, there are now 55 CVEs per week. Other CNAs and projects are doing sometimes more CVEs per week, sometimes a comparable number.

Kernel CVEs are descriptive: all the metadata is publish in machine-readable json files: every CVE says what files are affected, what versions. Each user is only affected by a very small subset.

But some CVEs are likely to affect you, Greg says; and the solution is to run stable kernel with all patches. It’s going to be a lot of work to triage all of this manually, so either with or without taking all stable patches, it’s better to have process in place to update regularly.

Greg gives an example of the Android project, Android syncs automatically and regularly with the stable kernel, so they get all patch fixes; and they also review fixed CVEs at the same time.

To sum up, there are two options for not taking stable: the first one is to triage 55 CVEs and associated patches per week The second is to fix the process to be able to take stable updates, Greg says.

One of the tools in use by the kernel.org CVE team is dyad, built for to find vulnerable/fixed git ranges. What Greg found was that Linux git commits are a mess; they contain invalid Fixes: tags for example.

Another tool is bippy, which creates a CVE record for a git commit in json and MBOX format for emailing.

strak is a tool to show how vulnerable a given kernel release is: by showing all assigned CVEs for it. All those tools are written in bash, and abuse the filesystem and git as a database.

Kees compiled a list, and everyone agrees that CVEs are an imperfect mapping of flaws, but this is the current system. Before, there were a lot of false negatives. Now, there are less false negatives, and in addition a small number (1 to 2%) of false positives instead, Greg says.

Giving Rust a chance for in-kernel codecs by Daniel Almeida

This talk is about hardware codec accelerators: specialized hardware to speed up decoding and encoding of video codecs. They are usually faster and generate less heat; but we need kernel drivers to use them.

But how would Rust help here? Inside a video bitstream, there is Metadata, and the data. The “metadata” controls the decoding process. A change in one parameter changes how the hardware interprets the rest of the bitstream. The metadata is parsed from untrusted user input.

Currently, to use accelerators, userspace programs may do some checks; but those are untrusted. The kernel has its own checks. And if something goes wrong, the hope is that the device just “hangs”. In this case, you have to reboot the machine.

Not long ago, a PhD by Willy R. Vasquez found many issues with H.264 decoding in VLC, Firefox, Android, Windows software etc.

Last year, Daniel proposed writing a media codec driver entirely in Rust. It has many safety advantages.

But to do this driver, the first step is to add a layer of bindings, or abstractions. But this did not please the media people, so this approach was given up. The common remark about the abstractions: who will maintain them?

So Daniel thought: what if we could write Rust code without bindings? Instead of writing a full driver in Rust, only some library functions of the driver would be in Rust.

The goal is to write Rust, then generate machine code that can be called from C. The functions should not be mangled, so they are annotated by the rust macro #[no_mangle]; it loses some Rust features, but it’s fine for the purpose of interfacing with C.

To do this, the bindings need a header file can be written by hand; but this does not scale. There is a tool for that, called cbindgen, that will process Rust code and auto-generate header files for all pub extern "C" functions. In order for this to write, Rust structs must be annotated with #[repr(C)] in order to have a layout similar to that of the C compiler in order to interact with C.

This strategy works best when doing conversion of self-contained components. For video4linux, the target was codec libraries and codec parsers. Those are used to get to the data from a bitstream into a form usable by the hardware decoders.

So Daniel worked on rewriting the VP9 library; two drivers were converted. With the testing tool, the rust version got the same score as the C version: no regressions.

While the approach is much less ambitious than last year’s proposal, it works, and is less inconvenient to kernel maintainers. In the future, drivers fully written in Rust can also use the Rust code, without using the C binding layer.

The media people gave a different feedback to this work: they asked for fixed bugs, performance impact, and are generally open to the idea of merging this. In term of performance, there should be no impact to using a Rust binding; doing more checks might be have some impact, but this is not the performance-critical path for programming HW codecs.

Tracking down sources of kernel errors with retsnoop by Andrii Nakryiko

Why does retsnoop exist? Mostly to be able to understand what triggers errors. It’s a mass-tracer of kernel functions, using eBPF under the hood. An important goal is to have a high signal-to-noise ratio. Which is why it captures errors only by default, and has filtering capabilities.

Andrii talks about the concept of “session” to limit the scope of watched functions. In order to do that, retsnoop provides entry and non-entry function to be able to stop and start the watching process. Andrii gives an example of calling retsnoop to watch for errors in the call stack of a bpf syscall.

When using the tool, the first step is to specify the set of entry functions. Then non-entry and denied functions can be added. The tool supports globs in function name filters, as well as source code and module paths. It relies on DWARF information to get the information.

When using kernel tracing, there might be a few gotchas. Suffixes might be added to functions by the compiler for example. Inlined functions might not be traceable for example. retsnoop’s --dry-run coupled with verbose -v mode helps preparing before writing a filter.

By default, retsnoop only records error stack traces; so the entry function has to return an error-like result: NULL for pointer function or -Exxxx functions returning an integer.

retsnoop supports filtering, by thread name, pid, or duration of the entry function; and all of those combined.

If the CPU supports LBR (Last Branch Record), retsnoop can take advantage of it to be able to get more information into inlined code. When passing --lbr=any, more information will be available (all code jumps), and it will help finding out what’s happening by looking at the code line numbers.

Andrii Nakryiko

In general, when looking at a code with retsnoop, it won’t solve your problems immediately, but help you as part of an iterative and refined process.

retsnoop can work in function call trace mode with -T , which is very similar to ftrace’s function_graph tracer.

Recently, support for function arguments capture was added as well, to help debug given error conditions. Andrii gives the example of cap_capable, which would return -EPERM, but why? With argument decoding, it’s easier to understand which capability was needed. Argument decoding relies on BTF type information; retsnoop will dereference struct pointers and capture their data as well to better understand pointer-passed structures.

Support for injected kprobes, kretprobes and tracepoints has been added to retsnoop. In verbose mode, it’s possible to get very detailed information, including full CPU register status at syscall entry points for example.

retsnoop is available at https://github.com/anakryiko/retsnoop with links to documentation. Also check the --help and --config-help, Andrii says. There are automated x86_64 and arm64 builds, and Arch and Fedora have official packages.

Efficient zero-copy networking with io_uring by Pavel Begunkov and David Wei

Pavel started by saying it’s not the first time he talks about io_uring and zerocopy. But things have progressed again, in that area.

But why do we need zerocopy? Pavel showed a graph of logarithmic evolution of network speeds. As networks get faster, they use more CPU; and copying the data can be very slow.

Linux has supported some zerocopy modes for a while: TCP_ZEROCOPY_RECEIVE sockopt, DPDK for kernel bypass, AF_XDP, and protocols like Infiniband. Why are there so many different solutions with so many caveats? The problem is that the userspace application choses the buffer. When traffic arrives, we don’t want the application to get any other traffic than its own, it would be a security issue. The proper queue should be selected as well, with adequate steering, which is luckily supported by network cards.

Pavel Begunkov speaking with David Wei in the background.

Another issue, is that from the kernel the code used to take a struct page, but this represents a kernel page. So the solution is to pass a wrapper instead called memtcp. Once the application is finished with processing the data, it can free-up the buffer, which goes to a refill queue before re-joining the Page pool that can be used by the RX queue.

David then takes the mic to present benchmark results and the setup. They use kperf, a more sophisticated iperf. The first comparison is between epoll and io_uring zerocopy at 1500 MTU: a 31.4% increase in bandwidth from 68.8Gbps to 90.4Gbps. Then, memcmp of the data is disabled, and the BW improvement is 43.4%, pushing performance even higher. At 4096MTU, the gains are 37.8% and 41% respectively.

In order to support zerocopy RX, it needs support in the NIC, its firmware, and the drivers.

Since userspace memory now needs to be handled by the netstack, another type of memory had to be added; struct page represents host kernel memory, and in a net_iov, there can be either host userspace memory (for io_uring ZC), or Host device memory. These three are abstracted with netmem.

A new netdev queue API is needed. Reconfiguring network queues currently requires bringing a device down, which is not very efficient. For ZC, specific queues have to be configured, which has to bring the device down. The new netdev_queue API solves this issue.

Currently, support is implemented in Broadcom bnxt and Google gve. Mellanox mlx5 support is a WIP by NVIDIA developers.

Queues are configured with ethtool, first setting combined mode, then a queue for “normal traffic”, then using flow steering for a given queue. Then in the application using io_uring, the if_index and queue number is set in the io_uring_zcrx_inq_req structure. The refill ring needs to be setup as well by passing io_uring_zcrx_rq to io_uring_zcrx_mmap. Requests are then prepared and submitted to the ring. Completions are processed from the completion ring (cqe). The area of the buffer is the the upper 16 bits of the offset. Once the data has been processed, the buffer is passed to the refill queue using a process similar to the submission step.

Often, zerocopy can be combined with another operation, like in-kernel TLS decryption with kTLS for example, removing the need for a userspace decryption. But in this case, there will still be a “copy” of the data during decryption, so no memory bandwidth is saved. PSP can provide a solution to this by receiving plaintext directly from the NIC where TLS is decoded.

Another issue is that sometimes userspace needs to copy data for alignment purpose of block operations.

In the future, optimizations like refcounting and support for huge pages and larger chunks will be worked on. The full patchset has been in-progress for a while, and is nearing completion.

That’s it for the first day’s live blog ! See you tomorrow morning for the live blog!