Welcome back for the final day of Kernel Recipes. You can yesterday morning and afternoon live blog as well.
We start the day with the traditional charity auction. All proceeds will go to Lar Sao Domingos, a Brazilian charity teaching and supporting poor children. 2570€ was raised for this auction with the various items sold, and the conference decided to double the amount.
Making the Linux Kernel by Steven Rostedt
Steven with its usual selfie with a dedicated compact camera. The original title, was “Making the linux kernel suck less“; but something changed a few days ago: Thomas Gleixner recently gave Linus the first ever physical pull request, to merge PREEMPT_RT; so the new title is “PREEMPT_RT Making the linux kernel suck less“.
So now real-time is upstream in 6.12. Why did it so long to be merged? The merge has been going on for 20 years, Steven says, and over the years many features have been merged, improving the kernel over time. It started with a very simple things: mutex
; those did not exist before, only semaphores.
Another feature was NO_HZ
: before, the kernel was ticking all the time regularly, even without it. Thomas Gleixner had to pull a trick to package high resolution timers with NO_HZ
at the time, and then became timer maintainer has well.
Lockdep was another. Before it, Steven and other real-time developers kept finding deadlocks in the kernel that were latent upstream, but blew up in RT. They kept reporting those, but the deadlocks kept being added. So they worked on adding lockdep to the kernel, which can compute dependencies on locks at runtime. And developers shouldn’t silent lockdep, it usually just papers over bugs, Steven says. Lockdep can show hidden dependencies betwen locks taken by different tasks, and interrupts, for example. It can also show dependencies hidden with memory reclaim when locks are held.
Another feature coming from Linux RT was thread priority inheritance: it can remove cases of unbounded priority inversion (with userspace), by analyzing priority inheritance.
Interrupt kthreads to handle the work of slow interrupt handling (like hard disks) also come from RT.
ftrace came from RT function tracing. Steven initially said “give me three months, I’ll push it upstream”; it has been ongoing ever since.
printk
, the function, used to have (in 6.11) the most amount of code that came from the first Linux versions. Steven showed an example of a system having a max latency of 68us with cyclictest
; by using the same benchmark, but triggering printk, the max latency jumped to 34ms. Printk was the last blocker of PREEMPT_RT
; old printk was serializing the output; the new one merged for 6.12 is threaded and allows all consoles to be printed at once; it can now be called in any context Non Maskable Interrupt (NMI), or in the scheduler code (which would deadlock if calling printk).
In 2008, users were complaining about build speed about build speed: users couldn’t bisect issues, because they only had the huge distro configs. Configs needed to be stripped down, without breaking the machine. But Steven already had a script for that since 2004; so they asked him to merge it, and this became make localmodconfig
.
Steven also talked about CONFIG_JUMP_LABEL
, which was added a long time ago in order to have static branch conditions that would have no impact at runtime. It’s disabled in some configs, but can yield up to 5% performance improvement.
Many people sprinkle cond_resched
in their kernel code because they don’t use CONFIG_PREEMPT
kernels. But thaat’s not a good approach, Thomas Gleixner said (it was presented last year’s Kernel Recipes). Thomas came up with LAZY_NEED_RESCHED
to fix this issue in RT kernels, of lock contention increasing latency on CONFIG_PREEMPT
kernels. CONFIG_PREEMPT_NONE
no longer gives any performance advantage in RT kernels. And soon, the kernel will have CONFIG_PREEMPT_AUTO
, which takes the lessons learned in RT with LAZY_NEED_RESCHED
and briniininng them upstream.
Working towards upstream first by Anel Orazgaliyeva
Anel works on EC2 Nitro at AWS, and her team owns various part of the stack. Nitro Hypervisor uses the Linux Kernel and KVM. Explicit goals of the Hypervision Kernel include staying secure and compliant, have the newest features, deliver fast, all without regressions.
In order to achieve those, the first step is bumping stable updates quickly, and updating to major kernel revision. This means rebasing downstream patches quickly. There is a test and deploy phase that takes 1 to 4 weeks. But the rebase to major versions can take over a year with multiple engineers. Stable updates is usually much shorter.
The hard lesson learned here is that rebase is a hard, slow, and error-prone process that takes away engineering resources that could be spent on new features and debugging customer issues. Why does it take so long? The downstream patch count grew over the years, from 200 to over 1000. So Anel inherited this “very easy” patch reduction project. The goal is to halve it by 2025.
So Anel analyzed all the patches, and found they fell in multiple categories: hardware or bugs workaround in non-updatable hw components; some patches were attempted to upstream, but it got rejected. Other patches were initially developed downstream without understanding of what it would it take to upstream; some would only be ”temporary”; and the last ones were built on top of other downstream patches. A big chunk are backports, but those aren’t counted usually because they go away on the next rebase.
The first step has been to look at the patch list, and drop patches for unused features, re-squash disorganized patch series fixups, and move to features that have been upstreamed. The last one includes security mitigations for hardware bugs, where the fixes had to be developed in parallel to upstream solutions. Often the upstream solutions are done more gracefully. Sometimes it would include retiring known-buggy first-generation hardware from production, so that patches could be dropped.
Many of Anel’s colleagues are working on upstreaming Memory persistence over kexec: patches were sent and discussions happened at LPC on featurs like Kernel HandOver (KHO) by Alex Graf, or Guest memfs by James Gowans.
This work includes splitting downstream patches that were doing everything at once to have a full featureset into smaller pieces.
Another one was upstreaming the Hyper-V VSM support; it had to be completely redesigned to be upstreamed, so that it’s less intrusive. This was discussed at this year’s KVM forum. In this case, the userspace API will change so apps have to be rewritten as well. But Anel says that all the lessons learned from the first implementation are very valuable, so it’s not clear if it could have been done upstream-first.
AWS engineers also sent patches for Secret hiding of VM guest data from the hypervisor (without confidential computing).
All dowstream patches have been categorized, and multiple upstreaming projects are ongoing.
A success story is the merging of Xen-on-Nitro; and it’s going to be used as a template of how to keep upstreaming downstream patches, and new projects. Fully upstream-first might not work, because projects need to be shipped, but the goal is to keep the spirit of upstream-first.
That’s it for this morning! Continue reading this afternoon’s live blog.