Welcome back on the live blog. You can also follow the live stream.
Demystifying the Linux kernel security process – Greg Kroah-Hartman
The upcoming EU regulation: Cyber Resilience Act (CRA). The main idea behind this regulation is good, but the way it is currently written brings huge issues to open source.
Often, companies ask to join the Linux Security team; but it doesn’t work that way.
Linux is big, everyone runs only about 5% to 10% of the code base. The current release model has been stable since 2004; userspace ABI is stable, and won’t ever be broken on purpose. Version numbers no longer have special meaning (major number or parity for example).
The stable kernel rules haven’t evolved, it’s Linus-tree first. Longterm kernels are maintained for at least 2 years. Patching older kernels is more work. The kernel release are considered stable, and people should not fear upgrades.
The world has changed, Greg says; 80% of the world of the world’s server run non-commercial distribution kernels. Android is so big that it dwarfs everything in number of running Linux kernels. The “community” does not sign NDAs.
The Kernel security team is reactive, not proactive. Proactive work is left to other groups. It was started in 2005. As part of its declaration, it’s not a formal body, and cannot enter NDAs. It uses the security@kernel.org alias. The individuals behind don’t represent any company (not even their own); they mostly do triage, by bringing in responsible developers when a security issue is reported, and working in getting this fixed and merged in Linus’ tree. If people are brought in enough times, they get added to the alias.
The goal of the team is to fix issues as soon as possible. Embargoes can’t be longer than 7 days, and no announcement is done for fixes; they aren’t even tagged specially.
Linux Kernel developers don’t tag security issues specially because they don’t want to alert anyone, and aren’t confident that a bug is or isn’t a security issue: any bug can have security impact, so all fixes should be applied. Greg invites contributors to do the analysis; many people tried to triage security or non-security patches, and have failed: it’s too much work, and it’s easy to miss patches. Better to take them all than missing a security issue that is lurking.
The kernel developers don’t know the details of every user use case, what code is used in which context; which is why taking all fixes is better idea: even a known security bug might have varying kind of impact depending on the context.
Hardware bugs are a bit of the exception to the Kernel security policy. Greg says the hardware vendors aren’t moving fast enough, and embargoes might last 15 months. Embargoes have been tolerated, but having such a long period is no longer tolerable, Greg says. The plan might be to only give 14 days to fix them.
When fixing issues, there is no early announcement; even to a limited audience. Greg considers all “early notice” lists to just be leaks and should be considered public. Otherwise, why would your government allow it to exist ? Greg thinks the linux-distros
list might not be allowed to exist much longer. He does not want to play this game.
There is at least one security fixes, which is known about; and probably even more that aren’t known. Mitre agrees that CVEs don’t work for the kernel.
If you are not using an up-to-date stable / longterm kernel, your system is insecure, Greg says.
Panic Attack: Finding some order in the panic chaos – Guilherme G. Piccoli
SteamOS, a linux distribution, has an interest in having a kernel panic log collecting tool.
The goal is to have as much as possible, while being careful with size. kdump is an existing tool to collect crash information at kexec time. pstore can be used to store dmesg or information after a panic in one of the backend (RAM, UEFI, etc.), that resists reboots.
Both need a userspace component. kdumpst for example is in Arch Linux, and uses pstore for kdumps; it has been used by default on Steam Deck, submitting logs to Valve.
An issue, was that it was missing some information, because panic_print()
was called to late, after kdump. Guilherme proposed to move it before, so that logs contain the appropriate information. But it causes other issues. In the audience, Thomas remarks that calling printk causes calls in serial drivers, and often framebuffer ones for virtual console. It might even touch panic pointers.
Panic notifiers are a mechanism to call multiple handlers (with a priority) in the kernel, when there is a panic. A proposal was to filter those notifiers, but doing that papers overs the main issue. A better proposal was to split the panic notifier list into multiple different lists according to use case.
Currently (6.6-rc2), there are 47 notifiers in the kernels (18 in arch). Before splitting in lists, analyzing those uses is necessary: decoupling multi-purpose ones, verifying priorities, fixing locks, etc. An example was given with pvpanic, which used a spin_lock, and it was changed to a spin_trylock.
The plan is to split the panic notifiers into four lists (from three initially): hypervisors, informational, pre-reboot, and post-reboot. Then a proposal was to modify panic_print()
into a notifier itself.
An extensive discussion on the proposal exposed plenty of conflicting views. Consensus is at least to make the system as simple as possible. Many fixes will be integrated in the next iteration.
Guilherme gave a real-life example of an issue where a NIC would hang the system by creating an interrupt storm; running kdump in this case didn’t work in the middle of the interrupt storm. A workaround was to clear all PCI MSI on boot. Thomas added that in general, the IOMMU configuration should always be cleared at boot.
There is work having graphics showing the panic, but none of the proposal are fully ready yet.
How not to submit a patchset – Frederic Weisbecker
Frederic said this was a tour of his failed patchsets. Starting the lazy RCU patchset. Read-Copy Update is a synchronization mechanism in the kernel used to have readers and writers do the work in parallel, with copies. An example use case was to free an object with an RCU callback when the old copy no longer has any user.
The callbacks could be offloaded to other CPUs. This has CPU isolation and powersaving advantages. The issue is that it has poorer performance. But not all callbacks are performance sensitive. Some, like memory release, can wait and be batched together.
Frederic thought that lazy callbacks could benefit non-offloaded RCUs, bringing even more power savings. And it worked; until it was discovered that performance measurements were buggy. So it didn’t work. It showed only minor improvement on rare cases (<1%); it made things worse otherwise. After 3 weeks of work, and a v1 sent, Frederic dropped the idea. He still learned a lot about RCU internals in the process.
The big soft irq lock is next failed patchset. It starts with bottom-half IRQ code; those run with IRQ disabled in a non-preemptible context. A running vector blocks all the others. This is implemented with the “big hammer” local_bh_disable which disables all the IRQ vectors. Frederic had the idea of softening this, by having a specific local_bh_disable that only works on a subset of vectors. But the issue is with the vectors themeselves. So in another proposal, Frederic proposed re-enabling softirqs from vectors that are known to be safe against other vectors. But this had other issues.
After 2 months of works, the design still needs to be justified. He still learned a lot about softirq and lockdep internals in the process, triggering an RT debate.
Nohz_full cpusets interface is next failure. Frederic says he has been working on nohz for close to 10 years. nohz_full stops the timer tick completely on a given CPU. What is done in the timer callback is moved on another CPU. This is a tradeoff for extreme workloads that need full use of a given CPU. The way to enable nohz_full is only at boot time with the nohz_full= command line switch. The plan was to add a runtime way to enable it. After multiple years working on this, Frederic realized that their is no real user need for runtime enabling of nohz_full. So this work is now postponed permanently.
Why do those fail patchsets happen ? Frederic asks. The first reason is that kernel code has become very complex (audience: and the hardware as well). Usecases have broadened a lot; subsystems have grown in stability.
Those failed patchsets are still very good and efficient to start a discussion on a mailing list. There is actual code to discuss, and a better approach can be found. Frederic says reviews should happen in “RW” mode; hacking on a subsystem while reviewing.
7 years of cgroup v2: the future of Linux resource control – Chris Down
Chris is a systemd maintainer and working on the kernel at Meta. He started by saying that control of resource is a critical thing at Meta. cgroup v2 was declared stable a few years ago and brought the unified hierarchy.
In cgroupv1, each resource type (blkio, memory, pids…) was maintaining its own hierarchy. By contrast, cgroup v2 has a unified hierarchy, and resource controllers are enabled on a given part of the hierarchy.
Chris says that without this unified hierarchy, it’s impossible to do resource control at scale. If memory starts to run out, this would cause reclaim, and the reclaim costs a lot more if the hierarchies are split. Reclaimable and unreclaimable are important, but not guaranteed. And RSS has been a focus while it provides a very unreliable (but easy to measure) view of process memory usage. Chris gave an example that was thought to use 150MB of memory, that was in fact closer to 2GB and that was found with cgroup v2. In cgroupv2, all types of memory are tracked together.
When cgroup v2 memory.max is reached, reclaim is called for the cgroup. A change is paradigm is to use memory.low, which is a different kind of limit: if a cgroup is below this limit, no reclaim will happen. Chris says that cgroup memory.current tells the truth, but the truth might be sometimes complicated. For example, applications will grow to the limit (with kernel caches for example) if the system is not under pressure.
The issue is to have the wrong, unactionable metrics. cgroup v2 added PSI pressure metrics, like memory.pressure to be able to solve this problem, and understand how to better tune applications.
In order to understand exactly how much memory an application needs to run without loss of performance, Meta wrote a tool called senpai
which will reduce the memory.high limit of a repeatable task until it finds a stabilized minimum amount of memory.
Memory hierarchy is the way to offload memory to various types of storages, with various level of latency. zswap (compressed in-ram swap) can help a lot here, but the swapping strategy in the kernel was tuned to rotating disks, using it only as last resort, even with swappiness set to 100. A new swap algorithm was introduced in Linux 5.8. The effect was a decrease in heap memory, increase in cache memory, and lower disk I/O activity. Meta calls this Transparent Memory Offload.
Applying resource control on memory is not enough, io needs to be looked as well, otherwise issues might compound.
Nowadays, cgroup v2 is well deployed, on most container runtimes and distributions. It can be verified by booting with cgroup_no_v1=all on the kernel command line. Android is also using metrics from the PSI project.
That’s it for today ! You can continue on to the last day !