Welcome back to the live blog. You can read this morning’s live blog.

All Kernel Recipes speakers on stage.

Enhancing spatial safety: fixing thousands of flex array warnings by Gustavo A.R. Silva

In the C language, flexible array members (FAMs) are the last member of a struct, and they can expand its size depending on the number of element they contain. So they are usually accompanied in the struct by a counter member. A rule of the C language is that those structs containing flex arrays may not be a member of another struct.

In GCC 14, the new -Wflex-array-member-not-at-end (famnae) warning was added by Qing Zhao. It warns about FAMs in the middle of a composite struct. The fact that struct with FAMs can be composed is an extension of the C language. In this extension, they could be at the end or in the middle.; but putting them in the middle is being deprecated, because compilers do not handle such a case consistently.

Before looking at FAMs in the middle, Gustavo worked on the flexible array transformation (see his previous talk at Kernel Recipes), and it took 5 years to fix all call sites in the kernel to use standard C99 syntax. Then, enabling memcpy fortification did not work, so compilers builtin helpers had to be fixed because compliers did not want to break bad code.

In the next gcc version, the counted_by attribute will enable fortified memcpy by linking the counter member with the FAM. Annotation in kernel is in progress.

The new famnae warning triggered in over 60000 instances in the kernel. Luckily, only 650 of those were unique. But fixing even those was not trivial. But Gustavo was able to classify this in multiple categories.

The first case was a FAM that was not used at all. The fix is simple: remove it, and that’s it.

Another case was when the FAM of the sub structure was never accessed through the composite structure. In order to fix this, a possible approach is to split the structure in two, with a “header” struct (without the FAM) that will be included in the composite struct. The issue is that it duplicates the code. But this would be error prone, because one would have to maintain two independent struct with the same data. Another approach is to use struct_group_tagged() in the original structure; and that prevents duplicated code. Struct group tagged is a helper that can create a union in a parent struct, containing an anonymous and a named structure, both with the same members. But the anonymous structure members can be accessed directly from the parent struct, preventing the need to update all sites where members are accessed. The named and tagged struct can be used independently, and does not contain the FAM from the parent struct.

The last case is when there are implicit unions between FAMs and fixed-size arrays; in a packed composite struct, there is a FAM and a fixed-size array. The fixed-size array members can be accessed from the flex array. In the best case, Gustavo says, this is an implicit union. But he found cases where alignment rules created holes, and it did not work as intended. To fix this, Gustavo used __struct_group to create a header struct and remove the FAM from the embedded struct. But it was not sufficient in itself when code would access the FAM; so container_of tricks had to be added on top.

Answering questions from the audience about the proposed solution feeling a bit hacky, Gustavo reminds the ideal solution is to move the FAM at the end; but he also wants a general solution for the rest of the kernel. Some maintainers might chose to refactor the code instead, and that’s the best solution, Gustavo says, but it’s not always possible.

Another case is to have an implicit union as well, of the FAM and a same element type, but on stack. In this case, there is helper for that: DECLARE_RAW_FLEX, which is designed to prevent using heap space when the size of the FAM is known at compile time. DECLARE_FLEX can also be used to initialize the counter.

In conclusion, there are multiple solutions depending on the case. At the moment, the number of warnings in upstream is down to ~300 from ~650. Those fixes were done manually, Gustavo says. Someone asked if Coccinnelle was used, and Gustavo says it might be useful before the compiler warnings are developed, but for the actual fixes, reading and understanding the code is necessary, which is why everything is done manually.

From left to right: acme, this edition's godfather, Anne, Kernel Recipes founder, and Paul McKenney, next year's godfather.

Paul E. McKenney is going to be the next godfather of Kernel Recipes.

Scheduling with superpowers: Using sched_ext to get big perf gains by David Vernet

Last year, David presented sched_ext at Kernel Recipes, and he recommends re-watching the presentation (and we recommend reading the live blog for it!).

Schedulers multiplex threads to CPUs. While the concept is simple, things get complicated very quickly, David says, with multiple angles to take into account. sched_ext allows writing schedulers in eBPF, for faster iteration, without reboot, and safely.

At Meta, David’s employer, sched_ext has been running on millions of hosts, bringing from 2.5% to 10% performance improvement depending on the use case.

Offloading complexity in userspace with sched_ext is also a big advantage.

When implenting the sched_ext interface, it’s “just a set of callbacks” to implement, like the scheduler does.

Since last year, a lot of kernel-side improvements have landed in sched_ext: cpufreq integration, dispatch queue iterators (and consumers), dispatch to remote CPUs. Many schedulers have been written since last years: scx_rusty, scx_lavd for gaming and scx_rustland.

For Linux gaming, sched_ext is also very promising, because the workloads need a lot of interactivity, and are very cyclic. David showed an example with the game Terraria, and then its performance profile: we can see the cycles for each frame very clearly. Using perfetto, a web interface which can be used to view scheduler traces, David analyzed what happens during a frame. The takeaways are that the workloads are predictable (periodic), but still have lots of context switches and pipelines from the game, to the compositor, to Xwayland, etc.

David Vernet

But when the system is overcommitted (other running tasks), what happens? The game deadlines for frames stay the same. David showed an example with a parallel stress-ng running many threads: the game becomes very stuttery, when using the EEVDF scheduler. 60fps is not happening.

EEVDF means Early Eligible Virtual Deadline First; it’s the fairness algorithm used in Linux since 6.7. This is implemented something call vruntime in Linux, which is a weighted portioning of CPU. It is OK in general, but has shortcomings David says.

How to do better? By buliding a better deadline-based scheduler, with no user input necessary. Changwoo from Igalia (scx_rusty) came up with a way to do that using runtime statistics. This helps to speed up work chains, which happens in pipelined workloads. David showed a demo of the game, making it stutter with stress-ng, dynamically switching to sched_ext with scx_rusty, and stuttering went away; stopping scx_rusty it appeared again.

Even in this case, looking at the scheduling traces, it’s not perfect and can still be improved, David says.

David has a few ideas to improve this. One is to use cooperative scheduling, with userspace giving its QoS needs to the scheduler. Another would be to group work chains in cgroups.

sched_ext was merged in Linux 6.12. And Kernel Recipes helped, David says, thanks to the conversations that happened last year.

PREEMPT_RT over the years by Sebastian A. Siewior

Sebastian has been working on PREEMPT_RT for a few years.

cyclictest is a program that is used to measure the wake up latency. Sebastian showed the same workload running for 8 hours, with a cyclictest measurement. For a an upstream 6.11 kernel, the latency distribution can be very wide, even going a few times over 2ms. With PREEMPT_RT, the latency never goes above 40us.

There were many requirements to having PREEMPT_RT in the kernel. In 2.6.0, none of those were present. In 2004, the Linux 2.6 Real Time project was announced against v2.6.9-rc3; and the debate began.

One of the requirements was lockdep, merged as soon as 2.6.18. The modular scheduler core and CFS was merged in v2.6.23. High resolution timers (hrtimer) were also needed, and were merged in v2.6.16.

The generic IRQ infrastructure (genirq) was merged in v2.6.18. Clockevents making use of highres timers arrived in 2.6.21.

The preemptible RCU appeared first in 2005, but were merged later in v2.6.23.

Steven’s ftrace infrastructure was merged in v2.6.27. It caused issues with the e1000e driver erasing its firmware; the code in RT was good, Sebastian said, but safeties were removed when merged to mainline.

Sebastian A. Siewior

Threaded interrupts were merged in 2008 in 2.6.30, and the raw_spinlock in v6.2.33. In 4.6, the CPU hotplug rework was merged, and mysteriously fixed some bugs.

Pagefault disable decoupling from preempt_disable() was merged in v4.8, and it was extracted from RT by mm maintainers. Other features followed, like seqcount_t.

migrate_disable was merged in 5.11 to fix a deadlock during CPU hotplug. The printk ringbuffer was merged in 5.10.

Who is using PREEMPT_RT? Sebastian asked the room, and people gave a few of their usecases as examples. Sebastian then showed industrial examples from Keba, an injection molding company, with 150us latency constraints on some models. Engel Victory make machine that make Lego Bricks. Durr robots can do industrial car painting with Linux RT. Trumpf builds laser control systems to do welding with PREEMPT_RT; it communicates over network with 2ms deadlines. L-Acoustics’ L-ISA is used for spatial audio Live in concerts. Ellips is doing potato sorting in Real Time from pictures taken on a sorting lane; this uses machine learning on Nvidia GPUs to analyze all the data received from 10G Ethernet via XDP; in case of missed deadline, a rotten potato could make an entire warehouse go bad.

Last week’s tuesday, printk was merged, then Thomas gave a physical pull-request at a ceremony. But it’s not over, Sebastian says, there are still a few things in the pipeline. Arm32 and PowerPC are still out of tree for example.

Consoles, printk, nested-NMI!? oh my! by Derek Barbosa

Derek wants to share a bug hunting story he encountered in kernel-rt, through his lens.

In 2023, a partner found a bug and opened a ticket, bisected at a given commit. In this crash, kdump did not start and save a core and dmesg logs. An NMI is a Non Maskable Interrupt.

In order to reproduce the issue, the acpi_ipmi driver had to be used in conjunction with ipimtool in order to trigger an NMI. It used a printk spammer module as well. The reproducer was a simple script combining all of those. It takes about 30s to reproduce, Derek showed in a video of the crash. The hardware lockup detector is triggered when the kernel hangs. Eventually, the kernel reboots, but with no logs or core.

The printk spamming prints a lot of messages (4000 per second), but it’s intentionally exaggerated. On NMI, the panic code should take over, and Derek showed how it’s implemented.

In the panic handler, printk code is called to flush all messages, which calls into console code, which needs to synchronize too.

There are a lot of messages to flush, and it can be quite slow, but works. Except when handling an NMI, the flushing is still ongoing during panic, and it triggers the hw lockup detector. And in this case, it triggers a deadlock when the panic handler still calls printk that wants to flush consoles.

But on x86, an NMI should never happen if an NMI is not complete. Unless the iret instruction is used to return from interrupt. Page faults do use iret, so the implementation of kdump does no allocation for example. A solution for that is to set a special variable at the beginning of the first NMI; when the next one happens, this variable is checked and code jumps to a nested NMI handling instead.

What was happening, is that kexec did work, but in the middle of the latched NMI. So the next kernel triggered an NMI for page-faults, but before the NMI handler was set, so it used the previous NMI vector. So Derek fixed this by disabling NMIs during early boot.

But the printk deadlock was still present. So Derek dived again in its implementation. He recommends watching John Ogness’ 2019 LPC talk “Printk: Why is it so Complicated?” as a context. Luckily, the latest (6.6) linux-rt kernel had patches to rework the way printk interacted with NMIs. So the next step was to backport those, which took a long time, but finally fixed the issue.

To conclude, Derek says “don’t make assumptions”, this will just slow you down and “read the source code” to understand what it does.

That’s it for the 11th edition of Kernel Recipes! See you in 2025!