This is the last live blog for this Kernel Recipes edition.
Hardware and its Concurrency Habits – Paul E. McKenney
Hardware has evolved a lot in the last 40 years: CPUs have more cache, deeper pipelines, are out of order, and mostly unpredictable. Paul showed a the simple block diagram of the 386 ALU. With the help of a logic analyzer it taught him concurrency.
At the time, instructions took multiple cycles. Pipelined and Superscalar execution changed that. Where does this hardware complexity come from ? Laws of phisics first: atoms are too big, and light is too slow. Transistor atom size determines switch speed. Light goes 1 width of A4 paper per nanosecond. In silicon, this is 100 times less. As a result, data is even slower.
This accounts for a part of the CPU complexity. But people want to write portable code as well, and this impacts performance. In addition, there are systems with a lot of CPUs.
In order to get the maximum performance, one needs perfect branch prediction. Branch misses are an obstacle. Memory references are another: anything not in cache loses hundreds of clock cycles. Atomic operations require locking cachelines and busses; even if the impact has lowered recently. Memory barriers are another obstacle. More recently, Thermal Throttling is another issue if the CPU is used efficiently enough.
Which obstacles to focus on? Paul wants to focus on cache misses, memory barriers and atomic operations.
Paul showed the time to do a compare-and-swap on a Xeon CPU. Depending on the use case, it can take from 14 to 2000 cycles (across cross-connect in multi-socket configurations). To get a sense of scale, Paul went in the audience with multiple rolls of toilet paper. A single sheet represented a cpu cycle; 2000 cycles would be at least 4 fully unrolled paper rolls.
Can hardware help ? There is research to improve the latency by improving the integration. Stacked transistors for example.
Hardware accelerators also show promise, too; depending on the usecase. They have been helping for quite some time.
Memory hierarchies help. L3s have been growing a lot recently. Hardware is not afraid to throw transistors at the problem.
To summarize, Paul says that modern hardware is highly optimized, most of the times. Incremental improvements have compounded.
Gaining bounds-checking on trailing arrays in the upstream Linux kernel – Gustavo A. R. Silva
The work presented comes from multiple contributors, Gustavo says.
Arrays in the C language can be declared simply, but the boundaries aren’t enforced in the language. Trailing arrays are arrays found at the end of a structure. Flexible arrays are trailing arrays where the size aren’t known at compile-time, but determined at run-time.
There are two types of fake flexible arrays: they can be 1-element or 0-length arrays. Both are used as flexible arrays, but don’t use the “modern” C99 flexible array syntax.
1-element arrays are prone to off-by-one problems, and extra-memory might be allocated if not careful. They trigger -Warray-bounds false positives.
0-length arrays are a GNU extension; they don’t contribute to the size of the struct, but they also trigger -Warray-bounds false positives. Another issue, might be undefined behaviour if someone adds a struct field after the 0-length array.
Another undefined behaviour can be triggered if the structure with a fake flexible array is embedded in another structure, and has fields after it. It should always be at the bottom; and even the embedding structure should be at the bottom if embedded in another structure. Luckily, there’s a new warning in development for GCC 14 to detect those called -Wflex-array-member-not-at-end.
After testing this warning, Gustavo found 650 unique occurrences of it in the kernel.
Another issue, is that
sizeof() of a (C99) flexible array member is a compile error. And that brought problems in the kernel. For fortified memcpy, this meant that __builtin_object_size returned -1, for fake or not flex arrays; which is inconsistent with
sizeof. It even failed for any type of trailing array. This is for historical reasons: in struct sockaddr, the trailing 14 elements array in fact behaved like a flexible array. And this broke bounds-checking for any type of trailing array.
There’s a new compiler option GCC 13 and clang 16 -fstrict-flex-arrays=<n> to be able to control that behaviour. If n is 3, all type of trailing arrays gain bounds-checking.
Starting with Linux 6.5, both fortified memcpy and -fstrict-flex-arrays=3 are enabled globally.
In order to properly sanity check, an attribute is being added in GCC 14 and clang 18 to point to the element count of the flexiible inside the structure. For bounds-checking, __builtin_dynamic_object_size replaced __builtin_object_size in the fortified memcpy; it’s able to use the hint from the attribute.
In order not to break UAPI, the whole structs was duplicated in a union with both 1-element array for userspace, and flexible array for the kernel. There are now helpers to simplify this.
The impact of this work even had impact on improving user-space.
Getting the RK3588 SoC support upstream – Sebastian Reichel
The RK3588 is an SoC from Rockchip. He was asked to look at support for it last year; a few days later he had the evaluation board on his desk.
He had a look at the available source code from the vendor, and tried to create a minimal device tree from that. His goal was to only have the serial port and uboot. He tried it, and had nothing. After adding interrupts, he was able to get some kernel error messages on the serial port.
Slowly extending the device tree step-by-step, he was able to try and fix the error messages: first missing interrupts, then CPU properties, then other devices. There were many third party devices. Most of them depended on clocks, so he started with that.
The goal was to boot a Debian userspace. First step was clocks and resets, then pinctrl, eMMC and normal console.
Clokcs and reset are similar to previous generations. He needed to add support for lookup tables for different register offsets. Then he worked on the clock gates; the clock framework only supports one active parent, which didn’t match the hardware reality, that had linked clocks. Initial solution was to mark linked clocks as critical, wasting power but making the device work; this is being fixed now.
The V2.1 GPIO controller is very similar to the V2.0 version already upstream. eMMC needed two changes compared to the previous generation; and there even was a regression in 6.4 breaking the controller on the RK3588; this is now fixed.
Once Debian booted far enough, Sebastian started to upstream the incomplete Device Tree. While this was being reviewed, he continued working on adding more features.
Network and power management domains were the next two ones. They worked, until a colleague received another Radxa board. So he fixed it. But the next Rockchip board was done even more differently, using PCIe ethernet.
It needed GIC-ITS (Interrupt Translation Service) enabling, but just enabling it broke the boot completely. The RK3588 has a design flow, making the cache non-coherent. Previous generation did not use the ITS either. An Errata was send from Rockchip, working around the issue, and Lorenzo Pieralisi started working on a generic solution, proposing a change to the ACPI standard.
The PMIC (Power Management IC), was new for the RK3588, with two different configurations using different chipsets. Having both work required fixing RK808 MFD subdrivers in a lot of subsystems.
Multiple hardware blocks were supported by simply adding a compatible string in the device tree, so not a lot of changes. For some devices, DT bindings were broken and had to be fixed. AV1 codec work started early.
Most of the basics are supported. Many persons are working on HDMI Out, In, USB3, GPU, Crypto and DFI. DisplayPort, DSI, CSI, ISP, Video Codecs, SPDIF, CAN, RNG are still at the TODO status; the audience said that someone had a demo for DSI and CSI.
In u-boot, Ethernet support was needed for Kernel CI support. But it didn’t work in the downstream u-boot; fixing this was as simple as disabling an option.
Upstream u-boot support is being worked on by multiple people. It follows a milti-step plan. DFU support won’t be worked on by Sebastian, he said.
Evolving ftrace on arm64 – Mark Rutland
ftrace is a framework to attach tracers on kernel functions at runtime. It is used for tracing, but also live-patching, fault-injection. It is used in production environments thanks to its minimal overhead. It needs architecture-specific code.
In order to hook a function entry or return point, ftrace needs architecture specific magic. This “magic” is related to how functions are called. On arm64 this happens with the
bl, branch-and-link register, that jumps to an address, and stores the return address
lr, the link register. In the function entry point, this lr might be stored on the stacked, and restored at the end. Using the
ret instruction will consume the link register.
Another “magic” that ftrace relies on is compiler feature called
mcount. It inserts at the beginning of every function calls to an
_mcount function. The
_mcount function is used as a trampoline to hook the entry point. But what about the return ? To do this, hooking the frame record (the part where the link register is placed on the stack) is required. By modifying the return address to instead a return tracer, that can then restore the link register that was saved in the entry point hook.
But when tracing is not being used, the call to mcount shouldn’t stay here in every function of the kernel. Another bit of magic is to replace the calls to mcount by nop instructions, this way the overhead is minimal.
That was all simple, until Pointer Authentication came in the loop. It’s a new arm64 CPU feature that protects against ROP/JOP attacks: at the beginning and end of every function, the compiler inserts two new instructions:
autiascp. And this breaks the mcount-based approach.
GCC 8+ added support for
-fpatchable-function-entry=N which adds nop at the beginning of every function, but before the
paciasp pointer authentication instruction; so the link register isn’t signed yet. The net result of this new patchable-function-entry is also much better code generation, with less instructions.
The current approach only allows calling a single tracer; for tracing this might not be enough, so ftrace has common code to multiplex tracers. Some architectures can add multiple tracers trampoline dynamically to a callsite, but this isn’t feasible cheaply enough on arm64, Mark says.
In order to make that work cheaply enough, a creative solution had to be found. Before each function entry, 64 bytes are added for custom trampolines. Combining that with patchable function entry, allows having per-callsite ops. The logic is simple enough to be maintainable and enable other features.
That’s it for this edition of Kernel Recipes ! See you next year !