Welcome to the Kernel Recipes 2024 live blog.
Anne is doing the usual introduction: what is Kernel Recipes, what is special about this edition? The theme is the Olympic Games, and this can be seen in this edition’s logo, the penguin mascot, and the upcoming charity auction!
This edition’s godfather in Arnaldo, who is doing the first talk. Before starting, Arnaldo remembers the life of late Daniel Bristot de Oliveira, a fellow kernel hacker who passed away this year at a young age.
Assisted reorganization of data structures by Arnaldo Carvalho de Melo
Arnaldo maintains the Linux perf tools, used for performance evaluation. This includes analyzing data structures, and their impact on kernel caches.
Why should the programmers care about the data structure layout? Isn’t that the compiler’s role? But changing the layout poses a few issues with ABI (Application Binary Interface): between apps and syscalls, tracepoints, and between programs and their libraries.
A tool like pahole
, which Arnaldo maintains, helps understanding structure layouts.
In the kernel, non exported structs can be changed, since only the kernel uses them, it poses no issue; unless one works for an enterprise distro, Arnaldo says, which tries to maintain a stable kABI.
Over the years, the kernel community has been moving things around (manually), in order to optimize for cache effects: grouping fields written by the same workloads for example; reducing false sharing to prevent using too many cache lines with wasted space. He gave an example with patches by network developers like Eric Dumazet grouping fields and optimizing tcp data structures like tcp_sock
. For example, tx and rx -related fields are grouped in their own category.
In order to prevent future changes from changing the cacheline implementations, their are kernel macros like CACHELINE_ASSERT_* and __cacheline_group attributes, that do the checking at compile time. Those changes can have tremendous effect on high-speed performance, often with two-digits % performance improvement.
Is that enough to do this manually though? Not necessarily. pahole --reorganize
can re-pack a struct by moving fields, while also respecting alignment rules. But the current algorithm is naive and bitrotted, Arnaldo says.
So a new approach was designed, but based on real-life usage, with perf mem
; it uses PEBS counters on Intel to analyze struct usage. It is combined with perf mem report
to analyze what was recorded.
perf c2c
is another tool which has record/report, but is cacheline oriented. The goal is to find data structures that are often evicted from the cache.
perf annotate
is also being improved; from a performance standpoint it’s now faster since it does not rely on objdump, and uses capstone and libllvm (with a fallback to objdump). These improvement also help doing memory and data structure analysis. perf report
also understands data structures better, and can sort its output by struct types for example. With perf mem report
, the types can be combined with the cache level access. With perf report
, it’s possible to show the cacheline access distribution next to the struct fields.
The next step, Arnaldo says, is to find sibling fields automatically based on the collected samples. Accesses of the same type (read or write) are analyzed, and should be grouped to the same cacheline. The goal is to add a sort order to the report
commands for the siblings, and having this data consumed by pahole
; after reorganization by pahole, the code should be rebuildable.
The end goal, Arnaldo says, is to have the tools automatically do the work that was done Eric Dumazet and his team to reorder struct fields, group by cacheline, and verify performance at build time.
Answering a question from the audience, Arnaldo says pahole
already supports Rust. But there are some limitations, du to namespaces, and very long symbol names. clang
can do LTO across C and Rust, and it confuses pahole
for now, which Arnaldo would like to fix. perf
can already demangle symbols for Rust (as well as many other languages).
Another person asks, why not get a full view of the cache accesses orders, etc., akin to what cachegrind does? Arnaldo says that while it’s great for development, (and should be done), the goal with perf is to look at production-running code, like ftrace.
Interfacing Kernel C APIs from Rust by Andreas Hindborg
Andreas has worked on interfacing the block layer with Rust.
The general mood in the kernel community about Rust has changed; instead of saying “I don’t want Rust in my subsystem”, Andreas has heard “What should I do if people come with Rust in my subsystem”.
A common question that Andreas hears, and he wants to answer is “why is the code so complicated?” or “Why so many lines to do something I can do in a few lines of C code?”.
Andreas starts by introducing the advantages of Rust around memory safety and productivity. Memory safety costs a lot of money. In Rust, there is a subset of the language that is entirely memory safe.
In the block subsystem, there is block layer blk-mq
, that sends requests to drivers and hardware queues. The interface between the two is the subject of Andreas’s work, in order to write drivers. The block layer sends requests with struct request
; it has a request cache as well to prevent too many allocations.
When implementing a driver, one has to implement a “vtable” or ops struct with many callbacks. A simple example of that would be the null_blk
driver. In order to do this from Rust, the best way would be to use a Trait
, which describes the interface. If the ops structure were initialized just like in C by writing raw pointers, it would be unsafe
. A reference in rust is always initialized, unlike an unsafe raw pointer. Another complication is that Rust values are movable by default, which we don’t want to happen for those pointers; this is done with Pin
.
There is an abstraction in the kernel called PinInit
which helps doing that; it’s a 1k lines of code + 700 of doc that is a bit complex to understand, but easy to use. Andreas showed an example implementing a wrapper to an ops structure, with the pin_init!
macro.
The goal is to have the user, implementing a block device driver, to just implement the Operations Trait. The goal of the bindings it to provide a safe and easy-to use interface for driver writers. The drivers should be done entirely in safe Rust.
(As an aside, the Rust null_blk driver used for this work has been merged in Linux 6.11, and Fedora is considering enabling it for its next release, F41.)
Making this safe interface for driver authors needs some amount of unsafe code. The wrapping layer does all necessary checks on unsafe code, and then calls into the safe driver code that implements the Operations
trait. The wrapping code in the kernel initializes the ops struct used from the C-side; it takes care of optional fields while instantiating the vtable. The code is a bit complex, but driver authors should be shielded from this complexity.
Another challenge is with blk_mq_end_request
, a function that should only be called once per request. In C, the rule is simple “don’t call it twice”; but in Rust, it must impossible to call twice.
A solution to this would be to use a reference counter to handle request
ownership between the block layer and drivers. But it did not work because iostat
can access requests. So the solution was to store Rust private data in the request; it is then used for atomic verification that there is only one Rust reference to a request. This makes the end request fallible, and an error can be thrown in case the code in buggy.
This wrapping with a check has a cost: around a 2% degradation in IOPS for null_blk workloads. But there is no need to have this check enabled in production, Andreas says, only in debug. Mathieu Desnoyers in the audience, says that his Hazard Pointers with Reference Counters patchset sent two days ago could be a solution to this issue.
To conclude, Rust kernel abstractions can be somewhat complex, but users (driver authors) should be spared from this complexity.
To answer a question from the audience, Andreas says that compile times are generally not an issue for his kernel development needs.
How CERN serves 1EB of data via FUSE by Guilherme Amadio
Guilherme works at CERN, in the IT group responsible for data storage, which makes heavy use of FUSE.
CERN is responsible for the Large Hadron Colluder (LHC), built around CERN between France and Switzerland. The particles are accelerated inside the LHC, collisions are measure by Detectors, and all the raw data is stored by Computing systems.
Detectors can be thought of as “big 3D cameras”, performing measurements about 40 million times per second. Each experiment at the LHC has its own detectors: Atlas, CMS, Alice and LHCb.
Particles interact in complex ways inside the detector. This creates a deluge of data, around 51 Tbit / s, which is challenging to store.
For the ALICE experiment, the online processing with 250 Nodes and 2000 GPUs reduces this to 280 GByte / s, which is more manageable. Once stored, there are multiple systems accessing the data, for example the tape backup system, or the jobs that process the data, which is the focus of Guilherme’s team work.
This system processing jobs, has 180PB of raw storage, 150PB of usable space, in around 12000 HDDs.
During the 2024 proton-proton run, the monitoring showed a peak transfer speed of 1.3TB/s, with an average of 480GB/s.
To create the system storing the data, there is a mixture of generations: 6TB, 12TB, 14TB and 18TB are the most common. The low-capacity disks are not issue, but an advantage: they provide more bandwidth per stored bytes, since HDD bandwidth does not increase linearly with size.
The fuse traffic reached a peak of 292 GB/s. The rest is handled by XRootD
, an open source project maintained by Guilherme for scalable cluster data access. It can be thought of curl+nginx+varnish
; it supports its own stateful root://
protocol, as well as HTTP. XRootD is not a file system, Guilherme says; while it works with file systems, it can do a bit more.
The cluster architecture is organized with 4 layers, from the root, to two supervisors layers, to leaf data servers. The goal is to have a unified namespace across the cluster. Plugins can be written, and the Vera C Rubin Observatory wrote a plugin to build a big MySQL cluster on top of XRootD.
EOS is the CERN cluster built on top of XRootD; it is a very efficient and cost-effective storage system. It has a FUSE interface which is more familiar for users, even from their laptop; but there is also an OwnCloud integration.
File storage servers are based on cheap JBODs, with a RAIN layout; there are around 150 storage nodes.
EOS supports throttling/QoS per user, group, or application, to prevent disruption of ongoing experiments, or between users. It handles file placement policies: layout, hardware to use (HDD or SSD), and geographical location.
The FUSE clients for EOS have evolved over the years, from being path-based to inode-based, and from using libfuse2 to libfuse3. At a given time, there can be 30k active clients connected on average.
While they encountered a few issues over time, in general FUSE performance is not an issue for EOS.
At CERN, there is another user of FUSE wihich is CernVM File System, a read-only filesystem built on top of HTTP used for CDN and package distribution in HPC. It is built on top of Ceph by other CERN people, Guilherme says.
That’s it for this morning! Continue to the afternoon live blog!