Welcome back, this is continued from the morning liveblog.
Filesystems & Memory Reclaim — Matthew Wilcox
The main subject of this talk is not just filesystems & memory, but essentially the challenges of working across subsystems: when subjects collides, experts on both sides don’t necessarily have the best idea on how to fix problems at the limit between filesystems.
Memory reclaim is process by which Linux mm claims back memory that was allocated. But the exception to this is caches: those don’t free memory, but hold on to it. So they must implement “shrinkers”, callbacks to free memory. But in some part of the kernels, those are complicated to implements, for example in filesystems.
Filesystems are complicated things to write. Matthew gave a simple example of an “already solved” problem: doing an allocation in a fs under a lock, could trigger reclaim, and going to a shrinker in the same filesystem that uses the same lock. This can be caught by lockdep.
Recently, three bug reports were reported in succession on both fat and ext4, and direct and background reclaim were involved. Matthew goes over one of those stacktraces, from bottom to top. It’s triggered by a WARN_ON, so the reporters thought it might have a security impact. Matthew did not think so, but it still needs fixing, he says. The gist of the issue is that the filesystems needed a datastructure during reclaim that wasn’t in memory, so it needed to allocate, during a reclaim triggered by an allocation.
Matthew has been exploring various solutions to fix this type of issues: using mempools, pinning memory in advance, only allow reclaim in background, allow inode eviction to fail. None of those solutions work.
XFS does not have this problem: it has different inode lifetime rules from other filesystems. One solution could be to convert the VFS to use the same rules as XFS, but “he’s not old enough to do that” Matthew says.
Matthew is proposing to only evict clean inodes in PF_MEMALLOC
, but he hasn’t asked anyone about this idea yet.
Even though there has been disagreements between mm and fs developers over the years, it usually leads to improvements. For example, dirty file pages writeback has been removed from reclaim paths. kmalloc
should return 512-byte aligned memory (for 512bytes+ allocations), in order to help with IO performance; slab wasn’t willing to guarantee that because it would break red-zones, but an agreement was reached to only do that for 512+ bytes allocations. In the audience, Vlastimil says that power-of-2 sizes are now aligned to the same power-of-2, and that red-zone weren’t a blocker, but the slub allocator that couldn’t provide the same guarantee.
Stack Tracing, Simplified: the SFrame Story — Indu Bhagat
SFrame was first released in 2023. SFrame can be thought as “generalized ORC for userspace”. When compiling, the compiler can skip emitting frame pointers for performance reasons, but this leads to unreliable stack traces: 5-7% of those are missing frames. EH Frame is a powerful but complex way to get ful stacks, but its implementation is bulky and not suited to constrained environments. That’s where SFrame comes in: it bridges the gap, by bringing the reliability of EH Frame, while reducing the complexity of the code.
SFrame is different from EH Frame in multiple ways: EH Frame is for unwinding, SFrame for stack tracing. SFrame does not encode the information in DWARF opcodes, but encodes stack offsets directly, in a compact way. SFrame only tracks a few entities, not all registers like EH Frame.
SFrame is supported in GNU assembler. The current format of the ELF section .sframe
is at v2. Its goal is to only encode the minimal information required for stack tracing. In the upcoming GNU Binutils (2.46), SFrame Version 3 will add support for bigger text sections (> 2GiB).
There are many pieces to SFrame: toolchain support is done in GNU tools and upcoming in LLVM, Kernel stack tracer, Userspace track tracer. In the kernel, livepatching on arm64 is evaluating SFrame thanks to the reliable stacks.
In binutils 2.45, support for s390x was added. It posed a challenge because the s390x architecture has a “flexible” ABI, which conflicted with what SFrame required. It was fixed in binutils 2.45, thanks to how SFrame encode FREs blobs in an ABI-specific fashion. Indu recommends making space in the ABI for stack tracing if one wants to have it performant; the s390x ABI specification fine print was updated for this.
In the future, distro-wide builds with SFrame support will be done, thanks to a new binutils option that should help with that.
Userspace stack tracing in the Linux kernel is being worked on. It would allow deferring stack tracing in NMI context for example, which perf can then use.
SFrame V3 will address robustness issues, including big text section, greater than 2 GiB. Another challenge is to mark the outermost frames explicitly, to help stack tracers disambiguate between incomplete and complete stack traces.
Access to Frame Row Entries can be unaligned, leading to performance issues: it was decided to not fix this issue at the moment. Re-aligned the structure was also evaluated, trading-off compactness for performance, but it was found to not be worth it.
To summarize, SFrame is the “Simple Frame” stack trace format, with a few iterations to be able to have fast, reliable stack tracing, and implementations in multiple projects.
In answer to a question from the audience, Steven Rostedt said that SFrame might also replace ORC in the kernel for x86 since it’s built into toolchains.
Da Xue from Libre Computer is offering attendees a little present <3
nolibc: a userspace libc in the kernel tree — Thomas Weißschuh and Willy Tarreau
noblic started as a really tiny C library, that is now maintained as part of the kernel tree. Willy started to use diet libc, but found it did not support arm64. So he started nolibc for that. Paul E. McKenney at some pointed wanted a small libc, and Willy proposed integrating nolibc into the kernel source tree. It was merged in Linux 5.0. Thomas contributed many fixes and became a co-maintainer.
Today nolibc is around 5000 lines of code, and has seen 22 contributors.
To recap, nolibc is a header-only C standard library, the whole libc is rebuilt every time the application is rebuilt. It is dual-licensed LGPL-2.1/MIT. The main goal is to have a small binary size, letting the compiler doing the optimizations, which it can do easily since all the objects are in the same compilation unit, generating static binaries. It intends to stay close to “real” libc in case one wants to switch to another. Multiple architecture are supported in the same codebase. The library is made to be easily vendored.
Thomas says nolibc is tested with GCC and clang. It can work with a toolchain with nolibc, like the ones Arnd Bergmann’s kernel.org crosstools proposes. Kernel UAPI headers for the correct architecture are used directly from the kernel sources. The applications should be simple enough.
An example hello world is shown: the compiler should have -nostdinc -nostdlib -static
options, with the include paths specified.
It supports many APIs around lowlevel I/O, highlevel POSIX file streams. Adding support to a new architecture requires ~200 lines of code, and 10 major architecture families are supported (more with the variants). In order to add a new architecture, one needs to implement macros for the syscalls and the entrypoint
_start
. Syscall wrappers then call the architecture-specific macros.
In-tree, there are multiple users: rcutorture, riscv/arm64/vDSO selftests, Kexec HandOver (KHO) tests. It works with kselftest.h
, so all self-contained kselftests could be migrated; but kselftest_harness.h
might require libgcc.
Of course, libgcc as many limitations: it does not support pthreads or TLS, longjmp, signals (yet), networking, or is not y2k38 safe on all architectures. syscall wrappers are not complete, but patches are welcome if there is a need. The code is generally optimized for size over performance, and libraries are not supported. Some Linux architectures are not supported yet either.
Thomas showed how small binaries can be compared to glibc and musl, and it can be a sizeable advantage.
nolibc has a test suite that is tested on different architectures with a QEMU runner. It tests validity against the system libc.
Using nolibc, Thomas wants to integrate kselftests into kunit to be able to run both in the same unified pipeline: he showed a demo of this work. It’s not mainline yet, and LWN wrote about this work.
How to tame a Panthor — Boris Brezillon
Boris has been contributing to Panfrost for a while, and initiated Panthor and Panvk drivers for Mali GPUs.
A graphics pipeline to render a triangle has multiple stages, which Boris simplified in the Geometry and the Rasterizer stage. Each stages map to simple concepts in the kernel; most of them map to buffers.
Both CPUs and GPUs are pretty fast Boris says. But relatively, the communication between the two is slow. So in general, there needs to be a way to communicate asynchronously: submitting jobs and checking for completion without waiting.
Usually the User Mode Driver (UMD), in Mesa, does the biggest part of the job: shader compilation, pipeline configuration, etc. The Kernel Mode Driver (KMD) does memory management, synchronization, power management and bridging between the UMD and hardware.
A GPU is just a different kind of processor, Boris says: it also needs an MMU to isolate workloads. In the “old world”, the KMD is handling the virtual address management; but in the “new world”, the UMD is in charge of this part, driven by Vulkan needs; UMD then handles VM contexts explicitly.
The main synchronization pritive being used are dma_fence
s. Some fences might be hardware-backed. Container fences collect multiple fences. drm_syncobj
and sync_file
are a kernel-internal fences. dma_resv
is the last synchronization primitive to be aware of, used for implicit synchronization.
In the old world, most synchronization was implicit. In the new world, all synchronization is explicit, and the UMD needs to make implicit fencing explicit, and reconciling the two when needed.
To submit a job, dependencies must be extracted, the job fence armed, then propagated, all before passing the job to the scheduler, at which point it can no longer fail, or the GPU will crash. Boris asks: Isn’t is simple?
What goes in a KMD? A buffer object manager and a VM manager, both related to memory management; and the last in the job scheduler, waiting for dependencies between jobs for example. The User facing API is relatively simple, Boris says: it has memory management: to manager buffer objects (BO), and to manage VM contexts; and job submission. It has a few additional ioctls, to get device information for example.
The DRM subsystem supplies many libraries to driver authors for each task: for BO management: drm_gem
or drm_ttm
, for VM management: drm_gpuvm
, for job scheduling: drm_gpu_scheduler
.
In practice, Buffer and VM management require implementing a few operations, but many defaults are provided.
Boris says that all major vendors have an upstream KMD, even if some might be incomplete. And Rust is “around the corner” in DRM, already in many drivers, AGX, Nova and Tyr.
That’s it for the 12th edition of Kernel Recipes, see you next year!
0 Comments