Test-driven kernel releases – Guillaume Tucker
The Linux kernel might be heavily tested, because it’s used all over the place. But this is a both hidden and duplicated testing effort, that isn’t tracked upstream.
There’s already some available testing available in the open for mainline though. syszbot is a syscall fuzzing tool with automated bisection and reproducer generation with a public web ui. KernelCI is a tailored CI system, that’s distributed, built on kubernetes, has also automated bisection and public Web UI and API. regzbot in another, that focuses on regressions and manual submissions.
The goal of Guillaume’s project is to aggregate all those test results for each release, and ideally provide them in-tree. But it’s challenging because those need to be done before the release.
This started with the results reproducible on any hardware: the tests included in the kernel tree. Does the kernel build even with reference toolchains ? Then there is builds with sparse, coccicheck, KUnit, device tree validation.
The first idea (RFC) was to put the results in-tree, and rely on git history for previous results. The second idea was to use a link in release commits. Third proposal was to use git metadata. The goal is always to be compatible with the infamous email-based system linux kernel developers use.
Guillaume is looking for feedback on this concept, or how to move this idea forward. A good part of the audience agreed it would be a good idea, but there a few back and forth on details of implementation which might have an impact on everyone’s workflow. For example, once you have the data in-tree, how do you decide what to do when tests don’t pass ?
Some preferred the second proposal, with a central link that is referenced in commits but with the results kept out-of-tree. What matters to companies is the paper trail that shows that a given release was tested, not necessarily what is tested and how. Other think it might need to be actionable, not just a simple paper trail. There was also some question on whether or not test results should be stored forever, if it’s free to store why not ? It was also added that this type of information is very useful for pushing companies and customers in general to update their kernel.
What’s new in ftrace – Steven Rostedt
ftrace was originally the “function tracer”. It was designed for embedded, and is greatly portable. You can watch previous kernel recipes talks for an introduction.
ftrace has so many features, that even Steven sometimes forgets about them, even though he is their author. This is a reason for this talk.
trace-cmd is a wrapper around the tracefs directory, and is much more powerful. Steven says you should use it.
kprobe tracing is kind of new (2009), and very powerful; for example it supports adding function argument tracing. Steven showed an example with ftrace and kprobes: how to use argument in kprobe to get a given function argument, dereferencing a pointer in order to get a given field, like the interface name.
uprobe tracing a bit newer (2012), and it’s also possible to use them with ftrace. Steven showed to trace system-wide all the calls to malloc, and print the size argument in the ftrace buffer.
Histograms were added in 2016, and have an event “trigger”. They are designed to count events. An example with syscalls was given. Syscalls only have two tracepoints: sys_enter and sys_exit, not for specific syscalls. A histogram with syscalls was created on the sys_enter tracepoint, with mapping from syscall number to symbol done directly in ftrace. Another example used the histograms with the malloc uprobes, with per-process total allocation size, and matching from pid to comm.
In 2018, synthetic events were added: they connect two different events into one event. For example, it’s possible to show the latency between the two events. Steven showed an example with wakeup latency: first, a synthetic event is created, with a name, pid and latency. Then a histogram is created the pid as the key, and the timestamp (from the ftrace ringbuffer) as the value. Then, a second histogram is created with a mapping from the next pid, creating the difference between the sched_waking and sched_switch events. It’s then possible to create new histograms from the synthetic event. Quite simple to use, and Steven is wondering: why is nobody using it ?
In 2020, libtracefs was released: it exposes almost all features of the tracefs system: it was extracted from trace-cmd. It contains an example program called
sqlhist that helps with synthetic events. You write an sql query, and the program will parse it and convert it into ftrace synthetic events histograms. It’s much more intuitive.
Another nice example with
sqlhist showed how to create synthetic events to measure how long tasks are blocked, and by which syscall. The first synthetic event got the syscall name that blocked (followed by a sched_switch with task uninterruptible). The second one counted the time it spent offcpu (unscheduled).
Event probes are from 2021: they extend trace events with kprobes. Steven realized he forgot to write the documentation for it (it is now done). He showed an example eprobe to add the interface name (like the first example). Another example showed all the files being opened with an eprobe on
openat. But it wasn’t as simple because the filename from userspace might not have been mapped. So he used a synthetic event with
sqlhist to join both the
sys_exit_openat, and get both the filename (already available), and the return code.
bootconfig was added in 2020, and allows loading a config file into a kernel to enable a set of ftrace options during boot.
In 2022, the
CUSTOM_TRACE_EVENT macro was added to be able to only select a subset of information in an event, saving on ringbuffer memory used by events.
Idmapped Mounts – Christian Brauner
VFS Ownership is usually stored into the filesystem, but not necessarily. Internally, at the filesystem level, this is usually read with
i_uid_write functions, which then translate from
struct inode raw uids into kuids or vice versa. That’s where idmappings come into play. They map userspace ids to kernel-space ids, by ranges, usually in a given user_namespace. This is noted with
u:k:r notation: userpace-id:kernel-id:range.
When a user namespace is created, it has a default idmapping:
from_kuid functions are used to do the mapping.
When opening an existing file, a lookup is done if the inode is not in the icache. Depending on whether or not the fs is mounted with idmapping, the kernel id might be different from the on-disk user id. The same is done when creating a file: the process
current_fsuid() is fetched from the current user namespace and converted through the idmapping into the filesystem.
To alter ownership filesystem-wide, for example for a filesystem mounted by an unprivileged user, the relevant idmapping is determined at mount time.
In systemd, it’s now possible to have portable home directories: a way to have a home directory that can work on multiple machines. This works by allowing a pseudo-random uid at login time. For this usecase, it recursively changes with
chown() the permissions if the login uid/gid changed.
For simple process isolation, unprivileged containers are very useful, but make filesystem interactions difficult if sharing outside the container is needed. They also need recursive ownership changes, and it wastes space and makes container startup expensive.
Idmapped mounts allow to fix the issues in these two use cases. The file ownership can be changed locally (per mount), and temporarily (for the lifetime of the mount). Christian showed an example of a bind mount with an idmapping, doing exactly this, using a patched mount binary which should soon be upstream.
The internals are all documented in the idmappings official doc.
The idmappings cannot be changed after being mounted. It currently needs patching in filesystems. It started in linux 5.12 with ext4, fat and xfs, linux 5.15 had btrfs and ntfs3, 5.18 had f2fs and 5.19 will have both erofs and overlayfs support. This feature is heavily tested in xfstests.
Userspace support has already reached many projects: systemd, containerd, crun, runC, LXC, LXD, podman, the OCI spec and mount in util-linux.
Rethinking the kernel camera framework – Ricardo Ribalda
With his team, Ricardo has been working on improving the way camera are used with a new user/kernel framework: CAM.
Unlike standard joystick abstractions, Cameras are special. For example, there is now support for more than 200 video formats in the linux kernel. These can’t really be converted on the fly in the kernel: it would be too costly. They also have complex input parameters, and sometimes multiple ways to reach the same result. In addition, the video bandwidth can be huge, and need low latency. So this leads to things being done in hardware. Cameras are also driving the consumer market.
In video4linux, the model has changed over the years: from simple cameras where the driver did all the work, to the internals of modern cameras being exposed and needing a lot of software to configure all of it. This is where libcamera comes into play (see the presentation by Laurent Pinchard at Embedded Recipes).
And Ricardo says that we have been living a lie: sensor data as it comes, raw, already needs a lot of software transformation before getting a useful image.
CAM is a new kernel subsystem (KCAM), which provides much simpler abstractions on top of hardware components: Entities, and Operations. Entities are organized in a tree, have a single register-set, and can throw events. Operations are a way we modify an entity; for example, reading or writing to a regmap, transferring a parameter buffer. An operation can depend on other operations or events; it can create a fence for synchronization.
In practice, a list of operations is sent by an app to the kernel via an ioctl (soon to be changed with an io_uring), and it is run by the kernel.
This new prototype framework is heavily tested: with kunit, libkc, vcam, error injections. It is tested on hardware with the ChromeOS test infra.
The main goal of CAM is to get smaller drivers, moving more work in the userspace (using libcamera, too) app. It’s a full blank-slate approach from video4linux, changing the model from streams to operations.
The governance model will be similar to DRM: any driver must have an open source userspace stack before being merged.
The goal is to test this stack in ChromeOS first to benchmark it, and then of course proposing it upstream; this presentation is the first step for the latter, Ricardo says.
That’s it for today ! Read the day 2 morning live blog.