Welcome back for day 2 of the Kernel Recipes live blog ! You can also follow the live stream.

The unique way to maintain Ftrace Linux subsystem – Steven Rostedt

This is the third talk about maintaining, but it’s only an accident. Great minds think alike.

As an introduction, Steven said it’s only his personal approach. He uses it because he thinks it’s efficient, and doesn’t like change for the sake of change, but only if it works.

Steven showed pictures and config of his personal server and workstation and office desk. His workflow starts on his workstation, where he does most of the development. Then the server manages email, web, inside dedicated VMs before going to a firewall.

Email clients aren’t the best things for managing patches, Steven says. He might miss patches, and find them 6 months later. Most of the time, maintainers ignoring patches is not intentional, but probably just missing patches.

He setup a patchwork instance on his server. He showed his procmail config; some patches are ignored directly (if they don’t Cc the mailing list), some go to the inbox, and others go to patchwork.

Steven Rostedt taking its traditional selfie with a dedicated camera.

Steven also subscribes to the git-commt-head mailing list, which sends an email for every accepted commit that goes into Linus’ tree. This is used to remove patches that have been merged from the database of pending changes.

The workflow to look at a patch starts from a patchwork instance; a link to a series is copied and passed to a script to download it on the server; then lore lkml links are added with a script. If applying the patch with git am fails, then Steven switches to quilt. The conflict resolution is done in emacs, with manual diff editing. Then the patch diff is double checked in git.

With git, he gets too many branches, and it makes it hard to find things. This is why he created a custom tool, git-ls to see all of them. But for small changes, he uses quilt. The patch files are generated by git diff, then managed in quilt.

In order to build and test the kernel, Steven uses ktest.pl, an in-tree tool to automate building, installing and testing a kernel. It uses a config files, and there are example config files, for example to use VMs as test targets.

Steven Rostedt

To test, Steven uses two VMs, one 64 bits, and one 32 bits. To build both against the same repo, he uses git worktree. The test setup takes a few hours for each config; no tree is sent to Linus Torvalds without passing this testsuite. But Steven will often find bugs in other parts of the code preventing sending his tree. The configs and test suites are both public on Steven’s github.

To prevent testing twice the same thing and wait a 13 hours for nothing, in Steven’s config, ktest verifies the previously tested tags and fails if a full testsuite was already ran for the git commit.

Sending emails to pull for-linus and for-next branches, this is automated with scripts as well.

sched_ext: pluggable scheduling in the Linux kernel – David Vernet

A CPU scheduler is used to share tasks across CPUs. Things get complicated pretty quickly though, and many constraints that get added depending on the OS, tasks, CPUs, etc.

CFS (Completely Fair Scheduler), the default Linux scheduler aims to have a fair sharing policy. It’s great, David says, but it was built in a simpler times: smaller CPUs, homogenous topologies, closer caches. Nowadays, the hardware is more complex with CCD’s (Core Complex Dies), aggregating CCX’s (Core CompleXes). Heterogeneity is becoming the normal, with CPUs that might have 4 cores per CCX, with 2 CCX per CCD; then different cache hierarchies, shared L3s, etc.

CFS has other some drawbacks as well: experimentation is difficult, because one needs to recompile and reboot for tests. On boarding a scheduler developer can take years, David says. Since the scheduler is for everyone, it’s general approach makes it hard to please everyone at the same time.

sched_ext enables dynamically-loaded BPF programs to set the scheduler policy. It works as a different scheduling class, in parallel to CFS, but at a lower priority. The interface is not stable and considered kernel-only, and GPLv2-only.

With BPF, experimentation is much simpler: no reboot needed, impossible to crash the host (eBPF verifier rejects bad programs). The API is simpler to use too, David says. And for safety, a new sysrq key has been added to go back to the default scheduler.

At Meta, HHVM has seen up to +3% requests per second with a custom scheduler, and up to +10% for ads ranking. At Google, they also have seen strong results on search, VM scheduling and ghOSt.

One of the goals is to move complexity to userspace.

sched_ext does not aim to replace CFS, there is always going to be a need for a general scheduler. David hopes that quick experimentation will even help improve CFS. The GPLv2 constraint is checked at load time by the BPF verifier; some example BPF schedulers are provided.

Currently, a goal keep the sched_ext API unstable, and not impose UAPI-like constraints on the kernel scheduler.

In order to build a scheduler, there’s a basic set of callback functions that the BPF scheduler programs can call. One of the building blocks for that are the Dispatch Queues (DSQs). Every core has its own DSQ, and they are similar to the kernel “runqueue”.

David Vernet

David showed an example Global FIFO scheduler that is very simple, combining a global DSQ with local DSQs, yet works very well on single CCX machines.

The strategy to select a core is also relatively simple, even as it takes SMT into account.

Another scheduler examples used a Per-CCX DSQ, which improves L3 cache locality (same CCX cores usually share the L3 cache). This on is a bit more complex, and needs a work stealing strategy once a CCX DSQ runs out.

David says that the first three example schedulers that are shipped as part of sched_ext are all considered production ready. scx_rusty is a multi-domain BPF / userspace hybrid scheduler. scx_simple is the one that was showed with the global FIFO. scx_flatcg flattens the cgroup hierarchy.

Then there are four other sample schedulers that have very different and creative strategies, but aren’t considered production ready.

sched_ext is still out of tree (patchset at v4). It relies on clang for now, but gcc support for bpf programs is upcoming. Meta is following an upstream-first philosophy; and the company’s kernel team top priority is upstreaming sched_ext.

The current version is considered complete enough to merged and testable; upcoming features could include power awareness or latency nice.

stress-ng: finding kernel bugs through stress testing – Colin Ian King

stress-ng is tool written by Colin to do stress-testing. There might be different reasons to do stress testing: from finding breakage points (in the kernel…), to check that the systems behaves well underload, scales well, to verify the failure modes, or even to test that the hardware works well.

Stress-ng has already been used to find 60+ kernel bugs; it has brought many kernel performance improvements. Many people use it for performance testing, even some silicon vendors that use it for bring-up of new hardware. Stress-ng has been cited in 80+ academic papers to do synthetic stress testing.

About 10 years ago, Colin was working on laptops, and found in some cases, the CPU might get quite hot, even leading to shutdowns. He started with the “stress” tool to verify the Intel thermal daemon was working properly. Then he needed to improve it, and that’s how stress-ng was born.

Nowadays, stress-ng has more than 300 “stressors”; which are used to test various parts of the CPU, the kernel, memory, and even GPUs. They each have different goals.

A stressor has a very simple architecture: first initialization, then a loop that does the work, increase counters, and then it exits when a condition is reached, and runs the clean-up phase.

When running stress-ng, there are global options (for run duration for example), and stressor-specific options (number of instances for example).

Metrics might evolve between versions, so when comparing it should always be between the same versions of stress-ng.

Colin showed many examples. Starting with running different stressors in parallel with a common timeout.

Memory stressing is done from userspace, and it’s possible to test the memory behaviour, its bandwidth, or even how the kernel behaves in OOM scenarios.

Network stressing has many modes, from udp, sockets, zerocopy etc. When testing filesystems, there are multiple workloads, and it’s possible to verify if there are disk SMART errors reported during the test.

Kernel interfaces like sysfs and procfs have their own dedicated stressors. Syscalls (existing and non-existent) can be tested, and it has found real-world bugs.

Stressors are tagged by classes, so it’s possible to run all stressors that are in a given class in a single run with various running modes: sequential, in parallel, with permutations. Stressors can have methods, which are different sub-stresses; by default, they are all used in a round-robin fashion when selecting a stressor, but it’s possible to select a specific one.

Stress-ng can run perf directly to analyze counters during a test.

The update cycle is very fast, and the program is very simple to compile. Colin says he really likes testing systems, and that is what drives the development.

To release, stress-ng is tested on many different architectures, compilers, operating systems, with over 100 VMs used before release. Portability is a big goal. Different static analyzers are used as well.

That’s it for this morning. Continue with the afternoon live blog.