Welcome back for the last day of this 10th edition of the Kernel Recipes live blog ! You can also follow the video live stream.

Netconf 2023 Workshop – David Miller

This year, Netconf, the invitation-only Linux conference for network kernel developers was held in parallel of the first two days of Kernel Recipes. The agenda was packed.

During the first day, the location for netconf was discussed by Alexei. David is comfortable having it happen co-located with Kernel Recipes. Toke talked about XDP; it’s a very important technology, David says.

Jesper discussed page pool, a buffer management tool. It has become more important to have a common mechanism like this for drivers.

Alexandre discussed GRO (receive offload) overhead; decreasing the overhead per packet is quite important for general performance.

Jesper also talked about the impact of ksoftirq on network processing. Networking would welcome important in this area of the kernel.

Paolo discussed evolving some terminology in the kernel to be more inclusive.

Jiri on devlink; devlink allows abstracting a network device when it has multiple ports to only use a subset.

Florian F talked about mDNS offload.

Jakub talked the development process, including the pw-bot. It had improved toil since it detects common issue on patch submissions. The impact of new tools on CI is quite important, and helps a lot. When they don’t work, the developers miss them, which Dave compared to an addiction.

Willem talked about refactoring complex code. It’s important, but impacts both rebases (making the refactoring harder), and backporting patches to older, pre-refactor version.

David Miller

Florian W did a workshop on IPSEC; PK key deprecation took a good part of the discussion, as it’s getting older, but not all applications have migrated to the new netlink-based API.

Daniel Borkmann discussed header/data split which is very important for zero copy of packets, since the data can be put in a separate page and sent to userspace. He also discussed Big TCP and zerocopy, as well as bpf_mprog to have dependencies and ordering between bpf programs, allowing more complex pipelines.

Eric Dumazet is trying to reorganize struct file to make sure the fast path work. The most important fields should be in one cacheline, and the rest in the next ones. He also discussed deferred wakeups to improve performance by having less wakeups. The idea of using UDP accept() was introduced recently: it makes more sense than it seems for userspace applications, that often have stateful protocols on top of UDP.

Florian W also talked about fixing CVEs in nftables. It’s often related to the accumulation of technical debt, and tackling this is important, but the limited resources of the networking developers make this very hard.

David finished by thanking the Kernel Recipes conference organizers, because it stayed true to its roots and is a great environment to discuss technical ideas.

Faster and fewer page faults – Matthew Wilcox

This talk is about four projects. Matthew had a hard time deciding what he should work on; he picked ambitious projects, that made sense and came together.

Linked lists are an immoral data structure, Matthew says. If they are used for anything where you care about performance, it does not work. Your CPU is attempting to extract parallelism from sequential code. His laptop, is able to do up to 6 instructions / clock cycle; missing an L3 cache hit costs 1680 instructions; which is a lot of work. Linked lists bottleneck on fetching the next entry in the list. Arrays don’t have this issue, and they can be prefetched. This compounds a lot in the kernel.

A page fault starts with a VMA (Virtual Memory Area) lookup for virtual address. Then the page tables are walked down. If the VMA is anonymous memory, a page needs to be allocated, zeroed, put in the page table, and returned to userspace. Otherwise, for a file-backed page, the VMA page fault handler is called to return a page or populate the page table directly.

VMAs were originally stored in singly-linked list (in 1992); then this was replaced by an AVL tree in 1995; then Red-Black tree in 2001. The RB tree was a bad idea because of imbalance vs the AVL tree, Matthew says. Lookups are penalized to favor insertions. In 2022, a new generic data structure called a Maple Tree was introduced by Liam Howlett to fix this issue, with Matthew advising.

Matthew Wilcox says you should cross the linked lists sections of the Kernel cheatsheet distributed at KR 2023.

A Maple Tree is an in-memory, RCU-safe B-tree. It changed the tradeoff: lookups are faster, but modification slowers. With RCU, readers no longer need to take a lock.

The next project, in per-VMA locking. VMAs used to be protected by a semaphore (1996); then this was changed to rw-semaphore in 2001. But this introduced its own set of problems, whether readers or writers have priority. In 2023, this was replaced by per-VMA RCU read locks, combined with a seqcount. The approach is complex, but conceptually simple, Matthew says.

Support for per-VMA locking need architecture support, and it was already added for Anonymous VMAs (the most common) to arm64, powerpc, s390, x86 and riscv. Other VMAs type like swap, userfaultd, page cache and DAX have been added later. There is still more support to add, Matthew says.

The past twos brought faster page faults. To get fewer of those, Large folios come into place. Some filesystems like XFS, AFS and EROFS already added fs support for larger than PAGE_SIZE buffer chunks. Matthew says he won’t do any other filesystems, but is extending help to any filesystem developer that wants to add folio support.

Larger folios are now possible on file write(). Folio support for anonymous memory is being worked on by other people.

Another page-fault reduction strategy is brought by a new interface for PTE (Page Table Entries) manipulation APIs. For example, set_pte_at() can only insert one PTE at a time; set_ptes() was added to insert multiple consecutive PTEs pointing to consecutive pages. This allows exploiting arm64 and AMD hardware features for PTE entries. Other APIs were added to be able to flush multiple pages at a time.

Matthew says there are a lot of other projects he is thinking about or working on. Matthew is happy to work as an adviser to anyone who might be interested.

Coming soon – Thomas Gleixner

The placeholder title of this talk became the actual one.

It looks like there are four preemptions models in the kernel. Or is there ? Thomas said that in practice these can be reduced to two.

The first of the four is PREEMPT NONE. While userspace can be preempted at any point, in the kernel there is cooperative multitasking that uses schedule(). But it might cause issues of latencies and starvation because the NEED_RESCHED flag will only be used at the syscall interface. The workaround is manual placement of rescheduling capabilities using cond_resched() for points in a kernel task where preemption can happen. It’s a manual and error-prone task, and it needs careful placement to prevent doing it with lock helds for example.

The next is PREEMPT VOLUNTARY. It’s mostly the same as NONE, with additional scheduling opportunities with might_sleep(). Initially, might_sleep was a debug mechanism, and cond_resched was glued onto it, which bothers Thomas. It can result in redundant task switching and even lock contention for example. It does provides better latencies than NONE, but has the same issues, with even more contention possible.

FULL preemption model is for multitasking, based on timeslicing and priority. There are still non-preemptible sections in the kernel: spin and rw locks, any interrupt disabling or per-cpu accessors, or even simply calling preempt_disable(). In those sections, migration is prevented and blocking operations are forbidden. Migration prevention can be made preemptible with migrate_disable.

In this model, the scheduler knows when preemption is safe. But latencies can increase, and agressive preemption can cause contention.

The RT preempt model is the same as FULL. RT will further reduce the non-preemptible sections, by changing some locks to become sleeping locks. There are also further restrictions in RT for non-preemptible sections: no memory allocation, or call to any function that might itself use a now-sleepable lock. The tradeoffs are similar to RT, but with smaller worst case latencies in exchange for less throughput.

This throughput tradeoff for non-RT tasks can be mitigated with LAZY preemption.

On x86, REP MOV/STO can bring uge memcpy/memset performance improvement. This instruction can be interrupted, but not in the NONE and VOLUNTARY preempt modes. So in those cases large copies/clear will cause latencies. In this case, going back to a chunk-based loop (instead of single instruction), it looses the hardware benefit.

Bringing new semantics on NONE and VOLUNTARY is error-prone. Which is why there’s the need to take a step back. The explicit points are moved regularly in those collaborative scheduling mode, and these are hacks. The preemption counter should be always enabled, Thomas says, since it’s no longer as expensive (even if not free). This would allow all preemption models to know when it is safe to preempt.

This would also bring preemption model reduction. NONE, VOLUNTARY and FULL could be merged, giving more control to the scheduler. It would be aware of the voluntary semantics. And cond_resched() can then be removed completely. Scheduler hints for lazy preemption can be added for the normal case.

All this work is still at PoC level, but it works and looks promising. RT would still be separate and selected at compile time. “Museum” architecture support is a challenge, but this can be worked on.

In response to a question from the audience, Thomas said that the mechanisms in the kernel shouldn’t be convoluted with the policy, which comes from elsewhere.

That’s it for this morning ! Continue with the afternoon live blog.