Live Blog Day 3: morning

Live Blog Day 3: morning

by | Sep 24, 2025 | 0 comments

Welcome back to the liveblog, continued from yesterday.

Modernizing the Virtio GPU: A Rust-powered Approach with vhost-device-gpu — Dorinda Bassey

Virtio-GPU is seeing increasing use in QEMU and cros-vm, and it has generated demand for increased memory safety.

A paravirtualized GPU device interacts via virtio-gpu with the host. It is passed-through via PCI. The virtio interface is specified in a document.

Dorinda says that the main component that vhost-device-gpu replaces is the QEMU software backend. vhost-device-gpu is a separate process and should provide better performance. In the guest, the virtio-gpu driver sends commands via virtio-gpu-pci in the hypervisor to the host’s virglrenderer.

virtio-gpu is being upstreamed in multiple projects: QEMU, libkrun, rust-vmm and crosvm. A new virtio-GPU device has been added into rust-vmm based on the virtio 1.2 specification.

Displays are handled via D-BUS to provide modularity. The renderer backend is modular, and can be selected between virglrendrer (OpenGL or gfxstream (Vulkan).

Dorinda Bassey

The protocol in the host uses vhost-user protocol, which has a frontend and backend, the device in the VMM, and the process that provides it.

Integration in rust used the rutabaga crate. Dorinda says they encountered a few challenges integrating it, whether it’s thread safety issues (!Send constraints), or extracting rutabaga and its dependents out of the crosvm source tree. After a while, Dorinda worked on the virglrenderer crate to do direct Rust integration with virgl instead of going through rutabaga. It led to simpler integration and less API changes.

Thi virtio-gpu has started to be tested with IGT (Intel Graphics Test Suite), IGT was extended to include virtio-GPU ioctl tests to cover key interfaces.

Upstream support for shared memory regions was a recent development, which helped with gfxstream vulkan integration and provides less CPU overhead. A new blob path was added as well to allocate GPU-visible memory in a zero-copy fashion. Shared resources can be assigned an UUID to be referenced across host and vm; it was added to qemu recently to be able to pass dmabuf file descriptors.

Red Hat’s use case is Automotive devices, with RHIVOS (Red Hat’s mobile OS) being able to have multiple guest OSes, including Android. It leads to cost savings, because a single GPU can be shared accross VMs. Android support as a guest was also the primary driver for gfxstream’s Vulkan renderer.

Dorinda showed benchmarks results showing that the new rust-vmm based approach for vhost-user-gpu as good or better than QEMU’s previous implementation.

Recently, Vulkan support has been added to virglrenderer via Venus, a mesa (userspace) driver that is not yet widely enabled; Dorinda showed a demo of virglrenderer+Venus running the vkmark test program, and then llama.cpp (via ramalama) with the vulkan backend.

For future work, a WebGPU backend is being considered, in addition to snapshotting support or a gfxstream crate.

Observing the memory mills running — Vlastimil Babka

Vlastimil is a Linux Kernel Memory Management (mm) maintainer. Last year at Kernel Recipes he explored /proc/meminfo, and this talk is a followup.Vlastimil Babka

To summarize, MemFree isn’t useful, one should try to use MemAvailable, or “used” as reported by the free command. In general during the workload operation, the exact values aren’t important. Following kernel/glibc updates, those values can change because of algorithmic changes (ex: different reclaim time), so there are not necessarily regressions. Ditto for memcg limits, as different kernel objects might start being accounted following updates.

So what matters more than the exact values, Vlastimil asks? One can look at: is there a memory leak? do I run into OOM? is it due to a kernel bug or workload problem? Vlastimil recommends /proc/vmstat, which can be much more useful. There are 193 lines in vmstat though, vs only 58 in meminfo, so one should be ready for this talk.

While meminfo is designed for humans, /proc/vmstat is more suitable for script consumption, with a dump of counter names and their raw values without trying to convert units. Some are gauges (describing state), starting with nr_, and can increase and decrease, other are counters (for events), also starting with nr_, and can only increase. Some are tracked internally per-NUMA node or zone, and summed up in /proc/vmstat, and there are other filer per-node values.

The first set of counter is per-zone, the second related to NUMA nodes page allocations. The third set of counters duplicates is per-node. Vlastimil continues a global overview for vm-based values, then focuses on more specific interesting counters.

Many counters are related page allocations: success, per-zone, per-numa nodes, or freeing events. All of those are in number of pages. Vlastimil talks about what happens when there is low memory. Kswapd works with “watermark” levels, that determine on what to do. When free memory drops below the min watermark, allocations perform a direct reclaim, and that increase a counter depending the type of allocation.

Some events are related to page faults, in number of page faults, regardless of the size of the folio. There is a special counter pgmajfault when the page fault triggers disk I/O to finish the processing. Usually those are not directly useful because the kernel does readahead, so following pages/folios might not fault but already have the data available. There is a specific counter pswpin for read from swap on fault (again, there is readahead). There is a counter read from the block layer.

Some events are related to memory reclaims, and are very specific to the reclaim algorithm. When page is put in the inactive list, when it is isolated for inspection, when it moves to the active list, or when it is freed; and a few others.

Other counters are related to writeback; when pages are written to swap during reclaim pswpout. When the folio writeback finishes and it has the reclaim flag, pgrotated is incremented. All write via the block layer (not just writeback) increment pgpgout.

Working set detection has its own sets of events. If a folio was reclaimed recently, it can be used to make better decisions on future reclaims for example; this is tracked with shadow nodes.

To conclude, Vlastimil says that /proc/vmstat is useful to provide insights into what the mm subsystem is doing, and is helpful for dealing reports against it. Again, the absolute values in themselves are not useful, but how they evolve over time is what is interesting. Vlastimil does not want to share exact debugging recipes, because everything depends on context; understanding the counters is the first step, because they help to answer questions about the behaviour.

Many events couldn’t be covered in this presentation, Vlastimil says. For some events, there might not even be enough details, so one will have to look at tracing, or even custom-built drgn debugger scripts.

BPFize your Kernel Subsystem: the fanotify Experience — Song Liu

Song Liu has been working on BPF at Meta for 7+ years, and maintaining the BPF LSM side specifically.

Song LiuBased on the chinese proverb, Liu provides a kernel version: “There are a thousand BPFs in a thousand kernel hackers keyboards”.

eBPF allows modifying the kernel with more control than just tunables, it brings its own instruction set, is safe, cannot crash kernel, will not leak memory or kernel reference objects. It is safe thanks to the in-kernel verifier. The BPF instruction are JIT-ed into native code. The BPF “sandbox” runs in privileged context, and it interacts with the rest of the kernel with build blocks like maps, helpers/kfuncs that also need to be designed safely.

Writing verifiable BPF programs is not a simple task, Liu says. And creating reusable building blocks for more than one use case is also difficult.

To integrate BPF into a subsystem, one needs a hook point using struct ops. For building blocks, Liu recommends using kfuncs now instead of helpers. And reach out to the BPF community exposing the use case, being prepared to modify your solution.

Liu’s use case is BPF LSM. Linux Security Module (LSM) is Linux’s Mandatory Access Control (MAC) solution. The hooks can be just “notifiers” to let the LSM know something happened (VOID hooks). INT hooks allow the LSM to deny an operation. Hooks are called for every registered LSMs.

BPF LSM allows writing LSM logic in BPF instead of in-kernel code. It allows a lot of flexibility. Writing BPF LSM hooks means declaring just simple “empty” functions that just return 0. Those will later be used generate direct calls to BPF programs at runtime, and be able to modify the returned value as well, via an ftrace trampoline. BPF LSM kfuncs help accessing the private data for LSM.

There are many users of BPF LSM already, including systemd for example, or the Tetragon open source solutions, as well as internal use cases of big companies.

A limitations of LSMs is that the hooks are global, and don’t allow per-file (or tree) rules. fanotify is a kernel mechanism to allow userspace to be notified of per-tree filesystem events. So Liu combines BPF LSM and BPF fanotify hooks for the usecase of applying security rules to a subset of files. In order to do that, the fanotify events aren’t sent to userspace, but to in-kernel handlers that will then call the hooked BPF programs.

fanotify implements a struct_ops, like sched_ext or tcp-bpf. It allows having implementations in bpf, or in-kernel modules. Liu added an is_subdir kfuncs to help fanotify-bpf programs.

fanotify-bpf has not landed yet, Liu says. It has a few challenges, especially around subtree monitoring; it requires global rules (at superblock level) so it’s not really local for example; so there are was a bit of pushsback.

In the future, Liu wants to find a better solution for subtree monitoring, and let the usecases evolve as well.

Continue on the afternoon live blog.

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *