This is the continuation of this morning’s live blog.
Diving into the kernel mitigations by Breno Leitao
When Spectre happened, many people thought the bug would be fixed and that would be the end of that. But CPU bugs keep coming out, and new mitigations for them being added in the kernel.
After Breno changed jobs, he thought he would be done with handling Spectre. But those bugs and their mitigations keep nagging him.
CPU bugs have always existed, Breno shows. What’s new is the far-reaching security impact. There are many variants of Spectre and other hardware vulnerabilities, depending on how we count. Wikipedia lists 42 different CVEs for example.
A simple explanation of the side channel attack like Spectre: speculated instructions fetch things from memory and populate the cache. Using timing, cache can be probed to be able to guess what is in the cache. It’s not the CPU caches, it can be in the store and load buffer, line fill buffer, load port, etc.
There are different type of mitigations: hardware, software, and workload specific.
Hardware mitigations can take the form of CPU instructions or MSRs, that will flush the prediction. IBPB will flush the branch prediction for example. IBRS is a CPU mode, and not a command; but it has a huge impact on performance. New CPUs have an enhanced mode: eIBRS, which can be always-on and more performant; but it does not mitigate all issues. STIBP is another mode that disables cross-sibling branch prediction.
There are three main software mitigations: Userspace sanitization will prevent userspace-controlled array indexing to have an impact on speculation. Kernel Process Table Isolation (KPTI) isolates changes the page table when going from userspace to kernel space and back. Retpoline is the last mitigation, that does indirect jumps by transforming a jmp
into a ret
with pause
and lfence
instructions in the middle: this tricks the branch prediction.
Original Retpoline could be controlled by controlling the Return Stack Buffer (RSB), which was mitigated with call depth tracking.
If your server is running trusted code, if the service is not security-sensitive, or if the threat model does not include user input, mitigations are not needed. For virtual machines, one might only pay the cost on VM exit (when it calls the host code). But those options are not enough, Breno says; advanced workload users want more options, and he tries to help here.
A first issue Breno found, is that terminology, whether in Hardware or in the Kernel, is not cohesive. He gives multiple examples of such terminology issues that make it harder for sysadmins and workload owners to understand what is happening.
And when looking at virtualization, things get even more complex, Breno says. He gives two examples: one of a VMM (hypervisor) not passing MSRs through properly, preventing the guest to enable mitigations. In other cases, things will get mitigated twice, paying their cost twice as well.
At scale, every CPU performance gain matters, and can save money. Kernel upgrades can cause churn because of new mitigations; or maybe from something else, but it’s hard to know in advance the type of performance penalties a mitigation will have. Advanced users want to (and should) fine-tune mitigations for their threat model.
In the future, those bugs will keep coming, and are not one-off, Breno says. The kernel should adapt and become more flexible. Breno started with the terminology, with config options being renamed and grouped in kernel 6.9 for easier configuration. In-flight, there are proposals to have per-cgroup mitigations as well that are being proposed.
For distros, which are very generic, all mitigations should be enabled. But users should have more granular control to adapt to mitigations workloads, and that’s what Breno is calling for.
virtme-ng by Andrea Righi
virtme-ng is a tool to build and run kernels in a virtualized snapshot. It wraps QEMU, but also gives you access to your host system (in read-only). It is derived from virtme by Andy Lutomirski.
virtme-ng is not a virtualization manager for production workloads. It is specifically designed for kernel development. When Andrea started, he had his own QEMU wrapper script; doing baremetal testing being dangerous for the host system if it’s not separated from the development system. The lack of fast edit/compile/test cycle motivated Andrea to invest in this project.
“It’s not rocket science, it’s just a python script” Andrea says, humbly. It wraps qemu, and uses virtiofs and overlayfs for Copy-on-Write snapshot access to the host filesystem. Recently, support for QEMU microvms was added, as well as a new init script written in Rust: virtme-ng-init.
virtiofs is using fuse to expose the host filesystem to guest vms; overlayfs is on top (or underlying) to provide write access, but without modification to the host system.
microvm is machine made for virtualization in QEMU, with less memory and CPU footprint; for example, it does not have PCI, but it has networking, graphics and sound. virtme-ng-init
replaces the original shell init script, and is also faster; it does not boot systemd in the VM, just a shell.
All those combined brought booting from 8.5s with virtme, to less than 1s with virtme-ng to boot a kernel, run a command, and exit. virtiofs has a huge impact on performance for a simple git diff
: it’s 167 times faster than using 9pfs for exposing the host fs.
What can be done with virtme-ng? A first example is doing a kernel build, and booting it, all in less than a minute and half. Another example is running kselftests
to run this kernel testsuite in a guest, in the same time as it could run on the host, Andrea says. In order to test an LKML patch, Andrea shows that it takes the same time to automatically download and apply the patch with b4 shazam
, as booting the VM with the patched kernel (with a build in the middle).
virtme-ng can simulate different CPU NUMA topologies very quickly. It supports standard UNIX pipelines to be able to use pipes with commands in a VM and outside. virtme-ng can also download pre-built kernels from Ubuntu’s mainline build packages, and use those to boot any arbitrary kernel version. It can be used with drgn
, the kernel debugger.
virtme-ng can run graphical applications; Andrea showed an example with glxgears. Andrea is still working on PCI passthough to be able to debug GPU drivers though. Since it supports sound, it’s possible to run graphical applications with audio. Andrea even showed an example of Steam running inside a VM with a game.
A clear goal of Andrea, is to move the tool towards a standard to use in CI/CD for kernel testing. sched_ext
, Linux netdev and Mutter all have a CI/CD based on virtme-ng for example. This is very powerful both for kernels, and applications themselves. Some people even use virtme-ng to test webcams with USB passthrough, enabling a regression test suite to run for the kernel drivers.
To conclude, the goal of virtme-ng
is to lower the barrier of kernel development. In the future, Andrea wants to support the vsock console; adding support for systemd to boot a full OS is on the radar, but non-trivial. Booting full qcow2 images could help with that. GPU (or PCI) passthrough, testing confidential computing and secure boot are also in the TODO list.
Hazard Pointers with Reference Counter by Mathieu Desnoyers
Mathieu completely changed his presentation topic, as he wanted to focus on this new area that hare Hazard Pointers with Reference Counter (hpref).
Existence guarantees are guarantees that a given pointer points to actual, valid data. Immutable Data, RCU, Hazard Pointer and Reference Counters are all different existence guarantee mechanisms.
RCU has a publication guarantee (stores are ordered before publishing the object), and a grace period guarantee.
Hazard Pointers (HP) have a similar publication guarantee; but it works differently by tracking pointers. It uses memory barriers with a retry dance for readers. The core is based on Dekker’s algorithm.
For RCU, readers are fast and scale well; but the tradeoff is that the read-side critical sections have a higher memory footprint. For a linked-list, all nodes have to be protected in the same critical section.
With HP, readers are also fast; and memory reclaim can be very fast. With data traversal (linked list), HP can cause issues since they only protect a single node.
refcount tradeoffs is cache-line bouncing when readers access the refcount concurrently. It is more memory efficient though.
It’s possible to combine the mechanisms while preserving existence guarantees. Which is what Mathieu did with hpref, combining HP and refcount. The HP is used as a fastpath, and fallback to refcount when none of the 8 fixed HP slots per CPU are available; and readers can promote their HP to refcount if they intend to keep the reference for a long time.
Mathieu made the initial prototype in userspace; its designs looks a bit like RCU, he says. All the hpref can be synchronized, waiting for them to pass though the quiescent state. Mathieu showed benchmarks from the userspace implementation of RCU, hpref, mutex and rwlock. The performance is very promising for reading, and already quite performant for writing. And it will get even better or bigger machines, Mathieu says.
The next step is to port that to kernel, and make use of kernel facilities for better performance: ability to disable preemption, use the scheduler mechanisms, etc. It’s also not clear how much hpref can be applied to data traversal (linked lists), and this will be worked on. In fact, Mathieu already started from his hotel room the previous evening, and showed a handwritten schema to maybe solve this issue.
All your memory are belong to… whom? by Vlastimil Babka
Vlastimil talked about deleting the SLAB and SLOB allocators in favor of the current (SLUB) allocator. But this talk is on a different memory-related subject.
The most basic command to find out what is happening on a system, memory wise is to run the free
command. When running it, the sum of used and free memory is not equal to the total. That’s because in Linux memory management, a principle is “unused memory is wasted memory”. So there is page cache of easily-droppable memory. used + available is almost equal to total.
The next step is to look at /proc/meminfo
which contains a lot of text of various category of memory areas. To dig through this, Vlastimil first drops a few categories, which are uninteresting for the purpose of this presentation: HugeTlb, Cma, HardwareCorrupt, DirectMap.
So Vlastimil classified the different remaning categories in Kernel, Userspace and Redundant (subset of another).
MemTotal
is not really the total memory, but the total available to the kernel page allocater (buddy) after bootmem (boot alocator) has reserved some pages. It changes with memory hotplug/hotremove and memory ballooning.
In the kernel, ZSwap
is the compressed userspace data, and ZSwapped
the size before compression.
Slab
is the memory used by the Slab allocator pages, with more details available in /proc/slabinfo
.
KReclaimable
used to be SReclaimable
+ other users like the ION allocator (now gone from the kernel), but the field can’t be removed since it’s part of the ABI.
KernelStack
is the total space use by all kernel threads’ stacks.
PageTables
is all the page tables for userspace processes. If it’s unusually high, a process might be fragmenting its memory. SecPageTables
is secondary page tables for VMs.
Committed_AS
is the sum of sizes of accountable private mmap()
of all processes. CommitLimit
is the maximum theoretical memory usable by processes. In Vlastimil’s example, it’s lower than Committed_AS
, but his kernel has overcommit configured; it would only be useful when disabled.
VmallocTotal
is the total adress space where vmalloc()
can be placed. VmallocUsed
is more useful because it shos actual usage.
VmallocChunk
, NFS_Unstable
are also historical fields, hardcoded to 0, which can’t be removed.
For userspace: anonymous pages are not backed by files. The page cache represents pages backed by files on a FS or block device. Shared memory is a hybrid between the two.
Counters for shared memory are a bit inconsistent, with only a subset of the information being exposed bye ShmemPmdMapped
.
Active and Inactive is a divide for reclaim optimization purposes in the LRU hotness list.
Summing fields up to MemTotal
does not work: in Vlastimil’s example, 572MB is left unaccounted for. A new Unaccounted
field was proposed in 2016 but with mixed feedback.
It’s also impossible to sum fields up to MemAvailable
; in the example, 460MB is left unaccounted for because of reclaimable memory reserves..
When looking at process memory with ps
, what do the various fields mean? VSZ
or VIRT
is the total of mmap-ed area, and may sum up to Committed_AS
.
RSS
can be misleading, and Vlastimil shows a simple example using mmap 1GB, and RSS does not grow; then writing to all pages’ first byte, RSS grows. After forking 10 processes, the sum of RSS of every process would be 11GB, but total used memory did not grow. Only after writing to the memory in each child does the memory used
increase in the free
output.
This is because mmap does not populate the page tables, so RSS does not increase. Writing to the pages populate the page tables with anonymous pages, so RSS increases. fork
is copy-on-write, the RSS counters for each child stays the same, but the memory is not duplicated until child processes write to pages, only then will the system memory used increase.
Writing more accurate counters would be too expensive, Vlastimil says. Unfortunately, the OOM killer uses the same counters to make its decisions, and relies on RSS. Two such counters were proposed: PSS (proportional set size) and USS. But what if we pay the price for accuracy, not all the time, but only on demand. The kernel can do that with /proc/$PID/smaps(_rollup)
, and reading this file can be quite slow since it holds mmap_lock in the kernel.
The smem
tool does the reading of those file and exposes a more friendly interface, and in the initial example, does report accurate memory consumption at each of the steps.
How to debug missing kernel memory? There is a debug facility in the kernel to track page owners CONFIG_PAGE_OWNER
, which can help find leaks. For slab, there is also a helper called kmemleak
, but it has some overhead. A new feature CONFIG_MEM_ALLOC_PROFILING
was added to attempt to reduce the overhead. It’s also possible to use tracing to debug an ongoing memory leak; capture memory allocations and freeing events, and use the bcc-tools memleak.py
to analyze it. Vlastimil demoed a kernel module that leaks memory and how to use memleak.py
.
Finally, Vlastimil want to disproves a wrong opinion: non-zero swap usage is not a bug. Since 5.9, the kernel might still use the swap because it got better at tracking which pages are actually used, and it will swap those that are never touched.
The charity auction has been pushed back to tomorrow morning. That’s it for today’s live blog, you can continue with to tomorrow!’s live blog.