Welcome back to the live blog for day 2 of Kernel Recipes 2024. You can read the live blog from yesterday morning and afternoon.
Emma, the artist behind the Kernel Recipes penguin mascot has signed two original drawings that will be put to sale at the charity auction later today.
Case Study: Concurrent Counting by Paul E. McKenney
Last year, Paul introduced the law of physics and hardware basics as the first step of recipe for concurrent counting. In today’s talk, he will introduce the requirements, a next step for concurrent counting.
As he reminds people of the physics limitations for transmitting information, some helpful people are distributing a small number of candies to the audience.
Sometimes, complexity creeps in because of those physical limitations. And CPUs are getting much more complex. It needs a lot of adapting for applications to match reality. Unless they use an “ideal” view of the CPU, and they can get an approximation of the actual hardware. For example, caches introduce tiering to workaround memory speed limitations; there are simple hardware hashtables of values at memory locations.
Paul showed a small example of C code doing basic concurrent counting, with a threads incrementing a variable, and reading the count. Paul says that with 6 CPUs, this simplistic code will lose 87% of the count. This is due to caches and the way concurrent CPUs will just overwrite an already incremented value with the same value.
Another example is shown, but using atomic_inc
and atomic_read
; but this example is slow, serialized, and still does not scale to many CPUs, Paul shows. What’s happening is that the cache lines thrashes among CPUs, since only one CPU can modify the memory location at a time (and have it in its cache).
Nevertheless, with a small number of CPU threads (less than ~10) incrementing, it can work well enough.
And even with atomic reading, by the time the value is used, it’s already stale if the counter keeps being incremented by other CPUs.
One way to count better, is to use per-cpu variables, where increment is local per-cpu, and reading does the sum of every CPU’s variable. In this case, updates are fast and accurate, but reads are very slow because they need to do the sum of all threads. How to improve on this?
A solution is to still have per-cpu variables, and add a global counter that is updated periodically. This way, the global value is a good approximation, of the value in the last time period, and can be read instantly. The tradeoff here, is the need for an extra thread to update the counter. Since it wakes up periodically, it’s not good for batteries. This solution is only useful when the global counter needs to be read often enough.
Another approach is to use “statistical counters”; it’s well in use in the kernel with __this_cpu_inc
.
Per-CPU refcounts can be also be quite useful to immediately know when the global sum of a per-cpu variable reaches zero.
A third approach used in the kernel is SRCU count of readers, used with srcu_read_lock
. Paul recommends looking at the code to understand the implementation details.
In conclusion, Paul says it’s “Engineering, not science”: it’s all about tradeoffs, and the way to count depends on the usecase.
To end the talk, Paul made everyone do practical application, and made everyone count every candy in the room, organizing however they see fit. After a few minuter of counting, Paul says it might not be a fair question. But if one’s only able to answer fair questions, there are a lot of important questions that would be left unanswered.
Reader-Writer Semaphores by Neeraj Upadhyay
A reader-writer semaphore (rwsem
) can have multiple concurrent readers, but only one writer at a time. Neeraj starts by showing the metadata of the semaphore inside the 64 bits used to store its value. The three lower bits are boolean values used for the state machine: entering the critical section, locking, etc.
There is other metadata, like whether the rwsem is SPINNABLE. If it is, the writers waiting for the lock will use optimistic spinning using an Optimistic Spin Queue (osq
) lock.
For reading, the flow is in multiple steps: the readers attempt to lock, and if it succeeds, enter the critical section. If it fails, they will enter a wait lock state. Then a lock stealing mechanism is in place: if the first waiter in the queue is not a writer, then all waiting read locks will be woken up and enter their critical section.
There is then a whole mechanism to wake-mark waiters, and we invite you to look at the stream to follow the complex state machines.
The flow to up read the read is then described; it goes through multiple steps to clear the bits, wake-up readers, and free the appropriate locks.
There are a few conditions that need to be fulfilled for optimistic spinning; multiple flags are checked, depending on whether the lock can spin, is the owner a writer that is not running, and whether it already spinned for 1ms, after which it stops spinning. The full state machine of the optimistic spin lock is then shown, with a step where the code attempts to prevent RT livelocks. Neeraj followed with the full state-machine for the writer wait flow.
The flows for per-CPU locks are then shown. For per-CPU down read flow, there is a fastpath (thanks to RCU), and a slow path. The Up Read path is quite simple.
The per-CPU down write and up flow also use RCU sync, respectively enter and exit. Neeraj describes how RCU sync works, with its various grace periods: idle, passed, exit, replay and exit; depending on which state of the lock we are at. Again, the sequence diagram and state machine are shown and better to understand the talk.
Neeraj then describes issues found related to rwsem, starting with KernFS contention. KernFS uses a global rwsem, and it causes issue. An approach was to use per-fs percpu rwsem, but it caused issue during boot.
To study the issue, Neeraj created a simple test case module for the read case, and shows that it does not scale, particularly on recent CPUs which can have 128 cores, 256 threads on a single socket. For the write case, he designed another test, comparing percpu, rwsem, hazptr scanning, srcu, each with configuration variants.
Dual level of task scheduling for VM workloads by Himadri Chhaya-Shailesh
The main issue being the premise of this talk, is the split between guest schedulers in VMs scheduling apps, and the host scheduler scheduling vCPUs: the “semantic gap” happens because they both have an incomplete view of the system.
There are many consequences to vCPU being interrupted because of host scheduling, well described in the literature. The goal is to achieve bare-metal -like performance for VM workloads; but the scheduler is optimized for running on bare-metal using hardware information; and VM workloads don’t have all this information as not all of it is exposed by the hosts.
A first very obvious solution to this issue is to not do over-provisionning: pin vCPUs to CPU cores; but it’s an expensive solution for customers and cloud providers. And the additional hardware information is still not exposed.
In the kernel, KVM exposes “steal time” to the guest, to tell the VM how much time it spent off-CPU. Some CPUs can also detect excessive spinning in a guest and the host can use that information to unschedule them.
But all of these are partial solutions.
Multiple people have attempted to address this, like the CPS paper, or the Dynamic vcpu patches sent to LKML.
Himadri says we could take it a step ahead by taking the point of view of userspace. The apps can compute how many actual CPUs they get by measuring parallel compute performance. But the hosts could also aggregate how many physical CPUs (pCPUs) are actually used at given time, and tell that to the guest: this way, userspace apps can reduce the number of parallel threads to make better use of the available hardware.
In order to analyze this, Himadri designed a test setup, with a guest with an app using OpenMP for parallel compute. First, it exposes the problem by showing that performance degrades as the number of pCPUs lowers with regards to vCPUs. With OpenMP, there is an OMP_DYNAMIC
variable, that can adjust the number of threads automatically to optimize system resource usage.
Then, the libgomp
(OpenMP) library was patched to use a new syscall that reads the number of actual pCPUs. The host kernel was similarly patched to do sample-based measurements of scheduling decisions: when a vCPU becomes idle, and when it becomes busy again. A per-vm global variable tracking the number of pCPUs is added. This information is exposed to guests using the a prototype IVSHMEM virtual shared memory device, that is then exposed by the guest kernel application to the guest apps.
The results are promising, Himadri says; it provides a new approach of co-ooperative para-virtualziled scheduling to address a the semantic gap. Himadri calls to standardize the interface to share information between hosts and guests with shared memory.
In the audience, Mathieu Desnoyers proposes using the Linux scheduler’s concurrency id and expose that to kvm. David Vernet proposes cooperation with sched_ext
and kvm.
That’s it for this morning’s live blog! Continue over to this afternoon’s live blog.