Fast by Friday: Why kernel superpowers are essential – Brendan Gregg
What would it take to fix any performance issue in 5 days ?
Improving performance is good for the environment, companies, etc. Brendan’s motto is that any performance issue reported on Monday should be solved by Friday. By solved, it means that the root cause is identified. It’s both a vision and a way of thinking.
The focus here is on finding the performance issue root cause. And for many people, this is the hardest part, and what matters.
Often, working on performance is an issue of time. Performance regressions aren’t tracked down, and hardware and software are getting more and more complex.
A common scenario at product vendors is that there’s often a tunable or configuration that might prevent the product from being the fastest.
The first step has to be done in the prior weeks: Preparation. Everything must work and be already installed. Performance “crisis tools” should already be installed (sysperf, bpftrace, systat, ftrace etc.), and stacktraces should work.
On Monday, the, the problem should be quantified, using the problem statement method. The problem should be understood by the end of day.
On Tuesday, checklists should help eliminate subsystems where the issue is not coming from. Most new observability tools need kernel superpowers, like ftrace and eBPF. eBPF tools are very helpful; future eBPF tools should live in the kernel tree itself, and be reviewed and ideally written by the developers themselves. Brendan gave an example for a ZFS L2ARC health tool, which he wrote. A health tool doesn’t have to be done by the developers, but at least sharing a document explaining how one would assess the health can let others use that as a guide to write one.
On Wednesday, profiling. CPU flame graphs, CPI flame graphs and Off-cpu analysis.
On Thursday, Latency analysis, logs (events…) and hardware counters with perf.
Finally, on Friday: efficiency and algorithms. It’s the hardest part, Brendan says. Checking the Big O notation of the algorithms. From the audience, Matthew Wilcox says that even the Big O notation is not sufficient: one should look at constant overhead, and cache performance for example.
If on Friday, the performance issue isn’t root-caused, it might be time to move-on, and accept the code is efficient enough.
Installing the crisis tools by default is the easiest part. Stack walking should also work by default. Should the frame pointers be disabled for performance reasons ? Brendan says, there’s a probably middle ground. Steven Rostedt in the audience says that s-frames can also help. For higher level languages like Java, the stack walkers should be shipped as open source with the runtime code.
Without eBPF, Fast by Friday would not be possible, it would take longer.
io_uring meets network – Pavel Begunkov
io_uring is not only for storage; it also has support for networking; it supports sendmsg/recvmsg operations too. In io_uring, submissions and completions are asynchronous.
In the early days, this relied on a worker pool, which can be slow. This is why one might need to do polling instead, with the appropriate io_uring operation. It was first step for operations that wanted to convert an epoll workload to io_uring. The best option is to instead have a syscall batching system, otherwise, the app won’t get the best performance from io_uring.
Pavel says, one should dig a little bit deeper int the application architecture in order to be able to use io_uring. As a first step, coupling sending the request and polling is simple, and makes for a simple api. io_uring suports MSG_WAITALL to workaround shourt reads or writes.
The memory consumption can quickly grows because slow connections might lock buffers and consume too much memory. A solution is to use provided buffers to have a kernel buffer pool. This way the buffers are only locked when the data becomes available.
The first version of buffer pools is unofficially deprecated for performance reasons; the second one is recommend one.
To improve polling, one can use multishot polling combined with multishot accept and recv with a buffer pool. With multishot requests can be cancelled, and can fail. The Completion Queue is finite, so having it overflow will create performance issues (allocations).
Fixed files optimize per request file refcounting; it mostly makes sense for send requests.
Close can be done with io_uring as well. But for in-flight requests, it is recommended to use shutdown first to ensure cleanup of resources before the recv is done.
Zercopy send is already available, and there are patches for multishot zerocopy recv.
Chosing the appropriate task run mode is important.
Using all those tips can help improve performance if assembled correctly.
Update on Landlock: Audit, Debugging and Metrics – Mickaël Salaun
Landlock has been in mainline since Linux 5.13, and is now in multiple Linux distributions. It’s a solution for unprivileged processes sandboxing.
Landlock is dedicated to protect against untrusted applications and exploitable bugs in trusted applications.
Landlock allows putting restrictions on filesystem access. Mickael gave an example use case, where limiting the access of an application can limit the damage it can do in case of issues.
Landlock explicitely does not want to track access requests. Since it’s a Linux Security Module (LSM); they can be stacked; and once an action is denied, following stacked LSM won’t see any event.
But a Landlock wants its own denials to be logged, and the reason why. This helps app developers writing a sandboxing policy, power users, sysadmins, and security experts to detect attacks.
When there are multiple, dynamic, nested and unprivileged security policies, tracking denies in logs might be an issue. Logs need to track denied requests’s domain hierarchy, follow the lifetime of rulesets. Those logs should not be available to unprivileged users. The audit framework is the perfect tool for this.
Mickaël did a demo with sandboxer, a tool from the linux kernel samples directory, showing how denies would appear in the audit log.
In an upcoming patch series, new syscall flags will be added to opt-in or opt-out of logging, and to have permissive mode to be able to log all actions that would be denied.
In the future, a new filesystem to get a view of Landlock domain information could be added, once a few challenges are overcome, like unique and global IDs that currently leak information about layering. Checkpoint-Restore In Userspace (CRIU, a snapshotting solution) support will be worked on in the future.
On the roadmap, there are new access-control types for ioctl and tcp.
Linux and gaming: the road to performance – Andre Almeida
Linux has no roadmap, Andre says. It is driven by users. Different companies are pushing for different things. Cloud providers might care about storage performance, Android about responsiveness, and Embedded about resource usage. The result in a kernel that is versatile.
Gaming can bring other things to the table.
Most games were developed for Windows; Valve changed the landscape as it brought Steam to Linux, invested in Proton (Wine), and shipped a Linux portable game console: Steam Deck. Valve sponsors the development of all the open source stack needed for gaming, from the kernel to Mesa, and many others.
Proton is bundle of projects and patches, and relies mainly on Wine, which is a way to translate Win32 API calls to Linux.
Nowadays, the Linux kernel has seen a lot of gaming workloads, leading to a few patches, done by multiple companies (including igalia, Andre’s employer).
The first controversial one was the addition of case-insensitive filesystems. This works with normalization at the lookup time. It was initially done for ext4, then reused for F2FS. There is work in progress for case-insensitiveness in tmpfs and bcachefs.
futex was the next one. A win32 API was used a lot for waiting on multiple locks. futex2 came out of this, as a completely new API, addressing this issue, but many other issues like NUMA-awareness.
Some games might use syscall directly in assembly, bypassing the Win32 API. To emulate those, userspace syscall dispatch was added. It took the form of a new prctl, and is even now used by CRIU as well.
For VR gaming, many input peripherals might generate a lot of events in parallel. So HID throughput was a bottleneck with a single mutex used to protect the HID device table. This was replaced by a rwlock, leading to a 4x perf improvement.
Split-lock across cachelines are possible on x86, and might be used to DoS the system, so Linux added a delay to mitigate potential attacks. But games might use this… so a new sysctl option was added to disable the mitigation.
A WIP to improve performance are adaptative spinlocks, which would allow using spinlocks in userspace by relying on rseq (restartable sequence); the lock would only spin if the process is scheduled.
Pagemap scanning was added to help anti-cheat support that needs to know about recently modified memory pages; dirty page mechanism isn’t as accurate as Windows’.
On Steam Deck, HDR work was done for amd gpu. On Steam Deck, all device drivers are upstreams, or being upstreamed.
GPUs are complex and can crash; there is no standard way on Linux to recover and report a reset to userspace.
That’s it for today ! See you tomorrow for the day 2 live blog !