pidfds: Process file descriptors on Linux — Christian Brauner

A pidfd is a file descriptor referring to a process. It is a stable, private handle that guarantees a reference to the same process (unilike PIDs).

Inside the kernel, pidfds use a pre-existing stable process handle: struct pid; it didn’t reference task_struct because it’s too big in memory.

Christian Brauner

The main goal was to avoid the pitfalls of pid recycling on high-pressure systems, which caused a few security issues over the years: this is when the process ID numbers are being reused, because the pool (usually 32k) has been fully used, and this causes race conditions.

Anoter reason is linked to the way shared libraries want to handle subprocesses without interacting with user processes’s process management (SIGCHLD). It’s also helpful for handing process management to a non-parent with fd-passing. FDs are well known and many userspace programs know how to handle them.

There are already many userspace programs that care about pidfds: dbus, qt, systemd, criu, etc. This isn’t the first time an OS implements file descriptors for process IDs: Illumos has an userspace emulation of this feature, FreeBSD has pdfork(), pdgetpid() and pdkill(). There were also a few proposals before for Linux that weren’t merged: forkfd and CLONE_FD, both of which Christian looked at to understand why they didn’t land.

In Linux 5.1, signal support with pidfds was added for reliably sending signals to processes. Lots of people had opinion on this, in particular, many people wanted to be able to use pidfds with files from /proc. This is now possible, but not completely race-free, Christian says. It’s not an ideal solution, but it worked.

In Linux 5.2, the CLONE_PIDFD flag was added to obtain race-free pidfds at process creation time. The way it was implemented was initially controversial because Linus initially wanted fds from /proc, but Christian was pushing with anonymous inodes. After implementing both of them, it was obvious that the latter would be simpler and easier to maintain. A nice bonus of pidfds is that they are O_CLOEXEC by default, meaning they will be closed automatically when the process calls an exec()-related syscall.

In Linux 5.3, the clone3 syscall was added, and it has a dedicated argument for pidfd instead of abusing other arguments like clone2. Polling support was also merged to get process exit notification for non-parents that have a pidfd.

In 5.3, pidfd_open() was also added to create a pidfd after creation time (without CLONE_PIDFD). In 5.4, a new P_PIDFD flag was added to waitid() to wait for a process through a pidfd.

Current work in progress include kill-on-close semantics, to send a SIGKILL to a process when the last fd referencing it is closed. Another semantics that’s being thought of is a way to have exclusive waiting to hide process from generic wait requests: it would be a flag at clone time. Christian is also thinking about a way to use pidfds for some namespace management tasks, but the scope and security impact isn’t clear enough yet.

To conclude, Christian says that resilience is important: understanding what reviewers want, what is important, and what is bikeshedding is a critical skill.

Keeping the kernel relevant with BPF — David Miller

The kernel has changed over the years, David says. It used to be that known breaking changes could be merged quite quickly, but this doesn’t happen anymore.

David Miller

Nowadays, one should always explain the use cases, think hard about the API impact and its extensibility. When writing a kernel change, it always takes time to write the test, take reviews into account, and iterate.

David argues that in order to propose syscalls and design them properly, you must be arrogant. You are putting the users inside the boundaries of your design, and making choices for them. This isn’t necessarily what users want: they always want maximum flexibility, fast iteration, or to have an arbitrary policy for example.

Describing BPF is complex, because it pushes this maximum complexity to users. BPF provides a mechanism by which users can solve their problem more freely. BPF also seems contradictory because it gives users maximum freedom, but it also provides containment and safety.

That’s why, David says, people understanding BPF should go and speak about it so that everyone understands its impact on kernel development and user freedom.

Hunting and fixing bugs all over the Linux kernel — Gustavo A. R. Silva

Gustavo started contributing to the kernel in 2013, which is quite recent, he says. He has since been fixing bugs all over the kernel.

Part of his work starts with analyzing the reports from the proprietary static analyzer Coverity. It gives a lot of false positives with the kernel, which takes Gustavo a lot of time to review. He still committed more than 500 fixes thanks to Coverity over the years. A few high impact issues were found with it, like unitialized memory use or out-of-bounds access.

Gustavo A. R. Silva

He says he looks at every issue; the scan used to run on weekly tagged -rc releases, but now he has access to daily scans, including from linux-next, so that bugs are caught before reaching mainline.

He has many examples of bug fixed. The first one is a wrong variable type, where a counter’s max value was raised from 100 to 1000, but the counter was an uint8_t, so its maximum type value was 255. The fix simply changed the type to uint16_t.

Many fixes look trivial in retrospect, but they catch real bugs. Another one was inconsistent use between IS_ERR and PTR_ERR variables. It was caught with Coverity, but Coccinelle can also catch this type of issue.

Gustavo has found issues in linux-next, before they hit mainline, like a wrong use of bitwise operators. When investing missing fallthrough, he’s found missing returns. He has found resource leaks (missing goto). He found a seven years old bug in a perf test, or an 8 year old bug in an USB gadget driver.

Gustavo has collaborated with the kernel self-protection project to help remove Variable Length Arrays. He also helped introduce the struct_size helper, that helps compute the size of structure that has a tail array with a variable size at allocation time.

This struct_size helper macro helped not only simplify the code, but also catch buffer overflows or undefined behaviors in some places.

A big work Gustavo did this year, was to fix the -Wimplicit-fallthrough warnings. This helps find instances where the a break has been forgotten in a switch. If the fallthrough is intentional, a comment should mark it to fix the warning. Gustavo reviewed thousands of cases (2300 initially just for x86) throughout the code, and sent numerous patches; sometimes for quite old bugs. He initially encountered a lot of resistance, but once he started having successes, more maintainers understood the need to fix the warnings, and his patches went in more easily.

He spent part of his time working on this for long time, but he says it was all worth it the end: the warning has been enabled by default in Linux 5.3, and people already started catching bugs with it.

Gustavo says he needed to have his own tree to fix the last 10% of issues were patch being ignored, sometimes deliberately.

That’s it for this morning ! Continue with the afternoon report.