This is the live stream for this morning.
Single board computers made possible by the community – Da Xue
In 2016, Da had a crazy dream: he wanted to invest into a better future with an ecosystem with upstream development. So in 2017 he found what Neil from BayLibre was doing was awesome, and decided to have a partnership with them for the LePotato board. In 2018, the upstreaming of the video side of Amlogic SoCs with the Bootlin kickstarter was very interesting.
The goal is to get an upstream stack, from the bottom up: starting with the lower layers of the stack: ARM Trusted Firmware (ATF), edk2, optee, u-boot and Linux. Da showed a demo of the Renegade Elite board booting a standard openSUSE distribution.
Once upon an API – Michael Kerrisk
Michael cares about APIs, he says, and he wants them done properly. It all started in the beggining of time()
, when SIGCHLD
was sent to a parent process when a child process terminates. Then in 1997, someone decided that it would be nice to have the reverse, so they added a prctl
flag to do this. And they did add a documentation for this feature.
But there was missing pieces; for example, it wasn’t possible to discover if this option was set. So another prctl
flag was added to get the value of the option. And that’s where inconsistencies started. PR_GET_PDEATHSIG
was a getter added that returned the result differently from PR_GET_DUMPABLE
: as a value in the second argument instead of a function result. Nowadays, these inconsistencies have become the norm.
Another missing piece: the documentation that was added for the flag mentioned the option being cleared on fork()
; but what about execve
? It wasn’t mentioned until the documentation was updated in 2012. Would anyone have noticed the associated security vulnerability if this was documented earlier ? Indeed, if a suid-binary is execve
-ed, it was then possible to send it signals, which wouldn’t be allowed otherwise. This was fixed in 2007, 10 years after the feature was introduced.
In another example, a mis-design on an API was reported 15 years after it was done. And since it was part of the uAPI, it wasn’t possible to fix it, so the quirk was just added to the documentation.
Back to the child signal: what happens after the subreaper feature is added ? A subreaper is a parent that wants sub-childs to be re-parented to it. But then, a child can have a series of parents, if there multiple subreapers in a hierarchy. This was documented later properly.
What about now if the the subreaper process is multi-threaded ? What happens when a thread of the subreaper parent terminates ? Usually child processes are parented by individual threads. Reparenting can also happen inside a given subreaper process, when a parent thread terminates.
Should these intricacies actually be documented ? Michael thinks so, because users will eventually depend on a feature, whether or not it is documented.
In this case, the feature exposes accidentally exposes internal linux kernel behavior: when a child gets multiple signals if the parent process is multi-threaded.
What went wrong in the end ? Multiple things. There was no single owner of the API, or interface. There was not enough documentation, and the interface evolved over time as new features (subreapers) were added.
Who owns the interface contract ? Should it be the kernel developers, since they own the code. But sometimes, behavior might derive from the actual intention (it’s a bug). Should it be the glibc developers, since they do the actual wrappers around it ? But those are pretty thin. Should it be the documentation, man-pages developers (Michael) ? But the documentation might be incomplete, or non-existent. Or should it the user-space developers themselves ? Given enough time, they will find intricacies of the API, and invent new use-cases based on them.
So, Michael wonders, if things had been documented upfront, would things have been different ? He thinks so. Whether it comes in the form of a man-page patch, or a very well written commit message, Documentation is a time multiplier: it makes everyone’s life faster.
Most of the time, the issue is the lack of a big-picture view: de-centralized design does not work well, and this example is a perfect illustration. Michael thinks that there should be a paid kernel user-space API maintainer(s).
We won’t live forever, what does that mean ? – David Miller
As a maintainer, you need to ensure continuity of the project.
In October 2019, Dave suffered a stroke, and it was a wake-up call, he says. You have to be able to delegate.
But maintainers don’t grow on trees, and you can’t find them overnight. You need to groom them; this needs to be thought well in advance. And it brings other advantages: once you delegate, you get more breathing room.
Plan for succession, David says. “A goal without a plan is just a wish”. Find people you trust, and delegate to them.
After a question from the room, David says he’s very happy with his co-maintainers now, and trusts Jakub could become maintainer with little change to the workflows. Co-maintainers also need to be recognized in the community, have the ability to push back and be respected when making decisions.
On grooming, there are multiple step; one can become a trusted reviewer without being a maintainer. There are also discussions in the room on how to make sure whether or not younger people are interested in this field. Some thought there wasn’t enough, but others raised that there were quite a few younger people in the room.
The story of BPF – Alexei Starovoitov
BPF is an instruction set designed 30 years ago. In 2011, a startup Alexei worked for, wanted to revolutionize Software Defined Networking (SDN). SDN is the equivalent of VMs, but for networking.
The traditional approach for VMs, what to use different kernel modules (kvm) for each feature; but at Alexei’s startup Plumgrid, they decided to do it differently, with a single iovisor.ko
module for switches, router, etc. that would load binary native code dynamically from userspace.
But after a critical x86-dependent hard to debug issue, Alexei decided that this binary code injected needed to be verified. There were multiple x86-based iterations for an instruction set. They were always designed to be JIT-first, not interpreted. Then they wanted to have their solution upstream. So Alexei started talking with people, but this new instruction set looked scary to compiler developers, and even scarier to kernel maintainers.
To work around this, they decided to make it familiar. That’s how the new instruction set was designed to be as closed as the original BPF, and called “extended” BPF (eBPF
). In reality, there is not much in common with BPF, apart from the opcodes.
Before submitting anything, Alexei registered to the netdev@
mailing list, and read all the messages for 6 months, in order to identify all the key people. His first kernel patch wasn’t even related to eBPF: he moved bpf module free into a worker. He continued for a while with a few patches to fix issues to “build his reputation”.
Finally, he posted the eBPF patchset. It was rejected. Why ? Mostly because of the UAPI. So it needed a plan B, to add it in the kernel, without adding a new UAPI. They decided to find something to make faster first. So Alexei rewrote the existing BPF interpreter, using eBPF opcode and implementation, but called “internal BPF”. The term “classic BPF” was coined by Daniel Borkman for the historical one.
In 2014, BPF code was converted to iBPF (the internal one), and JIT-ed. Neither eBPF nor the verifier technically existed upstream at the time. There was some arguments at the time on applying iBPF to networking.
So there was another pivot to try to apply it on tracing instead of networking. It started with filtering. Again, the strategy to make the existing code faster was applied. The tree walker filter was much slower than the BPF tracing classifier filter. Finally, by september 2014, the verifier landed: eBPF was born, and the team celebrated.
In only became useful later in the year when eBPF programs could be attached to sockets, then they could be attached to kprobes for tracing a few months later. But this was only the beginning. They had an instruction set, but no compiler.
The LLVM project had very different rules and way of working than the kernel. The fact that the instruction set was in the kernel did not really influence anything: tests had to be written, Alexei went to in-person meetups, etc. The backend monster patch was submitted in the end of 2014, and soon merged after many acks, but only as an experimental backend. It was in-tree, but could be removed or reverted at any time.
To graduate from experimental status, it had to have users, more than one developer, and participate in tree-wide changes. A build-bot had to be contributed as well.
For GCC, things were harder since the backend emitted BPF code directly (which was a blocker), so nothing was done because it was too much work. It was finally done by a separate team in 2019.
Being present at conferences doesn’t necessarily make difference: it is useful, but it won’t help improve the patches: the code matters the most.
To summarize the strategies to land the patches: build reputation, make things look familiar, improve performance, and be ready to compromise.
That’s it for this morning !
Recent Comments