 So it's getting 20 past four, so maybe time to start. As you can see from our slides, we are starting to set a number of world records here. First is for most speakers on a stage at the Linux conference. Most authors offer paper at a Linux conference. It's actually a collaboration of four people from Siemens, the Technical University of Applied Sciences Regensburg, some hobbyists from the ELISA projects who also happens to work at BMW and the University of Erlang in Nürnberg, and fortunately only three of us could make it here, so we're going to miss that world record for four speakers on a stage, but still we are setting out for another record and that is learning more about the behavior of kernel developers from their behavior on mailing lists and in Git, then the NSA already knows. And this is what we hopefully will be presenting you in this talk. Actually, this is both a talk and a request for comments, so we've been observing the kernel community now for quite a while and tried to learn from you guys and we're very interested in once you've seen what we're doing, what kind of questions come up from your side, what we could look into what benefits the Linux kernel. We've had some very good suggestions from plumbers already that will be integrated in this talk, but we're more than happy to hear input from you after the talk in the discussion. And with that, Rade Framsauer will continue to show the technical details. So Wolfgang, thank you for setting the world record for the longest introduction. So what we want to do is we want to formalize and assess the Linux kernel development process. So why would someone want to do that? We have different motivations from inside and from outside the community. So first of all, my personal motivation is to write papers and to finish my PhD thesis. And, but there is also interest from, for example, from safety critical developments where certification is needed. For example, if you have a look at the automotive industry or in industrial developments or applications, there are certification authorities' demand for development process assessment. And needless to say that companies cannot simply modify the Linux kernel development process to their needs, so all they can do is an exposed analysis of the kernel and its process and assess the process that is being used. Nevertheless, the results that we produce are not only valuable for certification efforts, they are also interesting for software monitoring purposes and or for software health analysis. So first and foremost, a preliminary requirement for an assessment of the kernel's development process is to understand the process and its peculiarities and being able to trace the development from the initial submission of a patch to the final integration of the patch in the repository. And tracing the development process is an issue that gained interest within the community. Recently, there were some discussions on how software changes can be tracked and how tracing can be improved by aligning the submission process. There were some proposals, how things can be improved, for example, by adding message IDs to commit messages when the patch is being integrated by a maintainer. Yet, this would lack information to earlier revisions of a patch, so they proposed to add links to previous revisions of a patch in the cover letter of the latest series, for example. So some subsystems also do follow these guidelines that may be inconvenient to some extent, but yet there is no common opinion on how those issues can be solved. So the community is already aware that there is a huge gap between the kernel repositories and the development that is happening on the list. So first of all, let's have a look how the development process itself works. So in the beginning, the patch is being created by a developer in private on their local machines and being sent to one or more of the kernel's mailing lists. Their reviewers send their comments and they start a discussion of the patch and in the end, the maintainers decide whether to accept the patch or not, and if, for example, further changes are required. Then the developer will amend the patch and add changes, rework the patch. This happens in private again. It's not publicly visible to us. He will resend the patch to the mailing list, the discussion starts again. So we have an iterative process until the patch is in an acceptable state. Everything within the blue box is visible to us, but everything that happens outside the blue box, this includes the rework of the patches and that includes the integration step, is not visible to us because we simply, we lack the trace between the patches. To make clear why this happens, let's have a look at the Linux kernel development workflow. So on the one side, we have the mailing lists where our developers send their patches and on the other side, side there is the repository that contains the commits in the end of the day. A patch on the mailing list is typically identified by a unique commit hash, by a unique message ID and the commit in the repository is identified by a unique commit hash. And typically, if you have several revisions of a patch, there is not only one message ID that belongs to a patch, but there are many message IDs that belong to the same functional change, so to say. And there may even be more than one commit hash in a repository that belongs to the same patch as several maintainers may pick up the same patch from a mailing list and apply them to their local branch and later when those branches are merged, a patch may appear twice in the repository. And we are missing the link between all those message IDs and commit hashes. So why is this important for us? Why is this important for tracing the development process? The commit in the repository is just the outcome of a process that happens before on the mailing list. So all those steps on the left side, where you can see the patches, all the discussions, all the remarks, all the tags that are being added, the act buys, signed off buys, the tested buys. This is lost when the patches finally integrated. So what we need, we would need a tool unless there are no guidelines how to add those traces to the commit. We need a tool to recover the relations between patches on mailing lists and commits in repositories. And we do have a tool for that. We call it tool pasta, the patch stack analysis. Initially I developed that tool a few years ago. And what we wanted to do is we wanted to detect similar patches on different branches in a git repository. And we did that for quantifying mainlining efforts of massive out-of-tree developments such as pre-empt or T patch stack or vendor kernels. We extended that tool to work with mailing lists and we are able to map messages and mailing lists to commit in repositories. So how does that algorithm work? So here's an example of two patches on the left side. You can see a patch, an early revision of a patch that first appeared on preempt or T side and later was integrated to mainline Linux. And it is obvious that the commit message slightly changed over time and also the diff changed. And what we do here to connect those patches is we calculate two scores, one for the commit message, one for the diff. We calculate the similarity of the commit message, similarity of the diff and combine those two scores to one global score. If you have a closer look at the diff, then we see that within the same file, the same lines got removed, but different lines got inserted. Let's have a look at the insertions. So from a first view, it looks like this patch would actually be dissimilar, but having a closer look at it, you can see that these lines introduced the same functional change. And we detect that by tokenizing, by simply tokenizing the two parts, and we see that they use pretty much the same keywords. Again, we use string distance metrics for calculating the difference between the two parts and for this patch, for instance, we get a similarity of 87.5%. And given two patches, if the similarity of those patches exceed a certain threshold, we consider those patches as similar. And what we then do is a pair-wise comparison of all messages on all mailing lists against all messages on all mailing lists and then compare those against the repository and recreate a graph or reconstruct a graph of connections of different patches. So on the top, you can see the similarity network where all similarities are included. And below there, you can see partitions of similar patches where we remove the connections of patches that do not exceed certain thresholds. What we have them are clusters of similar patches. So on the top right here, whoops, where is it? Here, for example, we have a cluster where we only have patches, but no commit hash. And there are also clusters with several patches on mailing lists and one commit hash or two commit hashes. The green nodes are the patches on the mailing lists and they may denote, for example, several revisions of the same patch. So, but I don't want to bother you with the techniques. If you want to fall into the details, you can also look them up in one of those papers that we have written. So that's about the technique. But where do we get our data from? So the repository is clear, we can simply clone it, but where do we get the mailing lists from? A few years ago, we still use gmail.org, but all of you might have noticed that from one day to the other, they went down and Gmail is not up any longer. But luckily, colonel.org, lord.colonel.org hosts public inboxes. Those are Git repositories that contain mailing lists as archives, so they are archived in Git repositories. That's pretty nice because the repositories that are available there, they contain prehistoric data from 1996 and downwards. But unfortunately, they are limited to just a few lists. And some lists include data from dumps from gmail.org. And unfortunately, gmail.org cut off some email headers and so this makes analysis a bit hard. So we started to create our own collection and we subscribe to all mailing lists, to all Linux kernel mailing lists that are more than 200 lists. So we are watching you. You can find those lists and your emails on github.com slash linux mailing list archives. Those messages are just from May of this year. So actually, all data that we need would be ready if those wouldn't be emails and emails are broken. So we can find all kind of weird stuff inside emails like broken encodings, base 64 encodings, emails claim to be UTF-8 while they are in fact ISO 8859. So it was really hard to normalize all this data. Just to give you a few examples, we can find headers like that one. It starts like a same message ID but ends up with a date. Happy parsing. Dates like this, I don't know which time zone this is. My parser doesn't know as well. Things like that. I don't know how that can happen. Maybe this is the reason. Yeah, but once all those things are figured out and are clear, we can start with our analysis. And this is where I would like to hand over to Sebastian who did most of the analysis part. So now when we are able to pass all the emails, we start our analysis and we want to analyze the Linux kernel from the release to 639. So from the beginning of the kernel version 3.0 until the present day, which are above 600,000 commits. On the other hand, we collect the emails as I said before from approximately the same day, so May of 2011, until the end of last year, which are above 3 million emails. But you have to keep in mind that's not all emails in the Linux kernel development because these emails are only the ones we have as public inbox available. So here you have a list of the mailboxes we have available and keep the bold ones in mind because we'll talk about them later. So now we have the 3 million emails and now you know it's not all patches so there's discussion, et cetera, and that's no patch, so we can exclude them. In the next step, we noticed that not all patches on the Linux kernel are patching the Linux kernel. For example, there are some patches for the userland tools, et cetera, and we exclude them as well because we want to analyze the kernel. In the next step, we noticed there are very much bots, there are Linux Nextmails, there are Git pullmails, there are stable reviews, and we will exclude them as well because we want to analyze the patches sent by people, by developers, for the new kernel. So we have like 900,000 remaining patches and now we will still exclude some because there are several patches in the answers which are like code snippets or stuff we don't want to analyze so we have remaining 800,000 emails and we passed all of them and now we have them stored and map them upstream so we did all the stuff he explained earlier and what can we do with them right now? We asked the research question, what are the specific characteristics for ignored patches? Why do we ask this? This is because we hoped to determine unmaintained subsystems based on the ignored patch data and to say it now, we were not able to because it's not correlating but let's go on with the talk. So what is the definition for an ignored patch? We have three requirements. First, the patch itself must not have a response by any other person than the author itself. Second, the patch must not be upstream because then somebody worked with the patch and the third requirement is that both requirements are met by all related patches. This includes revisions of the patches or other patches of the patch series. So example, if I send a patch now and it is ignored and I send a patch later and this is answered, both patches are not counted as ignored because one of them was answered so both are not ignored. So and when we did this, we found out that 2.5% of all patches of the Lore kernel org mailing lists in our timeframe were ignored and then we took a look at the years and we found out that in 2011, like 4% were ignored. In 2018, only 1.5% were ignored. So you see the trend that the amount of patches or the ratio of patches ignored is decreasing. So we thought, why not plot this? And we got this graph and the petrol colored line is the total amount of patches per week and the red line is the ignored patches per week. And you see this one spike here and there was a guy who sent a patch series with 1,300 patches at once and it was kind of messy because the patch series was not linked in a thread but each patch of this patch series was sent as one thread and of course not every patch of this was answered. So we have this spike and because this spike destroys all other graphs, we will just ignore it. So and now you have this graph which looks much prettier and you see when you look at the total lines, once a year, there's a spike downwards. This is maybe Christmas, I don't know. And you see each year like five to six spikes downward which might be the releases of the next kind of. So when you want to look at the ignored patches, we get this graph and you see most of the time each week 30 to 50 males are ignored. There are some spikes up and down but it's really steady 30 to 50 patches are ignored. And now we thought is this much, is it not much? We need to compare this. So we plotted the ratio between the total patches and the ignored patches and we got this graph and you see it's decreasing. It's trivial, the amount of patches ignored stays the same, the total amount of patches is increasing. So the ratio is decreasing. And in the Linux plumbers talk earlier, there was a number of developers, of maintainers who asked how many males am I ignoring? And to answer these questions, we implemented the statistics for mailing lists and we selected for mailing lists to architecture related mailing lists and to matriculating mailing lists and these are the graphs. And again, you can see the blue one is the total amount of patches and the red one is the ignored amount of patches. And now we can see three things. First, in all mailing lists, the amount of patches ignored is quite steady. So there are no new spikes, there is no increase, decrease, it's really steady. Second, the amount of ignored patches in each mailing list, regardless of the size, is quite similar. And the third one is that when you look at ARM and NetDev, then the amount of total patches is increasing, but the amount of ignored patches is staying the same. And now back to you. Thank you. And that's pretty interesting because if you have a look at the beginning of our analysis in 2011, the ARM mailing list received about 150 patches and now we receive up to 700 patches per week, but the amount of ignored patches always stayed the same. The same applies to the NetDev mailing list. We have some mailing lists that look a bit different, but for most of the mailing lists, this is how they evolved over the years. We took the ARM mailing list because this is a mailing list that gained an interest over the years, so there are more and more patches on that list, while in MIPS, another architecture, for instance, it was a steady amount of total patches, so this is why we chose the two of them and we chose NetDev because NetDev also did the total amount of patches increased and for Linux wireless, the total amount of patches decreased a bit over the years. Now, if we look at the ratio of ignored patches, it's needless to say that, of course, for Linux ARM, for instance, the amount of ignored patches, even though the total amount of patches increased, the fraction of ignored patches decreased over the years and significantly decreased over the years. So we can not interpret or we do not know if some of those subsystems are in an overload situation, but at least from these data, we can say that the process somehow still skates, so this seems to look good to us, but we do not know if there are maybe some problems within the subsystem, of course. So another analysis that we did is we wanted to know if it does matter when a patch is sent. So is a patch rather ignored if it is sent during a merge window or is it rather ignored if it is sent during the stabilization phase? It turns out that it is largely independent when the patch is sent, but there is a slightly higher chance that the patches ignored if it is sent during the merge window. Yes, so with this, we were able to show some first results of our analysis that we did, map things that happen on mailing lists to the repositories and we introduced some first metrics to quantify the development process. During that work, we observed some other interesting thing that I would like to present to you, off-list patches. So an off-list patch is a patch that has been included in Linux's repository, but it has never been seen on a public mailing list. So it doesn't mean that there is no mailing list where the patch has been seen, but there has been no public mailing list where the patch has been seen. We are able to map messages on mailing lists to commits in repositories. So we can also determine the commits that we miss so where we can find no email. And we analyzed the release, the stabilization phase of version 5.1 of the Linux kernel. This includes about 1,800 commits. And we found 60 commits that had no message or that had no patch on a mailing list. Some of them were errors where our tool failed, but 24 of them were really off-list patches. And the obvious thing is that some of those patches are reverting patches. So the reversion of the patches is cast on the mailing list, but the reverting patch itself is not being sent to the mailing list. That's obvious. Some patches from maintainers are off-list patches to fix up some tiny little things that are not being sent to the list. Every subsystem behaves a bit different. The less obvious things is that some subsystem tend to often send off-list patches. And of course, the most interesting thing is some of those off-list patches are indeed security-related issues. So we can automatically find security-related patches before they are maybe publicly... Well, for example, we found a dead patch during that merge window from Greg Rowe Hartmann where he marked some Siemens thing as broken. We asked Greg if this was really the case, that this has never been seen in a public mailing list, and he said, yes, this has been discussed on the list, but not on a public list, so indeed we can find this type of commits automatically with our tool, which was a side effect of our work, but still that's pretty interesting. So we were able to show a kind of formalization of the development process. We are now able to map messages, mailing lists, to commit to repositories. We showed what patches are actually relevant for our type of analysis, and we showed you some results from the patches that are being ignored on mailing lists and also made an analysis of some specific lists. So with this, I can again hand over to... Yeah, thank you very much for presenting all these details. I've mentioned that we like to have community input from you, so if you now come as companies and tell us, oh, you want to give us money, and we give you patches in the Linux kernel that have not been discussed on public mailing lists, we will not yet be doing that, but as an academic, I would also like to point out a certain problem that we're doing this type of research, and it's extremely hard to get public funding for results like this that may be interesting for the kernel community, but that are not really recognized as part of any institutionalized academic funding system, because interestingly, if you present results to conferences like this one, it has no impact from the scientific point of view. If you show it to five people in a workshop that have never contributed a single line to the Linux kernel, then it has impact. So we'll also be happy to not only take your questions, but of course also take your money that you would instead be spending on outlook. And with that, thank you for your attention and fire away with your suggestions and questions, if there are any. Sorry, which subsystems were off-list? You cannot map it to a certain subsystem, so it depends. Sorry, which subsystems tend to be off-list? I would not like to say that in front of that audience. Unless you give us money. Because I do not want to blame specific subsystems or person. That's actually not our intention, so we'll be happy to say to discuss with subsystem, maintain us how the subsystem does, but we really don't want to publish that in the sense of oh, this subsystem's doing bad, that one's doing good, you suck, you don't, and so on. Oh, there's this thing called code of conduct from the Linux Foundation. So if you would have asked us last year, then maybe that's what we also do. Our next step is we want to do a more fine granular analysis, so we want to look at the actual patch, execute get maintainers, map it to the subsystem, and then do some more fine granular analysis because a mailing list may be used for several subsystems, for instance, like the Linux ARM kernel mailing list, like the NetDev mailing list. Actually, so, nice episode, so we are interacting with the getmaintainers.pl maintainer, and one of the questions we got is, oh, is it academic research, or will you be doing any real work? So the interest is huge. That's illegal. That's illegal. You could do, but then we didn't recommend that. I want to ask that on record, we didn't recommend that. Frankly. There could be, so we did look into that as part of other research, so basically, we were trying to come up with how the Linux kernel developers form groups, who works with whom, and so on, and we did some partial analysis on extending the data from mailing lists and the get version control system with IRC logs with other data sources, and the results you get out there in terms of collaborative teams don't really change much if you include this kind of information, so of course it's a threat to validity. It could be that all these patches are, that there are some people who don't use mailing lists for religious reasons, what not, who only use IRC, but it would be extremely improbable that this really makes a big impact on the, by the other results that we have. So we didn't find any patterns regarding this question yet because we haven't looked in detail into that, but of course we'll be happy to do that. Yeah, not yet. So there's one pattern that we found during that window, there was a new subsystem, a new maintainer, and it looked like these were bootstrapping issues. So question, so the question was, did we look at the latency of patch integration, how long it takes for a patch from first submission to ending up in the git repository? Yeah, we have data on that. Yes, we have data on that, but in another presentation. So we have graphs and we have the data, but unfortunately not in this presentation. If you look into the references that we give in the talk, then it's all there. And if you wanna know, again, an offer from us, if you wanna know for a specific subsystem, specific aspect, we'll be happy to investigate that for you. Yeah, I think two years ago, he did that kind of analysis on I2C, as far as I remember. So any more legal or illegal suggestions. And thanks very much for the attention. And just in time.