 Great. So hello everyone. Let me welcome you to my talk about open source in research and reality at this year's open source summits. Supposedly in North America, but more actually spread all across the globe. As you'll know it's a bit sad to not see any of you in person this year. The OSS and A and DELC and AR are always events that I'm looking forward very much to discuss things with colleagues, friends and competitors. But let's try to make the best out of what's going on this year. For those who don't know me, I'm Wolfgang Maurer. I have two affiliations. So for one I work with Siemens Corporate Research and Technology in the Corporate Competence Center, Embedded Linux and I'm also at the Faculty of Computer Science and Mathematics of the University of Applied Science in Regensburg, where I'm head of the Digitalization Laboratory. And that also explains the title of my talk, Open Source in Research and Reality. So I know open source, I've experienced it from both angles from actively producing open source software, bringing it into products, extending it, making it more functional, but also doing scientific work about how open source works, how communities work, how software development works or should work. And actually when once you do that for a couple of years you see that there's quite some disconnect between these two communities, the communities that produce actual software and the communities that research software, that think about how software could be improved and should actually these two should complement each other and should assist each other, but that's not the case in reality and that's why I came up with the idea of discussing these issues in my talk today. Actually I'm trying since many will probably be watching this talk as a video recording, I will be trying to be brief and to not use all the time that's available because I also know from a semester of online teaching and from interactions with students that these online formats can be very tiresome and much more stressful than the actual non-virtual meetings that we usually have. So what is the, I suppose most of you are not from to have an academic background but are not actively engaged in academic communities anymore, that's the typical impression I get when I'm on open source conventions, it's about 98% people doing open source development, management integration and so on and perhaps 2% of research is mixed in. So what are current trends in research? It's one big thing is empirical quantitative evidence based software engineering, so we want to quantify how things work and maybe let me interrupt right here because I'm getting the question, are there any slides being shared right now? There should be, but we've had some issues when setting up the call, so maybe can any of the webcast engineers double check that again? So Avni, you're not seeing any slides, can you answer that probably in the chat? Okay, I need to screen share, then here we go, that could have been my fault. Okay, so now people should see a slide that's not the cover slide. Okay, I assume that now works. So let me continue with that slide. So current research is much about empirical quantitative evidence based software engineering, so you may have heard, oh, I still get messages about no slides, but this time I'm certain I'm sharing some. Okay, so 50% of the chat messages say they can see slides, some say they cannot. Okay, Bob, I think most people cannot see the slides. Very good. So I'm assuming you can see slides and let me get back to these keywords I mentioned. You may have heard these from the medical profession and so on, you don't just want to take anything that is sold to you as a drug that may work or not. But we'd really like to have proof of that, things that are suggested in software engineering do really work. Okay, automated software engineering and construction, that's another big current research trend, which is obviously about going away from people sitting in front of their laptops and writing software to actually using these machines themselves to produce the software. And why is open source of such great interest in current research? That's for the simple reason that as OSS community, we are producing some of the largest engineering artifacts that have ever been produced by mankind. And we provide lots of open and public development data, of course, the source code, but also lots of communication data, lots of bike tracking data, lots of data on how we socially interact and so on. And this is obviously a data source that is of very much interest to researchers to analyze, to analyze back into the past, that gets into the past. And so quite a lot of contemporary software engineering research is so lately based on open source software. To give you an idea of what I'm working at, when I'm not doing products or when I'm not doing open source software development, I'm listing a few papers. These are usually not really relevant in detail to people producing software. That's my academic at. And basically, I'm interested in firstly understanding of how the socio-technical factors of software engineering influence open source development. So of course, we need our technical capabilities, we need our skills in dealing with programming languages, dealing with build systems, dealing with all the nitty-gritty details of how modern technology works. But as you all know, open source development is also much about communicating with people, talking to people, interacting with them on mailing lists, remotely on virtual events and so on. And so there's lots of social, there are a lot of social things going on. And basically, the focus of my research is to quantify these social, these social aspects of software development as reliably and as quantitatively as we can do it right now with the technical aspects. We also, from the industrial point of view, my interest is mainly in two things. It's firstly in applying Linux to real-time and safety critical domains. You know, many, many industrial appliances that range from trains to planes to medical devices require real-time capable base systems. And of course, there are commercial ones. I don't think I need to argue in this community why we're not too interested in the commercial solutions. We want open source solutions. And of course, we want it to be based on Linux. So we are using various real-time systems, integrate them in products. And we also have to take care about that. We have to take care of that these systems always work as expected. So we have safety critical. We have safety concerns that we need to satisfy. Again, this list of talks should just give you an impression of what I've done in this regard in the last couple of years in the open source communities. So I've seen the topic from both sides. And that actually means now when I when I start criticizing what is missing in each of the each of these two communities and what could go better in these two communities. It's kind of a loose, loose situation for me, because whatever I say, I have the possibility of annoying about 50% of my friends or colleagues. And there's just no pleasing anyone if you start to compare how things work from two different perspectives. But please keep it in mind that I don't of course intend to do any harm or to blame anyone for anything. I'm just trying to point out things that should work better. And that would actually help both communities to improve what they're doing. Or as I called it in my in my talk description, I will now try to to unsplit my split personality between academia and commercial development and look at things from both sides both ways. So when I when I was preparing the slides for this talk, I thought about actually what are the projects that we use most or that's probably we contribute to most. And where where science and research could not really provide anything any any inputs of interest that would help us to get our work done better than we would get our work done without reading scientific papers without looking at these studies. And I came up with these three projects. It's Xenomai, it's Jailhouse. These are two open source software projects that you may know or not if you work in the real time in the safety critical communities Xenomai is an extension of the Linux kernel of the Linux kernel to to enable it to do real time processing. Jailhouse is a so called partitioning hypervisor that's used in safety critical systems context. And that can partition a system into different totally isolated components so so that these cannot interact or hopefully cannot interact in any harmful ways. Why are these two projects special so in Xenomai basically we've been at Siemens we've been using this project for a long while. And it's the backbone of one of our industrial products of one of our flagship products magnetic resonance tomographs these are really large machines that sell for that that are quite quite ubiquitously deployed in any part of the world. So it's something that society really relies on and Xenomai is actually the main working horse piece of software that underlies the whole system. So if Xenomai fails, then the whole system really badly fails. It's much worse than if say any portions of the image reconstruction code or anything any parts of the visualization fail. If Xenomai fails, the whole system is down and you can you can leave a multimillion dollar machine just stand around and do nothing. Problem here is we rely on this system very much. But the system when you look at it closely has a community of yards. It's not exactly two active developers, but it's quite close. And that obviously brings in some challenges to companies like how should we invest our money best into such a product to to bring it forward to create a community around it's around a niche system. It will always remain a niche system. Are there any measures how we can predict if what will how how maintainable the system is going to be should anything happen to these two or three core core developers? Is there any scientific advice perhaps in emerging systems like like that upstream? And it turns out this these questions have not been investigated in any single in any single academic project that I could find. Although it's of very crucial. It's a very crucial thing for commercial companies. And many of these many of these questions like merging Xenomai into the mainline Linux kernel have been discussed over and over in the open source community. So they are of interest in the real world. But it seems they are not much of interest to the scientific community because it's either it's considered by them as just some yeah, say minor irrelevant projects that doesn't generalize much. If you publish about a project of two, then you cannot claim generalization. You cannot claim huge impacts to the real world like you can like you can do when you publish about the Linux kernel and so on. But this total disregard by the scientific community does not really equate the importance that the project has or that the project has for a number of commercial really large commercial undertakings, but with limited public visibility. The same question I promised I wanted to go fast or I so basically the same questions apply to jailhouse let me maybe push the discussion of that one a little behind two other topics and let me more focus on CIP the social the social aspects of the CIP project. It's the this is the civil infrastructure platform. And what our goal here is is to provide a Linux kernel and some part selected parts of a base system that we maintain for very long amounts of time. So we are initially focusing at 10 years, probably even longer. And that is also a thing that has been discussed in the open source communities quite extensively. Red Cross Hardman, for instance, has given lots of thoughts to maintaining all the kernel versions to maintaining them over over multiple years, two, three, four years. And it's a problem that every company that produces software products in the medical sector in the safety critical sector need to face because whenever we do an update of our base kernel, we more or less need to re certify the system from the ground up with which is a massive, massive undertaking that's usually much more involved or much causes much more effort than the actual kernel update or the benefits and the costs usually very much outweigh any benefits of kernel updates. Of course, the question is, when will the pressure the cost pressure caused by running older kernels? When will the development pain caused by running older kernels? Because as you all know, the kernel from 10 years ago is of course much, much worse than the kernel that we have now. But when does to the costs of running such old software quantifiably outweigh any of the, sorry, when to the when to the when to the savings that we have from running old software by avoiding resortification on so on? When do the savings? Okay, it's already, it's already half past 11 in Germany. So I'm not, I'm not the fastest at thinking anymore. When do the costs associated associated with with running old software? So the technical pains we have when do these costs outperform or when do these costs grow larger than the costs we would have with a prospective update? You would think that this is a question that could be very well studied in science. You could come up with with models on how cumulative costs for technical depth, technical depth sum up technical depth is a is an issue that's very often investigated in software engineering research. But again, it's, it's impossible to find any results that would address such a topic. What's the cost of back boating? What's the cost of maintaining old software in community related environments? And how can we how can we quantify the respective the respective aspects? skipping that. So instead, what kind of what kind of results do you find when you look into into into the scientific literature into the scientific literature that specifically deals with open source software? Of course, this election I did is profoundly unfair because I took some of the papers that I found found to be most relevant for the scientific communities that receive a lot of citations and that are generally very well regarded in science. But it's of course not a scientific methodology. How I picked these papers nonetheless, I think it's a fair representation as far as it gets on typical questions that people in science deal with. One big topic as already mentioned in the beginning is analyzing communication in software development projects is analyzing communication in open source development. And for instance, three results that have been found in quite a few research papers. It's not just one paper. I'm talking about it's it's a it's a selection of papers are these three unlisting here. When you when you think of mailing list communication, there's results like people in email discussions who are addressed in to NCC are much more likely to reply in a communication thread than people who are not. Second is people who commit to same areas of source code are more likely to reply to questions that address these areas of source code and maintainers or committers are more likely to reply to questions incidentally they are less likely to be replied to when they post messages to the mailing list. Now you can you can ask yourself is that really are these three findings really something that gives you any new information or not. So I would I would say it's something that is pretty much obvious for people who have worked in open source for years or more and it's it's not anything that would gain you a lot of additional insight into the into the development processes as such perhaps it's interesting to find that maintainers are less likely to be replied to when they post emails than regular contributors. But unfortunately there is not much interpretation of why that is so in the literature but just the observation that this is so. When it comes to upstreaming I've which is also which is also quite a a widely discussed issue how to upstream changes how to make sure that the changes get upstreamed with without placing any undue efforts on reviewers on the processes on the projects and so on how to do that technically well is an important issue in open source when you look into the scientific literature you find again summarizing a little bit and reducing the arguments a little bit you find three main arguments on why you should upstream that's reduce your overhead provide benefits to the community and because it's the right thing to do I think everyone in the community all the communities would agree with these free observations but again a question is that is the scientific community who may well be astonished that these holds are not if you're not used to open source development but is that really something that open source developers would take as much of a surprise I don't think so that also this adds much value to our understanding of how open source processes and community behavior goes one more topic that is recurring in the scientific literature is on how to optimize patches for the upstreaming process interest so the the two major find of the two questions that are addressed quite a lot in the literature is how fast will a patch be applied upstream and more or less the same how long will it take to get changes upstream by analyzing various various properties of the changes like to how many people did the did the patches go on mailing lists how large were the patches how how many subsystems of a of a of a project did the patches concern and so on and then you can build nice mathematical models to predict from the properties of a patch how long it will take until it's upstreamed into a project and this is a mathematically very interesting question and operationally a very interesting question you can publish quite nicely about that but then in reality when you ask any open source developers is it really irrelevant how fast the patch will be applied upstream does it matter if it takes a week or if it takes two weeks or if it takes half a day I don't think that's the that's the right question to ask it's a fair question to ask but the right question it's uh is it the right question I don't think so because usually in open source you would say it takes as long as it takes until your patches are applied until you satisfy the the needs and the quality requirements of the projects and it's actually not not even a feature if projects basically apply patches very quickly it's usually considered a much more important feature to apply the best possible version of a patch regardless of how long it takes good so again I'm skipping a slide with with more observations because you get you get the spin so the questions that seem that seem relevant from a open source the questions that seem relevant for a scientific audience and that have one important common characteristics namely that you can answer them by measuring data and that you can build mathematical models on these data which is very important for scientific work very often do not really address those questions that are relevant to the open source communities so I've come up with a list of questions here from the open source point of view that would from my point of view be very interesting to be addressed in the scientific literature but that have not found any answers in this in this literature so far for instance back floating in the long term stability versus updating I've mentioned I've mentioned this in the CRP example I think that would be if someone could give us criteria on given a structure of a system given a certain type of change and probably given some some company internal changes that people don't want to upstream for one reason or another because it's not of sufficient interest for the community or because it contains which is hopefully only rarely the case some some proprietary knowledge that people don't want out in open source projects but if given if we had a model given any of these influence factors that could give us quantitative guidelines on when we should keep back quoting changes from an upstream open source project versus when we should do a when we should do a complete update of the system considering not just cost issues but also stability issues like back quoting brings in risks updating a complete kernel of a system brings in stability risks if we could quantitatively compare these and then come to an optimal decision that would actually be really really helpful not just for companies using open source but also for open source projects that try to maintain like the Debian project that try to maintain large amounts of software over a long amount of time speaking of Debian they've invested a lot of efforts in coming up with a topic that's also very important for certifiable systems namely the reproducibility of builds when you work with say software that's three years old and you rebuild it then for very many reasons you're likely to get slightly different binaries than you used to get three years ago because compiler dependencies may have subtly compiler details may have subtly changed you get timestamps in your builds that change the binary and so on and that makes it makes it very hard to even if you haven't changed anything to the source code but want to benefit say from bug fixes in compilers to come up with an exact same copy of the software with a copy that has only changed in places where you actually know how the change can be attributed so it's it's an unsolved problem it's a massive unsolved problem in the open source community but it's seemingly not of much interest for the research community because it's it's usually considered as a detail problem it's just just a build system problem just a problem of making binaries and you would likely hear that's been solved like in the 1970s but is not of much interest to research these days some of the other points i'm not going to read them out in beta so you're all seeing them on the slides concern the questions of how community structures and other non-technical observable measures quantitative quantitatively influence the quality of the outcome so question of how does how does the structure of a certain community to say the linux linux kernel has certain approaches to um to its release cycles to how subsystems are formed to the maintenance structure to the governance structure and so on do any of these structures actually influence the result of the outcome the qualities in types of a number of bugs um reliability of releases bug fixes that are needed after a release and so on and i think it would be a very important question to quantitatively study these influence factors plus the effect sizes so if there's an effect how large is it is it just something that i can barely observe or is it really something that brings in uh groundbreaking changes to the outcome of the community process that is something that would require a lot of a lot of investigation and that could uh that opens the possibility for a lot of mathematical model building for a lot of um giving actionable and accurate accurate um guidance to the open source communities from the research communities and this is also this is also partly being done but um firstly the results are these days mostly ignored by the open source communities um just think of if you have anyone at a say linux kernel gathering when people discuss processes discuss changes to processes discuss possible benefits and drawback have you ever heard anyone mentioning any research on that although quite a bit exists in that direction i haven't and that's typically because the questions that are addressed um are more tailored towards being nicely nicely analyzable versus really addressing the uh the problems that people in open source development face uh from this list let me let me finally mention i don't know if i if you you should you should see when i um when i highlight something let me address the last question because that is a question that has come up in uh or some some facets of that question have recently come up in some discussions for instance about um naming git branches and so on although we have we have codes of conduct although we try to be uh increment we incrementally try to be nice to people so if you compare the communities from 15 years ago to the communities now i think we've made big steps forward in um in being welcoming and in not treating people badly but still um people are people so people are humans and you will not you will not get to a stage of universal happiness no matter how hard we try so there will always be conflicts there will always be uh persons that probably slightly misbehave that um have different ideas about how things should go than other persons in communities and an interesting question to open source communities as such but also in the in the interest of um of companies relying on open source software relying on the stability on the trustworthiness of communities is how many to phrase it um a little direct how many bad or evil or misbehaving persons in a community can you tolerate uh how many can a project tolerate until it starts to run into substantial problems and how could we could we detect such such substantial structural problems for instance when when does communication starts to digress from the technical issues to just issues of policy and politics does that influence does that influence the technical outcome of communities are there any measures how to handle that and so on so that's that's also a question trust in ecosystem stability and the influence of say point wise bad influences on communities and how to best deal with that that I think could be answered from a scientific point of view but still haven't received um any satisfying answers that would address the needs of the communities yet so um how could how could academia come to conclusions or come up with research that better benefits the communities and the software that they're looking at than it is now and you may know you may know uh Jesus Gonzalez Parahona who is uh one of the few researchers who are all actually also quite active in the open source community a um his research group provides a lot of um a lot of open source software to analyze say the diversity aspects of communities um the um kind the um the the ways how people interact in communities uh they they provide dashboard sensor on how to how to measure project progress how to monitor the health of projects and things like that and uh we know how to talk from him a year ago so he said that the main thing for him and that is something I can only underline is that academic research is basically producing many fine pieces of insight and many fine models but what's lacking or what's what's essential is connecting the result of model with the impact of things and that is I think the the main thing that we would need to solve to better connect scientific research to open source communities not just come up with the model but really also do that extra step that extra step that um cannot often be nicely published but that extra step that would make the bridge between uh research and the communities by not just showing we have this and that model and it's mathematically nice and it works nicely for us but also research explain and understand what's the result of these models has uh an impact on the things on the software that we actually developed in and how could how could industry help with that uh with that endeavor so when you when you look at um when you look at the problem from a researcher's point of view how to select the the software and the problems that um that should be best addressed in your research then you like all you typically lack the experience that open source developers have in dealing with real-world issues in finding the spots that really hurt if you haven't done the things that really hurt in development if you haven't done a substantial actual development and if you haven't worked in groups that need to live with realities of doing this development of doing software integration of combining components of fixing bugs and so on then basically you cannot know you cannot know where the pain points are so industry would need to provide a lot more guidance to research than it does to turn the research to turn the course of the research into something that actually benefits industry and communities and what's likewise important is to interpret any numbers that come out of research i i've often experienced that researchers do say measure things like we have so and so many bugs per developer we have a merge time of this and that our patches are arranged from uh so and so many commits to so and so many commits but typically these numbers these numbers are very specific to projects and don't tell much to the people measuring them but they they could say quite a lot to the people doing the actual development and by by interpreting these numbers by providing guidance on how to interpret what people have measured which is actually not much effort for seasoned practitioners it would be it would be quite a lot of help to the research community to better understand what we are doing and to better understand what their result what their research could actually tell us good so i've uh not made my promise to not fully use the time i have for this session so i'm at least not not over time but uh i didn't make it shorter than uh then i then um the full time so let me um quickly at the end of the talk because i'm also interested i'm also very much interested in your comments and your questions on that so i would like to leave some time for that let me summarize basically let me summarize what i've what i've talked about the the things i've talked about on how to close the gap from uh from both sides and actually it's it's not it's not very much it's not very work intensive to do that and it doesn't require very much time for many for many things uh companies can easily support research by actually just looking more into what uh what science is coming up perhaps sending sending people that they usually send to open source conferences to the os sna also to a scientific conference or another it's it's really that um researchers are very friendly people they are very friendly people especially to um to industry because they obviously they obviously have the hope of getting money from them that makes them receiving money from them that makes them double friendly but they are if companies come and say okay we have this and that problem and this and that concrete scenario then it's actually very very likely that you will get people interested and that you will get um people to look into your problems but of course you need to do that you need to provide initiative on the company side and uh the second big thing for companies is when they look at at research enough i've experienced that uh quite often myself is uh this is a lot of black and white expectation going on so people see they read a research paper and then they want one definitive answer to a problem organize your process like that build communities that size make commits that uh make patch series that comprise seven plus minus two commits but that's unfortunately not the way how reality works and um the answer from from scientific advice will typically be it depends but just don't dismiss that upfront don't expect anything too simplistic and too universal but also but really really um really accept that uh that the situation or that that scientific advice never will be of that kind and always will quite very very definition need to um need to be on the um it depends on as for as for science i think providing actionable statements is the crucial thing to go with industry or to increase to increase the interest of industry and for that it would really it would be essential to apply the knowledge they get on real projects that's typically done for the Linux kernel but um barely or very very rarely have academic results by the finders be applied to to open source software beyond one or two representative examples and the experience that can be gained this way by by using what what people what researchers have found by using this to actually improve projects and to work with these projects to improve them uh by applying the findings would increase acceptance for scientific research and also the validity of scientific research quite a lot and um maybe maybe as a as one final point that i'm mentioning in this talk is what's also very much missing is when when i see scientific papers they all claim substantial improvement on some uh very specific issue but it's very hard to then weigh the actual improvements that one gets from um from the the overall improvement to a project to weigh this um this specific improvements to the general impact of a project or of a development of a development effort and to see a how how big the impact on the project as such on the development as such and not just on this tiny isolated aspect is and also to get a a more reliable quantification of what the negative drawbacks uh the the drawbacks the negative aspects of applying results to projects are because these also can very often outweigh the actual benefits but are typically not really addressed in the research as such that is focused on one very very specific um very specific aspect of the project okay so now the lights in my office have gone out and i guess that's the sign for me to stop the talk actually the lights go out every day at 12 o'clock midnight but um but i will take that as a as a sign to stop my talk and and to stop the screen share as i've been told by the webcast engineers and to see if you have any if you have any questions or comments on what i've said so far actually there is there is one from um if i yeah so there is one so where can a junior researcher with no coding background start with open source um that is that is actually a a thing that i experience quite often with students in my research group when they come to write um thesis and do their first little research projects they um they often ask this questions and what i do is i don't think it matters much to which project you contribute um but just follow follow the guidelines it's not specific to researchers or say uh general developers just follow the guidelines that the projects give you most projects have uh programs to welcome new developers there's um yeah for researchers the google sum of code initiative is not necessarily the best um best place to go but the linux foundation is building up a community bridge the many projects have mentorship programs um and so on and it's it really pays off to engage in such um mentoring opportunities because you will gain you will gain very important insights on how development works by doing it based by by by acquiring first hand first hand knowledge and it will also very much increase or very much improve your ability to find the right research questions even by just just discussing with the people with the persons in projects say hey what what are interesting what are interesting spots that i could work on what are some uh some easy fruits that i could contribute to because starting with these i have the experience that uh students and the junior researchers very quickly pick up how to interact with the projects and are very quickly drawn into the projects to contribute um sustainably on a long term basis and that as a as a side effect will get you lots of ideas um on areas that you could uh that you could address in your research cool so are there any more questions or do you have do you have on any comments on uh on areas where you find that there are some uh some some gaps in knowledge or some some some gaps in knowledge in open source projects where people don't know what the best cause of action is where science could provide solutions which i would of course be happily taking because that gives me gives me some more opportunities on and ideas for my own research so if not i guess uh you can you can catch me after the event anytime on the virtual platform i thank you very much i thank you very much for your attention if you're in europe i thank you especially because it's very very late uh this time of the day if you're in japan japan i also thank you very uh very especially because it's very early um in the day here and of course thanks also to everyone uh else on this talk who's in a more comfortable time zone so goodbye and i hope to see you next time again in person and not just my adept cast