 Hello everyone. Good morning. Good day. Good evening. Good evening, depending on where you are. My name is Wolfgang Maurer. I'm greeting you live from Regensburg, Germany, which is close to Munich, and where it's now more good evening then. Good day. I welcome you to my talk on safety, security, and quality artificial intelligence versus common sense. Right in the beginning, let me promise that since I've experienced that it takes quite a bit of concentration to listen to non-life webcasts, I guess we'd all rather be spending our time right now in a conference room somewhere in North America and talk to each other in person, which is much more relaxed. So I will try to keep my talk quite short and instead focus on the Q&A session in the end, which is also an interesting thing for me to do because I'm going to talk about some research results that people in academia and science have come up with during, say, the last 10, 15 years to apply methods of artificial intelligence or machine learning. I don't really make any philosophical distinction between these two names. Introduce these to you, discuss a little bit what I think could reasonably be applied in open source systems, but I'm also eager to learn from you as industrial practitioners where the areas of applicability for such methods and where problems that could probably benefit from such methods are. I have, for those who don't know me, a dual role. For one, I work with Siemens corporate technology and that's maybe a good time to switch from the title slides to my actual slide desk. So you should be seeing, let me put that to that screen. Just a second, I'm solving a window shifting problem. So you should hopefully be seeing my slides now, my slide deck in full screen mode. If not, please ping me on the chat channel, but it seems to work. So I was mentioning for those who don't know me, I have a dual role. I work at Siemens corporate research in the embed Linux department where I do practical applied work and I also have a research group at the Technical University of Applied Regentsburg where I think about the more fundamental issues of software engineering. So why do we want to use, what are the main possibilities to apply artificial intelligence and machine learning in software engineering? If you look at the literature, it turns out it's three classes of problems where machine learning techniques can support our work as software engineers but also our work as integrators. The first category is classification, learning and predicting. That's naturally quite a good match and that's the category that has been investigated longest in the area of software engineering. So people have been using case-based reasoning, rule induction and things like that for software project prediction for various properties you'd like to predict of software be that quality, be that the time it takes to finish a project to make the notoriously hard predictions when things like software or in our German case, perhaps airports will be ready. We all know this is very easy to get wrong and the hope is of course that software can and machine learning can help us to get that less wrong. People have also tried things like defect prediction and many more things in this regard. The second big topic in machine learning in software engineering is probabilistic reasoning with basically similar tasks as I mentioned before, similar problems as I mentioned before but in a more theory-based setting where we don't just benefit from statistical insights but do logical reasoning, proper logical reasoning augmented with some probabilistic components augmented with dealing with uncertainty that can for instance also help to model how users interact with software which can help us in generating realistic test cases, can help us in generating for instance brook loads that stress schedulers and so on properly and simulate general real-world user behavior. And thirdly, we have what is the newest of the free research strands in this area, computational search and optimization that is a technique to reformulate many software engineering problems that arise in programming as kind of optimization problems that can then be dealt with by going through large search spaces and finding optimal solutions that very often comes or many papers about that deal with require things like requirements, software design, maintenance and testing. I think that's the least good match to open source software because the problem studies are very specific on occasion but the first two have quite some good potential to be used in open source software. Research in these three areas very often is based on Linux as a data source or if you want to call it that way as a guinea pig because of course it's one of the largest known engineering undertakings that mankind so far is producing. We have lots of open code, we have lots of open development data that researchers can use and feed in the machine learning models and many projects of the Linux ecosystem first and foremost the Linux kernel are considered as role models of good for whatever value of good software engineering so people naturally try to follow the success in that direction. Why am I interested in applying machine learning to these open source development type problems? That's for two reasons. Firstly I'm for various reasons interested in safety critical systems. I've seen some of the experts in the area in the talks I don't need to tell you much or I probably cannot tell you much about the details of safety if you're an expert in that area but for everyone else including me knows there's some very hard to read very complicated standards that describe how safety critical software is supposed to be built. There's only some experts in the world who can actually read these understand these and act accordingly but on the other hand we have the problem that we do not just want to use such software that is built with such highly expert knowledge but that we want to use software like the Linux kernel software that comes from the Linux ecosystem and then use it in safety critical context that sounds much like a contradiction but if you ask the people who understand the standards then they will tell you there's actually three ways of achieving certifications for safety critical software three routes to safety. First one is standards compliant development so obviously how it was intended in the beginning then you have compliant non-compliant development and whatever that means and then you have proven in use argumentations and that is one thing one aspect of safety critical systems where we place much hope in machine learning because we are actively trying to engage machine learning techniques in understanding the processes that are used to develop projects in the Linux ecosystem that drive development of these projects and then learn the processes in a way that we can guarantee the properties there are the reliability properties the quality properties and so on that are mandated by mandated by the safety critical standards. The second reason or the second project where we are partly trying to use methods from machine learning is the civil infrastructure platform so here we're trying to support Linux systems for more than ten years more than a decade hopefully it's going to be two decades in the end in the end that we can support a given system configuration over time. I'm not going to get into detail why this is important why this is important for industrial applications but I guess you can easily imagine that industry when they build things like power plants when they build things like airplanes when they build things like trains they don't want to update the software every three weeks like perhaps you are used to on your mobile phone but in nuclear power plants I think everyone agrees even if we're going to shut them down eventually somewhere it's wise to not update your software every year it's also not wise to update your software every five years so we need to keep it running in a safe way as long as that's possible and that also implies us to obtain knowledge about many aspects that can benefit from machine learning first thing is the software that we employ in these systems once we make a decision to focus or to bet on a given component then we are stuck with that for the next decade or so and so we want to quantify our choice we want to quantify our trust as well as possible and that of course we can do by getting quantitative knowledge about the development process by using machine learning to extract all the data that we can from public sources and then objectively make quality statements about these data automated software engineering also comes into play here so when we do back porting of patches of security patches of functional patches from one kernel released to another which is a day-to-day business in the kernel community say over two or three years time frames but if you go over 10 years of time and if you have to support five or six kernels in parallel it gets a hot problem to even choose which patches to back port you can do that if you have an unlimited supply of engineers like perhaps Google or the Linux foundation have for the newest kernel but we certainly won't be having this unlimited or effectively unlimited supply for kernels that are five six seven years old so one immediate application of ML machine learning would be to automatically identify patches that are worth back porting that are necessary to be back ported by automated classification techniques so yeah that's not important I said I wanted to not give a too long talk so let me skip that slide and come right to the core of what I initially intended to do so there's software engineering papers and machine learning based software engineering papers are published at about the same speeds as patches to the Linux kernel appears so it's quite impossible for someone to read them all and to make a fair selection and I have a fairly good overview of what is happening in the software engineering community based on my own contributions to conferences to journals and to research but I wanted to be as fair as possible and as objective as possible in presenting approaches that may be of interest to the Linux ecosystem communities so I was trying to avoid making a strictly personal selection the method I chose then is to think of some appropriate keywords actually it's grown to quite a long list of keywords that I found to be relevant to include research that should be considered in the open source world and then did a web of science search for these keywords and then again had to select a subset because it was too many papers to discuss here so in the end of the day I ended up with a process that is still unfair and subjective but I hope a little less unfair and a little less subjective than just using my personal subjective judgment actually I was half expecting there's someone from the Linux foundation here to check for code of conduct violations and I was half expecting that my session will be immediately cancelled after I admit in public that I used a very unfair method but obviously I'm still good with that so what I ended up with is about 50 top rated representative papers I guess I don't need to apologize for covering them all in detail of course I can provide you interesting references after the session if you tell me what topics you're interested in and that's an active offer from me I can give you guidance on what results to look into and probably whom to approach so I'm not going to cover any specific papers I also didn't want to highlight some research groups and others not but I'm just going to give you the impression that I had from these papers for which tasks machine learning methods could work well for your open source project or for your project that integrates and extends open source software so the categories I came up with that are potentially suited to machine learning techniques are 5-fold it's quality, reliability, communities, cooperation and processes so more social topic testing and analyzing code at large scale understanding licenses and co-chairing and finally effort estimation I will focus on three of these oops in particular so it turns out it's not so easy to advance slides in a multi-screen scenario with this software but I finally did it quality, reliability what can we say about this what people are doing here again I'm not mentioning any specific papers that would be boring but if you're interested in any aspects of this quality and reliability work that I'm mentioning now please address me after the session and I can give you exhaustive lists of references and can also come up with references that are particularly suited for your specific problem the base thing that people are considering in this domain is exposed analysis so application diagnostics, bug reporting, analysis of system crashes and so on the Linux community is using that partly for Linux kernel development for instance there are some repositories that collect kernel crashes from the internet that collect the coding messages and then try to classify these into categories assign the bugs to people who might be responsible for handling that and so on the techniques that people have shown however can be quite useful for improving your own applications that you have combined from open source components or the components as such by finding the most critical portions of your software you can of course do that by simply counting which bugs appeared in which source code module appeared in which file and so on but it turns out that this is not a very accurate classification so many more factors come into play and by properly considering these factors like the frequency of how people interact in files or by considering higher level dependencies between components that go beyond being placed in the same file or being placed in the same module it's possible to quite accurately predict things like where are bugs most likely to appear after changes have been done to certain portions of the code where should I direct more testing efforts and which testing efforts on my code and so on that can help to improve general quality and reliability of the software quality and reliability is also a big topic in safety analysis and people are actively trying for the reasons that I mentioned in the beginning of the talk to come up with safety arguments from measuring properties of software development processes like for instance measuring the amount of the count of bugs or the count of bug fixing commits that are brought into a repository you can then see how these factors change over time how the number of bug fixes goes down the interesting question is and there's a number of research approaches this is of course not a good idea to just count how many bugs do I have in a given timeframe for a given kernel release because of course that will go down because people lose interest over time so it's essential to distinguish to see what confounding factors are there that influence these direct measurements that we can make number of bugs but what fraction of a decreasing number of bugs can be attributed to a decreasing interest in a given piece of software because it later releases out and what and how far does a decreasing number of bugs really indicate a growing software quality it's not a question that can be answered straightforwardly but a lot of effort has gone into finding these confounding factors and building proper statistical modeling and by doing proper learning on these models and two more things I'd like to mention in the direction for under the umbrella of quality and reliability is analysis and simulation of user behavior I've mentioned that before so most projects still use very very straightforward approaches to testing under real conditions under conditions that could stem from real user interactions there are a number of quite sophisticated methods available that could very well improve the situation but there is currently a big gap between the refindness of the methods that would be available in research and the methods that are actually applied in real projects that could use some bridging the gap between them and so there will be lots of opportunities to try out many techniques regarding simulating proper user behavior performance is another thing that is fairly obvious in terms of quality and reliability and should in my opinion also be considered more in open source projects especially with the highly complicated highly complex cloud deployments that we are facing where a large number of software components interact with each other in a very very hard to predict and in a very hard to understand way it's getting very hard to do proper performance tuning based on the traditional engineering approach so I know I have these and these knobs that I can turn I know what action which knob has and then I find by including a priori knowledge or by appropriately setting values to parameters to values that are known to work in this and that scenario to get optimal performance from the system nowadays we have so many tunable parameters even for a simple stack that consists of a kernel a database system and a web application that it's hardly possible to optimize performance by hand but a very large number of very sophisticated exploration techniques for these large search spaces using AI and ML are available that could be used and should be used in anything but the least but the most straightforward deployment so the second big topic that I mentioned is community cooperation and processes of course open source is much about communities is much about how people interact on international scales in virtual in real environments how they cooperate and one would usually think that these are factors that are very hard to quantify and that are very hard to come by to understand with mathematical modeling and related techniques but again it turns out that this has been a topic that is very dear to the research communities and a large large large large number of software systems has been developed that can actually record interactions between people that can infer how people communicate with each other be that directly via mailing lists be that via chat systems be that indirectly via say commits to revision control systems where people work on similar or related parts of systems and it turns out that a number of quite interesting conclusions can be drawn from such data for instance people have shown that if you want to predict bugs it's much less efficient to look at the traditional quality metrics of software like how many cyclomatic complexity how many loops did I use in the given portion of code how many global variables do I have all these traditional quality indicators these are much worse quality indicators so I wouldn't even call them quality indicators because most of them are just linearly related to source code size so that just basically random measurements observable that you can easily get from a system but these are not very very apt for predicting things that actually interest your quality wise it's much much better or it's much easier to predict build system failures defective components areas in the source code where bugs will appear with high probability by looking at the communication structure between developers then by looking at these at these more traditional factors of course it's not the easiest thing to do and the problem is although these techniques have very high potential it's still quite a bit of a challenge to apply these to realistic systems where you stumble across all these nitty gritty technical details that you all know so the typical experience when you try one of the tools that research provides in this area is you install the tool it will not run you get to you beat the tool to run you apply it to your software it will crash for some reason or another and then you lose interest in that tool that's kind of a drawback but once you have mastered this initial stage especially once when you have not an infinite number of choices between tools to make because that does not necessarily make the problem of having to spend considerable up front time investment to get these tools running if you don't have this if you can focus on a solution that promises to be useful and again I'm making the offer that I can give you guidance on that in after the talk then it's something that is really worth trying out with sometimes really astonishing and accurate predictions that can be made of course sometimes the predictions are a complete nonsense but well that's the risk of life two more things especially one more thing I'd like to mention in regards to communities, cooperation and processes is the notion of the adherence to processes in open source software at all so while 15 years ago or 20 years ago open source software may have been the wild west in terms of processes and people just did as they thought it was best most large processes these days have or think they have quite elaborate processes that should be followed for instance the Linux kernel community thinks a lot about how their development processes works and are organized and have spent lots of effort on streamlining these processes trying to arrive at optimal processes when you use these quantitative analysis methods you have the possibility to actually extract the effective processes that people use from what you've measured and it's quite astonishing to compare these effective processes with what people think they are using as a process and knowing the difference can of course bring many insights firstly you can think about why are we actually acting differently than we document in writing or you can also think about are there perhaps reasons why these process violations occur and if you think about things like security patches in the Linux kernel the usual Linux kernel submission process would be if a person writes a patch sends it to a mailing list for discussion then gets criticism then improves the patch sends it back to the mailing list and at some point it's picked up by the respective maintenance and then travels upstream if this doesn't happen and if you see lots of patches that appear that appear in the repositories but that were not covered by these processes one obvious reason is that this can be security related patches that developers intentionally do not want to discuss on mailing lists before disclosing the vulnerabilities or perhaps it can also be developers misusing the process by bringing in deliberately ill-crafted patches that introduce actually security risks or bugs in the project that you of course want to detect moving on to effort estimation the third big topic that I'd like to discuss with you and here we have a bit of a mixed picture when it comes to research at least in my subjective classification scheme so what people obsess a lot about in this area is how to measure patch acceptance times how to optimize patches for being quickly accepted into repositories there is a very large number of papers discussing a very large amount of measures that or qualities that patches could have to be quickly accepted upstream but in the end of the day I think that that most of these measures are either trivial because so you can say that patches with trivial fixes go upstream quicker than patches that introduce large and complicated features which is probably not to the greatest does not contribute to the greatest astonishment in the community but is something that can be very nicely mathematically modeled what's more interesting in that field is a exposed consideration how much effort it took to upstream a patch to develop a patch after it has been accepted into repositories because that can give you quite interesting insights on implementation costs now perhaps you would think that it doesn't matter how much it took or that the information of how much how much money how much effort it took to implement a given change to a system is not very interesting after the fact it's been done but actually turns out and everyone who has done who probably has done software effort estimation knows that estimating the efforts required to develop software upfront is about as hard as predicting airport opening times in Germany if you're not from Germany and Europe you may want to Google for Berlin airport to see why I'm so why I'm obsessing for airport opening times in this talk it's one of our German engineering anti-features that we can always use as an example of how hard it is to predict things but coming back to my discussion about airport estimation it turns out that when you know when you have a large number of patches that you can infer the costs it took to write an ex post and this is very well possible with open source software and especially it's very well possible on a project basis because many projects these days have a very very large history going back years perhaps even decades it's possible to come up with models with specific models for a specific development situation that are close to a given situation in open source projects that can help you very accurate that can help you to predict very accurately in my experience much more accurately than with all the traditional guess and magic methods out there how much effort will be required for your internal software projects be that contributing to open source be that integrating open source or be that even developing proprietary in-house software so I promise that I wouldn't be using up too much of your time and leave time for Q&A afterwards I see I'm about to break this promise so let me discuss this quickly of course there's a lot of challenges to applying research results to open source or to custom projects it turns out that usually data validity is one of the major issues of course it seems obvious that open source systems open source development leaves a lot of publicly accessible traces that can be fed right into machine learning techniques algorithms and software in reality this is absolutely not the case so it's really unbelievable how many broken encodings how many ignored standards for emails how many broken things in every aspect of data communication you find on the internet but again with some experience of having suffered through this process of cleaning the data making sure it's consistent in its versions and making sure it's mostly free of contradictions you can overcome this step again I'm happy to offer some guidance on specific problems here after the talk in just testing and just trying the many promising results of research in open source communities is that both communities are quite disconnected at the moment so I've given if you have the ability of going back in time you can go back 23 hours from now and see my talk on open source communities and scientific research where I have taken the liberties of ranting a bit about this disconnect and how to resolve it if you cannot travel back in time perhaps the video recordings of this talk will be of interest for you and two more things that are challenging in the adoption of these solutions is that research often uses commercial tools that are not available for the average open source developer because they're either proprietary and people don't want to use it deliberately or because they're simply in cost ranges that don't benefit just trying things out and unfortunately many interesting results are also focused on quite restricted settings regards to the use technologies like language academia really favors Java and C sharp a lot and that's one obvious restriction that makes it hard to apply these approaches in open source systems and don't even get me started on Pables. I work in academia I thought I would have or I think I have really good access to all these services behind Pables but even I can often not access papers that promise substantial reductions in engineering efforts that promise substantial improvement in software quality I really don't know why people think it makes sense to hide such papers behind Pables because if no one can read about the research then that's the guarantee that it will never see any application in practice but that's a long story good coming to the end and leaving some time for discussion the conclusion I am one general one general caveat in applying scientific results manage in aligning expectations of what you can realistically expect from a machine learning analysis and from open source development processes is the same thing that can be said about recent events for instance the Covid-19 pandemic now that everyone has become a hobby virologist and a hobby epidemiologist people gained lots of interest in statistics and one particularly interesting comment on why so many forecasts for the behavior of the Covid-19 pandemic were wrong came from one of the leading researchers in this field Nassim Taleb who wrote a comment on single point forecasts for fat tail variables and these comments are at least the first three points of this comment I can directly reuse to make the conclusion and set expectations what you can expect from such scientific methods applied to open source projects so his findings are that forecasting single variables in fat tail domains is in violation of both common sense and probability theory without discussing what fat tails in statistics exactly means just saying basically it's that you are observing parameters that are not very uniformly and not very evenly or gaussianly distributed but that exhibit distributions that are very skewed that are very unsymmetric that's the same in measuring pandemics and in measuring many properties of software and open source systems and what you cannot do what people would usually like is statements about say I have the source code base and the next part will appear in precisely this spot or if you consider measuring latencies in real-time systems then people would like to have statements like a latency of 27 milliseconds will appear caused by this and that combination of input events and code in the systems that is not what any realistic statistical model what any realistic machine learning approach could give you because quite like pandemics most data that we experience in software engineering is also extremely fat tailed and has also other properties that make this kind of point based forecasting impossible and let me close finally with point number three that summarizes the expectations very well science is not about making single point predictions but understanding properties which can sometimes be tested by single points so what machine learning methods can give you is a better understanding of what you are doing, how your processes do, how your interaction, how your communication in the community is doing not individual artifacts, be that files, be that person be that whatever, but it can give you a much better understanding of what's going on in your project, in your code overall and that will almost certainly lead to thoughts about how to improve these processes that will almost certainly lead to really good ideas how to improve code, how to improve structures in your projects, but of course you need the curiosity and say the willingness to apply the according techniques even if it takes you a little bit of effort and patience with that let me come to the end so I'm listing again the five domains that I personally think are most well suited to looking into when it comes to applying machine learning and AI and the final tip that I want to give you is really to try approach a local researcher but if you find something in the literature that's interesting to you, if I suggest something that's interesting to you, please really don't hesitate to contact these people, to tell these people I'm from Open Source Project XYZ I'm from Company So and so and I'm interested in your techniques and this is something that researchers are really looking forward to but as one hears on scientific conferences what happens too rarely what doesn't happen often enough people are really keen to hear back from you as practitioners and usually are also willing to spend quite a lot of efforts on adapting within reason of course on adapting their approaches to your specific problems and helping you with your product with your with gaining insights using the methods so it's some effort it's some approaching researchers with your problems is some effort that is really well spent and usually pays off very well with that let me thank you for your attention so far and while I was as usual much slower than I anticipated to be and have used up much more of your patience than I wanted to we still have a lot of time to to hear from your comments questions and so on I already see one so let me maybe start with that one and let me stop the screen share before so the question is can you tell us what are some of the most meaningful metrics when attempting to make a model for effort estimation so with effort estimation I assume you are referring to developer time or since I'm not hearing anything to the contrary from the person asking a question I assume yes and that that question has actually two aspects so for one I said in the talk that using all the classical predictors to estimate efforts usually doesn't work so well however when you want to do a it turns out when you're trying to train models that predict the efforts, the monetary efforts the personal efforts and so on to implement features of course the main problem is that you do not have a very formal description of what the features will look like that you only have a rough understanding of what you're heading to and actually it turns out that when you run machine learning models or when you run machine learning approaches on this kind of problem one very effective approach is to first identify similar problems in other projects it's a non-trivial problem how to apply these similar problems in other projects I'll give some references on that after the talk and once you have done that astonishingly it turns out that just using you have two good predictors then for the effort and that a bit contradicts what I said before, one very good predictor is the number of lines of code so again some very elementary measure that is often linearly correlated to the effort it takes to write the code once given that you have found an appropriate scenario and what works even better is the number of commits because that excludes some problems when for instance you merge in code to solve a problem from other projects when what happens quite often in development projects you include external libraries into your code which then of course make the number of lines of code unreliable and so on but the best predictor then you can use is the number of commits because that eliminates all these variation factors if you can find suitably similar areas within systems and are there any more questions or comments from the audience perhaps of approaches you heard of that work well perhaps what I'd be even more interested in perhaps someone is actively using machine learning AI in their projects not then I'll still be around if you want to ask me specific questions for specific problems get references for specific references for a given scenario then I will still be around in the Slack channel today and of course also tomorrow I would love to hear back from you there and pending and with no more open questions let me thank you for your attention and I hope to see you again in person next year after we've hopefully finally combated the Covid pandemic so everyone stay safe and hopefully see you next year enjoy the conference, goodbye