 Hello, okay everyone, we're about to start our next talk. You can see the title slide is here I'm gonna hand it over to you Andreas Thank you Hello, everyone. My name is Andreas Platschek. I'm working with open tech and Most of my working time currently goes into a project called sell to linux and pee The title already reveals a lot about the project we planning to build a linux-based platform that shall be certified to seal to safety integrity level 2 according to IC 61508 and We're talking about MP multiprocessor system. We're not interested in single-course anymore One big part of this project or one big part of them working on right now is the Is to analyze the development process of the linux kernel and of our other elements and Yeah, for that we're using data mining techniques and that's what I'm here to talk about today So first I want to give you a little introduction to the sell to linux and pee project The actual goal, I just said we want to have a certified platform But but the most important goal is not to Certify one platform the most important goal is to have a framework that allows us to Certify all future platforms future kernel versions different configurations and so on and We want to verify our framework by doing the certification for one platform for one use case Yeah, it should be suitable for up to seal to that's the target and Multicorps systems already mentioned that as well The open source elements that we want to use on the platform is the linux kernel chilip see busybox and some smaller tools and for today I'll mostly focus on the linux kernel because Our assumption is if we can do this kind of analysis for the Linux kernel then it will be doable for the other components as well And even if it's not doable for for example busybox, which has a different development life cycle Then we still have other options there because it's just a lot smaller The methods that I'm Presenting today are basically suitable for all kinds of pre-existing software that's not restricted to the linux kernel but What we need as a prerequisite is the source code of course and the history in Preferably cheat And the target are software intensive systems. So yeah, you already got that from the elements probably so when you want to certify a Software according to IC 61 508. You basically have three options. They're called route one one as two s and three s We are out one s is called compliant development So basically you develop a software from scratch use from the very beginning follow the design life cycle as Defined in the standard. So that's not an option here route to s is called Sorry, no, it's by proof of F of Previous usage Proving use. Yeah, thank you Lucas Yeah, it's called proven use. So that's only suitable for smaller software and Not for for a big chunk like the Linux kernel and the third one. That's the one that we are using It's called assessment of non-compliant development route 3 s So basically we have some assumptions In this route first assumption is that we actually have a process in place that is defined Of course, this process also has to been followed It's not a problem if there are discrepancies between the Actual process and the process that is defined But we have to be able to assess the discrepancies and if there is Procedural defects we need to assign mitigations to counteract those the assurance that we want to give Some examples is the qualification of involved people. So if you're in a company setup, it's very easy to say Okay, these people are working on the project. They have this training. They have this experience from previous projects And to just write this small summary But for the thousands of developers in the Linux kernel, that's not possible in this form. So What we're doing is to actually look at the development history of the developers and design them a experience level so to speak Structural aspects of the organization. We're doing analysis of the reviewing process Of how patches get into mainland of the integration how many Clashes were during integration into Linux next and so on Then the methods and techniques used So there are number of methods and then tools that are that shall be used if you are submitting a patch or before submitting a patch and we're also doing an analysis if This has actually be done or to what extended has been done. So for example Are the patches following the coding style or is the coding style just for decoration and we want to Present the results in a quantitative manner. So we don't want to do quality of assertions. We really want to put numbers on basically everything so our concept for all three s is to In first place select those components that have no procedural defects So if it's possible and there are multiple components that provide the same or similar functionality Then we try to select the one where we have better proof. So for example, if you have a file system and During the analysis it turns out that they didn't follow the the process or part of the process Then it's easier to just select another file system then Trying to do. I don't know lots of review or I don't know what Yeah, so assessment of the process as already said we're trying to assess all the things from how patches get into the kernel how many integration clashes happen how How reviews are are done to what extent and so on Yeah, then we do assessment of the consistency of the results and as I said we want to quantify everything and we also want to quantify the residual risks More on this quantification will be presented in tomorrow's talk by Nicholas So I am kind of laying the foundation for how we are gathering the data so the first point I had before was that a Process has to be in place and the good news is there is a development life cycle in place in the Linux kernel It's documented in the Source code in the cheat repository in documentation process and Some examples for the content are how patches should be Formatted how the subject man in the patch should look like how the body should be What he should contain how that patches should be signed off and so on It also defines the usage of the by tags It Gives checklists on how and what you should test before submitting a patch and it also helps you to find out where to send patches So the good news is that we have a development life cycle in place So the next question is is it followed and how well is it followed? so But before we go into that question I have One more thing that I want to show you I'm not sure how many of you know how a patch actually Goes into mainline kernels. So I tried to Picture this a little bit. So on the very top. We have the mailing lists Linux kernel mailing list and the mailing list for the subsystems and this is Where you send your patch usually most patches are just sent per email to the mailing lists and to the subsystem maintainers and From the email systems the subsystem maintainers integrate them into the subsystem trees if they're okay If not to send back an email and say hey you have to fix this up and send another version of your patch So I can take it so now the patches in the subsystem trees From the subsystem trees. We have a daily integration into Linux into the Linux next into the integration repository Dumbest even Rothwell. So every day 200 feet about 240 Subsystem trees are pulled are integrated into this integration Repository in Linux next And based on this integration kernel a number of automated testing is done. So there are built parts There's kernels here and some others. They all have their specific Purpose their targets their defects they want to find and they are doing automated testing and sent back Warnings to the mailing list if they have some findings So this is happening more or less continuously the question is now where do the versions come in so You may have heard the term that Linux Torval's opens the commit window We are on the very left now. You can win mid-commit window is a period of about two weeks where Linus Torval's pulse the Commits from the Linux next from integration kernel and pulls them into his own Chitchat repository and at the end of this two-week commit window. He Produces the first release candidate of the new kernel version 4.n. So that would be the Leftmost green box of 4.n-rc1 Now we are in a stabilization phase Where patches are pulled in Into further release candidates usually it's about six to nine That's important to note that the number of patches pulled in Decreases radically between rc1 and the others so we looked at 4.4. I think rc1 Got almost 13,000 new patches while Each of the new of the other release candidates just got few hundred three hundred two hundred something like this And that some release candidate X Linus decides okay now we have a stable kernel and He releases the new stable version 4.n Now in this stable version 4.n Only stable only bug fixes are integrated from this point So from the mailing lists only Buck fixes are integrated into the dot releases 4.n.1 4.n.2 to 4.n.y and Why it depends on if it's a long-term stable this can be something really big like 80 or 100 if it's a Non-long-term stable this can just be like 10 and the version is not continued anymore and After some time, I think it should be about two to two and a half months three months A new commit window is opened and the whole cycle starts over again I'm not sure if there's anyone here who never saw a git commit so I just thought I Would spend two slides on this so basically this is what probably everyone has seen before we have Commits that is identified by a 40 character shall one hash the author is Listed as well as the dated the patch was authored Then the first line is the short description Kind of a subject then we have a body and what maybe not all of you know at least If you're just no jits and not every project uses this are the last four lines. So those are those tags I start with the bottom two is the signed off by tag. I said the author has to sign off on his commit, so the author signs off to Say that I'm aware that I'm now contributing to a project that is licensed to gpl v2 And I'm okay with this. So the sign off on the license basically basically And of course that the work is okay The bottom sign off is by the subsystem maintainer that Everything is okay with the patch Then the act by so Someone probably refute on the mailing list and said okay. I Refute this. This is okay with me. I Or it might might also be some other subsystem maintainer who is whose subsystem is impacted by the patch and saying Okay, I'm okay with this Also very interesting is now the first one of those four the fixes tag So this fixes tag basically says okay, this is a bug fix and This bug fix fix is a bug that was introduced in The commits that is identified by this shower one hash or by the 12 characters of a shower one hash So this allows us now to go back and find out where and when this bug was introduced and Of course the second part of the commit the patch the actual changes from this we can find out which Files were touched which lines in the files were touched and so on Now when we started our This time For the for the stable commits That's done by the stable for the long terms other by the stable maintainers and they get it from the mailing lists So usually they are just CC in the mailing list if they someone thinks it's a stable bug fix and then they will decide so some of the problems that popped up when he started our adventure to look at these Development lifecycle So our first Explorations were simple command line one-liners and and Python scripts trying to parse out data just to find out what's there. What can we do? but at some point we Decided we have to do this differently because we are Team of developers. We have also partner companies in our silto Linux and P project We want to make this data available. We want to distribute the data to the team members We also want to keep it up to date. So ideally if you We don't want everyone to have to take care of that and keep their data up to date So we want to do this in one point We want to support this exploratory analysis If you if someone decides to still do the command line one-liners first That's still good, but still we want to present the data in a way so you can get the quick impression What's there or what might be missing? Cleaning data that's an important point and I have some examples of how we are cleaning data and what ways we're cleaning data But it's important to also do this in one place Because otherwise people would could make it differently and versions would go Datasets would go into different directions Yeah, we want to keep it consistent for all team members So not so it doesn't happen that one posts a problem and it turns out it's just because his data set looks a little bit different And we want of course to eliminate they eliminate the processing overhead Between the different analysis scripts, so we have different scripts that use the same data But we don't want to grab and parse it out of the cheat logs every time over and over again so few points on data cleaning so we use the developer name as a unique identifier for the developer as in real life and Here are some examples that we found in the kernel While most of the things that we're cleaning up don't have the entertainment factor of these examples We still want to attribute all the Commits number one developer to the same developer So if there is a type in the name if there's a different lower upper cases in the name something like this Then we want to catch that and attribute all the commits to the same developers Another thing are the subdomains. We're using the email addresses to Find out where people are working Problem is as most of you know Some companies just use different subdomains or different domains for different branches on different continents or countries or whatever So we also want to clean that up So the affiliation can be Found out properly We're doing this to find out if there might be some dependence between the developers Now we're coming back to the fixes tag Before I just told you just showed it to you and told you what it's about Here is the actual definition of how a fixes tag should look like in a patch So use the fixes tag was the first 12 characters of the char one ID and the one line summary for example fixes than the 12 characters and the summary Problem is that of course also here. There are some examples That look a little bit different. So we have references to buck 14 6 6 2. I Don't know probably some internal buck fixing buck tracking system at still or the next one and be Seems to stand for Nokia buck tracking because it was Submitted by some guy from Nokia, but I have no idea if you know, but please let me know Third one is basically just text Then fixes version 1.0. We're already starting to joke around that we should be using version 1.0 of the Linux kernel for After limbs come a facility Linux MP because it has been fixed and the last one which actually seems to be a good idea is reference to a buck tracking system This example Baxilla kernel.org but the problem is that we found a bunch of those and Some links up just that So these are just the buck trackers that are referenced by fixes tags There may as well be others that are used somewhere in the body and not in the fixes tag The length of the hash that was used I Try to depict this a little bit. So it's varying from somewhere From 4 to 40 characters, but at least the proper value of 12 is the one that's the most common This is really just for demonstration purposes Actually everything that's seven plus characters is not a big problem as long as we can resolve it in shit Usually it should be at least seven. I think So basically we going back anyway and trying to find out which commit this is and get the 40 character hash so That's not too big of a problem now while these examples look like Like it's a bad situation. It's really not that bad we total we have a total of a little bit more than 24,000 Patches and we only have 744 they could not be resolved It's about 3% a big part of this for URLs and and 18 cvs but there's also for example Hashes that we can't resolve because it seems that the Reference hash is in some sub system tree and just changed when it came to mainland There are much more fixes that don't have a fixed tag so we know we come to this a little bit later Okay, so at this point. I want to show you what we actually How our system looks like how we distribute the data? Actually, this is still in a prototyping phase where we Want to send this out to our partners within the next weeks So what we're actually doing? So we have a web interface for browsing the data to get an overview But for convenient use in our or in python we Provide comma separated values files as download These are generated on demand We keep it automatically updated. It's basically just a cron job And it's extended as needed. So if someone So last week we had the case that one of our team members came up with a new idea what he wants to analyze But there is some missing data. So we just extend it as we need it So I'd like to give you a short Tour so this is our web interface not very exciting So basically we just have a list of our projects and if you look for example into linux table we get We'll get a long list of all the commits Maybe one example where cleaning data is still in progress. We're working on the dates It's very improbable that this patch was authored in 2037 But these are just the things that we also have of course some from 1970 These are easy to spot. It's harder if it's just an offset of two years, and that was five years ago But we're trying to find Ways to put numbers on how many patches actually have wrong timestamps So we get the long list of the commits If you go onto one of these commits you get more of detailed information The number of lines added removed files changed director is changed if it's a merge or not who signed off on it The commit message and what I'm missing here because this is a development server running on my laptop and I Just realized this morning that I forgot to do this We're also running check patch on all the patches. So usually you would get here the number of errors and warnings from check patch So to for those who don't know check patches a tool that helps you to find out if the Patch is compliant with the coding style so We can also go at the developers list. So here we already get some Numbers so for each developer we get the number of commits this developer submitted or The number of commits that got into mainline The number of signed off buys this developer gave the number of reviewed by act by and so on Going on the developer now here We get all the email addresses that were ever used not only to commit but also to sign off to review buy and so on Resolved into companies were possible and All the commits in this project that were done by this developer and the fixes tag so So basically what we have here is the Commit where the buck was fixed in the left column The information we got in the fixes tag. We'll see this is important when we get the URL Then the resolved commit where it was introduced and the time to fix in days And if for example having URL we try to resolve it I don't have internet connection now, but if this is not a deadlink we Just go there Okay, so what's the general idea between this approach the idea is that we have Our input data currently this is chit chit log and chit blame We also take the Gimple output from GCC. I'll explain later for what exactly We also use a cyclomatic complexity for all the functions This is still in prototypes that that I don't have in the web interface yet And we put all this data in our database that is presented by to the developers by the web interface and in the form of the comma separated values data and the developers can either Actually, it's the statistics guys Can use this data conveniently in comma separate values In there are scripts and produce some Results some statistical analysis So let's look an example how that looks like so basic. Oh Can you read this it's very hard to read so basically all I did was Fire up our to a read CSV of the URL to the fixes text data Now we have this fixes tags in an R data frame and we can for example Calculate the mean time to fix in days Or we can do a histogram of the time to fix so It's really convenient. It's very simple to get from the data to some To the actual analysis So a second example. I want to show you is a little bit more complicated here a download a bunch of CSV files and what I do is a Fetch for We've version 4.4 to 4.4.1 All the stable bug fixes so that's n or the number of stable fucks bug fixes and fixes is the number of Buck fixes that have a fixes tag. So now we come into your question and the mean time of those Tags of those fixes with the fixes tags. I Took to fix them and I'm doing this For all versions from 4.4.1 to 44 and I get a nice image on how So the blue line are all the stable bug fixes from 4.4.1 to 4.4.44 And the orange one are the stable bug fixes that actually have a fixes tag so I think my colleague put a number on it. It's I think it's about one search that has a fixes tag from those table bug fixes So we have this convenient Situation where we actually know all those stable all those patches going into this Dot releases are bug fixes so we can compare and can find out how many bug fixes actually have a fixes tag for that part and the other thing we can do we can do of course a scatter plot Stable bug fixes over there are stable bug fixes that have a fixes tag And we can do a linear regression model for example and put the fitted line into it Okay, I put all this code on the slides. So they are for download on the foster web page One other thing I want to talk about is where the gimbal output comes in We wanted to patch impact analysis So what we want to find out is which patches actually have an impact at our configuration of the Linux kernel So all the data I looked at up to now was really for all Really all commits for for the kernel, but we are only interested in those that have an impact on our configuration And the way we do it is that we build a source code and let GCC dump the gimbal output From the gimbal output. We get all the files and the line ranges That are actually in our Configuration and we use this information to get from the cheat repository Using a cheat blame and cheat log minus capital L the history for each line That actually is in our configuration and This way we can find only those patches that actually have an impact in our configuration okay, so That's already it from my side Little bit fast Sorry the gimbal Basically it contains for it contains the line information for each line that was was used in your build So basically it just creates an output where you can see each line for example, we know FS I know dot C line 1947 Was used so from this kind of information we can find out which lines were actually used during compile time No compiled So this is done at compile time. This is really done while the binary is built Sorry Not yet. I know my colleague has Contact with people from HL and probably also from Linux Foundation, so we're planning to do this. Yeah Yeah, that that's one of the things we are afraid of Basically before I said that we're trying to Find out the experience of the developers, of course You can only do this in this project and a developer may be very experienced in other projects, but we have some Boundary or some limits to what we can do But one thing that we're joking about is that people could start trying to get higher scores just for fun or something like that But yeah, it's a maybe a problem. Maybe not For example, yeah, you found out There are actually some examples and tomorrow's talk given by Nicholas Sorry, the question was if I can give some examples on How this will be used in the actual certification process and And some defects that we have found so far Actually, my colleague has some data on this tomorrow He's doing analysis not only over the full kernel, but also on different subsystems. I Think one deep problem you found was with surface FS That is a truthful or relevant representation of Typical bug in the kernel The art I just ask where there is about zero point ninety six And now we can look at the average age of each line that is in our specific configuration and from that we can Estimate which ones are sort of New files or new lines and therefore have a higher probability Which ones are very old line that some of them are 12 years old Okay, but we can have longer talk about this later on okay, thank you