 Seemingly not, yes, so welcome to the final session of the day. My name is Wolfgang Maurer, I simultaneously work for two companies. One is a university, the Technical University of Applied Sciences in Regensburg, and the other one you may have heard of is Siemens Corporate Research, where I'm active in the embedded Linux team and have been there for like 10 years by now. I'm going to talk to you in this late session, I guess you're all very exhausted already from a day full of learning and knowledge, so I'm the more happy that you came here to be tortured with a little statistics at the end of the day is about embedded Linux quality assurance and I decided to give it the catchy subtitle how to not lie with statistics, I may say a few words about that later on. Yeah, I already did say a little something about me, Siemens Corporate Technology and the University of Regensburg, so what I'm assuming about you is that you are in some way or another in the Linux system building business, that you're a software architect, that you're a system architect. Safety, I'm sorry, statistical methods are increasingly applied in safety critical domains, in real-time domains. I'm assuming much real-time knowledge besides familiarity with some elementary data sets maybe or knowledge about safety critical processes, I've spoken a lot about that the last time I was here at ELC and most importantly, I'm also not assuming much statistical knowledge that's deliberate if any statisticians should be in the audience. I'm pretty sure you will have lots of reasons to complain because this is of course not a statistics lecture, it's not as opposed to be a statistics lecture. So the statisticians, oh my God, that's such a complicated word, take what I say with a grain of salt and the rest, take it as simplified recipes. So I was already mentioning how to not lie with statistics, this whole lying with statistics phrase, or don't trust the statistics that you haven't made up yourself belongs to the most stupid sayings in science because statistics is actually what drives our world in an increasing fashion when you think of machine learning, when you think of all the mathematical optimization techniques that we apply when you think of all these stochastic algorithms that we apply, but still I wanted to get people into my sessions, so I chose this catchy one, I should have actually called it is not how to not lie with statistics, especially because I don't want to insult anyone who has used statistical methods in a way that I'm not recommending in this session is more to the point how to avoid incidentally over interpreting, measure data, presenting data in not so useful ways and or drawing unsupported conclusions from statistical analysis, which is scientifically precise, but I guess would have made the program committee a bit angry because they would have needed an extra wide website just to accommodate the title. Okay, so, but let me get right into the talk. Where do we need statistical methods in embedded quality, embedded Linux quality assurance, and these days actually quite a lot of topics where statistical methods come into play. We want to determine, we want to assort some several functional properties of our systems, like speed and throughput, of course, we want to measure how fast our systems compute, we want to measure how much data we can process per time unit. In real-time systems, we are concerned with response latencies and things like that, but we also need to deal with things like build consistency, if you think of long-term supported systems, then we should make sure that our system that's being built now can be reproduced in exactly the same way in 10 years' time, which requires some statistical methods. We're of course also interested in non-functional properties like stability, availability, scalability, correctness, you name it, all these elities that are typically, that typically cannot be measured directly, but that can just be determined via some indirect measurement via determining other properties and then statistically inferring how far we've come in these qualities or elities. Testing is of course one field where statistics plays an increasing role. We collect lots of data with continuous integration, tests with load testing, and we use statistics to, for instance, check if our process efficiency is good. So if we are not checking in too many bugs, if we are actually trying to reduce the amount of bugs compared to the amount of new bugs that we bring into our system, that's a statistical problem, it can be also used to detect gradual changes in process or detect if changes to the process actually have had good or bad outcomes. Same thing goes for reviewing patches. We use statistics to test how efficient we can use. We don't necessarily use statistics. We should use statistics more often. We could use statistics to determine if the review is actually sufficient. So if it prevents bugs from going into the system, if the relevant areas are covered and so on. Certification of systems is another area where statistics is becoming increasingly more and more important. So in the good old olden days that maybe precede that may become before my career, my own career in computing, people used to build systems that were completely deterministic. You wrote code, you proved that the code was complete, that the code was working as expected, and then that was more or less disregarding some formal details of certification. These days if you think of Linux systems that contains millions and millions and millions of lines of code, we of course cannot go with these approaches anymore but need to statistically ascertain that we satisfy the properties that are required for safety and other certification criteria. And even if we could still apply the aforementioned formal methods to ensuring system quality, there are people who still do that, who use formal verification methods to, for instance, do a schedulability analysis on real-time systems and so on. But these people these days also implicitly rely on statistical techniques because, for instance, most techniques that do schedulability analysis require worst case execution time, either calculations or measurements. Calculations are considering modern processes complexity not really feasible these days. So effectively in the end of the day, when you're doing any schedulability analysis, any proofs on real-time systems, you're also relying on statistical results that gave you estimates for parameters like worst case execution times. Good, so enough reason to think about statistics in embedded Linux development. Why is deploying statistical techniques hard in our domain and that's for the obvious reasons. So we're dealing with very rapidly changing systems, so statements that we make need to be computed over and over again to be accurate. We're dealing with large volumes of code that doesn't necessarily make it easier to analyze systems. We have non-overlapping communities that talk about things in very different ways and that makes an analysis of the processes and of related things quite different because data that you get from one community means something very different than data that you get from another community and so on. Good, but moving on to the main point beyond the motivation is dealing with data. How do we deal with data that can be statistically analyzed that comes up in our daily work and actually have to admit that when I was planning this talk, I was planning for a much wider scope that I would cover. I didn't prepare all the slides about everything that I wanted to say, then I gave this talk to me one time and after like one and a half hours, I stopped talking to myself because I wasn't even halfway through so I had to reduce quite a lot of what I was initially planning. So maybe some things may seem quite obvious to the statisticians and to those who have already done statistics, but on the other hand, the examples of doing things incorrectly I've chosen are all examples that I've met in my industrial job so there seems to be some need to go for the simple stuff. So before we can do statistics, we of course need to measure data in some way or another and that is already one of the very elementary problems. A problem that seems very simple. You record some data that you get from the system, then you store this data and later on your process data. But actually there's three things you need to consider that often go completely unnoticed. The first thing is about reproducibility. Of course when someone is doing a measurement, he or she knows exactly what he or she measured but that may change completely in a week, that may change completely in a month and it's very often the case that of course that you get data from someone who doesn't really know how, who doesn't really remember anymore how he acquired the data in the first place and that's then obviously a bad start to learning insights from these data. So reproducibility can others reproduce and or interpret the results you've been measuring. That's one of the very fundamental things we need to take care of when doing statistics and I'll show some very simple yet effective recipes on how to achieve that in the next slide. A second question that's relevant for statistical data is duration. When is certainty about what I want to infer from the data? When can certainty about what I want to infer from the data be achieved? That's particularly important when you test, like for example for real-time properties so how long for how long do you need to inspect a system before you can draw conclusions that it will really never produce or with high certainty never produce any bad results in production runs. Is that five minutes? Is that ten minutes? Is that more like ten days? And so on and we need to ensure traceability that we do not only know, that we do not only tell to others how to interpret the results but that we also tell to others what exactly we measured in a system so where we put say trace points into the kernel that are often non-standard ones where we hooked into the system what's a resolution, timer resolutions we used when we recorded our data and so on. And actually the answer to at least two of the questions traceability and reproducibility is extremely simple. When we record, so a very good way to achieve these two properties is to record data in a way that's called tidy data. So tidy data is actually when I talk about these three things that you need to make these three properties that you need to make sure that your data are tidy, they look so obvious. People usually say, how should I do it any differently? However, on the other side when you tell someone to record data when you tell students to record data when you tell engineers to record data it never comes out in this tidy form so that's these three rules are really one of the very gems of statistics and it has been it has taken an astonishing long while until these three rules were really formally adapted by the physics community so for instance in the R GNU R world that's a statistic language that I'm going to introduce later on the first papers that introduced this tidy data form really only appeared like 15 years ago although of course you may have seen this form before either in your work or in lectures that's of course one of God's normal forms for databases the third normal forms in this world it's been known for a while in statistics in statistical software engineering it has not really completely caught up yet so how do you produce reproducible data data in a tidy form three simple rules whatever you measure whatever variables you measure of course you put your measurement results into some matrix each variable forms a column one single column rule number two each observation for a column sorry Corum that way one column each observation forms one row and each type of observational unit forms a table that's more database linguizer we should say forms is in one file so I said it's not so obvious that data should be measured in that form what you usually find is data in the messy form by definition everything that's not in tidy form is in messy form and a messy form could be something like this you're doing latency measurements on different systems on arm systems on x86 systems and you're measuring the latencies under different scenarios under low load scenario given some network load and so on and then a typical way how engineers do organize the data is that way so we have all the measurement results here and we consecutively write down numbers for the measurement so that's not in the tidy form outlined before it's quite simple to bring it into tidy form when you do that the table that gets less white but much much taller so as I said one variable that's measured forms one column and each observation that is taken forms a row so you see here the main observation that we take is about the value of the latency that we record and there's only one single value every row as compared to the messy formant where we had multiple values in a single row multiple observations in a row that may seem like a total detail change when you see that for the first time but the secret intention of this talk of course is to convince you that languages like GNU are very good to visualize and to analyze statistical data and we'll see some examples later on how just bringing data into this form, into this tidy form essentially reduces plotting operations to one liners those of you who have worked with data in other form who have used extensive set scripts, org scripts GNU plot scripts, what not and know how much effort it often takes to massage data into a form that's presentable will realize that the others will just see that if you have data you only need one liners if you have a proper statistical environment and of course a proper statistical environment is the R language of the R ecosystem in my opinion it's an open source it's one of the oldest open source codes basically it's astonishingly little used in the Linux community but it's the default, a defacto standard so to speak in the statistics world and all the examples that I will be showing are based on the R language so I promised a hands on approach, I'm not sure yet if I will really be so mad as to do a life example if time will permit that because life examples go wrong as all statisticians know with 99.9% probability so we'll see if there's some time left but I have all the R commands that you need to reproduce the plots I'm showing and the plotting mechanism that I'm using is also based on something that's very common in the statistical world but that's again astonishingly little used in the embedded Linux community and that's the grammar of graphics that's a language to describe not how to plot stuff like you do in a canoe plot or in other such programs where you say I want this color for that I want this chart type for that but where you really say how the data should be plotted and then the language that which also accounts for why the plotting examples I'll be showing will be so short and to the point good so regardless of the technical details how we do this statistical analysis there's basically three ways how to understand data first a descriptive analysis that's working with numerical summaries showing things like mean values or standard deviations that you're all familiar from school the second level is exploratory or explorative exploratory analysis where you use visualization techniques to get a better understanding about the data than is possible with simple numerical summaries and the third stage is confirmatory analysis where you really use statistical testing formal statistical testing to a certain specific properties I'm going to focus on point number two of course if you come from the old world from a world where formal proofs are required where you do things like schedulability analysis and so on then you want this confirmatory analysis as final step but as I will be arguing in most cases it really doesn't add much value to what you see from the exploratory exploratory analysis but the exploratory analysis gives you quite a head start compared to simple descriptive analysis that we often see since time is flying I'm going to go over this slide very quickly when we talk about data it's also I guess clear that there's different type of data just to make sure we're speaking about the same things basically we can distinguish between two types of data categorical data and quantitative data you all know intuitively what this means categorical data is something like binary values say data or life or system is up system is down system is broken system is operational we can try that a little further for instance by extending it to colors like you have a color red blue green and so on these are of course different colors but it doesn't make sense to assign either numerical values to this color you cannot say blue must be one and red must be three likewise you cannot say blue is bigger than red or red is better than green that doesn't make sense but there's a third type of categorical values namely ordinal values that cannot really be associated with numbers like in military ranks you cannot say or maybe I'm pretty sure armies can you cannot say a general is a 27 and the private is a three doesn't really make sense yet still you can order these you can say a general is higher up in the hierarchy than a private and the private is maybe higher up in the hierarchy than I don't know what I'm pretty sure there's a lower rank which these type of type of numbers these types of numbers have appeared in the table before so here for instance system of course is type of system is a categorical value that we cannot order we cannot say a tegra is better than a raspberry pi perhaps we can in some sense or another but not in a statistical sense and we have also numbers that are comparable like latencies the value that we actually measuring that's a quantitative number and formally we differentiate between of course discrete values and continuous values but that's not so important for the rest of the talk good the data sets that I'm considering that I'm going to play around a little bit with is what you typically get is a very simple data set and it's what you typically get from latency analysis like when you're doing when you're running a cyclic test on a preempt or T system so I've done that on a system with multiple CPUs I guess it had four CPUs that are numbering from zero to four I'm recording an identifier with every measure that I'm taking and I'm of course recording the measured value which is a latency that I'm observing in a real-time system so very simple data set yet you will see that this data set already contains quite a lot of information that we can get out with some proper exploratory visual analysis the typical way how to plot such distributions is of course a histogram and that is I may have mentioned that R and the grammar of graphics is a very efficient plot mechanism that you can plot with one very simple line in R so you specify that you want to do a grammar of graphics plot using data from a specific variable and you say that the interesting the measure that you want to look at is contained in latency that's a column in the data we want that doesn't say anything about how we want the data to be plotted that just says what subset or what aspect of the data we'd like to plot you specify how it looks with typical GMs and here I specify that I want a histogram and miracles of modern technology I do indeed get a histogram that you've all seen before now that seems already seems to be a point where 50% of programmers are happy with but of course coming back to the statisticians statisticians know that this is not a that this method requires some parameter to be chosen namely how wide the individual bars should be in that example the system cannot know what we want to convey what the measurement resolution is for instance it has chosen fairly large histogram size so that's one of the occasions when you need to manually tell the computer system something about the data that it can visualize it and here I've given an explicit parameter to the plotting system that specifies the bin width I've measured with the resolution of one microsecond and you see if I make the bins only one microsecond wide then I already get more details into my graph of course I've not measured on a single core system that's boring single core systems are not available anymore these days I've measured on a multi-core system with four CPUs and I would in some way or another like to see how the different CPUs in the systems compare to each other that's very simple in the grammar of graphics so I tell them once to see another aspect of the data that aspect is another column in the tidy data if you recall that so that's this column and of course we've already spent our two-dimensional diagram on showing the histogram so we need to add another dimension to the graph which can for instance be colored that I've specified here I've said do not just take the latency aspect from the data also do consider the CPU measurements and adapt the fill of the histogram to the CPU so it chooses four ugly colors that then gives some extra information about how we used it that's still not the very good ordering so we see we can we can see the shape of the distribution for CPU 0 the red CPU but the other CPUs are just stacked on top of this measurement for CPU 0 which gives us reproduces the system global so to speak latency measurement so you see this is just a colored version of this but what we actually like to have is a visualization by each CPU and that we can do in some ways in the grammar of graphics I could for instance specify that I do not want to stack the measured results I want to dodge them next to each other so we get something like this and then you see you see you better see the individual recordings for the CPUs a slightly preferable way in my opinion to deal with the second the second aspect to the data is to facet the data into different subplots anyone who has done that with the mechanism like new plot knows that this can be very painful depending on how you plot with the grammar of graphics it's really easy I tell the system I want to facet my data by the CPU by the measured variable CPU and then I get the same plot that I specified here in the geom repeat it for each value for each distinct value of CPU that's present in the data and so we get four nice plots that immediately give more clarity to what we've measured good those of you who do real-time work may have realized in this graph that this is kind of strange that we don't have any values here so that we have a very sharp initial peak and that we also don't have any values here and that's a problem that often occurs in boxplots that you see from analysis that's because people tend to measure for very long time so I didn't even measure too long here it's just like 60,000 data points in one bin that's not too much but that's already way too much for the visual presentation to make some details invisible sorry that are in the data and that can be fixed as you all know by adapting or by transforming this axis here one typical transformation is a log transformation that you're all familiar with from school I guess that's not the most clever transformation for data that contains zeros because you know that taking the logarithm of zero is somewhat challenging so a transformation that is recommendable or that works well in this case transformation it has no problems with zero occurrences but still gives some more attention to the small values of the data and that in fact uncovers some features but they're simply not visible in the presentation before we have very small latencies before the big spike hits and actually here we have quite some tail that's actually not ending at 75 microsexons but I've cut it off here that of course is important if you do not just care about the maximum latency but if you really care about temporal precision and then you need to somehow deal with these early results that don't appear very frequently but that do appear that should be visible in the way you present data so I had to make one choice when I plotted the data one manual choice and that was choosing the bin width that the system uses to present the data and making a manual choice is of course bad so let me show you another method that can be used to visualize such latency data that does not need that doesn't need any presentational choice at all in a so called nonparametric method that's the cumulative distribution function that's shown here this graph contains the very same information as this graph in a slightly different form so we write down the latencies on the x axis and what's then contained in the y axis a fraction from 0 to 1 is how many of the data points belong to the range up to 50 microseconds up to 100 microseconds up to 150 microseconds and so on so you see the data span latencies from 0 to about 200 microseconds at 200 microseconds we have covered 100% of all the observed values and at smaller latencies we have covered smaller fractions of the data but we didn't need to choose any width of the bin or something on some other manual parameter this always looks so this always can be produced with our human intervention it of course requires some familiarity to be interpreted directly but when you look at it it's not so hard you see here the first increase is around 55 or so microseconds that of course corresponds to this spike here then we have another sharp increase here that's about here so this is the second spike this increase in the graph corresponds to this guy here around 100 microseconds and basically it flattens out and we covered the majority of all values okay I guess I still have too many slides even after reducing the amount of slides let me go over that quickly I said in the beginning that summary statistics, numerical summary statistics are quite often a fairly inaccurate way of having data yet still you see you observe quite often that data like scheduling latencies is described by two simple numerical summaries namely mean value and standard deviation that of course comes or that maybe comes from the fact that we are all told in school that everything in statistics is a Gaussian process some way or another and as you all know you can describe a Gaussian distribution by exactly these two summary values mean value and standard deviation that suffices to reconstruct the whole Gaussian unfortunately this whole statement about everything in nature being Gaussian is maybe true for nature in the sense of physics or for nature in the sense of biology it's unfortunately not true for nature in the sense of computer science so here you very rarely get to work with Gaussian distributions immediately obvious from a visual inspection of the latency data actually I have constructed this Gaussian function here let me switch to the square root axis scale so that we see the details I have constructed this Gaussian distribution from these measured values so I have taken the mean value from the measured value I have taken the standard deviation from the measured value and then computed this Gaussian function really obvious from these two graphs these are two very very different distributions the Gaussian is much wider it's less tall and this graph should make it clear to you that just using these two summary values is really not sufficient to describe data with some structure as an example of how to use such statistics I'm showing the corresponding grammar graph commands to produce the cumulative distribution function or the empirical cumulative distribution function because it's based on measured values again it's a very very short statement so I'm telling the system I want to use some data where I'm interested in analyzing the latency now I have two different statistical distributions the Gaussian and the proper measure distribution that I'm denominating by type I'm using a color to distinguish between the Gaussian and the measure distribution and I want a cumulative distribution function and that simple command line gives me that graph so here from this graph again it's pretty obvious that the distributions are not identical good so playing around with this simple data set has already consumed quite a lot of time that's located to this talk but still I really wanted to go over that in a very great detail because already plotting the data is all too often done in a way that doesn't really do justice to the data despite the fact that as I have shown you with the R commands you could do it with very little effort in a very very precise and apt manner so what I've done basically in this last slide is already starting the next topic that I would like to discuss that's about comparing data sets so this problem that's the second standard problem that occurs in the statistical analysis of systems so it occurs for instance when you want to track behavioral changes of the system after you're doing an update say you updating from kernel 4.0 something to 4.0 something plus X and want to observe are there any variations then you take a measurement set from the old kernel a measurement set from the new kernel and somehow need to decide by these measurements and by going beyond the mean value and the standard deviation if the system's gotten any better if the system's gotten any worse if the system hasn't changed so you need to prove that of course to your customer you may want to use that to evaluate alternative choices that you have when you use libraries and so on which one performs better but also lots of other use cases to do that as we've seen there's one way to do that is by visual inspection or by explorative analysis that's if you do that properly very apt for the purpose comparing summaries as I have outlined now for the 27th time I guess is not really a sufficient way to do that so just comparing mean value and standard deviation don't do that visual exploration is as simple as computing summaries but much more effective and there's also some formal methods and tests that you can employ that can make sense that's why it's partly yes but usually in most cases it doesn't give you any extra information that you get from the visual inspection besides the joy that you can say oh I've used some formal method like a t-test and rank some test or anything else that sounds fancy and that may impress your customers so I guess I've with this graph type with the cumulative distribution function I've already given you the appropriate command line that you may want to use to compare such distribution so I'm not going to get into any further details with the three examples that I would have prepared for this purpose which also means I'm going to skip the live demo because I need the minutes that I have left for some other stuff which at least make sure that the live demo doesn't fail you just don't do it and it doesn't fail let me end this part of my considerations with what can go wrong when you're doing this elementary analysis of statistical data like scheduling data like performance data and so on and again all the points I'm making here do really sound extremely obvious but if you look at Heidstein in the real world then these extremely simple points or these extremely easy to make mistakes are being made very often so from the list I'm going to pick three of course use inappropriate summary statistics I may have already mentioned that use wrong inadequate bin sizes that's the second most often occurring thing that you see so if you're using a parametric method make sure that you choose your parameters correctly or use a non-parametric method like the CDF and what also happens quite often is that people don't specify the sample sizes without sample sizes it's impossible I didn't go into the formal examples that would make use of this information but if you don't specify a sample size then basically you don't know if specific statistical tests will work I've said that data in computer science are usually not Gaussian distributed many statistical tests rely on the fact that the data are Gaussian distributed except if you have lots of data statisticians please close your eyes but if you just have if you have millions and millions of data points then it doesn't really matter if the data are Gaussian distributed or not most statistical tests will work fairly well regardless of the non-normality of the data but for that to know you of course need to know the sample size and if you want to if you need to know the sample size needs to be reported good coming to the final part of the talk with only 15 of those slides left for the next 10 minutes is one thing that people are trying more and more often these days and that's to make not just to describe data and to learn about the system behavior from data but to make predictions from measured data how a system will behave in the future how a system will behave in corner cases that have not been explored yet or using predictions to to satisfy some certification authorities that the system they have built is really satisfying the criteria that they demand so making predictions from statistics is essentially very simple process it's two easy steps you find a mathematical model that describes the data and then you just extend the model outside the current measured range and that's all you need there you go you can predict the future of course there's some some detail problems like is there such a mathematical model that describes the system if I have found a model does it really in fact describe my system or does it just describe what I want what I would like the system to be and so on and when you look at how modeling techniques and prediction techniques are currently used in our field then you really need to be aware of the first rule of predictions when you look at statements that are made in that respect if things sound too good to be true they probably are not true so there's lots of ideas these days that claim that you can measure this in that aspect of a development process and then prove that the system is completely apt for a safety critical deployments and so on that you can measure for five minutes and then you know that the system will satisfy all latency requirements during the next ten years or so sounds very good but is likely not true how do you how do you go about when you model when you find a mathematical model for your data that's of course a problem that has been considered for a couple of decades and I suppose that many of you will have heard about linear regression right so most of you have okay great so I can be quick about that I'm not showing any real data set here I'm just I'm showing some made-up data set that connects kilo-hydro cells whatever that is to the number of unicorns and we want to find a functional relationship between the amount of kilo-hydro cells I have with the amount of unicorns I get again from a visual inspection of the data I've done a scatterplot of the available measure data it seems quite clear that there is such a relationship the more hydro cells the more unicorns as things are in life question is just how can we mathematically asserting this relationship and how can we find the best possible slope for the linear relationship how you do that is of course you formulate a model I guess so I said I'm guessing that there's a linear relationship the number of unicorns why is related in this fashion the linear fashion to the number of kilo-hydro cells so we have some intercept term that shifts this line up and down and we have a slope how steep the line is the statistical task at hand now is of course to estimate the coefficients beta 0 beta 1 slope and intercept and finding out how my errors are distributed the beauty about that it's a very simple model that describes quite quite a lot of processes that we are seeing very accurately the bad thing about that is you can apply linear regression to everything and you will always get the result and it will always look halfway decent but in most cases it will describe the data very inaccurately and especially you won't be able to draw any to make any predictions from such miss specified models so I just realized I haven't shown I haven't shown the crew or commands that are necessary to produce such a model it's another one liner I will correct this deficiency in the published version of the slides for now let me just show you the results the data set and the line that describes the functional relationship and again without going into any details let me give the warning that let me say that there are very specific mathematical tests after you've done the model that you need to apply to a certain that the model actually does fit your data nobody ever does that of course but it's actually a very very simple thing to do you need to test for basically for basically four you need to test four assumptions that you are implicitly making when you come up with such models and that is first that your errors are normally distributed so here we come back to the standard Gaussian process we have this line of course the measured points are not all exactly on this line but they scatter around it and the way they scatter around it needs to be in a Gaussian fashion you can assort that with this kind of result the errors need to be uncorrelated that's a more mathematical thing but that can also be quantitatively assort the variance of errors needs to be constant and for the fourth thing the design matrix has to have full rank I really haven't found any image that could explain what this means so I'm just stating it without qualifying it further but rest assured that you would need to test that if you want to make sure that you have a good result so from this model from this model all these conditions are actually fulfilled and unicons can be quite well predicted be predicted by kilo idrosils when we look at when we look at data that actually occur in computer science and that people are using that people are using to make predictions then we see that life with this kind so as a final example of how you should not do things let me show you this data set that uses bug fix commits so that has really been used in some safety certification efforts to prove to authorities that development process of certain pieces of software are good enough to trust your life to it what people have measured here is that they've taken a number of bug fix commits because bug fix commits as we all know fix bugs so they don't introduce any problems any more problems never have and they've recorded the time that the time at what point these bug fix commits were introduced into the system now what you can see from the data of course is that the number of bug fix commits went down over time which is natural thing because either bugs or people start losing interest in fixing bugs because there's too many of them now the natural thing to do of course with this kind of data is to find a model that describes the relationship between the time when bug fix commits are issued and the amount of bug fix of the time after a piece of software has been released and the amount of bug fix commits that go into the software you do that not with linear regression with a more generalized form of it I'm not going to get into details about that but what you end up with at the end of the day is a graph like this so you have a functional relationship between time and bug fix commits and again that looks as nice as the graph as the relationship between UDRA SEALS and Unicorns and so you have natural thing to do is let's make predictions with that because if we know how bug fixes happened in the first 60 whatever time units of the product we can just extend that predict the future from that which is easy to do with statistical software and then you can say ok at this point we will have essentially reached a situation where we don't get any more bug fix commits into the system and that means the system is error free and so we can use it in safety critical environments now who of you is going to buy this kind of argumentation from me no one I'm really glad you don't the problem is that certification authorities tend to buy this kind of argumentation these days of course the problem with this approach is clear people did the regression did the regression computed a model predicted the future from the model but the only detail problem that they overlooked is that this model doesn't satisfy any of the requirements that I've mentioned before any of these four conditions that need to be fulfilled for a model to be correct or at least appropriately describe the data with that so I'm listing the issues here that this model had it's like errors are all completely wrong so we don't have a normal distribution of the errors that wouldn't be so bad but the accuracy of the base data is also not a given thing the functional relationship between time and amount of bug fix commits is if you look at the statistical model in detail and if you do actually look at the diagnosis that you get from the statistical software such that the software says okay so the the variables that you're using that you're trying to use to predict the amount of bug fix commits are not sufficient to give an accurate model but of course if you don't look at the diagnostics this problem won't bother you so that's the three things that are wrong with this model if people would have done the process that I have outlined very briefly in this talk we find an appropriate technique fit and plot the model and most importantly check diagnostic data then they would have realized that this model is not really appropriate for the task they are about to take on but maybe let that be the major thing you take home from you from this talk namely if you apply statistics then please make sure that you've done your homework that you've not just computed models but that you've checked that your model is really accurate then I'm already quite happy because that means chances statistical chances have improved that I'm not going to die because of mis-predictions from software quality data good thank you very much for your interest in this final session I know that I've had way too many slides despite still not even coming to the to the final points and to the deeper points of statistics but still again thank you very much for suffering with the simple examples and the many slides