 Yeah well then welcome to my talk that I'm I was intending to give together with Daniel Wagner from Suze software solutions now these days unfortunately couldn't make it here so all the work is mine now of course all the mistakes I'm doing on mine too. What I'm going to discuss is two things it's firstly a new tool the jitter debugger. Does anyone see my laser point so I can doesn't work yeah so it doesn't really work. The jitter debugger which is a replacement or drop-in replacement for the venerable cyclic test and some of the statistical considerations and statistical analysis that you can do with the new tool especially when it comes to analyzing safety critical systems. Yeah about me I have two roles I work for Siemens corporate technology which is a company you may have heard of we basically use Linux in a wide range of products from health care to infrastructure. Daniel Wagner when he did the jitter debugger work also used to be at Siemens these days he switched to Suze Linux in Germany. I'll be discussing three things the first is some generic general remarks about a measuring real-time systems the second part is about jitter debugger how it differs from cyclic test what you can do with it what was our intention when we develop the new tool and the third part is going to be some analysis examples of what you can actually do what we think you couldn't do before as well. So why why do we actually measure real-time systems it's basically everyone everyone in the room I guess knows the reasons is is anyone not actively involved with real-time systems here okay so the majority is basically the reason why why we need measurement-based approaches for Linux based real-time systems is when you think of traditional real-time systems you of course run these on very very simple processes and very simple systems where you can do a cycle accurate analysis you can count how many instructions you need to execute a given sequence produced by the compiler and then can make quite precise estimates of how long a certain operation will will need to to execute that's of course impossible with modern CPUs that to out of order execution that have caches that have translation look aside buffers and so on in the one hand add then and that run complex operating systems like Linux on the other hand where basically so many things can happen simultaneously can interact in very many ways that it's basically completely hopeless to come up with analytical precise measurements of how long a given task will take nonetheless a Linux is being deployed in more and more real-time critical scenarios and what's more important for us in a safety critical scenarios where the functionality of the system really matters much more than on the average mobile phone on the average desktop computer where things can really start to hurt when they go wrong and where you would like to have some some extra assurance that stuff works as expected yeah and that's that's the thing we can only do by properly measuring the systems and by statistically by stochastically making sure that at least with a given amount with a given level of confidence systems work as we expect them to work additional we need to distinguish two different modes of analyzing real-time systems by measurements one is debugging and development of course when you build your systems when you integrate devices when you when you are deploying your device drivers on your appliances everyone knows that a lot of debugging is required a lot of improvement is required before you can ship products to the field to eliminate the worst possible bugs like making sure that locking is correct that you don't get any excessive latencies from locking that you detect any abuses of system functionality like for instance engineers like to open up TCP connections in real-time context with the obvious results that you eliminate all these kinds of obvious bugs and mistakes that can go into your system that cause the really heavy latencies sometimes we we talk about seconds and even more that's one possible one use case for a measuring systems that I'm not so interested in in this talk the scenario I'm interested in is that you have a system that basically has already seen lots of love that works as expected to work when you ship it and that contains no obvious bugs and no things that that are obviously totally off still we want to make sure that the systems do work as expected all the time that there's also known minor variations in their behavior so we want to characterize their behavior in unknown environments in field deployments we want to compare the behavior of systems after we've done minor updates say we are updating from kernel 4.x.y to 4.x.y plus one and want to make sure that things still work as we intended them to work by comparing the behavior of systems with reference distributions and one one use case that's gaining more and more traction is to satisfy certain certifications to satisfy criteria for certain safety certifications that require a specific understanding of systems for instance a car manufacturers have recently become interested in deploying Linux in some aspects of the car control and that requires them to satisfy certain manufacturer certain standards like ASIL automotive still and so on that can be satisfied or that can be fulfilled by properly measuring systems and by proving by measurements that systems work as expected so that's the portion I'm going to focus on and actually measuring the latencies measuring maximum latencies of systems is quite simple we've been doing that for ages we've had the venerable cyclic test tool basically it sets up a timer knows when the timer is supposed to expire then measures when the timer actually expires very roughly speaking and then compares the expected expiration time expiration time with the actually observed expiration time and from the difference you can see how much overheads how much uncertainty the system introduces to a given workload so why we want a new tool why did Daniel set out to write jitter debugger as a replacement firstly cyclic test has been around for quite a while so the code base although it has been continuously maintained is not what people who are very much into data analysis big data machine learning and so on would expect so it's still written in C that's not the most friendly language to non kernel hackers as you know it has some historical say structure in it that make it hard to do certain changes and so on and it has grown into a tool that is actually when you look at it in detail quite hard to use it has very many knobs that you can tune and switch and where you can actually make things wrong the intention with jitter debugger was to provide a solution that is written in languages that are digestible to these young people coming from from academia that are coming into industry that has very few tunable knobs because the less you can tune the less you can do wrong of course that then implies that Daniel makes all the right choices but we can trust him on that one it's designed with post processing in mind so jitter debug sorry cyclic test it's usually used it's Iran it does a measurement and then it gives you a result the jitter debugger philosophies you run it then you record data and then without post processing you cannot tell anything from the data so we rely on post processing and for this post processing then we can use all the tools we like especially the modern ones that people have come up with in the big data and machine learning age a third point is that jitter debugger also includes facilities to control the stress or the load generation that may look like a minor point but actually it causes quite a lot of practical trouble when you don't run one single one single load profile during a measurement but you want to switch load profiles you want to know what's going on when you switch between load profile so you need to know when the load has changed and so on that's of course all doable with traditional cyclic test measurements but jitter debugger makes this very easy basically at the literal push of a button and we're also focusing on mass deployment so we're not just focusing on measuring one single system we're focusing on measuring a lot of different systems with say the same corner release so think of regression testing think of parallel testing and then collecting the results on one centralized toast in one central database on one central storage and doing an analysis of all the systems at the same time comparing results and so on the basic structure is very simple we follow the classical Unix philosophy of do one thing and do that one thing well the basic pipeline is you run jitter debugger jitter debugger produces an output file a binary output file for efficiency once that file is produced to run the jitter samples tool and the jitter samples tool gives you some gives you data in a format that's suggestible for whatever statistical software you prefer of course if you prefer who are that's the most reasonable choice you can make and that's an objective statement the format we are writing out is also extremely simple basically we record the CPU a trace was taken on we record a timestamp value and we record a latency value that's all actually things are a little bit harder or a little bit more involved than do one thing and do that well actually jitter debugger in the useful scenarios does two things still hopefully well two things is record the samples plus as I already mentioned control the load generation control the stress that the system is subjected to the rest is an identical jitter debugger writes out a binary file you run that through the jitter samples tool and that gives you output in a format for your statistical software we've spent quite a lot of thinking to come up with good formats to permanently store of the results that we've obtained it's it's that's also a thing that seems like a very straightforward problem but you will come to appreciate it after you've done a measurement once you've done a measurement that you want to reanalyze in half a year's time or in a year's time because if you don't spend if you don't plan that well you will never be able to to recapitulate what exact versions you used what the circumstances of the measurement where what parameter settings you had and so on yeah and the last the last use case I already mentioned that jitter debugger has been designed for network deployments is to deploy jitter debugger on one or more targets including stress generation you do not record the data on the device especially for for embedded devices think of raspberry pi class devices you usually don't have a fast enough storage media and be large enough storage media to to store all the recorded data so what we do is we send it over a network with some very simple protocols to some more capable hosts that then does all the archiving and the post processing okay but that's that's basically it to summarize the most important points what jitter debugger is supposed to excel at is a is at enabling a reducible and systematic approach which is important when it comes to certifications or certification authorities want results to be reproducible say if you if you grant a car permission to enter or if you grant a car road worthiness road legalness I don't know what the right technical term is so if you're allowed to drive the car on the road then a certification authorities in five years time may come up with suspicions that you've maybe tempered the engine and want to reinspect the evaluation dates again and then you better need to be able to actually do that again and that's only possible if you've if you followed a reproducible and systematic approach pardon me we're Germans we're honest people we'd never do such a thing I think so that's that's one important aspect and the other important aspect that is also actually is is may seem like a very very tiny detail but which is also very important in practice is that we've completely decoupled the measurement from any statistical processing so cyclic test does some does statistical processing it does some bending of the data and already that bending makes quite a lot of analysis impossible if you don't have time resolution you cannot you cannot apply quite a large number of statistical analysis that you actually want and with time resolution so by not just recording either maximal latencies or pre-canned distributions of latencies you can then apply a lot of techniques from machine learning from artificial intelligence however you want to call it from a classical statistics and I'll be showing you one example that we've worked particularly hard on and that does basically that that does allow to make worst case execution time estimates to perform worst case execution time measurements not just based on a simple observer maximum latency approach but also equips the results with a credibility level so how credible is the value that we've measured and especially it equips the measurements with error ranges so how trustworthy is the result that we've gotten yeah before I come to specific analysis examples let me make some remarks on how to store data reproducible I've already introduced this very simple format that we're using that that is basically child's play but that comes from three principles that should govern all reproducible measurements again this looks extremely straightforward and it actually is extremely straightforward but if you are tasked some people to do measurements then they with a hundred percent certain people not follow these simple rules and come up with some other some other ideas that will make it extremely hard to to reanalyze any measurements in a couple of months or maybe even in a week after they've forgotten what they actually did and these three principles actually come from from from Hadley Wickham I don't know if you thought of him so he sees running a company on reproducible data analysis centered around the our ecosystems ecosystem and that's if you want your data to be reproducible then for each then store them in a way such that each variable that you observe each observation that you make forms a column so that is the CPU that the measurement has been running on forms a column the timestamp forms a column latency forms a column and so on each observations or each measurement value that you take forms a row and each type of observational unit forms a table so that can either be a table in a database it can be a file or it can be and something that is called an entity in a format called HD HD F5 has anyone already heard of HD F5 before we did that on purpose so that you learn about HD F5 probably that may also be a reason yeah no so that that's that's a format actually that has been that has been around for since 1987 and it comes from an American initiative called HEE HEE HOO to build a format that does AEE HOO the all-encompassing hierarchical object-oriented format it yeah it's from it has origins in big data but it was designed for precisely the purpose of recording data reproducibly because it does not just allow you to store datasets in the way that I've described in this reproducible format but you can also store additional things like kernel configuration like microcode files like explanatory text and so you have one package and that package contains everything you file away the HD F5 file and then you have all to reproduce your analysis in months and years in decades it embeds structures in a single file and even if it's not well known in the Linux kernel community it's very well known in the data analytics community so it's supported by R it's supported by Octave by Python by Mathematica by Julius basically by everything you can think of that does that does numerical operations and of course if that doesn't convince you that HD F5 is a reasonable choice and worth installing a dependency or two on your system then observe that it was it supported natively in the mosaic browser so the fathers of the worldwide web found the format to be so important to directly include browser support I don't think it supported in Firefox these days anymore but well the web has changed so how do we how do we organize our measurements best practices that we use our views identical file names for each measurement that simplify some of the post processing again that's just a detail but if you have if you need to change file names every time you run an analysis that makes things hard to automate active parameters so parameters that we can and do sets during an analysis for instance the amount of load that we produce of the type of load that we produce we encode in directories and any derived parameters like the kernel version that we're using so parameters that we cannot set actively during different measurement runs are stored in files that's just a suggestion but that is something that has turned out to work very well in practice so we've been we've been out doing systematic analysis of latencies for a year or a year and a half and that is the thing that we've converged to reproducibility I may have already stressed a number of times before is a very very important thing and that implies that you are also not just supposed to save the parameters that you've set active and passive but also any microcode binaries that you've been using for instance if you're doing comparative measurements for different mitigations against specter meltdown and so on CPU attacks it's extremely important to keep the microcode binaries that we used to produce a given sample say because they tend to disappear from vendors websites they tend to be moved around pharnems tend to change and that's a recipe for not being able to reproduce any measurements include non-upstream patches and would not as a set HDF 5 makes this really easy to just attach files of these types to the measurements and then if you need them or not you have them around okay but that's more the mechanics of what we're doing in the mechanics of how we work let me come to the to the analysis proper I think I have until 3 o'clock so I'm going to skip the first part here which is basically about how to properly visualize how to properly visualize any measurement results I'm giving some cookbook recipes for plotting these in the slides and you can look at them after the talk because the thing that's more important is after you've taken say you've all seen you've all seen measurements of say that kind where you record say this is a system with 10 CPUs you record the typical latency diagrams you see okay you have so and so many hits with that latency so and so many fat latency and so on and for each CPU you do observe a maximum latency the typical cause of action I would be to find the highest value for the for the latency and say okay that's the worst case that can happen in my system which is okay which is an approach you can do but which is an approach that is not very reassuring because it doesn't make any difference so the number you will have a number but the number doesn't include any information about how long did you measure how certain is that number what's the amount of uncertainty in the number and so on and that's of that's the situation that we've that was a problem that we encountered when we showed systems of that kind of certification authorities because they ask precisely these questions how trustworthy are these numbers so how can you make sure that you've really captured all the bad events the bad data points that could lead to larger to larger latencies problem with such measurement based approaches is the interesting events are all in the tails of the distribution so the the interesting events in the sense of high latencies are very rare events so that means determining determining them properly is a complicated statistical problem as I said the typical typical statement that people make is the highest threshold we've seen is x y set so in so many microseconds and then you can trust them or not what's better is if you say okay after so in so many hours of measuring the highest threshold we've seen is so and so high because then if you say okay we've measured for 10 minutes and we've seen a threshold of 50 microseconds and this is of course less trust worthy than if you say we've measured for seven days and seen the same threshold of 50 microseconds you can to try to improve that in various ways that's the portion I can adapt to the audience so you can say okay after seven hours of measuring we've seen 50 microseconds of latency while Stephen Ross that was standing in front of the system making sure that everything goes fine that may increase social credibility of the results but it doesn't I'm sorry to say that doesn't increase credibility statistically because ideally what we but that's actually that's actually how how people try to argue because if if you tell them okay I'm not sure if your measurement is really correct and they say oh but Thomas Kleitzner did that measurement so it must be correct but that of course is not what we'd like to have what we would like to have is a statement like at a given confidence level the probability to exceed this and that threshold is so and so high because you cannot make absolute statements anyway now how do we how do we achieve that we do that by switching from a simple measure and then seek the maximum approach to a two-stage approach we measure we collect data that characterizes the system and then sorry we switch from a two-stage approach measure and then determine the maximum to a three-stage approach we measure then we try to come up with a model that describes the system well the model of course should be based on solid statistical considerations and then we infer estimates of the worst case execution time plus associated uncertainties from this model how do we come up with such a model so ideally we'd like to have we'd like we'd like the model to satisfy free criteria one is generality so it should apply not to just one situation but it should apply to multiple relevant situations so we should not just be able to use it say in the automotive context but also in healthcare and in other contexts it should it should be precise so the results that we obtained from the model should of course differ little from what's actually seen on real world systems and the models as such should be real they should accurately represent real world phenomena so they should describe the systems that we are looking at well satisfying all these three conditions simultaneously turns out to be impossible for statistical reasons that I'm happy to discuss after the talk that's going to lead very very deeply into statistics turns out we need to we need to drop one of these criteria and the criterion that is actually the best one to drop is realism so we don't benefit much from the model being too close to the actual system it suffices if the model has the same properties as the system but it does not necessarily be in a one-to-one connection with say how the system is composed of functional units and so on as long as the results are the same if you focus and generality and precision then you can come can come up with proper models now how does the model look like that we use basically if you look at what a latency measurement tells you from a statistical point of view it's a latency is a random variable so you don't know the value you're measuring you know that the value is within some range you know that the value has some distribution but you cannot tell on a shot-by-shot basis what the actual measurement will look like and effectively you are sampling a number of observations x1 blah blah blah up to xn from this random variable x now the good thing is that random variable does not mean that it has random or unspecified properties you all know for instance the statistical statement that if you measure a random variable what random variable whatever the distribution is say for instance if you measure the outcomes of throwing a die throwing dice then you get result one with a probability one over six result two with probability one over six and so on you don't know which value you get but you know the probability distribution and now this random variable has some some properties for instance if you add up if you sum up two dices three dices four dices five dices and so on you know that the that the sum the sum will converge regardless of what the distribution is here will converge to some Gaussian distribution regardless of the properties of the original random variable now the sum of latencies is of course something that's not very interesting to us what is very interesting for us is the maximum latency from a given measurement or from ten given measurements or from a block of 50 measurements and it turns out that there is a very interesting statistical statement that connects the maximum of a block maximum of random variable measurements and that is the so-called generalized extreme value distribution I'm not going to get into this formula in detail the meaning behind this formula is if you take a number of measurements a block of five minutes of latency measurements then five more minutes of latency measurements then five more minutes of latency measurements and you compute the maximum for each of these blocks then the maxima needs need to behave in a way that satisfies this so-called generalized extreme value distribution meaning that you can find this distribution is governed by three parameters that you can use a maximum likelihood estimation to find the three parameters that best satisfy the given measurement values and then from this shape of the distribution you can then infer not just the expected maximum value that will ever counter but also a range of values that at the given significance level the maximum value is supposed to drop into and that's that's exactly what we wanted that is not just the worst case execution time estimate as such but it also includes a quality a quantification of the uncertainty for the measurement now we're seeing the things that we expect to see when we do measure for longer times than the uncertainty gets smaller we can quantify a large uncertainties and so on that's already a first step towards satisfying these certification criteria that I mentioned upfront of course this they sound quite say miraculously and to some extent it also is because the method relies on two properties two properties to statistical properties that are not fully satisfied by systems at the moment so that one is the assumption that all the measurement values that we take are from an IID distribution and they're also from a continuous distribution both of these criteria not satisfied so we are still making some mistakes we've made sure that we are on air on the side of of caution so our estimates are usually are we can show that our estimates are too large not too small that's already a good thing but of course not optimal but there is still some some work required to basically address these two issues good and that is basically all I wanted to tell you about J2D bugger and how we proceeded with statistical analysis I guess there's a minute or so left for questions if you have any if not thank you for your attention