 OK. Točno, kaj jih se očisela, da so vzible, neko kaj sem tudi svar, tudi sem zrubim, pa ga se zrubim. Jsem Dario, jim da z Vzema Tjažu Suse v vsej vsej postavljajših z vzema. Jih je vsej o če vidimo, o če jih je postavljajši. Če dobro vzema je vsej, And let's say that you want to avoid performance agressions to occur, for example, when you release a new version of your project. In case of SUSE is, would be operating systems. So simple example, we want to avoid performance agressions when releasing a new version of SUSE Linux Enterprise, which is our operating system, and the way we do that is through benchmarks. So we run benchmarks on the current release, then we run benchmarks on what will be the next release, and then we compare the results. If there are differences, namely regressions, it's likely that the causes for those are the things that change between the two releases. And in this very, very simple example, it will be these two components. Of course, there's much more in a real system, but again, it's just an example. Now, this still applies, the concept is the same when virtualization is involved. The only difference is that, in that case, the benchmarks typically run inside VMs. What's already a little bit different, as we can see, is that the list of components that potentially could introduce performance agressions, it's already quite a bit longer, even in these silly examples that I made up. Actually, though, with virtualization it's possible, as we know very well, to run a version of the operating system inside of the VMs, which is not the same as the one that you are running on the host. If we want to be really sure that we are checking all the cases, we must run benchmarks, but we must run them in a few more cases and combinations, like, again, exemplified here. What about VM sizes? Meaning, what about if there are performance bugs that only shows up, for example, if a VM has more RAM than a certain threshold, or fewer RAM than a certain threshold for what it takes? Well, that means that, again, we need to consider more combinations, meaning we need to run the benchmarks in VMs with different sizes, different number of virtual CPUs, different amount of memory assigned to them. It's not over yet. There is also, it must also be said that there are configurations that you can do for the VMs at the host level, like any kind of tuning, for example, of the host itself or of the VM by pinning the virtual CPUs, defining virtual topology, this kind of stuff, which means, I think you are starting to learn how this goes, that in theory we should run even more benchmarks, we have even more combinations, we should check, for example, multiple and different configurations for other VMs and run benchmarks inside of them in all these cases. Finally, how cool is the fact that, I mean, as we also know very well, given a server, given an host, we can run from just one VM to hundreds of them. And how not so cool is the fact that this means that we also, if you want to be sure, then should run benchmarks in scenarios that are comprised of different numbers of VMs. So, multiple VMs involved and varying number of them. Yeah, I lied, it wasn't the last thing. The last thing that I could come up, but there are probably more, is the fact that when I say run benchmark, I think we all imagine, and I say run benchmark in multiple VMs, I think it's quite common to imagine that we run the same benchmark in all VMs, which is fine, but it's actually quite possible, depending on the use case that VMs are going to run, if you have more of them, different workloads inside of which one. And so, if you want to be complete and precise, we should also consider this situation. Which means if you try to put all these things together, that our test matrix for making sure that we're not having regression just exploded basically. And this brings us to the actual point of these points, because there are two of them, of this presentation, which I'm going to try to make in the rest of it, which are one, for managing this complex situation, we need tools, that's pretty easy to imagine, I guess. And the second one is that we also need to make choices. I have some examples. I am going very, very quickly through them, just to, I put them there just because, in the list of scenarios and use cases that I mentioned before, you can see that it wasn't made up. So, I mentioned that there could be performance issues that only manifest themselves when they size, the memory of that you assign to a VM is higher than a certain threshold, and this is one. Because the problem was, the slowdown in the startup time of the VM was depending on the size of the VM itself, and it was the most severe when the VM was quite big. Then there are, we have faced problems, like this one where, in this situation up here, whether or not the VM was having a virtual topology and depending on how the virtual CPUs were pinned on the physical CPUs, what happened was that the behavior inside of the guest kernel was different and different in a way that had performance implication. And down here instead, we see an example where whether the fact that the VM, again related to virtual topology, whether or not the VM had an L3 cache defined as part of its virtual topology also changed the ingest behavior. This time it was G-Lib C, which was being differently causing performance anomalies. I'm not going into the details but at all of these examples, there is this talk from two years ago, I think, which does exactly that if you are interested or otherwise, ask me later. And this is the last one. I mentioned that we want to consider the case where there are multiple VMs and here we have an example where there was a problem with the host scheduler giving in a way that wasn't fair to all the VMs. And of course, this is something that you only realize if you actually run benchmarks with multiple VMs. Right, I mentioned that we need tools, in my opinion at least. And the tool that I want to talk about today is this benchmark in suit called MMTests. There are others, there are many, actually of them, this is the one that we use and I choose to talk to you about. It's a benchmark in suit that is in use internally at SUSE but also in the Linux kernel menu lists in the upstream Linux kernel community for quite some time actually and it evolved over time. Again, I will describe it in not so many details. I will give you some hints about it. If you are curious, find me or check out this list of materials. Basically, MMTests is, I called it a benchmark in suit because it's a piece of software, a collection of script basically that allows you to run benchmarks, quite a wide set of benchmarks in a repeatable and automated way. It also automatically collects results, runs some fancy statistics on top of them and let you see those results and the statistic analysis in a bunch of different way, which includes plots. It can monitor and also trace and also profile the system while the benchmarks are running and try to figure out and have more clues about where the problem could be hiding if there was any. It works through configuration files. Benchmarks have configuration files which are basically batch scripts themselves with a lot of exported variables. They can be parametric so you can in the configuration file try to fetch information about the characteristics of the host or VM inside of which are running the benchmarks so you don't have to encode details like this. And it can be used, of course, otherwise I wouldn't be talking about it here, it can be used for run benchmarks in virtualized scenarios and inside virtual machines. And it also supports an host configuration file where you can define, for example, what kind of monitoring, what kind of tuning you want this time at the host level and also what are the VMs that you want to use for the benchmarking campaign. Yeah, as just since I mentioned in the beginning of the presentation it's possible to define in the host configuration file the characteristic of the VM. So you can use a memtest to run benchmark inside multiple VM at the same time and also construct scenarios where the VMs have different characteristics. And memtest will manage the life cycle of the VMs all by itself so it will start them, stop them for running the benchmark inside of them. So it will define and create and install OS inside of them if they don't exist already and all this kind of stuff. Containers because we are close to have support inside of a memtest also for running benchmarks inside containers and defining containers in the host config file and doing pretty much the same that you can do with VMs running benchmark inside containers and maybe comparing the two solutions if you want. And it can work even in a more generic way in which we don't care too much about here. It's just because I wanted to mention because it's useful for example for running benchmarking situations that in cases where we don't have full support for them in memtests meaning for example we want to run benchmarks in KubeVirtVMIs we are working on supporting these scenarios in memtests properly but until that happens you can just at least run benchmarks inside the VMs managed by KubeVirt by specifying just the IP addresses in the config file and the memtest won't be able to start and stop the VMs through KubeVirt until we implemented the support but benchmark will run there at least and you will have the results and everything. Now, I mentioned a few times that I want to be able to run benchmarks inside VMs and inside multiple VMs. What happens if I do just that? So what happens if I start CPU benchmarks inside two VMs I start it the same time in both of them and I just let it run. It happens something that I would call it quite bad because nothing guarantees that for example if the benchmark has phases or steps or if you run multiple iterations of it that the various iterations like in this example nothing guarantees that they are synchronized and so you have different iterations running against other things or nothing in the two VM and the results is not going to be something that is really useful. What about instead something like this where the various iterations of this benchmark run let's say in lock step inside the two VMs. This is much better and this is something that gives you results that you can use for actually making assumptions and drawing conclusions about the performance of this system. And M&Test has support for that. There are other benchmark in suite that does but not many at least to the best of my knowledge. And yeah it does that by implementing this barrier protocol in the host level so VMs and those communicate in such a way that benchmarking proceeds in a synchronized way then in lock step. In the source code there is even an ASCII diagram I know you cannot see it from here check out the code it's very it's awesome it took me more time to draw the ASCII driver than to implement the protocol but I think it's worth it. So in conclusion of it apparently I didn't draw enough and then decided to do more. And this is what happens in practice which now I'm not sure it's my fault, is it? Maybe it is. Okay fine. So basically what we see here is the CPU load inside of the VMs these are three examples where there are two VMs and the fact that the curves are basically identical we can infer that what I just said it's actually happening so the benchmarks inside of the VMs are synchronized and running in lock step and this is a larger example with 16 VMs. Documentation there is of course some documentation for MM test but of course it's not complete. Sorry about that we are working on improving that if you decide to try it which I really encourage you to it may happen that you run that you end up in situations where you don't know how to do something because it's not part of the documentation yet at which point you can for example write an email to me feel free to do that actually. Now MM test is a tool that can be used and that we use for day-to-day development so you do change in QMU let's say that you suspect we'll have an impact on performance and you use MM test to run the benchmark and verify whether that is the case or not and whether it has a good impact or not. However there's more to it so for example the performance team inside of SUSE is not my team as implemented CI on top of MM tests they use it as I said already for testing our product kernels and they also submit the results to the Linux kernel manual list. What I am doing is pretty much the same thing but let's say for virtualization. I am reusing part of what they have but not completely but that's because for now we are running it internally and I have to run it in a lab which is slightly different from the one they use so it's internal details that I don't think they are really interesting. In fact the code for these set of scripts that implement CI, virtualization CI around MM test is here I really recommend not to go and check it out, not right now at least it's very very weak and also as I said it's a little bit tight to some aspects of our internal labs however it can work in other environments I also use it for on my test boxes at my place which is not and I intend to develop it further and make it more general it's just not really really now although it's already working and running on these two boxes in our labs. The technical details of how it works are not terribly interested for example because as I said multiple times still in its beginning and much more than that because I want to use the time that I have left for for talking about and discussing how we are using it and how we could use it to even even further so right now we are using it for checking for regressions in performance regression in our virtualization products which is basically what I said at the very beginning of the presentation we take a particular version of our supported OS we install it on a server we run benchmarks on it for example we don't of course we don't do that only when our release is closed we also do that during development and release during maintenance so as soon as there are changes because some packages receive maintenance updates or whatever we rerun the benchmark and we check the result and if there are the regressions sorry we try to hunt them down what benchmarks do we run well that's the interesting part for now let's only say that we run benchmarks bare metal benchmarks even if only to use them and then we run benchmarks in just one VM but in multiple sizes and multiple configurations and also multiple VMs, multiple sizes and multiple configurations now what we wanted to start doing is to also do the same but on upstream QM and for the wider QM community so for example we can take when suset humble weed which is a rolling distribution so it always track the latest development of basically everything kernel system library is a little weird and we can use that as a base for building testing not really our QM packages that we have there but as I said upstream QM, the released version of QM, like I don't know 7.1 which is the latest and also the next previous latest and and the report results and all these things we can also try to use it for testing the latest developments of QM for fetching the latest commits from Git and testing them one important part when one thinks about doing something like that is how to deal with the the results and the data that you get out of this activity and ideally we would want to I don't know connect this to some official CI and automatically send reports but this won't happen right now until we are sure that all the thing works sufficiently well for now will probably be us taking care of monitoring the results and if we publish in the dashboards without spamming too much anyone then we will have to triage the problem figure out the commit and it's introducing it and try to want it down or at that point alert the maintainers and ask for help fixing it now questions but not questions that I am taking for you, questions, other questions that I am asking which are a little bit what I was mentioning before so actually more than these these is the something that I feel like it's important so we want to run benchmarks, we want to run a lot of benchmarks in a lot of different configurations and set ups in number of VMs and size of the VMs but how many of them should we run and which ones now of course one might ask why not all which is a very good answer but if we run not even all but a lot of them then each run will take a lot of time and since we have limited man and machine power for this activity then we risk to make it not very useful because if there is a lot of if a lot of time passes between two runs then it means that it's going to be harder to identify the changes or the component that it's introducing regression so I did put it here in the slides some ideas and suggestions of what I planned to start with as for example which benchmarks as as another example what type of VMs, so sizes and based on experience we were planning to focus on small VMs but also keep an eye out for larger ones and and the number of VMs it will of course depends on the capacity of the server in terms of memory and CPUs and stuff like that but for all these things the input from the QM community would be highly appreciated because especially when I say that we want to run it on released and upstream QM for helping discovering and tracking performance issues there and it makes sense that we decide together what are the most interesting and most important configurations and work that we want to consider yeah I think I've done another interesting thing is how when you run multi and it's the last thing that I'm going to say is that when you run a benchmark in multiple VMs so for example you run a benchmark in four VM you don't get one result you get four but the actual results that you want is one number so how do you aggregate these results right now I am just averaging them and the computer standard deviation and these kind of things maybe I mean it's the thing that felt most natural for me but for me there are other ways if that's the case don't hesitate to come forward and provide suggestions because they are very much welcome and now I am really done and these time questions are the ones that I am happy to take and not to ask I have a virtual question how do you see Intel LKP tests vs MM tests in terms of Linux kernel testing interested to know things which MM tests can support which LKP cannot yes that's an interesting question I don't know enough of LKP for comparing it with MM test especially for kernel testing so bare metal testing the reason why I focused on MM test was because I knew it and so an opportunity for implementing let's say virtualization support of it which is what I described when I said that it's now possible to run benchmark inside of it in multiple VMs synchronized and stuff like that there are many benchmarking suits LKP one but there are also others and it's one of those things that there are many of them and there is always room for comparing them so we are trying to use one but then another one always is developed and it's a little bit how it goes how I answered that was one there but I don't think there's time sure apparently time is up so feel free, I'm also around for the rest of the conference so feel free to find me and ask anything else that you might want to know about this Thank you