 Vse, češi, češi. Svaj, daj, daj, daj, daj. Sveč sem Darjo, češi, daj, daj, daj, dober, dober. Zelo v pomečenju, naredil na toh, in pa se izgleda na te, da je zapečen, da je zapečen, na vrtu, nekaj nekaj, nekaj nekaj, nekaj nekaj, nekaj nekaj, nekaj nekaj, nekaj nekaj, nekaj nekaj, nekaj nekaj, V CPU pining, a then I'm done. I'm not. It's not... That's the bulk of it actually… but there's a couple of more things, at least, in my opinion. Let's start having a very quick look at this… class of processors of CPUs, which are the CPUs from AMD, the so-called Epic or Epic II because it's a second generation of… vstupne,Relena,7.002. Ko lahko je vstupredni modulz in sečenice, z nove, naj zelo, vse če bomo kako tega. En od vsej vsej vsej izdelikovate Vio, in z več nekaj skaliv, v veselce, nekaj vsej vsej v veselce in, One is dedicated to alloys, the iodine. And the other eight ones are the actual compute dies. Then there is the concept of core complex, CCX, which is basically a set, a combination of four cores, which also means eight threads, because these processors have hyperthreading. And we will look into this a little bit more later. Ič core complex has its own L12L3 cash hierarchy. But, yeah, we will see about this later. Let's also introduce the concept of core complex die on CCD, which is basically two CCXs. And so eight cores, 16 threads. Ič CCD, this is the important part, this is the important thing about CCDs. Ič CCD has dedicated Infinity Fabric, its name of the technology, link to the iodine. Right. And so these are processors that can have up to 64 cores, which means 128 threads, and you can have them in two socket arrangements. So, yeah. And each socket has eight memory channels for two-to-to-to-word memory. So, yeah, these are links if you download the slides and navigate them to fetch more information, but you can easily find a lot more information on Google. So, I said that this talk is going to be about tuning, virtualization, and some workload at least, running inside virtual machines. We will, I will say a few things, most of which are going to be general enough, but we will use a case study throughout the talk. And so, I will speak about this effort, this work that we did together, us, Suze, with AMD as a partner, on coming up with a set of tuning advice for optimizing the performance of one of our Suze products, Suze Linux Enterprise Server 15 SP-1, which is pretty much the same as Open Suze Leap 15.1, on this class of AMD processors. So, I will use these as a case study. That's important to say and to remember. Right. So, this is another way to look at a one specific instance of the series of processors that I introduced before, the 7.7.42, whatever. It's a big one. It's one of, it's the biggest setup, the one that I said that has 64 cores, which means 128 threads, and comes into sockets. And this is the one that we used for this guide and the one that I'm going to refer to for this talk. Close up on one CCX. This is what I was saying. So, each CCX has its own L3 cache. And of course, each core also has a dedicated L1 and L2, as usual. But the fact that there is an L3 cache per CCX and, for example, not per NUMA node, it's a little bit, well, it's not weird. But it's something specific about this architecture, something which is at least very different than many other architectures that you find around, at least in the x86 world. And, yeah, tuning the performance basically means if you really want to try to achieve performance in side VMs, which that matches the one that you would get on bare metal, then it also means static partitioning. You cannot avoid doing at least some of that. And that it still makes sense to speak about virtualization. Then if we have to partition resources statically, well, yes, according to me, at least, especially on such a large platform, because you can still use it for server consolidation, because it's so huge that you can put a lot of VMs on it. And then you have the argument about flexibility and high availability and other stuff. So what resources are we talking about partitioning? Well, all the relevant resources, CPU, memory, and IO, these talk will be focusing on CPU and memory. IO, we'll leave it for another one. So the first kind of partitioning is going to be between host and guest, or guests, meaning that you want to, most of the time, want to leave some of the resources, namely some CPUs and some memory to the host, because you have to connect with SSH or whatever to the host to do monitoring or management. And then oftentimes, depending on the configuration, but most of the time, the host has to carry out some activity on behalf or to help, let's say like that, the VMs, for example, for doing IO, running the QMU IO threads, whatever. Requendation, well, it depends on what are your actual goals. One good rule of thumb is to leave at least one core per socket to host activities. Also, on this particular architecture, it will be better if you manage not to break, for example, what we said before, to be the CCX, because otherwise you will have the VMs, or some of the VMs, and the host sharing and the electric ashes, which is generally not something that you want for good performances. If possible, you should also try to not break a CCD, but then that would mean leaving eight core 60 threads for the host, which you may or may not want to do. And the RAMs, how much memory to leave to the host, it really depends, let's say 50 gigabytes, and be done with it. So another thing, huge pages, so whether or not use huge pages and how to use them, typically, and this is one of the general things, this is really general about virtualization, not really specific about this platform, if possible, you always want to use huge pages for the virtual machine, but you don't want to use them in the transparent huge pages way, let's say you want to preallocate the huge pages in the boot time of the host and then use them for as the backing of the memory of the VMs. And you don't want to have automatic numabalancing at the host level because you are going to do static partitioning anyway. In the guest, it depends, it depends on the workload that you run on the guest, it's not different than tuning a workload on bare metal from this point of view. Once you have tuned the host, then inside the VM, you just treat the problem like you would do it on a bare metal machine similar to the VM that you are focusing on. And one word about power management at the host level, of course, again, it depends. In general, it's good to do at least some benchmarks, limiting, for example, the deep sleep states and using performance as a CPU frag governor because it would help you get a first set of results which are consistent and that don't vary too much. Then it depends whether this is okay for you and for your actual goals to keep these settings or if saving a little bit more of power is important and if it is, you have to reassess the tuning and rerun in the benchmarks and so on and so forth. With the proper power management configuration that you want to have, let's say, in production. Then, as I said, pinning the VCPUs, we want to do that and we do that in, for example, libvert, like this. And you want to, if possible, I was already touching on this before, if possible, you want to pin the VCPUs of the VMs in such a way that you pin to the CCDs because in such a way you won't have two different VMs which will have to share the bandwidth of the infinity fabric link from the CCD to the IODI. This means that if you do that, you will be able to, you will be able to configure like that up to either 14 or 16 VMs. It depends on how many CPUs you leave to the host on an F2 platform like the one I showed at the beginning. And if it's not possible to pin at the CCD level, then you may consider pinning at the CCX level because again, yeah, then the VM will share the bandwidth of the infinity fabric link to the IODI, but at least they don't share the electric ashes. And at worst, at least pin to cores and don't make VMs, shares, cores and executes on CVX hyper thread and also share L1 and L2 ashes unless you really want to ask for big troubles. Memory placement, similar to VCPUs, but even simpler probably because if the VM that you want to use is big enough to span to take both the NUMA nodes, then you put half of the memory of the VM in one NUMA node and the other half on the other and then you also, I guess, yeah, I have this in the next slide. So let's, yeah, sorry. And then in the other case, which is when the VM is not large enough to span both the NUMA node and it fits in just one of the NUMA nodes, then you put all its memory in that NUMA node as simple as that. And then enlightenment, that's what I wanted to try to say before, if the VM spans both of the NUMA nodes, then you have, yes, put half of its memory, put, sorry, one half of its memory in one node and the other half on the other, but you also have to provide the VM a suitable and meaningful virtual topology, virtual NUMA topology, actually. If it doesn't, you are fine, you just enforce that the VM, the memory stays on one NUMA node, but you still have to provide, in both cases, a meaningful CPU topology, so virtual socket, threads, cores, stuff like that, and also good, let's say, CPU model. What does it mean, good? We will see in a few slides. Yeah, then secure virtualization, secure encrypted virtualization, AMD and these services processor also provides a feature, which basically allows you to encrypt the memory of the virtual machines and it's transparent to the VMs. It's very efficient, it's very cool. There are instructions to set it up, I'm not going to cover these in details. And security, so the hardware vulnerabilities, which are well known these days, the good thing about this processor is that AMD processor in general, and this in particular are only vulnerable to a subset of them, and in particular the nastier one for virtualization, this is not vulnerable, so we are happy about that. Benchmarks, the benchmarks that I run, I said I wanted to focus on CPU and memory, so I will show results of running stream, which is a memory benchmark, and I will show what I will show right now is going to be the results of running stream on bare metal and then inside one or more VMs, and so that we can compare results. And you can configure stream. In our case we use the OpenMP for parallelization of the stream jobs, and so we use the different number of threads, which was either 16 or 32 on bare metal, in general the real time again is to use as many threads as there are memory channel, but this is not an information that it is easily available via software, that you can easily figure out via software, and so you can kind of approximate it by using one thread per lnc, and this applies to both the host and the virtual machine. So what do I have here? Here I have in purple bars the results of running stream on bare metal, and then in green stream run inside the VM without any kind of tuning, so the performance don't match and you see it very well. Then I applied a little bit of tuning so the VM had a virtual topology, but it wasn't doing any pinning of CPU memory, and so again in the light blue bar, and again the performance don't match, and then magic, you apply the tuning that I described and you see in the last bar that now the performance of bare metal and inside VMs just one VM basically matches, so that's what we wanted, and this is when running stream just in single thread mode, the same when using 30 threads for stream, as you can see there is, we are able to reach very good performance because inside the VM we achieve pretty much the same level as on the host. Here I used two VMs instead of one, so there are a lot of elements of the plot, you would want to focus on again the first one, is bare metal, the red and black ones are, I'm now using two VMs and are the score, the results, the performance that you get from stream and run inside these two VMs, so it's okay that it's slower, it's less than bare metal because now you have partition and the CPU in two, basically you are assigned each part to a different one to one VM, and the important part, the nice part is that as you can see the performance of the VM is quite consistent because they are basically performing the same all the three, the four stream operations, and then this last bar is basically the sum of these two, which again is pretty much bare metal, and so we are happy again. Now, I mentioned secure encrypted virtualization, what's the, I said that the memory of the VM is encrypted, what's the overhead that comes with that? As a matter of fact, at least for this benchmark is very, very slow, in some papers you find that it always stays within 3%. In these cases, at least as far as I can measure, it stayed within 1%. And yeah, another benchmark, this is called NASPB, it's a very CPU intensive benchmark this time, which is what I said, memory and CPU. Also uses a parallelization framework, it's open MPI this time, not open MP, and this time lower is better, before I probably forgot to say higher was better, and yeah, the same stuff, basically first bar bare metal, last bar VM with tuning, and we want them to match and to be very similar, and that's actually the case with tuning applied in all the various variants of the NASPB benchmark. And again, I also benchmarked with encrypted virtualization enabled or not, or disabled with this CPU intensive benchmark, and again, less than 1% performance impact, which is very good. Now the CPU model, in theory, this is QMU that builds virtual CPU for the virtual machine, what flags, how it presents it, what kind of virtual CPU does QMU present to your virtual machine. In theory, if you want to achieve best possible performance, you find in various pieces of documentations that you should use this thing, so ost pass through, but it depends on, for example, the version of your software, in this case, QMU and VIRT. As we said, this effort was about doing this tuning on these particular distributions, distribution, sorry. And as a matter of fact, the distribution went out before the epic platform series of processors was available. And so, if you use ost pass through, turns out that in this particular case, it doesn't do a good job. And the detail is here, and the detail of why it's here, because the threads basically are not exposed correctly. So, this is, as a matter of fact, there is a CPU model called Epic, which is there because it's the previews, it's the one that represents the previous generation of epic processors. And if you use that one, yeah, except I, yeah, except I wasted the basic same thing, but that's a type, what is called it like that, this would have been too, and it provides the VM a better, sorry for that, a better virtual topology. And in fact, this is what happens if you use ost pass through, it's this one. So it's, again, lower is better, so it's tuning applied, but using ost pass through as a CPU model, very, very bad because we wanted to be here. Using Epic, it's here. Of course, if you go, if you use a more updated distribution, a new version of OpenSUSE or Zli or whatever other distribution, or just code from upstream, you will find the Epic 2 CPU model there and you can use it. But this was, I put this part in ER because I wanted to stress the fact that, yes, there are all these tuning advices, but you really should always double check because ost pass through was the natural choice and it wasn't performing well. Now I have other stream benchmarks, but I rather try to leave some time for questions than, yeah, let's see. So, yeah, basically the conclusions are that achieving very good performance, even performance that actually matches the one of the ost inside either one or more VMs is possible, at least for certain workloads and it happens mostly via resource partitioning. If you use KVM, QMU, LibVirt in that particular product, even better if you use them from upstream, you have all the tools of the capabilities to achieve this very good resource partitioning. We at SUSE also support Xen and you can do pretty much the same with Xen, although you will lack, the performance won't be as good as this because Xen still lacks the capability of exposing properly the virtual topology of the guest, a virtual topology of the guest and the Epic 2 platform turned out to be quite good platform from this point of view because they offered great scalability, offered memory encryption with exceptionally low overhead as we see and because they are only affected by a subset of the vulnerability flows related to speculative execution. So with that, yeah, in the slides you will find a little bit more information about myself and while taking question, let me, as we did this morning, say one more time farewell to my very good friend last course with this picture taken at Fosdam a few years ago. And yeah, but really, questions. Yeah. I see three hands, I guess. So we're... Sure. Yeah, sorry. Ah, perfect. Yeah, I will always forget about it. The question is whether the benchmarks, any of the benchmarks that I showed were run in a scenario where a VM was spanning multiple numanodes. So when I showed these results, these ones for, no, these ones, one VM, okay? If you use one VM, one very big VM, then yeah, I have another one slide that I didn't show, but let's use it for that. This was the VM that was used in that benchmark. So it was spanning both of the numanodes. It basically had all the, pretty much as many vCPUs as there are, vcPUs with the exception of the ones that I decided to leave to the host. But this was spanning both the numanodes and so it had virtual topology exposed to it. The question now is about what was the huge-page size chosen, it was one gigabyte. The other questions, let's go there. So this is, the question was about, since I said that if possible, it's better to configure a VM so that it stays inside the CCD, inside the CCX, if it goes outside, stuff like that, whether I have numbers for that. Not yet, again, this was in these slides that I decided to skip, but if you see, this is an ongoing effort. We are running more benchmark, continuing doing our evaluation and so I have, I'm finished, but going over investigation with multiple VMs in cases where I actually fulfilled my own recommendation and so I don't split CCDs and stuff, but also in cases where I violate them and I put VMs across CCDs. Just as a hint, when you start, this, for example, is a case where six VM were used and you see that the absolute level of the performance is the correct one, if you do the math, this is fine, but the performance is also actually quite consistent in this case, this case, this case, but you see some strange behavior here and that's what typically, again, this is an ongoing investigation, so this is just a little bit of speculation, but what we are seeing is that when you start not respecting this recommendation and putting VMs in, pinning VMs in such a way that they share too much resources, then what happens is that you have these not so much consistent behavior in the results. I have other, see here is a, here it is another example where the recommendation were not really respected and you have the performances which are not equally exactly the same in all VMs. Yeah, there was other questions, but I think we are out of time. We are, so I'm happy to answer. I mean. Just the last presentation, so. I mean, I can. It's not recorded, but if you want them. I mean, I'm fine. Go ahead. I'm good. Sure. First of all, thank you for your talk. I very much enjoyed it. I was wondering have these optimizations and tuning things been implemented in, for example, OpenStack already? I have the quest. Well, I guess I repeat it as well. The question was about whether these optimization are implemented in OpenStack or similar software. I have no idea. I have never played with OpenStack and I don't plan to in the foreseeable future, to be honest. I am aware of very few very few efforts and very few capabilities similar to the one that you are saying. So doing resource partitioning and optimization at this level, automatically, either in OpenStack or in many other software. There are solutions, but achieving this level of details in the tuning, it's quite hard. Because after all that, because of reasons. I mean, I don't know myself, but it's a matter of interface that you present to the user for letting him or her able to achieve this. After all, it turns out to be rather similar to the XML itself, because it's a very detailed level of. So I'm not saying it's not possible. I would really hope that the situation was better, but I'm not aware of anything that reaches this level of details. Yeah, go ahead. Die. Yes. Sorry, the last part. Yes, there are. I haven't monitored that part, but the fact is that at least according to me and to my experience in running similar evaluations in even in other platforms, the consistencies of results like these is something which is quite good and that you don't find very often. But as apparently as soon as you like mix things in a not necessarily super ideal way, then this very nice properties start to, like, fade away. So, yeah, I haven't checked that whether what you said also was happening in these cases. But yeah, I have a scenario with 30 VMs, where I am violating the recommendations by using too many basically threads for stream if you count all of them running inside all the VMs. And if you look at the actual throughput that you achieved, that's actually quite good, but it's all unbalanced. If you sum all of them up, it matches or even overcome the one that you achieved on the host. But then it's all like that, up and downs.