 everyone and welcome to this talk that will be about virtual topology for virtual machines. Someone once said in theory there is no difference between theory and practice but in practice there is. I personally like this sentence a lot and I think that it applies quite well to virtualization particularly so to the topic at hand which is whether providing a virtual machine with a virtual topology is good, bad or something in between. Anyway in some more details this talk will be about this function in the linux kernel scheduler code called try to wake up it will be about this graph and it will be about these log entries reporting an unmitigated vulnerability that I recently saw inside the virtual machine for some background the word topology comes from mathematics and geometry where it refers to properties of an object that remains invariant even after some specific transformations such as stretching twisting crampling and bending in electronics and informatics instead it is used to describe the interconnections and the relationships between the various components of a computer system so now that I think about it I have definitely been in the mood of wanting to stretch twist crampling bend a CPU and let's see if after that it will keep showing that bug in my but anyway when there is no virtualization the physical topology of a system describes how many CPUs caches memory and IO controllers we have there and how they are organized in threads core sockets here are your caches numa nodes and IO buses and bridges on a standard linux systems there are multiple ways of analyzing the topology for instance just using a less CPU will already reveal a lot of useful information numa ctl dash dash hardware is another way to inspect how the CPUs and the memory are arranged in numa nodes and last but not least linux offers a really convenient set of interfaces for checking which CPUs are part of which core and package what part what other CPUs they share resources with and a whole lot of other information like these ones they are all available within the ccfs special file system which means they can also be queried from programs and scripts on the other end a graphical representation is preferable a tool called ls topo from the hw log project can be used to produce diagrams like this one in the slide these provides a very immediate and clear look of the systems components their relationships and their interconnections which is in fact by definition the system topology now if you have ever tried using any of these tools inside the virtual machine you know already that it is very much possible to do it in fact they work and the output they produce has the same format and look of what we see on bare metal but of course when you are looking at that what you are looking at is the virtual topology it is the job of the hypervisor and of the other components of the virtualization stack to put together the necessary pieces like cpu id bits and virtual cpi tables in such a way that when the operating system running inside a vm reads them it will configure itself like if it were running on a piece of hardware that adds the topology that the user requested let's just look quickly at some examples and that how they can be configured here for instance we see how to put together a virtual topology where all the eight vcpus are seen as separate so heads and at how the results look from inside the vm this is by the way the topology that kim builds by default this is instead how we can make the vcpus look like they are arranged in cohorts and threads or even in virtual numanoads not that recent enough versions of kim and libert allows us to specify even the distance between the virtual numanoads exactly like on real in general quering the details about the physical system topology and then take them into account when making decisions mainly at the os kernel level is a good thing the scheduler in the kernel is probably the most obvious example in fact in the scheduler if the scheduler knows about symmetric multithreading it can avoid scheduling tasks on two threads if there is a full core available which improves performance or it can try to run a task on the numanoad where the memory that it uses the most resides and this improves scalability and in fact these once and other features are all there inside the linux kernel scheduler but what about virtual topology i mean can the scheduler running inside a vm that has a virtual topology which includes smt behave the same as it would on bare metal and then assume to have done equally as good at improving the performance of the work of the running inside the vm well if the vm has a smt virtual topology the scheduler will behave like on bare metal that is for sure and it will probably run the task t2 from the example in this slide on either vcpu v2 or v3 but is this good or is it bad or maybe we just can't tell with with only this information yes the current answer is in fact that in order to know because in order to know whether it would be good or bad to run the task t2 on any of the vcpus say v2 we would need to know things like on which physical cpus the vcpus v1 and v2 actually run and maybe also what other host tasks run on the systems physical cpus besides the vcpus of the vm i mean so i mean is there even any point in having virtual topology because if it's like that the scheduler inside the vm will never know all these things so how is this all supposed to work well one idea is to configure the system at the host level so that the scheduler inside the vm does not need to have in fact if each particular vcpu of each vm only and always run on a specific pcpu on the host then all the uncertainty about virtual to physical cpus relationship and mappings just goes away therefore if we know that for example vcpus 0 and 1 of vm1 will always run on pcpu 0 and 1 of the host and that p0 and p1 are two threads of a core then it sure makes sense to have the scheduler in vm1 working in the same way as if v0 and v1 were threads of the same physical core and yes if we provide it with this information by means of defining virtual topology for vm decisions make according to such a topology we likely have a positive effect on performance scalability inside vm1 itself let's look again at the host a little bit better so if we also configure the system in such a way that the vcpus of each of our vm's run in this joint subset of physical cpus then we are doing what is called dedicated resource partitioning for the vm's what about host tasks that are not vcpus well ideally you would isolate your vcpus from any interference including the one coming from host tasks however if the system is only devoted to running virtual machines the interference from regular host tasks would hopefully be small or maybe it is enough to shield the vcpus from just a few of them and in that case such configuration can still be addressed to as dedicated resource partitioning so this is how in my experience at least the topic of whether or not a vm should have a virtual topology is typically dealt with basically we use and we recommend giving a vm a virtual topology if and only if we have dedicated resource partitioning and yeah by the way how dedicated resource partitioning is implemented is by means of vcpu pinning and also memory pinning they both can be configured from libvert as shown for example in this slide in case you want to implement vcpu pinning but in such a way that the virtual topology of the vm will match the physical topology of the host these might be slightly tricky at times doable of course but i suggest having some pills against headache in fact kimo picks up the vcpus in order and puts them in place inside the vm topology on the other end the physical cpu id is numbering not only is interesting but it also it is also different between a system and another one as an example you can see here the physical cpu numbering scheme in an amd epic server and on the right what we had to do to achieve a properly matching virtual topology in a okay yes as already said this is the kimo and libvert default virtual topology for your vm if you don't define one yourself as you can see it is a flat one with no dependencies and no sharing of computational resources or caches and well after all something has to be used as default i guess and it is a good idea making it flat as we do not know in advance the characteristic of the hardware where one would want to run the vm whether or not there will be resource partitioning and so on and so forth maybe one thing to note is that a configuration like this where all cpu's are full sockets is not really so common in hardware but we will get to these later in the presentation okay enough introduction didn't i say that i wanted to talk about these try to wake up function well let's do it it is surprise surprise one of the functions that in the linux kernel is responsible for doing the work of waking up a task when such an event happens it happens on a cpu which we will call the wake up cpu the working task needs to go in a scheduler's ranking more precisely as soon as the cpu where the task needs to run is identified and we will call it the target cpu the working task needs to be put in its ranking then the target cpu must be somehow notified that there is a new task for it and if the cpu was idle probably the task will just run immediately as shown in this if on the other end the target cpu was not idle then the newly working task may or may not run immediately depending on whether or not it will preempt the other task running there as shown in this other example now in order to highlight the impact of the topology in the task wake up path let's ask the help of ftrace and tracemd in fact if we trace the wake up path we first of all find the event about task 31 which is waking up we also see that the event itself happened on cpu p0 and that the task t1 wants to run on cpu now you see that function called cpu's share cache yes here it is where our beloved topology comes into play basically we check whether the wake up cpu and the target cpu share the llc which is typical and if they do like in this case then p0 takes the spin lock of p2's ranku and put t1 directly inside it this is reasonable as probably both the ranku and the spin lock of p2's ranku will be present in the shared l3 cache already and even if they are not there they do belong there the sense that bringing them in most likely does not mean fetching them from another l3 cache and to deal with coherence and all that is related to that it is also still p0 so they wake up cpu that can check whether either a preemption is necessary or p2 needs to be for instance in this case p2 was idle and since it is monitoring the scheduling flag with literally the monitor instructions instruction waking it up from idle only requires for p0 to set that flag let's now look at what happens if there is not any shared l3 cache between wake up and target cpu well t1 still wakes up on p0 and it now wants to go accessing directly p0 from p0 the ranku of p3 is out of question in fact mean bringing in p0's l3 cache data from another l3 cache and have the cpu have to deal with cache coherency and stuff like that which works of course but is not as efficient and as convenient as before therefore we basically stash t1 in a wake up list pre wakes up and goes picking up t1 from the wake up list itself and start running it so why are we talking about all this well i just think that this is a rather nice example of how the system topology might actually have effects which goes beyond just the ones that one okay this was all about bare metal but we are at kevian virtualization shall we look at the virtualization case so as we have seen it is all about that function called cpu's shared cache the virtual cpu's of our vm's share an l3 cache a few years ago the answer would have been an outright no that is because back then kimu was not including any l3 cache in the description of the virtual topology now it does and interestingly enough the reason why it was out that was exactly that behavior of try to wake up that i just showed but the fact that we have an l3 cache in there does not mean that it is shared between all vcpu's that depends on the virtual topology for instance with the default virtual topology no no virtual cpu shares any cache with any other on the other end with a different virtual topology that one may define like one where all the vcpu's are defined as cores of the same socket then yes they do so let's redo what we have done on bare metal which is trace the wake up of a task but this time inside a vm and let's do it for a vm with the default topology basically means that no matter which one is the wake up cpu and which other is the target vcpu then they share an l3 cache so the trace is in fact similar to the bear shared and in fact we see the wake list there the inter vcpu notification mechanism is different though as in this case we do need not that even in the case when p2 so the target cpu is busy things are pretty much the same with the only exception then t1 may or may not run immediately there that depends on the results of check preempt core which needs to be called needs to happen on if we switch to another virtual topology so that now wake up vcpu and target vcpu share a virtual l3 cache we see that as it was happening on bare metal already the wake up vcpu is now putting the task directly in the rank of the target pretty much everything is the same even if the target vcpu is currently busy not however that it is the wake up vcpu that is important when we check what happens if that preemption check tells us that the newly woken task t1 should not preempt t2 that is currently running in fact in that case thanks to the fact that we were able to do the check from the wake up vcpu and since we already have put the proper the t1 in the proper rank q there is nothing else that we need to do so basically with respect to the case where the default virtual topology was used we are saving sending an ipi she is always a good thing especially considering that it would have been a kind of a useless ipi so vmx it and disturbing the target vcpu for nothing as noris scheduling is really you know so can we please have always the virtual topology with share the l3 cache one would think one may think well remember that the kennel in the vm was behaving like we saw in the traces because we told it that the involved virtual cpu's were sharing a virtual electric where does the actual data that was accessed resides for example when the wake up vcpu put the task directly inside the target vcpu rank q where was that rank q located in the real system inside which l3 cache was it really inside the l3 cache of the physical cpu's where both wake up and target vcpu were running well yes for example in this example that is shown in this slide yes also in this other example so even if it was the target and wake up cpu shared in this other example however no they may not so the whole point that i am trying to make is this one we are used to consider providing a vm with a virtual topology only when we are able to do dedicated resource partitioning with a one to one pinning of vcpu's to however what if for example we managed to guarantee for each of our vm's even if the cpu moves around vcpu's that they always stay inside an llc domain which most typically would they benefit in such a situation from having a virtual topology defined different than the default one and if yes which one so these are all open questions and one would need some experimental analysis and some benchmarking in order to try to answer to them actually i have done some benchmarks so i have used a xeon platinum platform for that with 96 cpu's arranged in two nodes 24 cores and two threads the benchmarks were running at the same time inside one and then four and then 12 and then 18 vm's the vm's add 12 vcpu's each meaning that with 12 vm's we were having as the hostess pcpu's while with 18 vm's we were in overload and i used mmtest as a benchmarking suite as it is some really big shameless self advertising it is growing some really nice virtualization benchmarking capability so i have run and am still running several benchmarks these ones are just the three ones that i will be talking about in this presentation so i used akbench with a varying number of thread groups i used the pipe benchmark from perf where two tasks are exchanging messages in a tight loop using a pipe and i used skedbench with a varying number of one to check the performance with different configurations in terms of how the resource partitioning is implemented basically we want to check the effectiveness of various ways of trying to keep each vm running on a specific subset of physical cpu's so the ones that share an entry cache so that we can hopefully benefit from the more cache sharing friendly vcpu so far i have done that uh the granularity of numanodes and so i have run the benchmarks with the default pinning configuration which is not pinning at all then i have enabled numadi which in theory should work as a mean of at least keeping each vm on one numanode then i have added some pinning but only pinning the vms to numanodes so all the vcpu's of each vm were pinned to all the vcpu's of a specific numanode and then i have also looked at the dedicated resource partitioning with one to one pinning case even just for reference as far as vm topologies go of course i benchmarked the default one which we can see here for a vm with eight vcpu's then i checked an all cores topology so that all the vcpu's do share an entry cache but not much else and i finally also looked at topology with both cores and threads theoretically speaking if we really managed to reliably keep a vm numanode these two topologies here define a shared entry cache should be able to take advantage also in the one to one pinning case this last topology here is the one that let us achieve the perfect one to one mapping between physical and virtual topology now to the results for akbench this is a comparison of all the possible combinations of all the tested configurations and topologies when running only one vm on the host on the x axis there is the number of threads grouped like bench on the y axis the average it seems that the red line with full dots is almost always the best and it corresponds to per no per numanode eight cores in one socket there are also other combinations of pinning and topology that are able to achieve good performance but it appears quite evident that the default topology with its eight sockets is pretty much always the worst when moving to 12 vms which means that we are saturating the cpu capacity of the host things do not change much now it is the one to one pinning configurations that provide the best results but only if also a topology with a shared entry cache among the vcpus per node results follow closely and show they show the same pattern and trends in fact the line that corresponds to one to one and per node pinning but with the default topology are the yellow one with empty squares and the blue one with x axis and they are not looking good we can also look at the per node configuration in some more details still with still for the 12 vms case from this view it is quite easy to see that among the three tested topologies the one with eight cores and the one with four cores and two threads are pretty much always largely better than changing angle one more time we see the confirmation that if we fix the topology to one socket and eight cores the one to one pinning configuration is the best for this benchmark but the significantly less stringent per node pinning configuration is really really close it is also interesting to note that the number of configuration is instead it could have been expected that the number put in place are the fact of per node pinning configuration and then achieve similarly good results but this is not happening and this I guess probably deserves for the investigation finally if we check what happens when we aust with 18 vms which means a total of 144 virtual cpus over 96 physical cpus the trend is confirmed the default topology especially at higher thread blue scad bench however shows us a slightly different picture in fact if we look straight at the behavior of the per node pinning configuration this time with the various topologies we see that in this case the default topology is actually the one performing the best both with a moderate load of four vms and at full capacity with 12 vms and this is also true in case we check the one-to-one pinning there are stingly so pinning to the vcpus one-to-one to the pcpus and also making the vm topology exactly much the aust one is not only performing worse than for example the default topology it is actually performing the worst of all configurations and this is a little bit strange and I think this also deserves being investigated a little bit at least the per pipe results here the performance seems to be entirely as we can see from how the data points show up in fact both with one and four vms the eight cores topology leaves basically everyone else whatever the pinning strategy is interestingly enough the default topology is not the worst performing one these times as the four cores and two threads actually and this is exactly the same the trend is confirmed here for the 12 vc vms hey I hope you did not have enough of graphs at this point because I have another one I have the one that I promised at the beginning of the talk I wanted to also talk about so this is the stream benchmark run inside a vm that was tuned to have the same performance as the host and as you can see from the graph it does that very well for three out of four operations of the benchmark for the copy operation we apparently have tuned the vm too well because it goes sensibly faster than the host so this took us especially a former colleague of mine some serious ed scratching back then but after a while we found out these so basically while performing copy inside the vm non temporal prefetch instructions were used on the other end the optimization was not being taken up by our configuration for stream was a little bit special I have to admit we were using some large array in memory and this is not special for stream but also quite a lot of threads and also this behavior was observed on a system running gilip c version and in such a version of the library there is a threshold that is computed in order to decide whether or not to use non temporal prefetch is in store in the way the threshold is computed takes into account the sides of the l3 cache therefore what was happening was that on the metal the sides of the main copies was just below the threshold and so in the vm it was above but the sides of the main copy was the same it was the threshold that in the vm it was smaller in fact the physical cpu had 64 megabytes of l3 cache in total on the other end in the vm we were using the kimu default emulated cache which is reported to the vm as being 16 megabytes and having so many virtual cpus with a so small l3 cache resulted in an unrealistically low threshold in the vm which allowed the vm to use the optimization even if it should not be used so the solution to an issue like this is to make sure that the vm is provided with a meaningful cache topology in terms of also cache sizes in this case it was easy enough fixing our situation by just using the cache pass through mode as i said specific heuristics that computes the threshold are not even there any longer in more recent versions of gbc however the issue is general meaning that having only the 16 megabytes or full cache pass through available in kimu means that we can for example end up creating a vm with just a few cpus and a giant cache which is not something that really exists in actual art some other threshold either in gilipc or somewhere else which is instead designed with what exists in actual hardware in mind may cause us other problems therefore i now am wondering whether this might mean that we need to put in place mechanisms for describing a virtual cache hierarchy with sizes and all its other details and make it part of this i guess so i have recently so l1tf being reported as vulnerable inside a big vm not inside the vm so not the crazily scary virtualization variant of l1tf this is the one tf that is trivially so how come the vm the answer lies in checking a less cpu inside the vm the last entry of the again it is somewhat related to topology i would say in fact while the host has 46 bits available for virtual addresses kimu which is fine because the address of the last portion of isovaram is representable on 42 bits as soon as you take one out for implement so this would be yet another of those cases where we richer and more flexible way of expressing the virtual topology to be fair kim and the weird support that is currently missing forcing people to if i had to draw some early conclusions from these ongoing activity of mine about investigating virtual topologies i would say that physical topology has more effects on the behavior and on the performance of a computer system that one may think and in virtualization when it comes to virtual topology this is actually even more also it is worth at least according to me enhancing kvm kimu and libvert in such a way that they will become able to more accurately describe and represent all the aspects of the physical topology it is sure more code maintain and maybe i don't know more parameters to kimu which definitely is not in need of such a thing but at the end of the day software running inside the virtual machine both at the kernel or at the user level may either depend or really highly change its behavior if we miss and while more benchmarks are definitely necessary in order to be conclusive it is probably the case that virtual topology is not really only relevant when the access to the host resources instead when configuring a system especially if some resource partitioning is being applied but even if not in an exclusive way we should probably think that whether this allow us to define virtual topology for the vms different than the default one and check whether that let us maybe reach better performance and this concludes my talk here in this slide you can find some information about me and my contacts in case you want to reach out and now thanks again for attending