 What we do in the virtualization space at Red Hat, we started with operating system level virtualization solutions a while ago and first we started with then and we were shipping then as our own virtualization solution in roll five. At some point we realized that there is a better solution on the market and with the acquisition of kumbranet we switched to KVM and now we do lots of cool stuff in KVM world. We supported for different architectures. We supported for data center and cloud solutions for many customers and we truly believe that it's the best technology out there and moreover it's open source. At the same time we truly believe in a hybrid world which is around us and our customers want to run rel in all sorts of environments which are out there and of course they run rel on rel and they run it on KVM. They run on open stack and rel. At the same time they run rel instances on public private clouds and all sorts of different virtualization solutions which are out there namely they run them on AWS and at the same time of course they want to run them on Microsoft solutions and Azure is the second biggest cloud provider in the world nowadays and of course they want to run there and as you might probably heard we had a big partnership announcement at the end of last year and now we officially support rel on Azure. We were supporting rel on Hyper-V before but now we officially supported even on Azure. So what is Microsoft's Hyper-V? Microsoft's Hyper-V is Microsoft's virtualization solution and it was first introduced in Windows 2008. It's the core of their Azure cloud. The Azure cloud is as I said the second biggest cloud in the world and it actually runs Hyper-V. So Hyper-V is a type one hypervisor which means that it runs on bare hardware and it's not like KVM when you have a model and you run something inside your operating system as a process. With type one hypervisors you run hypervisor on bare hardware and then you run your management operating system just as a guest for this hypervisor. In Hyper-V there are called partitions and this first guest is called root partition. It's similar to DOM 0 in then. Hyper-V requires hardware virtualization but all more than x86 processors support that and currently it's able to emulate two different platforms. First is so-called generation 1bm and it is a legacy platform with BIOS and all sorts of emulated devices including IO and you can actually run an operating system which doesn't know anything about Hyper-V there. It's going to be not super fast but you can run it. At the same time since Windows 2012 it supports so-called generation 2vm. Generation 2vm is an UEFI system and it doesn't have emulated devices. So to run something there this something should be aware of the fact that it is running on top of Hyper-V. It should support Hyper-V specific devices. So yes how it's done. So Hyper-V emulates x86 system and it emulates a standard platform at the same time it provides us with some so-called enlightened passes and special paravirtual services. As I said for input output you have emulated passes for generation 1 and you don't have them for generation 2. So these enlightened IO passes are mandatory for generation 2vm. It also has a set of paravirtual services such as like heartbeat so your host is aware of the fact that your guest is still running. It has utility drivers and I'll tell more about them later. It has special services for time keeping and synchronization and guests are able to report crashes to the host. So Hyper-V and Linux. So of course our Microsoft was interested in running not only Windows in Hyper-V. So they started developing their drivers for Linux when at the same time they released Hyper-V at 2008 and in 2009 these drivers were added to staging. At that time it was mostly done by Novel. In 2011 these drivers left staging and now they're part of normal Linux drivers set. These drivers are present by default in all currently present Linux distributions on the market and they're completely open source and GPL licensed so there are no legal issues with them and as you can see like currently on Azure 25% of all VMs run Linux. I got this data from Microsoft so we should trust them. This should be true. So as you can see that Linux is important for them so it's not a tiny portion of their business. So who develops them? Of course this effort is mainly driven by Microsoft themselves because they're interested in Linux working as a good guest on Hyper-V. At the same time as you can see community involvement in the development process is growing and last year only half of all commits came from Microsoft. The other half came from community and namely like one-third of all commits in Hyper-V space came from Red Hat. So we are doing something there. So which drivers do we have in kernel? We have first two is storage driver and network drivers and these two are crucially important for performance. They're performance critical part because that's actually what you do with your VMs. You use your some storages and you use some networking and you want it to be fast. Other drivers are not that performance critical but they also exist mainly are frame buffer device so you can draw something on your screen. You have keyboard and actually the funny thing is that in generation 2 VM this driver is mandatory unless you have it you won't be able to type anything in your VM. So don't forget to put it in your RAM drive if you want to debug something. It has a emulated mouse. It has a ballooning and memory hot plug driver which is very specific to Hyper-V and I will tell you about this particular driver later and it also has utility drivers. So about storage driver as I said it's really crucial for high performance. It supports all types of storages with a single driver so you can connect both SCSI, IDE or fiber channel devices to Linux sources driver. You don't need different drivers for that. It's SPC3 compliant since Windows 2016 which is not yet released but we expected to be released later this year. Before that it wasn't claiming these compliance but it was implementing SPC3 features such as block discard. This driver supports multi-Q because it's a high performance driver actually. NetVC driver. So again crucial for high performance supports multi-Q and scaling on both transmit and receive passes. On the receive pass they call it like virtual receive side scaling and it actually means that your host determines some conditions when it needs to start scaling incoming packets and it starts pushing these packets to different queues on your guest and for the transmit pass they call it virtual multi-Q which is also host driven so it can be dynamic or static. With static multi-Q you just get a number of queues up front and you can use all of them to send your packets. With dynamic multi-Q host waits till your guest really needs to scale and at that point it actually measures your CPU load and network load and at some point it sends you a signal that you need to use more queues and your driver starts scaling Linux. As it's a network driver for Windows we need to decorate each packet with R and D's header which is very similar to some wireless cards which were out there a while back and which had only Windows support and for someone who is familiar with network stack in Linux kernel there is no NAPI support for the driver so we can't we can but we still didn't implement the active pooling mode and I hope that it's going to be implemented later. So now Microsoft encouraged me to show you these performance numbers so again data from them we should trust them they're on the dark side they have queues so here you can see performance data for the netvc driver on a single hyperbic host measurements were done on a single numenode 8cpu system and I think yeah this data is from the transmit pass so they were measuring it with like hyper style test and as you can see Microsoft claims that on some connections like less than 1000 connections Linux driver like on 64 and 256 Linux driver performs better than Windows on Hyper-V so for some connection numbers the situation is slightly different but still the difference is not really big so we are on par with Windows drivers for Hyper-V for Azure for Azure we are for some reason performance slightly worse but again the difference is not that oh yeah and I forgot to tell you that 40 gigabyte gigabit per second network adapter was used for the test and we did like 31 something gigabits per second for the guest which is pretty okay oh it's these numbers are from these PV drivers so no net no device pass through was used there so it's not SRROV it's pure PV pass on Azure for some reason slightly worse but maybe it depends on how Azure hosts are optimized and in this particular instance type they had two numenodes and 32 vcpu's total we did well so I already pronounced utility drivers several times and what are these utility drivers about so we have a number of drivers which are not visible for a user and these are mainly like clock source and clock events because for virtual machines it's better to use par virtual passes for these devices we have time synchronization so when our guest starts it receives time from its host and it can adjust its own time according it also receives this synchronization when we migrate guests so when we migrated to the other host we can get time counter from the host we have heartbeat as I told you we also have special to Hyper-V devices and which are paired with user-based demons and all the source code is in linux kernel including these demons in two sub directory we have three utility drivers first one is key value pair and actually it's just a simple key value server so you can from the host side you can store some value on your guest and you can retrieve the value but it wouldn't be that interesting unless it was used for network settings and using this key value pair demon you can actually get your network settings from your guests and you can set network settings to your guests like adjust their dns server we have second demon which is called vss and it actually comes from a like virtual shadow storage or what is the name for it from windows we don't have such things in linux but we needed to do consistent backups so when you backup your guest and you do it live you want to make your data consistent and to do that you need to freeze your file systems before you do start doing the backup and actually get the snapshot because when you run on windows your storage can be snapshot it so windows does the snapshot for you and you can like saw them and continue using them for this purpose we have a special driver and a special demon in linux there is also an file copy demon which can be used to copy files to guests without network involvement everything happens through this host guest protocol which i will describe you later memory balloon so as i told you this ballooning mechanism is specific to hyper v so what we have you can you may as a customer you may want to assign more memory to your guests on your host then you actually have you do this under an assumption that not all memory is going to be used at once so guests will scale and they will adjust their memory needs accordingly and for that purpose we report our memory pressure like how much memory is used in the guest every like second yeah and we just send this report every second to the host and the host hyper v host can decide whether we need to balloon up or balloon down balloon up means that we are giving away some memory we are actually allocating some memory pages which we are not going to use and we report their frame numbers to the host once host received these frame numbers it can actually detach physical pages from these frame numbers and assign them to some other guest and at that point we cannot use these pages once we try to access these pages we get general protection fault so at some point we may want to get our memory back there is no direct way to get this memory back we can just report higher memory pressure to the host and we expect our host to return us our memory host comes to us with these frames and gives us back at that time and Linux we can freeze these pages and start using them as normal memory the same driver is being used for memory hotplug before Windows 2016 it was only possible with so-called dynamic memory which was a bit slow what's the like real implementation of the dynamic memory we don't really know we never saw hyper resources but since windows 2016 you can basically add memory to your guests anytime this will arrive we have some weird issues with that in linux that in linux our memory hotplug granularity for x86 is 128 max and in windows it's 2 max so there is a special handling in our driver for that time keeping as I told you that in virtual machines we'd rather use pvpasses to get time so what do we have we can still use tsc just like our dtsc instruction but in hypervis specification it's said that it's not stable so you can get like huge jumps in tsc value it can actually jump back so it's not a good time source so there are two additional time sources first is so called hyper we called source which is really simple but it's slow there is a special msr from which you can read or this time why is it slow because every time you read from this msr it actually means you're exiting to the hypervisor hypervisor traps you hypervisor puts some value for you and you get back good but slow there is a different clock source which is called tsc page and it's actually a servered memory page between you as a guest and the hypervisor and there is a special protocol how to get time value from this page you just read some sequence number then you read something on the page then you do your some calculations and then you check the sequence number to see that it hasn't changed since you access it for the first time because if it changes it means that you need to read from the beginning otherwise your host was updating time at the same time it's super fast because there is no exit to the hypervisor you're just reading from memory we have a bunch of hyper v drivers in development it's like again the effort is mostly driven by microsoft uh there are three drivers i am aware of first is uh hyper v socket and it's actually uh similar to what we have for vmware um which is named to be so and uh it is a way for two applications on your host and on the guest or on two different guests to communicate to each other without real network so they will communicate through this host guest protocol which is named vmvus in uh hyper v or the second driver is PCI pass through which is designed for obvious purposes to pass through a PCI device to your guest there is no PCI pass through at this time and the last one is rdma for a very specific like for one melanox card they are able to do or this rdma in the guest and they have a special so-called front end driver for this melanox card which is in your host uh so you can like do direct rdma from your guest uh i expect they'll be uh upstream in this driver later this year it's actually open source now you can see it sources on github but uh they never put it to kernel so we at redhead don't support it now we are waiting for them to start upstream into linux kernel so i promised deep dive when i submitted the stock so now start the deep dive so uh how all these drivers work first why would we need these drivers why can't we use emulated devices for everything well uh emulated devices would emulate hardware protocols and these hardware protocols were never designed uh to be used by like host guest communications so there is no way we can make them super fast all other virtualization solutions out there have similar uh pv drivers like kbm has vertio then has this front end back end pairs and there are pv devices for vmware uh some devices we have they just don't have hardware counterparts or is there is nothing to emulate like uh this memory ballooning uh device as i told you that there is nothing nothing similar in hardware which can balloon your memory and we have this enlightened drivers microsoft name enlightened they're from the other side of support yeah uh so um these drivers are implementing so-called vm bus protocol and vm bus is actually a set of things which uh you are supposed to implement to interact with your host and uh it's based on the concept of channels channel is like a single entity for communication uh they're bound to particular vcp use and you can have one or more channels for each devices uh we have a way to transmit data and ring buffers are being used for that but as we are using like ring buffers we'll also have to uh have a way of signaling both ways so our host should be able to somehow signal the guest that something is going on and our guest is some should be somehow able to signal or the host that there is something needs handling from the host side so how do we signal something to our host when we're in the guest the concept is called hypercalls and it's super simple you allocate a simple memory page and you create like a virtual mapping for it and then you put it's the physical address to some msr for the host then you treat it as a function so I have some functions there so once you want want to perform hypercall you put something like uh uh input output and like hypercall id to your register and you just call the page think that oh there is a function there hypervisor traps you there perform the function you asked for and returns back and you can look at your output like usually you put some address there and hyper with hypervisor will write something there so how does host signal something there are two concepts in hyperv they call messages and events messages are are being used for non-performance critical communications and they usually mean that something is going on with the channel like we want to open a new channel we want to close a channel we want to like restart the whole communication unload everything and it's actually like a single page per vcpu where there is actual data when you receive an interrupt like physical interrupt from your host you're supposed to look at the page and if there is some data there you get a message from the hypervisor but this message is limited to 200 something bytes and as you understand you can make any performance critical communication based on messages because you'll be just delivering one message per call and you'll need to have an interrupt for each of them so it's going to be that slow we have events event is a different page and when we receive this interrupt from the host it's actually the same interrupt which signals messages we are supposed to check this page and this page has an indication which channels have data to process there is like an index there so it's kind of fast we don't scan the whole page to see all the bits and we get all the channels which require processing we go to these channels and we process data where is the data the data is in written buffer so written buffers very simple concept where you have some memory space actually there is nothing like circle data structure in your memory so you use some amount of memory and then it like wraps around so you have like two positions like reader position writer position and the data between them is unprocessed data so writer writes reader reads nobody blocks the counterpart so for each hyperv channel you have two rings to ring buffers one for sending data and one and the other one for receiving data and this rings can be really different in size so as you can see like for the network driver we use like 128 pages for storage driver we have 256 pages for some non-performance critical devices we use much smaller rings so we need to somehow signal that there is data so on the receiving this is not symmetrical so on the receiving we receive this physical hardware interrupt when there is data so we check this event page figure out like which channel it has go there and we are supposed to read all the data which is on the channel because if we have like more than one packet there we won't receive new interrupts for the remain so we need to read them all so at the same time like if the whole ring was filled with data we need to signal the host back okay we process the data there is some space there you can continue on the transmitter ring where we put some data for the host we have two guarantees from the host that first once the host starts reading from this this ring it will read all data which is on the ring second guarantee is that it will indicate that the reading is ongoing so with these two guarantees we can only signal our host because like we signal with hypercode hypercalls are expensive so we are trying to make as less hypercalls as possible so we only signal when when our ring trans like goes from empty state to non-empty state in all other cases we either already signal the host and it will process the data or the host is actually reading the data so we don't need to signal that was it actually you'll probably have some questions for me please yes so there are special msrs which hyperv emulates where the guest is able to report that it actually crashed and there is not that much data where which you can put there but you can put there like your instruction pointer and some set of your registers cpu registers why is it useful it's useful because if you have a huge cluster of hyperv hosts and you have crashes you want to be able to identify fast if you have identical crashes or different crashes and you can do that just from like looking at this instruction pointer if they're all crashing at the same instruction pointer that this is probably a same the same crash all over the cluster and you can see these events in windows event log so once you write to this msrs you actually do the protocol is like that you put like your instruction pointer a x b x c x d x and sorry and then you write to this control msr which means report the event to the windows host nothing happens to your guest after that so you can actually report several crashes it won't like kill your guest or something like that but then looking at your windows event logs you can see that your guest actually crashed that this specific 18590 event and it tells you where your guest crashed so you can just take your like linux kernel image do add or two point to a line and you will see where your kernel crashed even if you didn't get like k dump or you have no console output like nothing just out of this data you can get something sometimes sometimes not sometimes it's just some random instruction pointer and there is nothing like that in your kernel it is an existent feature yeah the question was sorry the question was if this feature exists or in development it is an existent feature in hyperb and it is an existent feature in linux kernel since last year i think like the beginning of last year it was implemented upstream uh actually there is no control for them so you either start them or not there are two parts in linux first is kernel driver second is this user space part and they they communicate through character device nowadays so it actually does what is designed to do network uh driver it calls uh scripts so you can actually alter them like set ip address and it is a script you can put whatever you want there or just like azure yeah uh you are not able to access the host so i'm not sure what's going to happen but there are probably some like management engine around that on hyperb you'll just get an error like you can use power shell and with the power shell there is a comment like assign this ip to this vm if the demon is not running in your linux guest you will just get an error like or a timeout if like there's no module nothing more questions we don't actually need to because everything is going to happen on windows side if our volume is on windows and it actually is that uh windows can do real vss snapshot for like the whole file system with the volume okay so the only thing we actually need from linux is to make it consistent so we do freeze for all file systems do the snapshot and then we saw them so uh it actually takes like second or two so your guest is not blocked for like minutes when you do the backup like there are way too many comparisons like that so for different workloads you'll get like different results and it's not like oh this particular solution is better i just had we are like trying to do our best to make all these workloads work as best as as good as possible like so we are not like slowing down one workload just to make the other look better like we could more questions thank you