 lecture in the course design and engineering of computer systems. So in the last couple of weeks we have seen how memory management and IO subsystem work in operating systems. So in this lecture I will briefly touch upon how memory and IO are managed inside virtual machines when you have virtualization and that will wrap up our discussion of memory and IO in operating systems. So let us get started. So to recap we have seen that there are two ways of doing virtualization, one is containers and one is VMs. So what are containers? Containers are a way to do lightweight virtualization where you have multiple containers will share the same underlying OS but they will still provide different isolated views to processes. You know processes in one container will only see the processes in that container it will not see the processes in other containers. Similarly you can isolate the file system, network resources for example if somebody opens a socket in this container at a certain port number another container also can have the same socket at the same port number because these two are isolated from each other. So all such mechanisms are provided by containers and also resource limits are enforced, how much CPU, how much memory, how much bandwidth all of that also can be controlled. So Linux has mechanisms like namespaces and C groups to provide this isolation and resource limits and a lot of container frameworks like LXC and Docker you may have heard these names use this OS functionalities in order to provide this container abstraction. And we also have a lot of container orchestration frameworks like Kubernetes for example and all of these are very popular when you are running a large application because application has multiple components and these components are put in different containers and spread across multiple machines and these frameworks, these orchestration frameworks like Kubernetes make it easier for us to handle these different components that are spread across multiple machines. They provide you ways to handle the life cycle when a component crashes, restart it, you know distribute these components across machines, all of these functionality is provided by these container orchestration frameworks which is why a lot of real computer systems, large computer systems are typically built on top of these orchestration frameworks so that you do not have to individually manage each component, each container on your own. So the next way to do virtualization is using virtual machines. So virtual machine is actually more heavyweight than container, you have VM which has its own OS and its own applications and all of these are running on another host OS or you know a virtual machine monitor or a hypervisor so this is your guest and this is your host and then this is running on top of your CPU and other hardware. So the VMM basically virtualizes the entire hardware to each of these operating systems. So every operating system kind of thinks it is running on the underlying hardware fully on its own even though there are multiple such guests and this is a very important building block for cloud computing in the cloud when you give multiple users access to the same cloud server you need this level of isolation that is provided by virtualization. So we have seen this concept of how do VMMs work, they work on the concept of trap and emulate. So just like multiple user space processes cannot access hardware, they trap to the OS and the OS accesses the hardware on their behalf similarly these multiple guest operating systems also will run at a lower privilege level and whenever they have to access the hardware they will trap to the VMM and the VMM will access the hardware on their behalf. So that is the basic idea of trap and emulate VMMs you know just like the OS is virtualizing the CPU for processes similarly the VMM will virtualize all the hardware for the guest OS. But this is the simple concept is not so easy to implement because modern existing operating systems or CPUs are not easily built for virtualization. For example your guest OS may not run correctly if you run it at a lower privilege level it may expect to always run at a high privilege level. So in such cases we need some extra techniques beyond the simple trap and emulate idea which is there are many different VMMs today based on different ideas. Some VMMs use para virtualization which is they modify the guest OS so that it works correctly at a lower privilege level. Some VMMs use full virtualization that is you cannot modify the OS code but you will modify the OS binary the instructions in the OS binary you will translate them so that it works correctly at a lower privilege level. And finally you have hardware assisted virtualization which is the most recent idea which is where you change the underlying CPU so that no issues come up and existing operating systems can still run without any changes without changing the code or without translating the binary. So let us look into a little bit more detail on hardware assisted virtualization. We have looked at it briefly we have seen how processes run with hardware assisted virtualization. In this lecture we want to even dig a little bit deeper understand how memory is managed how IO happens and all of those things as well. This is a recap of hardware assisted virtualization that we see. So modern CPUs for example, x86 has this VMX mode which is a special mode in which to run virtual machines. So what is this VMX mode normally in x86 if you have your regular mode which is called the root mode you have privilege level 0 1 2 3 0 is the most privilege level where the operating system is running 3 is where user applications are run. Now with VMX mode you will create a separate mode called VMX mode which also has ring 0 1 2 3. In this VMX ring 0 is where you will run your guest OS and in VMX ring 3 will be your guest applications as usual and in your root mode this is where you will run your host OS, your VMM other processes will run in ring 3 this is your regular system. This extra modes of the CPU have been added in x86 it is called VMX all architectures have this support. This extra modes let you run existing operating systems without any modification. How does this VMX mode work? This guest OS when it is running in VMX ring 0 it is less powerful than regular ring 0. If the guest OS complete control over the hardware why because there are many guest you do not trust them. So that this guest OS even though it is running in ring 0 this is actually less powerful and the VMM can configure the guest to exit if something happens if some instructions some problematic instructions are run if hardware is accessed any such thing happens you can configure the guest to exit into the root into the regular mode where the VMM is running and the VMM can handle the exit can trap and emulate on behalf of the guest OS even though the guest OS is not aware that this is happening. So the popular example of this way of virtualization hardware assisted virtualization is the KMU KVM hypervisor in Linux. So this hypervisor has two parts KMU is the user space process the hypervisor also has a user space process to do a lot of things that can be done at the user level you do not want to do everything inside the kernel. So the KMU part of it does whatever is possible to be done at the user space level that is you know it will allocate memory it will copy the guest OS code you know it will create some kind of a ramp for the guest OS all of that will be done and when you have to actually switch to this VMX mode then it will talk to the KVM kernel module. So your hypervisor has two parts some part is done which is non-privileged is done in user space. So this KMU runs in user as a user space process and this KMU talks to KVM which is running inside the kernel which is privileged and this KVM will do the switch into the VMX mode. So whenever so KMU has created you know copied the guest OS into its memory guest OS code and then when KMU says switch to VMX mode to KVM at this point what is happening? The CPU core has switched to running this OS code this guest OS code that is there in KMU. You are no longer running regular user space code of KMU but you are running this guest OS code in ring zero right. So when KVM switches to VMX mode it is like everything at the CPU level is reconfigured your host OS stops running all the processes on the host OS stops running and now your guest OS directly starts running in privileged mode in VMX ring zero and where is the guest OS code located it is located in KMU itself KMU has set all of this up it has copied the guest OS code it has you know created the guest OS memory image everything it has done and when you switch from this host mode to the guest mode you have to still save and restore context now all the host OS context has to be saved the guest OS context has to be restored right. So this is like a context which is that at the machine level it is happening your entire host OS is being saved and your guest OS is being run this is called a machine switch and when this is done you have to store all the values of the CPU registers everything that is stored in a special data structure called VMCS or VM control structure right. So to summarize your KMU creates some sort of a memory for the guest allocates memory for the guest copies the guest OS into it just like how you know OS is copied into RAM similarly KMU copies the guest OS and then with the help of a kernel module it switches the CPU to the VMX mode and in the VMX mode the host OS stops running and this guest OS directly starts running and when you exit back from it once again you will restore back the host OS you will restore back everything that was happening in the host OS because you saved the context right and you will restore it back. And another thing to note is that KMU can have multiple threads you know if you are you want your guest OS this code to run on multiple CPUs how will you do it KMU will have multiple threads and on each thread you will go into VMX mode and run the guest OS code. So it is almost like now your guest OS code is running on multiple CPUs in parallel if your VM has four CPUs then KMU will create four threads and each of these four threads will run on different CPU cores of the host OS so that the same guest OS code is running four times in parallel you can do that also in KMU. So let us just visualize this a little bit so you have your CPU can either run in root mode or VMX mode in root mode you have you know the unprivileged ring 3, privileged ring 0 similarly in VMX mode so normally user processes are running here okay they are running in root mode in your host and KMU is also one such user process to the host OS it is nothing different it just looks like any other process. What does KMU do it will allocate a large amount of memory for and into that memory it will copy all the guest VM code okay as far as the host OS is concerned this looks like any other code of a process okay the guest host OS does not see this as any different it is any other process and it also creates multiple threads to execute this code in parallel on multiple CPU cores. Now when KMU has set all of this up what it will do is it will tell KVM to switch into VMX mode note that this is a privileged operation you cannot do it in user space that is why you have split it into a kernel part and a user space part whatever is easy you do it in user space part the privileged part you do it inside the kernel module. So KMU you know one particular KMU thread that is running this guest OS code will tell KVM hey switch into VMX mode at this point what happens your CPU stops running the host OS it stops running this KMU process right the host OS has scheduled KMU process at this point this KMU process starts stops running and you switch to VMX mode and you start running this guest OS code okay KVM shifts the CPU to the VMX mode you save whatever context okay whatever were the register values here and everything you save into this special area of memory called VMCS and now your CPU has switched into VMX mode ring 0 and now this guest OS code that KMU has set up in some memory that starts running and when the guest OS starts to run of course it will boot up it will create its own user applications it will create a shell that shell will spawn other processes we have seen all of this and you know you keep coming back to the OS running applications all of this will go on until something happens some privileged access that the guest OS is not supposed to do until that happens you are in VMX mode note that this ring 0 is not all powerful like regular ring 0 if you do something stupid the VMM will say okay stop and the guest OS will exit back into KVM now once again whatever context you have saved of the host you will restore all of that context the guest is stopped the host OS starts running now this KVM will see okay this guest has exited why has it exited maybe it needs something it will go back to KMU KMU will see what to do about the guest it will handle the event and then once again you might say once again if once the for example the guest needed some IO you will come here KMU will do the IO and then you might go back to the guest again okay in this way you will keep switching between on every CPU core this will happen independently on every CPU core on every CPU core wherever a KMU thread is running it can pause the host OS which to the guest OS run it and come back after sometime note that the host OS is not aware of it at all the host OS context is saved and restored so the host OS doesn't realize it was even paused and some other OS ran all the host OS ceases I am running this KMU process that's all so the next thing that comes up is how is memory managed inside a VM okay. So now you have say your guest virtual machine is running some processes which have its own virtual addresses we have seen this and the guest has a lot of RAM and from this RAM it will assign physical frames and it will give to the various guest processes for their core data stack heap whatever it is right the guest OS is doing this but whatever RAM that the guest OS thinks it has is not actually physical RAM the guest OS when it sees a physical address of 0 it is not actual physical address of 0 why because there are multiple such guests running okay it is not just one guest that is running this is not the only OS running that has access to all the RAM this is in fact multiple such guests are running and this guest's memory is in fact part of a process part of the KMU user space process okay. So therefore what the guest thinks is a certain physical address is actually not the actual physical address in RAM there is another layer of indirection mapping from this guest physical to the host physical addresses. So the guest OS whatever page tables it has they keep track of this mapping from guest virtual address space guest process virtual addresses to guest physical addresses that the guest knows about but whatever memory you have given to a particular guest where is it actually in RAM that only the host or the VMM knows about okay because this guest physical memory is not actually you have not given any guest full access to the RAM it is just part of the memory of a regular you know the KMU process that is all and therefore that also is mapped into some other host physical address is assigned to it and it is mapped to a different set of addresses. So when you translate addresses you have to go from guest virtual address using this page table you will translate into guest physical address and then this guest physical address the VMM will translate into host physical address only then you can access actual RAM that is there on your machine okay. So the problem of memory virtualization how you manage memory inside a VM has gotten more complicated due to one more layer of virtualization here okay the guest page table has the guest virtual to guest physical mapping and the VMM has this guest physical to host physical mapping why because the VMM knows I have given this memory to this guest I have given that memory to that guest it knows where it has put the guest's memory where it has put whatever the guest thinks as its RAM where it has put in actual host physical memory only the host was in the VMM know. So there will be one more set of translations like this. So now you have two different page tables so which page table will the MMU use there are two ways one thing that the VMM can do is it can create a combined mapping this guest virtual address is this guest physical address you look this up in this table finally from a guest virtual address to the final host physical address you can create a mapping combined mappings you can create and put them in what are called shadow page tables that is extra page tables not the guest page table but an extra page table the VMM can create and you know keep updating this page table as the guest updates its page table you keep updating the shadow page table and this shadow page table will be used by the MMU. So whenever some virtual address is being used you will use this translation to translate it into actual RAM locations this is one idea the other idea is instead of the VMM you know combining these page tables you can once again tell the MMU only look here are two different page tables you combine them yourselves that is the MMU hardware can be made aware of virtualization you can say here are two different page tables that concept is called extended page tables and it can take pointers to two separate page tables and walk both page tables during address translation of course this is more efficient because the hardware is doing the work but it also requires some hardware support. So next we are going to understand how IO virtualization happens that is when a guest does IO what exactly happens so we have seen that once again just like how guest cannot be given full access to RAM any guest OS cannot be given full access to IO devices also because multiple VMs are sharing the same server and there are security issues involved. So when the guest needs to access IO it has to go through the VMM and how do we do this there are many ways of doing it the simplest is what is called IO emulation that is you know the guest OS tries to do any IO operation for example in VMX ring 0 then this will trap into the VMM there will be a VMM exit you will trap to the VMM and then the VMM you know the KMU user process for example if the guest wants to open a file the KMU user process can open a file if the guest wants to read KMU user process can read a file right you can just emulate whatever the guest wants to do inside the VMM and similarly every time an interrupt occurs the VMM can first look at the interrupt and it if it is for the guest it can then inject the interrupt when the guest starts it can tell the guest that look you have an interrupt so every time there is an IO command given to the device or an interrupt occurs you will exit to the VMM VMM will handle it similarly when IO data comes from the device also you will first DMA it into the VMM and then the VMM will give it to the guest everything goes through the VMM so this is a simple idea it works reasonably fine for slow IO devices but once you have a very fast IO devices there will be so many VM exits that it might slow down the your VM significantly and you are also copying data DMAing it here then copying it here this overhead might get a little too much for high speed IO devices so what you do for high speed IO devices is you today you have special device drivers available that are called VIRT IO device drivers okay so these are device drivers that are optimized for virtualization inside your guest you do not use the regular device drivers that comes with your operating system but you will install new device drivers and what do these VIRT IO device drivers do instead of exiting for every IO operation what they will do is they will batch IO requests you will collect multiple IO requests instead of you know regular device driver anytime you have to give a command to the disk it will just give the command to a IO device disk or whatever but these VIRT IO device drivers know that they are not accessing the real hardware so therefore they will collect all these IO requests and once per batch they will exit into the VMM so that you avoid these frequent back and forth between you know the VMMX mode and the regular mode similarly when you have to share IO data between the guest and the host you will set up a separate shared memory region into which VMM and the guest both have access to and they can read the shared data okay so these are all extra changes needed to device drivers which are optimized for virtualization so the other technique is instead of you know going through the VMM and the host even in what IO you are going through the host just that it is optimized another way is you can give a guest VMM its own slice of the IO device if the IO device supports this you can give each VMM a slice of the IO device so that they are not interfering with each other okay so an example is modern network cards come with a feature called SR IOV or single root IO virtualization in simple terms what this means is that you can make one network card look like multiple different network cards which are isolated from each other and you can give each network card to a separate VM okay so traffic coming to this network card goes only to this VM traffic to this network card goes only to this VM in this way they are not interfering with each other at the same time you do not have to you know give data to the host VMM first and then copy to the VM okay so what these SR IOV network cards do is they have separate NICs which are of course configured these slices are configured by the VMM but once you configure them a VM has exclusive access to its slice for example things like packet DMA they will be directly DMAd into the VMM memory you no longer DMA into the host memory first and then copy to the VM memory but this is not straightforward why because the guest OS can only provide addresses of the DMA buffers in the form of guest physical addresses okay so in order for the NIC to do a DMA into memory you need to know what is the physical address of this DMA buffer only then can the NIC copy into it but if the guest OS is giving it dummy physical addresses guest physical addresses which are not correct how will the DMA happen into RAM it cannot happen therefore the solution for this is that SR IOV capable NICs also have something like an MMU built into them which will translate from this guest physical to actual host physical addresses that is they are doing the job of the MMU also in addition to doing DMA you are also doing the job of the MMU you are getting some pointer to a page table and you are translating addresses so that you can correctly do DMA into physical memory so that with this IOMMU you can directly DMA data into the guest OS DMA buffers so that you don't have to go through the host OS for DMA so this is a very advanced technology requires a lot of hardware support so these kind of ideas are called what are called device pass through techniques that is you are bypassing the VMM and directly the device is being assigned to the VM so such techniques while they are efficient they require more hardware support on the other hand techniques like what are you are just device driver upgrades you know in your VM you don't change anything in the hardware you just install what IO driver you get better performance so there is a trade-off are you willing to change the hardware or are you only sticking to software level changes so based on these different parameters you can decide for your application if you are running inside a VM then which technique should you use for IO virtualization so in this lecture we have just done a recap of containers and VMs that we have studied in an earlier lecture and we have added a few more concepts to what we have studied before specifically we have seen how memory and IO are virtualized so now you have all the techniques that are needed to virtualize the CPU memory and IO all the techniques you have understood and different VMMs use different combinations of these techniques some may use full virtualization para-virtualization or hardware assisted some may use extended page tables or shadow page tables or SR IOV or VIRT IO it really depends different applications depending on the need of the application and the technology in the VMM different techniques can be used so as a small exercise install a VMM run a few VMs and try to understand by reading the documentation of the VMM what techniques it is using for CPU memory and IO virtualizations for example you may notice some special VIRT IO device drivers that are running are they running are you using does your hardware have SR IOV capability are you using hardware assisted virtualization so understand these things for any VMM that you are using in order to see an application of these concepts in your real life so that is all I have for this lecture thank you everyone and we will see you in the next week when we cover a completely new topic of how networking works of how computer networks specifically the internet works that is what we are going to see starting next week onwards so the past three weeks have completed our discussion of operating systems and we will move on to the next set of topics in the course thank you all and see you next week.