 All right, folks, it's probably time to start. So I'm Greg Elkinbard. I work for Mirantis. We are an OpenStack distribution provider and OpenStack services provider company. And this speech covers a few of our experiences in deploying OpenStack in various environments, specifically focusing on various hypervisor technologies in OpenStack. So does the hypervisor matter in OpenStack? Well, it does, but let's see how. OK, so in this talk, I'll cover the brief history of, at least from our point of view, of what are the hypervisors that we've seen, cover the trends that are emerging in various segments where we deploy OpenStack, and discuss the opportunities and challenges for multi-hypervisor environment. OK, so let's start at the beginning. Probably not a very big game, but close enough. In 2011, we primarily did Zen-based deployments with a little bit of KVM. Zen was a very popular choice, mostly due to its selection by Rackspace and Amazon. People just looked around and said, OK, let's just do that. But in 2012, the picture actually changed, and KVM has overtaken Zen in a number of deployments that we have done. A lot of it is political. So Citrix's support for alternative cloud technology and its wall flowing around OpenStack didn't endure too many people. So people will confuse whether they will or will not. Go forward with it. A lot of it was just KVM was maturing and evolving a lot faster than the non-commercial version of Zen. OK, coming forward in 2013, this is the picture for this year. So KVM is still a very, very huge percentage of our OpenStack deployments and requests for OpenStack deployments. VMWare actually surprisingly has emerged as number two in the hypervisor race. Well, perhaps not so surprising. OpenStack right now is penetrating the enterprise market where VMWare has a very, very heavy presence, and people are considering about how to utilize their existing infrastructure. By the way, folks, these slides have been uploaded to save you the trouble of taking a picture of every single slide. You can just go to the website. But feel free to take the picture. All right, so not in the deployment picture, but a very popular request item is for containers. There's various containers and technologies that we've been asked for. Lexi, Parallels, Docker, a few of the other ones. They're primarily coming in from the web hosting and the SaaS segments. Not surprisingly, in these segments, you either need to run hundreds or thousands of individual guests, something that regular heavyweight hypervisors cannot really do, or the SaaS focused where the application itself has the multi-tenancy hooks and does not need the infrastructure to provide multi-tenancy. We do get, we got one request for them this year, and the only request for Hyper-V we got is because of the Microsoft licensing issues. We explain to people how to bypass them, so the request went away. So essentially, what you do is Microsoft will support Microsoft stuff on commercial hypervisors, such as commercial version of KVM and RL, and things like that. And we use server aggregates to limit the Microsoft deployments so you don't have to overpay for licenses. So that's essentially how it works. Hyper-V is still there. It's probably going to increase as OpenStack user base broadens out, but right now it's not really that large of a market for us. All right, so I noticed since I've been doing a lot of these, I noticed a few trends emerging. Just wanted to document them out. So the TELCO and the ISP segment is usually a single hypervisor. They want a relatively simple infrastructure. And KVM there is emerging as a choice for the hypervisor technology. Internet-focused companies, mostly internet startups, a lot of the social companies, and everything else. There's also a single hypervisor. There are also KVM. So the sub-segment of that is the web hosting and the SaaS guys. And there, we do see some multi-hypervisor technology. Essentially what they want to do is they want to run KVM and Docker, or KVM and Lexi. And in the web hosting segment, LXC is still the leading choice for the hypervisor technology. And some of them don't bother with virtualization. They just use Apache to provide just virtual hosting containers. But a lot of other people are running LXC. On the enterprise side, it's definitely multi-hypervisor. Almost everybody is asking us to deploy multiple hypervisor technologies. And their KVM, so they have an existing vCenter ASXI deployment. And their KVM is coming in as just an extra addition to their existing enterprise virtualization structure. All right, so opportunities and challenges of creating a multi-hypervisor use case. So we'll talk about various use cases and advantages of issues of a few of the hypervisors. So use cases. Well, you have an existing virtualization infrastructure. And what you would like to do is you would like to extend it rather than replace it completely. That means that you have maybe a VMware or a Zen-based deployment or even a KVM deployment. And you want to deploy it with a second hypervisor, but you have to provide a common set of APIs so that your internally developed applications or your portals do not have to be modified in order to support a multi-hypervisor environment. And surprisingly now, some of the portal technologies are writing to the OpenStack APIs and are not writing to the VMware APIs. So we've been asked to deploy OpenStack just so people can control their VMware environment using their portal. So some degree, it's also to hedge the bets against bugs. Hypervisors do have bugs, surprisingly enough. And it's also to hedge the bets against vendor pricing. So if you have both hypervisors running your environment and you've developed a migration strategy of workloads between the hypervisors, then if one of them becomes a little too expensive, you can come to the vendor and say, listen, it'll be a five-minute operation for us to move the workload. Let's negotiate. And to some degree, it's to utilize some of the additional features. So for example, if you deploy using vCenter, you can still enjoy DRS. You can still enjoy storage remotion. Few of the things that OpenStack still doesn't support. All right, so let's cover the hypervisors in a little bit of detail. So KBM, like I mentioned before, KBM is a very, very large base for us. It's about 90% of the requests, 95% of active deployments this year are on KBM. It's a type 2 hypervisor, which means it relies on a distro to provide a lot of its services. It is relatively easy to add new devices to support new features because of the way it's structured. And it's relatively easy to tune it to get decent performance numbers. So we focus on few of the areas, but both myself and the chief architect of Mirantis are working heavily in the HPC and network function virtualization arena, where the networking has to be relatively low latency and the networking has to be fast. So we've done a bit of research on what technologies exist there, and we'll cover it a little bit in the next few slides. Go ahead. You do not have a tool, but what we do as part of our deployment practice will provide you a set of recommendations of how it will be tuned. And our fuel deployment tool will do most of the basic tunings out of the box. Fuel can do most of them out of the box, but the list of possible tunings is fairly large. So we'll supply you a wider list of what you can do for specific use cases. Sure. So fuel doesn't really support application types. What you do is you can tune for specific network performance. So if you need high performance networking in large packets, it can do that. If you need to conserve memory, we'll provide you a recommendation how to do that and things like that. So we have not bothered to develop application profiles because right now, there's probably a hundred or 200 different applications that people brought to us that people use this for. So the universe is a little large. So what we have is, OK, here's the HPC profile. Here's what you do to get your storage, your compute, and your networking working correctly in the HPC case. Here's the NFV profile. So let's say you want to virtualize network functions. You need to tune extensively for the network. You can afford to spend a little bit more on CPU. You can afford to spend a little more memory, but your networking has to be dead on. So we'll give you that profile and things like that. So we developed this part of our practice. Like I said, fuel will do just the basics. So if you install it and you have a generic multi-use case, you don't need anything else. But if you have specialized use cases, we've done enough of them. We've captured recommendations from the various vendors. We'll put them together, and we'll present them to you. Or actually, if we do them, we will just do it for you as part of our deployment. So when we do the deployment, we just figure out what is your use case. We document it, and we'll just tune it for you. All right, so getting back to the slide. So the flexibility of KBM allows you to do interesting things to it. One of those examples is the Melonox eSwitch, where Melonox decided to replace a virtual switch with a piece of silicon, realizing substantial speedups. You guys will see it on the next page. Or the work that Intel has done to marry its DPA decay toolkit was the open V-switch infrastructure to gain substantial performance in the networking. There are a few issues with KBM. We'll cover it in the following pages. Most of them are related to specific destroy issues. All right, so we do use KBM both for HPC and network function virtualization use cases. It does work well in those. There is extensive set of tunings available. It's probably about three or four pages worth. I didn't want to create them all on the slides. But the basic ones are all here. So set the bias to max performance. Do not try to save energy by letting the computer manage your CPU speed. You'll wind up with a fair amount of jitter. So yes, do enable turbo, but shut off just about everything else. Enable huge pages. And there are some recommendations that requires you to wire them down. On RHEL, when you set up the RHEL box, you tune the server itself for virtualization, or we will do it for you when we do the setup. LibVert, you configure the pass-through flags. You should not use virtualized CPUs. And increase the TCP buffer, processer, Q, and stuff like that. Default congestion control mechanism in TCP. On most of the distros should be replaced with HTCP. And please enable jumbo frames, especially for the HPC workloads. They will benefit you a lot. All right, so some of the performance results for KVM. Without anything special, just with those basic tunings, on jumbo frames, we can get roughly 7 to 8 gigabits per second. So that's your typical HPC workload, where fairly large data sets have to be exchanged. Obviously, with the small packets of the performance, it will drop down to a few gigabits per second. This is obviously a run-time gig interface. We do very few one-gig work nowadays. Most of our customers run-time gig. So now, interesting thing. Just dropping in the Melonox card, nothing else, gets you to about 23 gigabits per second. You'll see that for VM to VM case, you're already starting to saturate the host. The CPU on the host, the kernel can process things. The system can time-slice things. But if you want to talk between two different machines, you get 23 gigabits per second. And we didn't even try to tune anything. This is just we drop the card in, activated it based on default Melonox instructions, and off we go into multiple 10 gigabit lambda. So interesting work has been done by Intel. Intel has taken its DPDK data processing kit, which essentially has a lockless kernel, which has mostly zero copy IO, and integrated with the vSwitch protocol. They've also done some hooks in the guest for either shared memory or replacing traditional socket applications. And this is an alternative to what you can do if you don't want to replace your hardware, we'll say like Melonox or some other on-chip switch. And with very small packets, would normally would be around two, one to two gigabits per second, you can get up to seven. And once you get to the average packet size, you get pretty much wire speed. So that's great work. Intel is working with Wind River, Ericsson and a few others. These performance notes have been shamelessly gripped from the Ericsson paper. So, and this work is ongoing. The switch right now is in prototype stage, but Intel is, and its various partners is working on making a production reality. Now, this is, these performance numbers are without the DPDK based guests. With the DPDK based guests, you can actually increase this even further. But then you'll have to use the shared memory. Now, obviously if you're going to be using the shared memory right now, the Intel's DPDK switch is insecure. The shared memory segments are shared within multiple VMs. So this is mostly for your service provider. You want to run virtual switches or virtual routers. All of your VMs belong to a single tenant. So it's not so much about security segregation. It's more about workload isolation between your VMs. So then you can use this switch. All right, so KVM features and issues. Well, it has one of the widest sets of open stack features. You guys can go read the matrix. I'm not gonna rehash it here. I think the only one that doesn't support is set administrative password, which I think only Zen supports at this moment. Some of the issues that we have run into it. So there is a bit of a difficulty of transferring images from other hypervisors. You have to redo the drivers. So migrating from VMware, migrating from Hyper-V is somewhat painful. We've actually developed a process. Red Hat has a relatively nice tool chain to migrate from VMware and we identify the set of service partners. So if somebody needs to migrate the workload from an alternative hypervisor, we can enable that. So now back again to the driver issue. So this is mostly distro related, right? This works pretty well on the boon too, but the older rail and centers distributions, the version of QMU is a little old. It doesn't support the SCSI drivers. It only supports the VRT-IO drivers. So that means that you have to replace your existing drivers with the VRT-IO package. So you need to update your images moving between the hypervisors. We've actually worked around that issue for people who need it. So in the Mirantis distro, we will ship an upstream version of QMU that bypasses this problem. Okay, so VMware ESXi. VMware ESXi is actually gaining steam as a relatively popular platform for deploying OpenStack. Now both vCenter-based APIs and just regular simple hypervisor APIs are supported, but all of our requests up to date have been vCenter. So we've done the test deployment on ESXi, but everybody, all of the commercial environments, they simply want to be able to take the existing vCenter clusters, wire them to OpenStack, combine them with KVM and maybe something else and run through portal and off your mobile. Right, so ESXi is a type one hypervisor. It does not need a distro to operate, but that means that VMware controls all of the bits and pieces. The code has to be signed. That means nobody can just drop a random vSwitch into it that VMware doesn't like or something else. So you have to go talk to VMware to extend ESXi. So like I mentioned before, VMware supports both ESXi and the vCenter APIs in their driver. They're actively maintaining their driver. So a little brief of funny history, actually Citrix was the first guys who put the VMware support into OpenStack, but VMware has realized that this is a good thing taken over this and is very actively, as you'll see in the follow-on pages, driving the deployment of this. All right, so OpenStack compatibility. It has fairly decent compatibility, missing few light things, pause, unpause and resize, but otherwise it's pretty good. Obviously our deployments up to this point have been done on Grizzly, although future ones will be done on Havana. So here's the issues that we've identified. So if you want to deploy with Nova Networks, there's no security group integration, right? So that means that you're not gonna be able to use your security groups, which is annoying and most commercial deployments want it. So that's why most of them actually have to buy NICERA. So this is one way for VMware to co-sell their products. It's a reasonable choice for commercial company. So if you plan to deploy VMware and KVM mixed environment, you need to plan on buying NICERA licenses, right? So you need to buy NICERA MVP or the NSX product and you'll be deploying that alongside of your existing OpenStack infrastructure. Now while NICERA MVP is currently not yet part of automatic field deployment, we can relatively easily deploy it post-fab. So we'll deploy everything else using fuel with the OBS plugin then we'll switch it out for NICERA MVP before giving it to the end user. All right, so Glance integration is relatively inefficient. It does not integrate with the templating agent in VMware, which means that you still have to copy your images in and out of Glance. It will be more efficient than VMware has committed to do this for Icehouse. There's a blueprint outstanding of integrating both of these functionality. In Grizzly, only single data store were supported, just the first one. So if you had multiple data stores defined, you could not use the other ones. With the Cinder, only I SCSI type volumes were supported. So you can do VMDK based data stores and stuff like that. And only link clones were supported, not full clones. All right, so like I said before, VMware is very actively working on its driver in Havana and has extended it with lots and lots of features. I've captured just a few of them. I didn't wanna bore you with the pages. So now a link and full clones are supported. You can actually specify which one you want. Multiple vCenter clusters are now can be managed by a single driver. So it's actually, so within a single vCenter cluster, you can set up multiple pools and each pool can be independently managed by the OpenStack driver. It now understands that the vCenter is not a single hypervisor, it's a collection of resources. That's a very important fix that VMware has put in. It makes it much, much more useful. Okay, so there's now support for config drives. So if you don't have your metadata service working, you can push an ISO up there with properties. There's finally a center support for VMDK based volumes. So you can do file based stuff. So the interesting thing is that you can now go ahead and let's say you have some sand based storage which is not yet supported by OpenStack because OpenStack is mostly ice-cazy and the FC support is a little laggy. You can wire it into your VMware and just use that for your VMDK store and now you'll be able to create center volumes using your VMDK store and using your, whatever your favorite sound provider names are mentioned that is, hasn't bothered yet to put in the drivers into OpenStack. VMware's commercial storage support is a little bit more extensive than Cinder is right now. You can now use the vShield Edge device for MVP. So you have support for firewall as a service and load balancer as a service using VMware products. So it's kind of nice, not that many people have gotten the albass drivers working except for F5. So it's nice for these guys to jump in and get their stuff to work. All right, so networking. I like networking, that's my focus. So I decided to put in a slide about networking. So obviously the default choice is the Nesira MVP NSX products. So the NSX is just the extension that supports some VMware specific stuff. So you can either use NSX or the MVP. And there is Cisco 1KV coming. The multi-hypervisor support is gonna be released sometime. In the future it's already in public beta, so you can ask Cisco and they'll enable it you. And obviously the 1KV for ESXi platform has been available for quite a long time. So this is an alternative if you have an old Cisco infrastructure that allows you to extend your virtual switching all the way to the server edge. Okay, so accelerated options. Well obviously since VMware controls all of the bits and pieces you can't just drop an ML Knox driver. But what you can do is you can use the MVP. Using STT, STT uses TCP based encapsulation. It can actually use the LSO engine and most server NIC drivers to get wire speed. So you can get 10 gig wire speed relatively easily using the Sarah and STT based workloads. And Cisco has actually developed and published the white paper of how you can use Cisco's hardware to accelerate your networking as well. So the key to it is Cisco has developed the VN tag protocol which is supported by a number of players. But in this case what you do is you use the combination of VN tag with the SRV. So you present virtual NICs to your VM guests and you let your switches manage all of your switching. So that means that all of the packets are hairpinned but that's actually not that bad of a deal because most of the folks we found run anti affinity. Very few VMs talk to fellow VMs. We had few people who wanted to optimize and have an extremely low latency application traffic but 90% of the use cases out there are anti affinity. They want to run in different hosts for application resiliency and for other use cases. So the amount of traffic that actually has to hairpin is relatively minimal. So the combination of the Cisco switches and Cisco calls this technology VM fabric extension protocol allows you to essentially replace your virtual switch with Cisco gear and get 10 gig wire speed. All right, so containers. Last hypervisor technology we'll be talking around here. Now containers is a growing topic for us. It's in a lot of the requests but we haven't done any container deployments yet honestly. It's for various reasons. Number one is technology is a little iffy and for us to finish it up would require a relatively large price. Most people who are running containers are not willing to pay that for us to get the technology to the level where they needed it. So where do you use containers? Well use containers where you have hundreds of thousands of guests that you need to run in very, very lightweight things. When all of the apps belong to a single tenant so you don't need secure segregation between your applications. Although there is work in place to get the containers actually to be more secure with SC Linux and app armor style tools. So containers are going to become more and more secure in the future. All right, the space itself is a little fragmented. We've got requests for Lexi. Lexi is probably the most popular container technology out there. We have the requests for parallels which unfortunately is lacking the open stack driver at this moment. And we have requests for Docker which is the new application container distribution based on loosely based on Lexi. So there is limited open stack support but there's a lot of interest. So Mirantis is going to invest in the space a little bit. So for Icehouse, you'll see some developments in this area. Now Docker guys also got a bunch of money and they're actively driving their driver to accelerate acceptance of containers when it's in the open stack community. So I'm going to cover Lexi instead of Docker. Like I said right now, Lexi is probably the most popular request we get right now. So on VM, only lunch, reboot and terminate are supported. No other operations are supported, networking. It's possible to get basic VLANs working. So we have the instructions necessary in order to get Neutron OVS working itself. But it's a little hard. It's not supported out of the box. And then officially there is no Cinder support, although the XCloud guys have gotten the Cinder driver to work. So the recipe is available. So if you're willing to hack around, you can get volume supported with Lexi as well. All right, so that's about it for the presentation. Questions? Well, when you run Linux containers, the container technology simply runs a single kernel. So if you hack your kernel, you've now penetrated your fellow guests. So you can now start looking at their memory. So what people usually do, it's a lot easier. So there's thousands of Linux kernel bugs that allow you to exploit Linux kernel and hack it. And there's only maybe 50, 60 hypervisor based bugs. So it's substantially harder to hack the hypervisors. Although my hacker friends tell me they can break any hypervisor in five minutes. So I bought them enough beers that I actually trust them on that. But I haven't bought it to follow it up. Now I do know that there is roughly, in the last years, there's roughly 45, 50 penetration bugs that were recorded against KVM and QMU. A little bit less than the Zen and a half that number on ESXi. No open source, so people find it a little harder to hack. Most of the KVM exploits were based around QMU. And shared IO processing. 70% of those exploits were actually denial of service attacks. So let's crash the machine. So and 30% of them were, oh, I'm actually able to start messing around with the memory pages and remap somebody else's memory into mine and start viewing data. So those were dangerous. They're actually very, so the vendors are relatively proactive about closing them, but you never know if there's an old version non-patched running around somewhere. So there's hard provides exploits. Now on KVM, what we do is if people ask us for security, we routinely deploy ESXi Linux to separate the QMU processes from each other and make sure that if the process on one gets broken that you at least can't start messing around with your general Linux distribution. So it's possible to do the same thing, same technology ESXi Linux and app armor on the containers, but it's a lot harder to contain the damage, right? So Linux kernel is very relatively easily hacked. A script, it doesn't require much intelligence. Script keys can go hack it, right? Especially most distros use an older version of the kernel. They haven't, some people don't really religiously update their patches and things like that. So it's just a lot easier to penetrate the containers. Also there's less processor isolation, memory isolation. So there is less isolation between the guests in the workload as well. Do we see any issues with Zen? We simply don't see Zen anymore. I'm sorry. God's on us too, so we got one request for them this year. One, just one. So it's a great hypervisor, Citrix is investing in technology. So it's just in the markets where we're going to, it's a duopoly between Red Hat KVM and ESXi-based VMware hypervisors. Zen is just not a no. A lot of the places where it's used is in the desktop virtualization technologies. These are still relatively immature on KVM. So Citrix desktop virtualization and VMware desktop virtualization is actually very, very superior. And that's the only place where we see Zen nowadays. I'm sure Citrix guys can give you references with Zen. Actually, Zen has a lot of support. It's a tier one hypervisor, which means that absolutely everything in OpenStack works. But Citrix would like for you to buy the commercial Zen, right, so instead of the open source one. We actually built our first pre-OpenStack cloud in 2010, we built our own IES. We built it around Zen and OpenVSwitch. But that was in 2010, this is 2013. Go ahead. Yes, you do. We, by default, use KernelIO and VirtualIO to improve performance. So those seven to eight, seven to eight gigabits on non-accelerated figures were using VertIO. Yep, it's okay. So the deal there is this was a relatively low power host. The CPUs and the kernel locking on both of the machines is beginning to saturate a little bit. It's 23, yeah. Because you can use all of the, so we give, to get to 23, so your VM is gonna become CPU bound, processing this much power. Remember that this is a user level application in a VM talking to another user level application in another VM. So you need to use up a lot of CPU to send 23 gigabits worth of packets. So in that use case, we're giving the VM the entire eight cores of the machine. And in the other use case, we actually had to split the cores between the two VMs. So performance went down. So Melonox actually, we didn't bother optimizing it or anything else, running the larger machines. It's what we got, right? Preparing for the summit. So Melonox actually has a use case that's showing significant performance locally and remotely. They have a large number of optimizations written into their manual that we did not bother applying in this use case. We just threw a quick test together. We had this Melonox card and couple of spell servers and said, okay, let's do some performance testing. My apologies, there's a little background noise if you can speak up. So yes, correct. So our routine, so our routine recommendation is to enable all possible offloads that your driver supports. So most of this is done with the checksum offload, large segment set offload. Like I said before, LSO actually accelerates STT and you can use the 10 gig toes in order to get the maximum performance possible. Okay, you run into some weird edge case bug in your driver. Interesting. So we haven't had any problems with the LRO, but, sorry, LSO, right? Oh, large what? Oh, LRO, no, we use LSO. We use LSO, we activate LSO, a large segment offload, not LRO, my apologies. Right, yeah, it looks like there is just a bug in the Intel driver for that particular chipset. Like I said before, we don't really enable the LSO engine on the toes or a full toe. So if you have a full chimney compliant toe and you're activated, that actually should work. So we haven't had any particular issues with either the Broadcoms or the Intel's. I think both of them work reasonably equally well. Obviously the Intel DPDK will only come for the drivers for the Intel part. So that may limit your choices if you wanted to use the Intel technology. Now, it's easy enough to actually add drivers to DPDK. We actually added Broadcom drivers for something else we're doing to the DPDK toolkit for the Broadcom. But yeah, you would have to do it. It wouldn't come out of the box. Okay, so was VMware itself? Okay, so what I would do is I would set up boot from volume. Cinder does have ability to specify storage class. It's a simple tag. It's not hierarchical. But you should be able to request what volume type your volume should be located in for booting, right? So when you create a volume, you can specify an appropriate storage type. Then you'll be able to boot off that volume and gain the advantage of your fast storage. Yeah, yes? No, that does not introduce latency. Cinder is just a control path. Cinder is not involved in a data path. You're simply using iSCSI to get your packets in and out. Cinder just sets up the volume. In the KBM implementation, the volume is actually mounted into the kernel using your iSCSI driver and exposed out to the hypervisor as a block device. I'm sure in the ESXI implementation it works similar. So Cinder is not a data path thing. It's only control path. So now, obviously, booting from volume across the network link may have some latency, right? Associated with just booting from volume. But most of the VMware environments already want you to use, it's around blades and they want you to use external storage anyways. So you already paid the penalty for going to the external storage. Using Cinder will add no penalty to that environment. Go ahead, please. Okay, so, well, most of the time when you're going to boot from volume, you're not gonna be booting from some local disk. You're going to be booting off a SAN, right? So, or you're going to be booting off a dedicated cluster based on commodity hardware. Now, I know Nutanix, few other companies are going to the hyperconverged and we actually have done environments where each node was a Cinder host, Cinder volume host, and was exporting its local storage as a Cinder volume. We've also done environments where each node was a SEV host and was running compute resources. Those are done as one-offs. They're not very popular. Most of the VMware or most of the other deployments that are going to the volume will be going to NetApp or AMC or some large commercial frame.