 I've been taking a nap every day, not today. OK, today's talk is the life and times of an open-stack virtual machine instance, quite a long title. So just to introduce myself first, I'm Mark McLaughlin. I'm from Red Hat. I'm Irish, in case you don't know from my accent. I've been working on open source for maybe 15 years now. I've been at Red Hat for 12 years. I've been working in the kind of general area of virtualization and cloud for nine years now, which is pretty crazy. And I've been working on open stack for the past four years. So I've worked on everything from the NOVA project, how to create the Oslo project. I was on the technical committee. I'm currently on the OpenStack Foundation board. So I've been around this space for quite a while. And so the point of today's talk, I think, was to give kind of a high-level overview of how NOVA works, of how NOVA creates a virtual machine, but to not just do your typical high-level overview, to also pick some kind of deep dive, interesting areas we can kind of really dig into and kind of see how it works. And I guess, from my perspective, what I'm trying to share here is I've learned a ton of really deep, interesting stuff that OpenStack touches on. It gives a second. I think we're going to have to get this sorted. Thank you. OK? Cool, thanks. OK, so OpenStack touches on a whole ton of really, really interesting technologies. And I've had the chance to learn about a bunch of different stuff throughout the years. So I kind of feel when people are looking at OpenStack and learning about OpenStack, they kind of get this high-level overview, but never really get to go down deep into some of the areas. I think that's missing an opportunity, right? There's lots of interesting stuff to learn. So I'm going to pick, I'm going to walk through OpenStack. I'm going to walk through how NOVA works and pick some kind of areas that I find interesting and maybe highlight for you that there's plenty of other things you can learn when you're working on OpenStack. OK, so to start with, what's going on here? I am going to be plagued by demons, aren't I? OK, so to start with, what's OpenStack all about? I always start with the mission statement, because I actually think unusually this is actually a really good mission statement. It's about ubiquity. It's about open source. It's about a cloud platform beyond just simple infrastructure. It's about public and private clouds, and large-scale, small-scale, simple to implement. It really captures everything that OpenStack is all about. But I think maybe the way we should all always start about OpenStack is the abstractions we're trying to provide to users of OpenStack clouds, right? Am I plagued with more? See, this is because I'm not using VGA. Can I switch to VGA? Sorry about this. Do we have VGA? OK, cool. Maybe next time I could give a talk on display technologies and VGA and I kind of feel like Kyle yesterday, which is a beautifully prepared demo that went horribly wrong. OK, so we talked about the OpenStack mission, and now I guess what I'm saying is the next thing I think about when I think about OpenStack is users of OpenStack clouds, right? The abstractions they're presented with, what those abstractions do, because that's ultimately what we're building here. So I guess if you think about you've got an OpenStack cloud, an administrator creates an account, give that account to a user, what do they have now access to, right? They've had access to all these abstractions, right? They can create storage volumes. They can create networks. They can list images. They can manage objects, storage objects, and they can create virtual machine instances. And so imagine it is a typical simple user case for OpenStack. You've got an IRC bot. You want a virtual machine. You want to run your IRC bot there. This is what we're allowing users to do. And as most of you probably know, we implement those abstractions with the following services, all pretty straightforward. The marketing view, I guess, of OpenStack is we've got a bunch of technologies. They all layer on top of each other nicely. OpenStack is an abstraction of, say, technologies like Cef and KVM and OpenV switch. All of those run on a Linux platform, and all of those run on these lovely companies and hardware platforms. That doesn't do much for me. In reality, this is bizarre. OK, we're going to have to do lots of switching here. In reality, what's going on is you've got a big bunch of noisy, a big pile of kind of noisy tin sitting there in a data center somewhere. This is what we're actually presenting to a user. We're giving the user the opportunity to run their IRC bot on some little virtual machine running in there. So the abstractions we're creating allows us to take a bunch of physical resources, this big pile of tin, and allow us to carve up those physical resources and make them available to users. So that was a photo of three racks of gear. And so I thought it might be interesting just to start with, in a very practical sense, what's going on in those three racks. So what I did was, I don't get out much, right? I'm a developer, I'm a technologist, whatever it is. So I don't actually get to touch this kind of real gear very often. So what I did was I went to Red Hat's reference architecture with Dell, and I looked through the reference architecture that we currently recommend for running OpenStack with Red Hat and Dell Gear. And so if you had three racks, you have three 42U racks. You would have a mixture of Dell or 630 servers and two U Dell or 730 servers. You'd have three controller nodes. In total, you'd end up with something like 40 compute nodes and 12 storage nodes. You'd use the or 630s for your controller nodes and your compute nodes. And you'd use the or 730 for your storage nodes. The reason you use the or 730s is because they're nice beefy 2U boxes. You can fill them up with a whole bunch of disks, and that makes them really good storage nodes. And you end up with 11 switches in this configuration, which seems a bit crazy. But the reason for that is, let me see if I can even add this up correctly, you end up with a management switch per rack. That's three. You end up with two top of rack switches per rack. That's an additional six. That gives you seven. And I think I'm missing some, but you end up with two top of rack switches per rack. You end up with a management switch per rack. But you also end up with two aggregation switches. And these are the switches that kind of aggregates your top of rack switches and feed you out into a core network. So in an even more practical sense, just to dig into it a bit deeper, what you have, you've got three 4U racks. So three 4U racks is 1.8 meters wide, 1.2 meters deep, and 2 meters tall. So it's a pretty small space. And in that, you fit all of this gear, which adds up to about half a million dollars, as best I could tell. In there, you've got 1,600 cores. You've got 5.4 terabytes of RAM. You could run something like 2,004 gigabyte virtual machines. You'd have 16 terabytes of storage. And just as one little other data point, I decided to figure out how much power this would draw. And so it turns out the servers themselves would probably draw somewhere in the order of 25 kilowatts. But you also have to kind of take into account the rest of the infrastructure in your data center, right? So researching this a bit more, I found out that data centers have a metric called power usage efficiency. And this is the ratio of power that spent kind of powering the entire infrastructure to the amount of power that spent powering the actual servers. And so that's typically about 2.0. So that means you're drawing about 50 kilowatts of power to keep these three racks of gear running. And I did another little bit of calculation. And it turns out 50 kilowatts is enough to boil 500 liters of water every hour. And that's probably going to work out as something like $35,000 US dollars per year. So I think it's just really interesting to put a really practical view on what we're talking about here. It's a bunch of APIs sitting in front of three racks of gear, half a million dollars worth of stuff. And you can run something like 2,000 VMs in there. Certainly opened my eyes to what's going on. So if you want to learn more about some of the stuff going on there, the OpenStack architecture design guide is quite good. Kind of goes into a bunch of different technical considerations you could just think about when you're designing an OpenStack cloud. And this Dell and Red Hat reference architecture I talked about is available there. And actually, what I'm going to do is I'm going to post these slides to Twitter later. So you'll be actually able to click those links and get that information. OK, so I pitched this talk as talking about the lifecycle of an OpenStack VM. So we all probably know all this already, but how do we go about creating an OpenStack VM? The highest level you could look at the horizon's interface for this. There's a bunch of stuff in here, but basically all you're doing is choosing a name for your VM. You're choosing a flavor, which is the kind of size of virtual machine you want. You're choosing the image you want to run, whether you want to run a Fedora image or a RL image or an Ubuntu image or whatever. And that's basically it. On the command line, it's similar. You've got an OpenStack server create command line. You choose a flavor, you choose an image, and you give your virtual machine a name. And behind all of that, then, you've got the REST API. And so the REST API is posting a little JSON blob to a URL. Again, it's got the image name. It's got the virtual machine name. It's got a reference to the image. It's a reference to the flavor. It's got an authentication token. You could spend a bunch of time just talking about how that authentication token was obtained. But that's how a virtual machine is created. But what happens when that virtual machine request comes into our rack of gear? So you're sitting out there on the internet somewhere. Some magic happens that transports your request over the internet. And it comes into your rack of gear, into the top of rack switch. And then some other magic happens. And now your Python controller process handles that request and creates your virtual machine. Digging in a bit further, it's not as simple as that. It's actually your request comes in. It's addressed to a virtual IP on one of your controller nodes. That virtual IP actually maps to a HA proxy. This is in Red Hat's reference architecture. I'm talking about here. The virtual IP maps to a HA proxy instance. And the HA proxy instance is load balancing across those Python apps which are handling your requests. So the request might come into controller one. And it's actually sent to the Python app running on controller three. And it's not even as simple as that, right? So HA proxy, the kernel on controller three is handling the request. But you don't actually just have one Python app running for handling these requests. The Nova API service, as an example, will actually typically be run so that you've got a worker process per core on your machine. So I've actually logged into a machine, a controller node, with 96 cores. And there's 96 Nova API processes running. So what's really interesting here is, I know a fair bit about what's going on here, but I actually don't know how this works. Because in this case, you have 96 Nova API processes all listening on the same socket. It's pretty interesting how that works. But how does the kernel decide which of those worker processes to give the request to? And I did a little bit of research, but not enough research to actually find an answer. It either wakes up all of your worker processes and they all rush to accept the request, or it's clever enough to wake up one of the worker processes. And so that in itself is probably a weekend project to figure out what's going on there. And it could actually result in a really interesting performance improvement to make to OpenStack if we're not doing the right thing. But it's not even as simple as that. The kernel on controller three receives the request. It wakes up, we assume, one of the worker processes. The worker processes is sitting there in what we call a main loop, calling the poll system call. So it's basically blocked, waiting for any activity on any of the sockets it cares about. Your request comes in, poll returns. And now our asynchronous IO framework called eventlet takes that request. And it knows what eventlet coroutine is waiting for a request on this socket. It dispatches that coroutine in what's called a green thread, which then basically runs something we call a whiskey pipeline. That's web services gateway interface. It's basically a standard in Python for how a HTTP request is handled by some Python code. And eventually, that code that you know in Nova is handling the request eventually gets called. So I guess what I'm trying to point out here is if you were reading the code in Nova for handling an API request, for handling that the API request for creating a server, you know that's coming across the internet somewhere, and you know you've got this rack of machines. There's actually a ton of really, really interesting stuff happening in between that's worth digging into. Which version of OpenStack? Any of the latest versions? Is there a specific reason you are wondering? Do you think some of this has changed? You remember eventlet has? OK, no, that's a good question. And actually, I'd love people to ask more questions. So the question is, has eventlet been replaced? So as I said, eventlet is an asynchronous IO handling framework. And the idea of that is if you go to do some IO, if you go to write from a socket or read from a socket, and you can't currently do that, you basically don't want your process to sleep. You want to pause what you're doing and do some other useful work. And typically, you use what's called a main loop for kind of handling that. And eventlet is the framework we have for doing that. And so the question is, are we still using eventlet? And in one case, we're not. In Keystone, we currently don't use eventlet, right? In Keystone, we actually use something called Apache's modWizgi, what do you call it? ModWizgi module, I guess. And so in that case, Apache is basically handling the request and spawning off a Wizgi pipeline. But in the case of Nova or Selometer or Neutron or anything like that, they are all still using eventlet. So it's only Keystone that switched from eventlet. No problem. So in terms of more information there, there was a really good talk at the last summit about HGA and load balancing. So that would talk about a reference architecture for HGA and the load balancing. If you want to learn more about asynchronous IO, I wrote a blog a while ago about how that works in Python and how eventlet is implemented, essentially. And that detail about having multiple processes all listening on the same socket. Actually, if you go back to Richard Stevens' book, I looked it up there recently, it has a really, really good explanation of how and why that works. OK, so we've talked about your create server request coming in and being accepted by the Nova service. What happens next? So this is a pretty typical explanation of how this works. There's a typical kind of diagram that we use to explain how this works. So your HTTP request comes into the Nova API process, the Nova API service. The next thing it does is it creates a row in the database which describes your server. So it will go to the database, it'll create a row, it will give your instance a UUID, it will record the name, the flavor, and the image. And basically, its job is done. The Nova API's job is done, and it immediately returns a response to your HTTP request. What happens next is the Nova Conductor service will actually, sorry, the Nova API service creates an entry in the database, and the next thing it does before it returns your response is it actually sends a message over the message bus to the Conductor service, just letting it know that a create server request has been requested. So the Conductor service, the next job it has to do, it has to decide which of those compute nodes in your three racks of gear to run your server on. And so it needs to make a decision. Basically, what's the best server for running this virtual machine on? That's essentially what the scheduler needs to do. So the Conductor asks the scheduler which host is the best host for running the virtual machine on, and the scheduler has awareness of the resources available on all of those compute nodes in your deployment. So the scheduler will filter out all of the compute nodes that can't currently run your VM. Maybe there's no resources left available, and it'll sort the hosts that could run your virtual machine to try and make a decision about which host is the best host for running your virtual machine. So it might decide it wants to spread the virtual machines out across all of your hosts, or it might decide it wants to really pack all of the virtual machines on the smallest number of hosts possible. But whichever it decides is going to return a host to the Nova Conductor service. And then the Nova Conductor service sends a message to the Nova Compute service to say, please start the virtual machine. At that point, the Nova Compute service running on your compute node is going to download the image. It's going to create some XML for libvert, and it's going to run your virtual machine. So again, I don't want to go into too much more about that. There's plenty of other talks about this. Steve Gordon from Red Hat actually gave a really good talk in Vancouver, I think, to go a bit deeper. Rafi Cardalian from Metacloud actually, quite a few summits ago, gave a really, really good talk about really digging deep into how Nova uses the libvert in KVM. And if you want to learn more about the Conductor service and how the Conductor service works with the Scheduler service, there's a really good blog post from Russell Bryant about that. OK, I thought what I'd talk about next is a very specific type of virtual machine. Basically, posing this question, how would if you want to run a very, very specific type of virtual machine that's going to give you deterministic high performance, right? Do you want to run some workload that's really kind of latency-sensitive? You want to know that it's going to get the best out of the possible hardware available, and it's going to be really responsive to the needs of your workload. So you want a virtual machine. Maybe that has eight physical cores dedicated to its virtual CPUs. So it's a virtual machine with eight virtual CPUs. But underneath the hood, it actually has eight dedicated CPU cores for that virtual machine. It also has memory that's spread across two NUMA nodes. So NUMA nodes, if you don't know what NUMA is, basically, with many kind of modern servers, you can have a bunch of CPUs with some memory that's local to those CPUs, and a bunch of CPUs that's memory to those CPUs. And if you have some code that's running on this set of CPUs and need to access some memory over here, that request needs to go across a shared bus. To get access to that memory. So as much as possible, you want the code running on this set of CPUs to access the NUMA memory that's local to those set of CPUs, and not have to go across the bus to a different NUMA node. So in this case, we actually want a virtual machine that's so large that it's going to be spread across two NUMA nodes. But we also want the guest to be aware of that. We want the virtual machine to actually have a virtual NUMA topology of its own. So it's aware of the need to stay local or for code to be running local to those NUMA nodes. We also, for performance reasons, we want to use dedicated host memory with two megabyte pages rather than your standard four kilobyte pages. And we want to have two SRIOV virtual functions assigned to the virtual machine. But we also want to be aware that those SRIOV virtual functions have locality to the NUMA nodes as well. And finally, just to make it interesting, let's play it up again. Are we seeing any changes up there? OK, this is bugging me. OK, now, so lots of really interesting stuff. And just to throw one last interesting requirement into the mix, we want rather than the virtual machine having eight physical CPUs, we want to set that up as having two virtual sockets where each socket has four cores each. And maybe we only want to do that for Windows virtual machines. So you might be thinking, that's absolutely crazy. You don't want to do this. That's just not cloudy, right? That doesn't seem like something you ever want to do with a cloud. Just get out of here, get lost. But I think I really like this way of thinking about it. There's more to cloud these days. There's more to what we're trying to do with OpenStack than just replicating what EC2 was doing in October 2008 when EC2 first removed its beta label. The world has moved on. EC2, even if you just define cloud as EC2, EC2 is just doing a bunch of really interesting stuff these days. And when you look at my requests or my requirements for my virtual machine there, you could just think of that as an EC2 high performance flavor. It actually does make sense what we want here. So that quote wasn't from anyone, particularly insightful. That was just from me. So how do we go about doing this with OpenStack these days? It's actually pretty straightforward. The set of requirements I had sound crazy, but what we've recently added in the Kilo and Liberty release is a set of our NOVA instance type properties and also image properties for defining this stuff. So what you see here is three different instance type properties. One is called CPU policy. And we can basically say that the physical CPUs should be dedicated to the virtual CPUs, which is one of the things I asked for. We can also say how many numenodes we want. And we can also say what host memory page size we want. And basically, these are all properties on our instance type, and we just call our instance type m1.medium.performance. So you can imagine a user of an OpenStack cloud coming in and going back to that UI earlier. They're asking for a virtual machine. They're giving it a name. And they're basically just saying, I want a high performance virtual machine. And this is what's happening in the background. And the last thing I said about having a particular virtual CPU topology, we could set that up as an image property, the hardware CPU sockets image property. And we can set that as a property on the Windows image so that we actually only get this virtual CPU topology for Windows images. I have a bunch of links at the end of this. I think some of it came in kilo. It's definitely there in Liberty. But I think most of this was actually there in kilo as well. Is there any? What there is. Yes, there is. But what there is is all these links. As I have here, they're actually really good documentation for everything I described there. But actually, what I wanted to do was just dig in a little bit deeper into kind of how we've implemented this. So if you go to a compute node and you run something called verse capabilities, it shows you a bunch of information about the capabilities of that hardware platform. So in this case, you might go to one of those machines in our set up with three racks. And what you're seeing here is the new metapology of that machine. So you're seeing that there's two new missiles. We're looking at one of the new missiles here. We're seeing that there is a certain amount of memory available in that new missile. We're seeing there's a certain amount of four kilobyte pages. There's no two megabyte pages available. We're seeing a bunch of information about the CPUs and the kind of cost of a memory access going from those CPUs to this new missile. So that's kind of one aspect of this. We need to understand what the hardware platform can do or what that hardware platform has available. And then when we go to create the virtual machine, there's a libvert XML, which describes the virtual machine. Actually, it has a bunch of really interesting information for basically implementing the virtual machine this way. So the CPU Tune element basically describes each of the CPUs, which physical CPU, the virtual CPU is running on. And it also has this emulator pin setting, which is basically the QMU process has a thread which helps with IO. We want to confine that thread to just run on one of the physical CPUs that we're using for virtual CPUs. It also has a bunch of settings here for basically how the virtual numit topology is implemented and how the virtual CPU topology is implemented. So I'm running out of time here, and I just one last thing to cover. But again, I'm going to post this on Twitter, and you'll be able to dig into any of these articles or videos about how this stuff is implemented. So I'm going to throw my laptop out the window in a minute. OK, the last thing I wanted to cover was just to give a quick run through about how KVM actually works. So I think everyone has a fair idea that KVM is this technology for running virtual machines. But how does that actually work? And what I'm talking about here is based on an article that was recently published on LWN by someone called Josh Triglet. And I thought it was just a really, really good article, really worth spending the time to understand what's going on here. So to create a KVM virtual machine, it's actually as simple as this. You go to the dev KVM. You open that. You do this IocTill, which is called KVM createVM. And you get a file descriptor back. You've now created a virtual machine. But the next thing you need to do is you want to set up some memory for that virtual machine. So in this example, what we're doing is we're creating just one 4 kilobyte page. We're copying some code into that 1 kilobyte page. And then we've got a structure which basically describes that memory region. So we wanted to occupy slot 0. We want to map the 4 kilobyte page to not the first page of guest physical memory. We want to map it to the second page. We're describing that the memory size is 4 kilobyte. And we're giving a pointer to that memory space. And then we basically do this KVM set user memory region IocTill on that file descriptor that we got previously. And now our virtual machine has some memory. The next thing we want to do is we want to set up a virtual CPU for the virtual machine. Virtual machine without a CPU isn't much use. So again, we've got an IocTill which is createVCPU. We've now a file descriptor for managing that virtual CPU. We've a bunch of stuff here for basically mapping some structures which describe the state of the virtual CPU. So the kind of state of various registers and the state of the instruction pointer. We want to pass in some parameters to the AX and BX registers there. And basically all of that CPU state, we set all of that up. And then we do this KVM set registers IocTill on the file descriptor for the virtual CPU. So we almost have a virtual machine that's ready to run now. And to actually run it, there's another IocTill which is called KVM run. So what we're actually doing here is we're running that virtual CPU. And what this IocTill is going to do now is it's going to actually start running the code for the virtual machine. And that code for the virtual machine is going to keep running until it does something that needs to be handled by the hypervisor. So in the example in Josh's article, the code is very simple. The code takes the parameters that we passed into the AX and BX registers. It adds them together. And then it goes to write the result to a serial port. So it can run all of that code. But as soon as we go to output to the serial port, basically we exit from running that virtual CPU. And that IocTill runs. And now we're back in the hypervisor again. And we need to handle the reason that we exited the guest code. So in this case of writing to the virtual serial port, we'd get a KVM exit IO. And we might handle that just by outputting the result to the screen. And so what you have here is a loop where you're running the virtual CPU. You're handling the reason for the VM exit. And then you're running the virtual CPU again. You're just going around and around with this loop. And you're doing that for one virtual CPU. So if you've got eight virtual CPUs, you're actually going to have a QMU thread for each of those virtual CPUs in a loop like this, just running the code and handling each of the exits. So if you want to learn more about that, Josh's article is quite good. But if you want to dig a bit deeper, actually, what I find really good still is Avi Kiviti's first paper about KVM, really describing the design of KVM. Everything there is still very relevant. So for me, that's just, this is going to be so annoying. So for me, that's just one example. These are just a few small examples of some of the really, really interesting stuff that OpenStack is pulling together. So when I was thinking about initially writing this talk and thinking about the kind of things that I could cover, it's the list that I initially came up with. You could talk about how to design a network that is the core network that feeds into your top of rack switches. You could talk about bonding. You could talk about in detail how OpenV switch works or how Ceph works. You could talk more about Async.io or RabbitMQ or MariaDB. Another really interesting thing is something called S-Vert, where when we're running that virtual machine, we're actually confining that process on the host side with an SE Linux context so that if there's an attacker inside your virtual machine, it can't break out to your host. We can talk about what real-time KVM is going to look like, how the Q2 image format looks like. Basically, there's a whole ton of really, really interesting stuff here. And OpenStack relies on all of this stuff, and I think it's really worth spending some time learning and understanding some of this stuff. So for me, that's really the conclusion here. When you're learning about OpenStack, don't just allow yourself to skim at the high level here. OpenStack is really, we're describing OpenStack these days as an integration engine. It's integrating a whole ton of different technologies. Most of those technologies are open source, and really there's huge opportunities out there to dig deep and really learn how these technologies work. So I'm out of time here. Hope that was interesting and useful, despite the display issues. They certainly stressed me out. But we're out of time. Anyone got some quick questions? Yeah? My machine, Robert, you should not influence that work. OK. How does this look really important? Yeah, so I think that's an interesting question. I'll repeat it so that the question is there's a concept of over-committing resources so that you could, say, maybe create virtual machines that consume more memory than you have physical memory, and that's called over-committing. And then there's the concept of resource isolation so that virtual machines shouldn't be able to influence other virtual machines. I think we could talk about that a bit further because we're gone over time here, and people are already starting leaving. But basically, the concept of over-committing is really useful to cloud administrators who want to get the best use out of the resources they have available, and the concept of resource isolation is you basically want to provide an SLA to your users, right? And the technology is there that you can enforce an SLA for your users while still doing over-committing. That makes sense. OK. Yep. OK, so people have started leaving. How about you and I just talk directly after this? That's OK. We've just run out of time. OK, thank you very much, everybody. Thank you.