 Good afternoon everyone. My name is Doug Williams, and this is my colleague from Red Hat, A.L. Barone. There's actually a little bit of a history to this presentation. Good. Aren't screensavers wonderful. If you look at the abstract, you originally see a gentleman named Mark Wagner who was supposed to present it. So you worked on the presentation and had to go in for knee surgery. And then you saw a gentleman named Andy Catherill. And he got sick mid-week in A.L. and I joined in. So our goal is to complete the presentation without medical issues. With any luck, we'll get through this. So bear with us, please. So... The topic of OpenStack Performance is a really broad topic, and I do want to set some expectations. We're gonna only scratch the surface of that, but it'll be a work in progress. But what we really want to do is a couple of things. We wanted to talk a little bit about how do you conceptualize OpenStack? Because you can think of OpenStack as really a couple of things. You can think of it as the individual projects. NOVA, Cinder, Keystone. But you can, and many of that lives in the provisioning realm, but also think about it as what's in the rest of the platform. The operating system, the hypervisor, the storage technology, and how they play together to the system level. So we're gonna try to talk about a few of those topics together. So a little bit, we'll talk about how do you conceptualize it? How would you think about it? What technologies fit where that could have an effect on performance? And then start to drill down, first on some foundational elements, things that anchor the performance of OpenStack everywhere. Then start to talk about OpenStack compute and individual components there, and then broaden into OpenStack Cinder and a little bit about OpenStack Swift if time permits. So what you'll see is really a work in progress. It's reflect our best efforts of where some of the characterizations were mid last week. You'll see a mix of results where possible. We're gonna try to share some Havana results. Sometimes we had older results that we view are still current, so we're gonna share those. And I also wanted to give you a sense of some of the things that we're not quite ready to present. And probably the biggest area is neutron. And the reason why is neutron reflects a really substantial change to OpenStack. It's really pushing things like OpenVswitch and GRE and VXLAN, and you're actually still seeing a lot of innovation around that. So we will be doing performance characterization. It's currently in flight, but that's really not ready to present yet, but expect results in the future there. The other area is a lot more focus on the end-to-end provisioning use cases all the way into heat, and so expect within Red Hat, we're putting a lot of energy on that testing, and those will be reportable in the future. So I can't imagine anyone who's in this room who hasn't seen this slide. High-level view, the various components of OpenStack. When I talk to our performance team, I like them to think about each of these components, such as Nova, Cinder, Neutron, Lance, and Swift, and think about it in really two domains. The classical domain of what does it mean to run a VM, what's the performance like of the VM in the steady state. If you come from a networking background, classic or referred to as the data plane operation, and then, very distinctly, a lot of the new value in OpenStack, which is, how do you provision? A VM, how do you stand up the image? How do you snapshot the image? How do you shut it down? Really, the provisioning use cases as well. And you can think of each of these elements in many cases having a set of performance focused at the steady state, and a set of performance focused at the control plane operations. Kind of the exceptions are Swift. Swift, because it's a storage layer in and of itself, largely lives in the data plane, and Lance's job is to provision images. It really provides more of a supporting role in the control plane. You've also probably seen this diagram, and I'm not going to talk through it in detail, but if you look at the individual services with an OpenStack, there's a couple of patterns that you see. One is within any individual service such as Nova, you have individual software components, you have a database component, and you have a messaging component. So there's a set of foundational services that dictate performance such as messaging and database, or it can become a performance limiter, plus the performance of the individual elements. And as you aggregate them together through something like Heat, that federation of those services and that performance to provision a template, which is the collection of that. And so this is a little bit of why when we start to look at control plane performance, there's a fair amount of complexity because of the scope of the use cases that need to be tested. So what I tried to do is annotate that conceptualization of what's in the control plane and what's in the data plane with technologies that you might typically see in an OpenStack deployment, and then I'll try to get more specific when Red Hat, both in our RDO distribution and our REL OpenStack platform distribution, what technologies we select. So what you'll tend to see on the control plane is largely dominated by the code that's developed by the OpenStack organization. So this is the Python code, so it's the OpenStack code itself, typically today Python 2.7, and then various issues around the global interpreter lock, eventlets that are handling concurrency very efficiently. And then you'll see some foundational elements, both those foundational elements that are part of OpenStack itself, such as Keystone for Security. So AL will talk a little bit about the getting of the security tokens and therefore the performance effect on the security subsystem based on, for example, ANOVA operation. What we're not talking about now, but expect to in the future is Solometer, same thing, Individual Services creates events and therefore there will be performance demands on the monitoring service, and then the service such as the database and the messaging. So I mentioned before the classic OpenStack, that's Python 2.7, and the OpenStack code, and then what you tend to see in the database is either MySQL, lately a lot of people are using MariaDB, and as they look at the higher performance targets, availability targets, either doing active failover, which pretty much means the same performance around MariaDB, or introducing another technology such as Galera for active-active, and then performance tax in order to deliver high availability. On the right-hand side, what you'll see is technologies that OpenStack manages, but they come from often other heritages. So the heart of NOVA, with the exception of ESXi, is the Linux node, so that's going to be hopefully an enterprise Linux, some from Red Hat, but certainly Ubuntu is very common as well. The hypervisor technology, typically KVM, Zen's also involved ESXi, and then when you have NOVA instant storage, that's really pulling in technology such as XFS or EXD4, and therefore that will often dictate the performance of your ephemeral storage, plus depending on your storage partner, do they use any backplane raid for striping, and that can have effect, as we'll talk about later on, things like the boot performance. As you get into Cinder, this is really where you're either going to be using some of the traditional storage technologies, the EMC NetApp, some of the newer technologies such as GlusterFS and Seth, we'll talk about that a little bit more, and some of the specialty technologies, particularly around SSD arrays, and that's in many cases where the storage path into Linux has a big effect on performance. Neutron, this is actually undergoing a massive amount of change, generally from a switch of Linux bridge to OpenVswitch, you know, rapid evolution in Linux community around OpenVswitch, and in particular some maturation around GRE and VXLAN as well, and then for many of the deployments that use a more basic model, they push the burden out of the software stack into the storage, so the storage dominates performance, or you may also see this around L3 performance in L3 gateways and things like that, depending on whether it's hardware or software. So I just wanted to have you walk through the broad set of technologies. What you'll see today will be results that Red Hat measures, so that'll be a little bit more biased towards our technology choices. So continue to be Python 2.7 and the OpenStack bits. Of course, for Red Hat, that would be REL 6.5. There's been a lot of enhancements to the kernel based on OS needs, it's bringing in things like network namespaces, a lot of work around OpenVswitch, still a pretty broad family of storage technologies that are supported. When it gets into databases, Red Hat will tend to focus towards MariaDB, and for Red Hat, Cupid is the underlying messaging layer there. The other thing that will affect your performance or your performance demands, particularly at a cell level, is to really think about this by two dimensions. One which is kind of the easy case, the more servers in a cell, the more provisioning operations that happen, you expect that to scale somewhat linearly with the number of nodes, but the other one, and this is really workload environment based, is to think about what's the intensity or the duty cycle of the workload, and what I want to do is point out a spectrum, because as you go through sizing, calibrating on the life cycle of a VM can have a big effect on the demand of the control plane. So at one end of the spectrum is fairly traditional apps, you deploy it, set it, forget it, it goes on without changes for months. So you can imagine the control plane sitting pretty much idle all the time. Slightly more dynamic, or quite a bit more dynamic, is the more modern style of scale out apps where the apps will autoflex, so there might be a provisioning operation every hour or two, maybe more active than that based on the peaks and troughs of the workload. As you move back from that, it might be a test dev environment, where someone is tearing down and reprovisioning apps at a very frequent basis, and then what you heard a little bit with Mark McLaughlin and others talking about continuous integration, where they talked about the frequency that they're tearing down open stack instances for CI. So this is probably the most intense use case, because individual test flows of service components are tearing down and creating VMs all the time. So depending on what your workload is, that will really dictate where you are in that spectrum and therefore the operating point of the control plane. So with that, why don't we switch over to AL, and he'll talk in a bit more depth. Thank you, Doug. So I think that's enough talking. Let's look at some numbers. So I'll dive straight to it. Again, we're just scratching the surface. Let's take a look at Keystone. So Keystone, these results are using UUID for the token provider, not PKI. These results are from PulseM. We saw a lot of chattiness. You see that when you do a horizon login, it gets three tokens, horizon image page, two tokens. So a lot of room for optimization in that space. Another side of it is that the database grows without cleanup. This has a lot of effects. It can really degrade performance, so really you should clean it up. You have problems with indexing, and for every 250,000 rows, the response time goes down by 0.1 seconds. This is actually pretty easily taken care of if you have a cron job that just runs Keystone managed. But the point is you need to take these things into consideration. You need to make sure that you not only install, but then configure your machine. You need to do some setup in order to make sure that your performance remains constant over time. Another aspect of this is actually just Python. So if you use the command line curl-8, you get in half a second you can run a command. If you're using Nova image show, then your performance degrades and you get 2.6 seconds. This is due to the fact that Python or the HTTP lib passes data one byte at a time. This is a known issue, but these things have to be taken into consideration. So if you're automating things, you should probably do it directly with curl, for example, and not through Python in this case. By the way, if you have any questions, please stop me. So let's look at Nova. So who in the audience knows what specvert is? Okay, very few people. So specvert is the leading benchmark standard for virtualization. This is not OpenStack specific. This compares VMware, KVM, et cetera, any virtualization technology you have. The good news is that for KVM, KVM is the leading hypervisor, and with OpenStack we use KVM on top of Rails, so we get pretty good performance numbers. The way, so a little more about specvert, the way specvert works is that it has tiles. Each tile is composed of six virtual machines, different workloads, and the more tiles you can create on the fly and keep an SLA, the better your benchmark will be. So here you can see basically the specvert results. This is specvert 2010. There's a new specvert 2013. There aren't results for that yet. As you can see, the more sockets you have, the higher density, of course, you can get, and better performance. The two columns are VMware results. The red columns are either KVM on top of Rails or KVM on top of Rev, in this case. Basically, two socket is the commodity hardware nowadays, so that's the spot for OpenStack, right? You want to get more and more machines cheaply, so that's probably the relevant side of the screen for OpenStack. If you have enterprise virtualization, your pits type of workloads, then that would fall on the right side where you'd like to spend more money and get more performance out of your machines. Now, although we are claiming here that specvert is valid for OpenStack, we need to validate, then, in fact, Rails with KVM works the same way for OpenStack and the way it was tested generically in specvert. So, we ran Java workload, and basically there are two things you can see here. This is ROS, ROS is Red Hat OpenStack versus an untuned Livered KVM. It's comparable. You see that the results are pretty much the same. In fact, with ROS, they're even better out of the box because we tuned with PacStack all the machines to use a tuned D profile called Vert Host. So, they're even out of the box better performance than KVM over Livered. Sorry, Livered over KVM. A few things to note here, though. In specvert results, basically, the people who submit the results tune the hell out of the machines. They use things like SRIOV and Numenode bindings. These are things that OpenStack still does not do, but I expect that this will come as we go. Another aspect to look at is Overcommit. So, OpenStack has a very aggressive CPU Overcommit. It basically Overcommit 16 to 1, so that means you can run 16 virtual CPUs for every physical CPU you have. On the memory side of things, it's much lower. It's one and a half Overcommit, so for every one gig of physical memory, you can at most allocate one and a half gigs of virtual memory. This is important because with memory, if you do too aggressive Overcommit, then your performance will run to the ground, you will start thrashing, so you really can't Overcommit too much. This very much depends on the workload and the type of things you do, so OpenStack still does not integrate with things like KSM, but I expect that these things will come. Yes. Well, the more control you give, obviously the better you can tune your machines. These things will come. Today the scheduler is really, really basic. You can only control a single parameter, which is the memory. The only parameter that is taken into consideration when you schedule virtual machines are how much memory does that virtual machine take up. It totally ignores CPU, totally ignores other aspects. I think, I'm not sure if there was already a session on this, but there is work being done on improving the scheduler and taking into consideration a lot more parameters. Obviously once that happens, you will also see things like C-groups coming into play where you can in fact control. Even if you allocate three virtual CPUs, you actually don't want to let this virtual machine go beyond, I don't know, 60% of a single physical CPU, things like that. Right. So Vert Manager is a single host application. Yes, but the point I'm trying to make is that with OpenStack, one of the basic tenets is scalability. You don't get scalability by performing each compute node separately. You get scalability by being able to linearly scale and have, keep your hardware as homogenous as possible. So what you really want is to get out of the box the best performance you can on the one hand. You want possibly to split your host up into cells, homogenous cells you can have heterogeneous across cells, but within a cell you want everything to be the same. And then you can, if you control the scheduler properly, you can then schedule different types of workloads, different profiles to different cells. Okay. So this is basically showing the effect of CPU Overcommit. Each virtual machine here had two VCPUs and the machine had 16 physical CPUs. So you see that it peaks at eight virtual machines, basically the exact physical allocation one to one and then you start to over commit so performance goes down. One thing to note though is that at some point when you start reaching memory over commit, then you will see a cliff because of thrashing or excessive phaging. Okay, so let's take a look at the femoral storage. So today out of the box when you run an instance in OpenStack, the image from Glance will be stored on the local hard drive and the compute node. Okay, this is something that you do not necessarily take into consideration, but can affect your performance significantly. There's a trade-off between the number of hard drives you put on your compute node and the performance or, sorry, and the footprint. Okay, so the more hard drives, obviously, the better the performance, assuming you take, you use those hard drives, but the bigger footprint, bigger space in your racks, more power, it will cost you more money. I will also compare how network based storage is with regards to local storage. So basically we tested three profiles, a single system disk, a stripe of seven disks, so this is rate zero, okay, and then network based storage with using fiber channel and SSD drives. Of course, from a performance perspective, this comparison is not fair, but the point to understand here is that if you do not take these things into consideration, if you do not tune, if you do not configure your setup properly, you will take a severe hit. So first thing we test is Nova Boot Time with a range of two to 16 virtual machines. In the red, you see the local system disk. Note that each virtual machine had a different image, okay, so we wanted to make sure that we're not caching anything and not going through page cache, getting a better performance there. So red is the system disk. In blue, you see the seven striped local disks, and in yellow is Cinder Boot from Volume with fiber channel SSDs. So with two virtual machines, it doesn't make much difference. Once we jump to 16 virtual machines, you can see obviously that performance on a single system disk runs to the ground. Boot time goes up per machine significantly. Here you can see that it not only affects the boot time, it also affects the number of virtual machines you can effectively run on a machine. We stopped, so in red, we stopped measuring once the boot time got so bad that it was basically irrelevant, okay. Then we wanted to see how long it takes to perform a snapshot. Now, the way snapshots in Nova happen today is that if it's a live snapshot, you have a sync between, or you have a block rebase in QMU. So basically, QMU starts replicating all IO until it has a full copy of the entire chain. This is a QCAL chain, so it has external snapshots. Then once that copy is finished, we run what is called QMU Image Convert. It basically collapses the entire image into a raw file on the side, and then it uploads it into glass. So we did not measure the time it takes to upload into glass, but we did measure how long it takes to convert the entire chain, basically your root disk. You can see that taking one snapshot, the average snapshot time is relatively okay, but the more snapshots you take on the system disk, the longer it takes. If you're using local RAID, that's the seven-disk stripe, then you can keep your snapshot times low. So this doesn't just affect the runtime, it also affects the time it takes to run provisioning operations and things like that. And the point is that the destination of the snapshot is tunable, so instead of having it by default go to the system disk, you can change this and get much better performance. This is the effect of taking multiple snapshots. You can see that per snapshot the average time goes up, the more snapshots you take concurrently. And now, this is something completely different. So who here is familiar with M1 or virtual machine profiles? Okay, so basically this is equivalent to what you'll see in Amazon. When you want to provision an instance in Amazon, you have a profile which is called M1, and then you have medium, large, and extra large. The difference between large and medium, large has twice the VCPUs, and Keith, can you remind me, memory ratio? M1, medium to large, is it also double the memory? Yeah, okay. So double VCPUs, large to extra large, double VCPUs. So naturally you can run less extra large machines on the same physical machine than medium virtual machines. Now, basically the throughput you get is you get a cap for medium machines at 21 vms. That's 61%. It's basically if you take the baseline, which is the red line here, which is a single machine, you multiply that by 21, 61% of that is 440,000, whatever. I don't remember that. I don't remember the units here that were tested. You can see that there is a difference between running M1 medium machines and large machines. So for large and extra large, you do get capped at the same ratio, but you can get a lot more throughput with running small machines versus running larger virtual machines. This is another look on the same thing. So here instead of looking at the total, the aggregate throughput, you're looking at throughput per virtual machine or the average. So you can see more clearly the red is the baseline, running a single virtual machine on the left, running a single medium virtual machine, and then the blue is the average when you're running 21 such machines. So per virtual machine you get smaller throughput. Another aspect of the same thing, so this is instead of throughput, it's the latency, so here lower is better. On the left, the average app latency when you're running 21 medium, and then the red is the latency when you're running a single virtual machine, same for large and for extra large. Any questions on this before I move on? Okay. So let's look at the scheduler. As I said before, the scheduler only takes into consideration memory allocation, okay? So it does not measure CPU load to make scheduling decisions. If you have different configurations, and this is what I was referring to before, you can see very bad allocation decisions. There are very few tunables, but if you wish, so just to be clear, what happened here is that we have four machines, they all have 24 physical CPUs, but the two left machines have 48 gigabytes of RAM, the two machines on the right have 96, and basically the Nova scheduler kept on scheduling more and more virtual machines on just the two machines on the right. So we reached a point where we have 49 virtual machines on each one of the two right nodes versus a single virtual machine on the left. Okay? There is a tunable that you can use. This is scheduler host subset size. What this tunable does is basically says the scheduler has a list of hosts, the top hosts. I will choose, in this case, the four top hosts and then choose randomly between them. Now here this is a bit cheating because we had only four hosts, so basically we effectively changed the scheduling from memory-based to random-based. Okay? But you can see here what we did was we added 20 virtual machines for every run. Every color is a set of 20 virtual machines, and you can see how the scheduler, in this case, scheduled the virtual machines. The most important thing to take away here is that you really need to keep your, either obviously write your own scheduler that takes more things into consideration or make sure you have homogeneous hardware within an allocation, within a cell or within a group of hosts that are allocated to. Okay? And make sure not to mix and match. Okay, now I will pass it on to Doug. Okay, we've got about five minutes more, so I want to highlight a couple of storage topics about Cinder in Swift and then for the available time and have a little bit more Q&A. Yeah, we'll leave it there. So one of the evolutions that's going on in Cinder is really the richness of the storage topology. So you go back Diablo or so, pretty much you could have any storage you wanted as long as it was block storage, QEMU to a block device. As you got to Folsom, then things like the whole NAS world got brought in, so NFS providers, things like that. One of the more interesting innovations lately is really leveraging some work out of the QEMU community, and I'm personally aware of some contributions with IBM there, where the QEMU emulation layer, part of KVM, rather than even going to the OS to do either file operations or block operations, can call a library. And so what you do is you're starting to see some of these, what I'd call software defined storage, some of you who attended the panel session, there's a lot of debate on that, but basically storage stacks, where there's a portion of the storage stack that lives, we often call it client, on the hypervisor side, and the ability to link that directly in. So in the case of Gluster, Gluster has something called LibGF API, it's client, Sep has something called Liberatus, it's client, and the ability to directly link in. So what happens is an IO comes through QEMU, QEMU by linking to that client library, performs to the next step, and immediately spits that IO request directly out of the network, and the only contact switch into the kernel is related to the network traffic, not storage traffic. What it also means is that client side, which can run into performance saturation, every individual guest picks up a copy of the client. So if I'm running 40 VMs, I've got 40 copies of the client. And so even if there was a client saturation, it's not a shared resource, it's a resource per guest, and therefore it gives us a lot of flexibility on performance. So one of the things that we've been looking at and read at is how do we deliver that where we link these together but make it supportable? And that really is extending QEMU to have loadable modules. So what we wanted to avoid is a situation that happens today, which is when someone links it, then the storage vendor now has to distribute the hypervisor pieces to you, and then it's who's responsible for the update. So a major enhancement, this will be delivered in REL66.5, you would expect the Enterprise Linux community to pick it up, is be able to loadable so your storage provider can provide that client library. It can link in with the hypervisor in a very supportable manner but still really benefit from this really attractive performance improvement. To give you a sense of this, what I want to do is share with you some results using Lustre FS over here. So these results are a pair of back-end nodes that have two socket X86 servers, 12 drive zips using a pre-DX expensive, in this case Dell's favorite back-plane rate with a little bit of right back cash. And we wanted to benchmark that and look at it both for IOPS and bandwidth. And what we're measuring is not aggregate performance, but how much performance we can deliver to an individual hypervisor node. And what you can see over here is for reasonable size IOs doing sequential IOs, 64K, we're delivering just north of a gigabyte per second, and guess what, with the 10 gigi network, that's wire saturated. You'll see the performance at half that for writes, but that's because it's a replicated write. So every user write creates two writes. So again, it's network limited performance over here. Random IO, the performance is a little bit different. So I'm showing this as IOPS instead. And what you see over here is in fact read performance that tops out at about 2.2K IOPS. Why is it topping out there? 12 drives, large form factor drives, generally 100 IOPS per drive. Guess what, 24 drives is going to give you about that expected performance. And this is really highlights if you're using something like a Gluster FS or SF, the decisions you make on the drive technology, use 15K RPM drives versus 7200 RPM drives have a big effect because the larger form factor drives can be like three times more cost effective over there. So that's an important trade-off. So it gives you a sense of the write for read performance and of course a write you'd want to have replicated. In the case of Gluster, it's two-way replication with backplane rate. So it's about half. If you had a technology that did three-way replication, you'd see that as about a third because you're conserving three head-seeks on the spindle. That gives you a sense of the performance. Now, what I intentionally did in these results is we had VMs where the virtual image was 48 gigs, a large number of them, eight. And so that guaranteed that the workload was so large that it wouldn't fit in the buffer cache in your storage pool. If we ran something that had much smaller image size, you'd see much higher IOPS rate, but that's because you're not exercising the spindles. You're having operations that are hitting in the buffer cache of your storage serving nodes. So our next step here is really to look at two things. Start to look at VM cache and starting to leverage SSDs because VMs have hotspots and that's a very promising technology. And then some of the vendors such as LSI, and you'll see this with others, are starting to bundle SSDs in with the cheap storage to do hotpot management. And so what you'll see is substantial performance improvements in that space. Skip the next two, and I wanted to highlight one other thing. So one other thing that Red Hat's been working on with the Swift community is extending Swift from today, it's both a proxy layer, and then a collection of rings that do the persistence. There's something that only comprehends that but allows the storage to have pluggable backends. So you can have POSIX backends in addition to that, particularly if you're dealing with storage interoperability. We use this to provide multi-protocol access. One of the things that brings up is that the libraries that are used for a lot of OpenStack use a mechanism called eventlet. What that allows us to take as an individual thread is managed by the OS, have multiple individual operations going on, and it's a lightweight threading model where a thread of execution when it performs a network operation yields to the next operation. One of the things that happens, so we see this both in the Swift ring structure as well as pluggable backends, like Luster uses, is that when you perform a disk IO, the yield operation doesn't happen. So that thread blocks, which means other operations that we're sharing that thread have to wait for the block in IO to complete. So one of the things that really comes up is if one was to use the default configuration for Swift, it tries to have a large number of clients per thread, 1024, but have a very small number of these threads. So what we're showing is the results when that happens, which is as you increase the amount of operations, individual clients sharing that thread in one stalling the other, and what you start to see is a very significant latency push out. So there's some fairly simple tuning changes that grow the number of workers and really minimize the number of clients per worker thread that have a big effect. My understanding is IBM for vanilla Swift has published some similar results where they again show certainly reducing a max client from 1024 to something more like 32 to 64 avoids these latency push outs you'll see. So I think with that, we're pretty much running out of time, so maybe one or two questions if anyone has that. Otherwise, we're certainly around afterwards. Thank you.