 Well, welcome everybody. To be honest, I didn't expect such a crowd, but I'm happy you're here. So this is everything you ever wanted to know about OpenStack at scale. Seriously, everything. No, but we're going to go through our views on OpenStack scalability and some basic scaling models and paradigms. And let's see here. Maybe I was supposed to explain this in the Y side. OK, right. So we're going to talk about basic scaling models. We're going to talk about how they apply to OpenStack, where they fail, where they succeed. But I just want to introduce you to the motivation of this talk. So I work with Merantis full-time. Work with OpenStack for the last four years, Merantis for the last two. And I think I have heard this question more than any other question about OpenStack. How does it scale? How do you go bigger? I think the other question is, how do you upgrade it? So we want to do a general overview, component by component. And then we will have some time for questions at the end. Let's see here. Here we go. So there's kind of our overview. Three core sections, scaling models, core OpenStack services. And then we have some special buggers here at the end that we'll make special mention of. I'm going to go ahead and turn it over to Randy. Hi, everyone. My name is Randy DeFow. I've worked in Merantis for about a year now. And my background is a little less on the system architecture side. So Mike specialized in that. He's run very large OpenStack Clouds. I come a little bit more of application and cloud solution side, but I have worked with a lot of different types of distributed systems over my career, including some of the big data systems and distributed consensus algorithms. So I found some of the scaling problems in OpenStack to be pretty interesting from that perspective. So before we jump into the component by component, I just wanted to cover some general scaling models and paradigms, which I think help frame the problem. I mean, one thing that's really interesting to me about OpenStack versus systems I worked with in the past like Hadoop, is that OpenStack is a very complicated system. And there are so many different components that scale in different ways. And so one of the big challenges, I think, for those of you who maybe have worked with other systems in the past and are now looking at scaling OpenStack, is that you have to be really specific about which particular bottleneck you're trying to address. Because with OpenStack, what I found personally is that if you dive into a problem, you can quickly become confused about looking at a lot of different possibilities. And so you have to be really careful about identifying the actual specific component that's the bottleneck before you figure out the best way to solve it. And I can say that from somewhat painful experience, even over the last couple of weeks, as Mike knows. So anyway, of course, one scale-up model that's traditional one is scaling up, just throwing more resources at the problem. And this still is relevant in OpenStack. Michael talked a little bit about how this works, even for the MySQL or Postgres database that's seized by OpenStack. And this, of course, works until you hit some limit that is just too hard to overcome. And what's interesting is that you can think about this limit for a specific thing, like MySQL database, so you can only throw so much RAM at the problem, or you can only increase drive speed to a certain extent. But if you look at it as a system as a whole, so if you start thinking about a particular component of OpenStack, say, like a Galera cluster, even though it is, in one sense, a distributed system, there's still this fundamental bottleneck you hit if you look at it as a system as a whole. And so the example I always fall back on is distributed consensus algorithms like Paxos. You reach a certain size, the quorum gets too big, and adding more nodes doesn't help you any because the overhead of the voting gets so heavy. And we actually see that in OpenStack. For instance, one customer I worked with tried to have a Keystone database that was distributed over many sites. Each one had a four node Galera cluster. And talking about going to six sites, a 24 node Galera cluster, if you think of that as a single unit, it just starts to get too big. So trying to scale Galera beyond 15 nodes can be really difficult. So of course, the next general scaling model is the scale out model. I always love the Grace Hopper quote because I just have this picture in my head of a gigantic ox for some reason. But of course, we can try to, a lot of things in OpenStack, you can just try to add more pieces, more nodes to the problem. So a lot of the OpenStack control services you can scale out by say, instead of running three Keystone controllers, you can run five if Keystone is a bottleneck for you. But of course, again, looking at it as a system, the key thing you have to be aware of is there's always a tax you have to pay. There's some level of coordination that you have to invest in to make those five Keystone instances work together as a coherent unit. And in OpenStack, that's really important because those overheads tend to become some key bottlenecks. So things like RabbitMQ, which is the glue that kind of holds a lot of these pieces together. And Michael talked more about that later. So again, just in general, when you're looking at scaling problems in OpenStack, it's really important to think about the difference between looking at components and looking at systems. Because by scaling out a particular component of OpenStack, you're actually scaling up the system as a whole, and you'll end up running into some of these bottlenecks like I just talked about. For instance, those of you who've worked in big data before, you know that Hadoop, of course, as a distributed file system, can scale out to be fairly large. But even in Hadoop, you end up with things like the name node, or in some cases, even the resource manager in Yarn, becoming a new bottleneck you have to worry about. And so in the Hadoop community, the name node, if you're not familiar with it, is the part of the Hadoop distributed file system that knows where different chunks of files are on different actual servers in the distributed file system. And this started out just being one of those, and so it was definitely a scaling bottleneck. So if you just kept adding more data nodes into your Hadoop cluster, it could be a real scaling issue. And so the Hadoop community has started to look at things like sharding that, dividing the distributed file system into different namespaces. And you'll see that same pattern happen with some parts of OpenStack. So for instance, even if you look at the data plane in OpenStack, you can keep adding more and more compute nodes, but at some point it starts to stress out things like the Nova scheduler. And so the OpenStack community is starting to look at things like federation and sharding. If you've ever heard of Nova cells, which is a somewhat still an experimental feature, but is gonna become a lot more prominent in Mataka and Newton, there's actually really good talk on that a couple of days ago, you can look up the video on it. You'll see some of these same patterns being repeated now in OpenStack. So just wanted to focus on one particular example that I think is kind of interesting because it kind of ties into some of the buzz you'll hear about containers and how they may fit into the OpenStack picture. So if you look at schedulers, and I'm personally invested in this because I've been helping someone kind of troubleshoot a potential scheduler bottleneck the past few weeks, the Nova scheduler to me, even though if you look at it just in isolation, it does have a scale-out model. You can add more Nova scheduler services running on more controller nodes. But from a system point of view, it's very much a centralized system. The Nova scheduler has to have global knowledge over the capabilities of the entire data plane unless you are one of the early adopters of things like Nova cells. And not only that, but it has to, through the use of filters, it has to have a lot of knowledge about what tenants may be asking of it. So tenants can pass in hints like I would need a flavor that's tied into specific host aggregate properties. Maybe I need to know that it has S or IOV capabilities. But that scheduler, because it has to have this global view of pretty much everything about the system and everything tenants may be asking of it, the Nova scheduler ends up being a common scaling bottleneck. And what we've seen in practice in our scale lab is that you can end up running the scheduler up to about 12,000 VMs on a couple hundred of compute nodes. And depending on your configuration, your mileage may vary. But there's a ballpark figure where we think the scheduler monolithically is gonna become a real bottleneck. And it's interesting to compare that to some of the newer container-based schedulers. Apache Mesos is one I like to look at has a much different federated scheduling model. So there's a separation of concerns there where the scheduler, Meso scheduler itself really doesn't want to know everything about the entire state of the cluster and everything tenants may be asking of it. It really just passes on resource offers from different parts of the cluster and then application frameworks can choose to accept or reject those resource offers. So in some ways, it's a lot more lightweight of a scheduling model. And I'm kind of oversimplifying, but it is a useful comparison. What's really interesting about Mesos is there's production systems that are known to run at 50,000 nodes, which to me is several times larger than what we would feel comfortable with for a single monolithic OpenStack data plane. And this is really interesting because some of the new container concepts floating around OpenStack like Magnum actually introduced this kind of separation of scheduling concerns. Magnum, it's a fairly new project. I don't have a lot of extensive field experience with it, but projects like Magnum or the Kubernetes packages on Merano, they'll actually use the Novus scheduler to carve off a chunk of your OpenStack cloud directly or indirectly orchestrated via heat or something like that. But they'll take a chunk of Nova and other resources and just make it available to one of these other schedulers. And then you can go and invoke cube control if you're using Kubernetes or you can invoke marathon if you're using Mesos and use a container orchestration engine. So you kind of get the separation of scheduling concerns. That's an interesting model. I think it's kind of an open question operationally, how well that works. There's some other questions it raises. But anyway, these are some of the newer directions that you may see in OpenStack over the next few releases. So enough of the theoretical, let's get down to I guess some of the nitty gritty about different OpenStack services. So a lot of this information is summarizing some field knowledge that folks like Mike and some of the other solution and system architects at Mirantis have put together over the years. We'll start with Keystone. Keystone, of course, is a central service for OpenStack. A lot of, pretty much every other service has to interact with Keystone to some extent. Keystone, to me, is a pretty interesting system because the Keystone API services are really easy to scale. They're active-active. Behind a load balancer, you can scale out, just add more Keystone controller nodes into the control plane. But what's more interesting about Keystone is the authentication and authorization backends. And here, depending on your back end, you have a lot of different ways you can possibly scale up or scale out. So for instance, if you're using internal Keystone, just have it in the same MySQL database the rest of the control plane uses, then it can scale out the same as your MySQL database. If you're using LDAP, I really think of it as a scale-up model because LDAP is typically owned by some other department and I make it their problem. But LDAP is a system that has proven itself can run at pretty large scale. And then, for some of the newer deployments, particularly if you're running a multi-region, you can actually have federated Keystone. So you can have local Keystones that actually defer back to a centralized Keystone for authentication. So a lot of different ways you can try to shoot to scale Keystone, depending really on your deployment model. And with that, I'll turn it over to Mike and he's gonna walk through some of the other services. Thanks, Randy. So Nova is an interesting case. I think pretty much every open-stack deployment out there probably uses Nova as one of the core services. Nova, we've talked a little bit about the scheduler. Nova has been deployed at scale, at different locations. We have some examples, Acorn, Bluehost, CERN, Rackspace, and primarily because it has this idea of application level sharding, we'll call it Cells, right? So Cells V1 is not for everyone and not recommended unless you have a team of experts to help you out. But Cells V2, which is gonna be implemented in Newton, it will become the way that Nova works. You will always have a federated model that you can fall back on. So real good progress in Nova. Very quickly, I'm gonna go over the components. This is an intermediate-level presentation, so I'm not gonna go into great depth, but we'll go over them, we'll talk about their roles and how they scale. So the API service in Nova is a pretty typical open-stack service. You'll see an API for almost all open-stack services, and they all kind of look like this. They're launched via Apache or Nginx, ModWizki, or you can use the Wizki launcher directly. But you just kind of spin up more workers of that. More workers on more nodes, scale that out, put a load balancer in front of it, and it's a fairly horizontally-scaled little service. It's pretty easy. Scheduler, again, we've talked about the scheduler. It has this worldview, right? It wants to do an optimal placement. So it needs to walk through the entire database, and if you kind of think of compute nodes and their capabilities as a graph, it's kind of a complex thing that it does. But that aside, I think the real problem that we see with the scheduler right now is just that this dataset is held in the MySQL database. And so we get lock contention. We sometimes get spurious performance just because everything else in open-stack use is MySQL. So it is a horizontally scalable service. In practice, it's really a scale-up thing for those of us running open-stack in production today. We can throw a strong CPU at it. We can make sure that our DB is really big. Maybe we can use some of the ReedSlave functionality or some of the Galera, some sort of Galera sharding in some middleware to make sure that you hit an idle DB. But yeah, we have some problems in implementation that we're working on fixing. Conductor, if you're familiar with the conductor service, this is kind of introducing Grizzly. Conductor is an orchestrator. It does do some transactional operations. But the heavy lifting of conductor is really proxying a bunch of DB calls from Nova compute nodes. So anything that your compute nodes ask for is going through conductor. And by the way, conductor rides over the message bus just like everything else. So this is a major thing that we see problems with. This is a sizable amount of traffic that goes on here. Even in a baseline cloud that's doing nothing, you have periodic tasks running. You have resources reporting in. And so this thing is always handling a lot of traffic. So we do see issues with this. It's more related to our rabbit issues than the service itself. The service is horizontally scalable. You can blast it out to lots of nodes, lots of threads. It's a good design there. So moving on to Neutron. Neutron is kind of the exception in the core services of its API service doesn't just handle RESTful operations. It also handles some intra-component communication via the message bus. But Neutron server is horizontally scalable. You can run a bunch of them that's been that way since Havana or so, maybe a little post Havana. And then of course we have common agents here. We're going to have the L3 agent that handles routing. If you're running DVR, of course, you're in a fully distributed mode. But there's also a federated mode where you can have L3 agents running on multiple parts of your network, multiple network nodes. And then tenant routers and all those things that get scheduled by the L3 agents, they'll end up on different nodes that kind of distributes the load. So the one kind of got you here is that the placement capability is not that great. It's advancing, but it's fairly rudimentary at this point. L2 agent, this is what's responsible for doing all the wiring, the virtual wiring, so to speak. And it's a scale-up model, I don't know. Maybe it would be a problem if we were networking thousands of containers on a single compute node, but right now it's not really a big deal. DHP agent, again, you can run in a fully distributed fashion. Cinder, so we have a volume service, which again has a scale-out model. The caveat is what is your storage back-end? Do you have a storage back-end? Is it SEF? Is it some sort of NAS? Is it LVM? These are all gonna have different behaviors. The Cinder service itself, though, I mean, you can run multiple services. It's kind of a multi-worker model. For example, I have a buddy who runs SEF at a fairly large scale, at a company next door to mine. And he was having some problems with SEF. He kept noticing that the Cinder volume worker was blocking, and he came to me and asked me, you know, what do I do? Let's debug this. Well, we went into it, yeah, it was kind of one of those eventlet problems where a C library was doing a blocking call. Eventlet didn't know about it, so the whole thread was stuck. But the solution was just run more Cinder volume services. So we would run like 12, and everything was fine. Scheduler, similar problem, not as big a problem domain. So pretend to not see problems with that in production. Backup, again, just make sure this can be a service that handles a little bit more throughput. So if you're using it in your deployment, just make sure you have lots of RAM and that it's spread out the way it should be. API, horizontally scalable. Okay, Swift. Swift is kind of a dynamo, well, I shouldn't say dynamo-inspired, but dynamo-like object storage system has a nice DHT model, has a nice scale-out model. So to handle front end traffic, you have a proxy service, and this proxy service can be distributed over a wide range of servers and load balanced. And then you have container account and object services, and container and account, of course, are responsible for the mapping object, the actual storage. And it's a pretty simple model, right? The catch here is when you're adding more nodes, more storage nodes, to your cluster, or your replacing nodes, or hard drives, or whenever you mess with your cluster, you're essentially removing buckets and adding buckets to the distributed hash table. So the placement algorithm tries not to do this, but you will experience rebalancing when that happens. Glance, this was our image service. So the API, again, is one of those services you can run across multiple nodes. It has a little caveat, there's lots of throughput, images can be gigs and gigs, depending on what your workload is, where you work. So what's important is on these boxes, you need to have large amounts of memory. You don't want those images, you'd like to grab it, and then you'd like to keep it in the Linux kernel cache. So if you have 256 gigs of memory, and you have your 12 most common images in the Linux kernel cache, you can serve those straight to the network card. So yeah, that's kind of the only thing I would say about that. Registry, really lightweight service, just run a bunch of those. So now we get to my favorites, these are the special mentions. So when I deployed my first OpenStack Cloud, ran into some MySQL problems. Looking back on the experience, I believe, again, they're mostly implementation issues. We have some concurrency issues, we don't have the best queries going to those databases, but when we talk about MySQL as a system, Randy already mentioned that a relational database, it's difficult to scale. It has a point of diminishing returns. When you're sending writes to all members of a cluster, eventually that write burden is just so large that it doesn't make sense. So we do see sort of an end to that scalability model. But for the OpenStack deployments of today, it's just fine, it's just fine. You use Galera, you use MySQL with standard asynchronous slaves and just load, if you're using Galera, use it in the active setup, load balance those queries, you're gonna be fine. But yeah, I mean, OpenStack's mission statement is that we want to be massively scalable. And massively scalable to me, well, so AT&T says they're gonna go to 300,000 nodes. That to me is massively scalable. So this may be something that changes in the future. Excuse me. Okay, now RabbitMQ, this is my ultimate favorite. Randy's gonna be sick of hearing this, but so RabbitMQ was a quaint choice when we were first developing OpenStack and we were playing around with it and we wanted to make something cool. I really, it really is at the root of a huge percentage of problems in OpenStack deployments today in production. What we've observed is that when you do the fully HAQs, active-active, like a three-member cluster where you're replicating all messages across all members of the cluster, it's just not workable when you get above about 200 compute nodes on a busy cloud. And so let's say you don't do that, don't replicate messages and just kind of have an active passive setup. Then you can get up to about 500, 700 nodes, but again, you're gonna run into the same wall. And I'm not blaming Rabbit. You know, there are multiple reasons why this happens. My thing is we shouldn't have a broker in the middle of a massively distributed system. There's no need for a broker. Direct communication, federated communication is the way to go. And that's changing. Oslo has a fairly mature 0MQ driver that came out in Metaca. It's development mature. It hasn't been used in production, but it has promise. So, all right, so conclusions from this talk. Here are the tactical things, all right? Understand that you have limits on performance. Understand that you have limits on scalability. You need to match these carefully with your application's needs. And you need to figure out where is the give and take, right? What are your SLAs, what are the real SLAs, okay? Invaluable is to have a performance pipeline. Since we have CI in the gate, and typically when we do development on OpenStack, we have CI in our local OpenStack. We don't see like syntax errors or a lot of the easier things happening. What we see are performance regressions and feature regressions. So, that pipeline's really important. And then, if you wanna have an easy time deploying OpenStack right now at scale, your option is kind of to use small clouds at this point. That's the easy way to do it. Nova cells could do, that would be, you need a team for that. Like I said, you need to learn to identify your bottlenecks, learn to troubleshoot OpenStack, learn what the troublesome components are. My sequel rabbit. But yeah, so we see improvements coming. Nova cells V2, that's coming. Really exciting and the messaging improvements that I already mentioned. And just to wrap up, I guess, with some longer-term thoughts or more strategic direction on scalability in OpenStack. You know, like Mike said, the mission statement of OpenStack is to be a massively scalable cloud. And I think what we're seeing is that you can reliably run a couple hundred nodes in a production environment. A lot of times people ask me, well, what does, you know, Mirantis has a distribution, of course, what does our scale lab certify? And so our scale lab tests up to 200 compute nodes at a density of 60 VMs, so 12,000 VMs. But that's a very specific configuration and it's not exactly a realistic runtime environment. We, the Mirantis reference architecture makes a lot of assumptions. It has a lot of opinions about how to deploy things. It doesn't use SDN, for instance. So your mileage may vary depending on your configuration. And going back to the comment about the CI CD pipeline, the performance testing tools in the OpenStack community rally and shaker. They're very good, very good test suite, but they don't always run a representative set of performance workloads. So for example, if you're a service provider and your workloads tend to be VNFs, very different performance profile than what most people would test in the lab with tools like rally and shaker, VNF workloads are gonna be extremely network-intensive. So you may be talking about using things like DPDK or SRIOV, they will likely be less intensive on other parts of the system compared to a general-purpose IT workload. So I think it's useful to have a distribution with a scale lab that says under this type of environment, this is what you can scale out to. But it is so important to do your own testing with your kind of workloads. And again, another comment Mike made, understand what your real performance requirements are. Is it important to be able to sustain 100 tenants booting VMs at the same time? Is it okay if that is slow, or do you really need that to respond in X number of minutes or X number of seconds? So it's very important to invest in your own. If you're going to try to push beyond that comfort zone of 50 to a couple hundred compute nodes, it's very important to do some investment in this area on your own. Because going back to more of the strategic points here, we know that beyond that comfort zone, there are some known bottlenecks, they are being addressed. But there are some fundamental questions about how to get really to the next level of massively scalable open stack clouds. There are some of those out there, CERN, Bluehost, a few others, but they're all very heavily custom engineered. And what we're seeing now is the trend is not towards investing a huge amount in one heavily custom engineered cloud. It's people looking to roll out a large number of clouds in different geographies to serve their customer base. And if you're rolling out 50 clouds or 100 clouds, you can't custom engineer every one of them. So there's some real fundamental questions that we're gonna keep working with the community with, and I think everybody in the community needs to participate in these discussions on how to get to that next level of very predictable scalability. So just a few comments, I guess, on what may be coming beyond the next release cycle. Mike mentioned there's NovaCells V2 coming in Newton, some messaging improvements. But beyond that, I think it's really interesting with all the buzz about containers. How might containers change this picture? So on the data plane, container workloads obviously have a lot of advantages for the application developer or the tenant VNF developer. Because containers are easy to scale out and have a lot of other nice properties. But from an OpenStack performance perspective, again, containers, if you're using an alternate scheduler like Kubernetes or Mesa's, they could potentially take some of the pressure off some of the known OpenStack bottlenecks like NovaScheduler and RabbitMQ. If those components only know about VMs and there's a second scheduler that's handling the greater number of containers, that's an interesting model. And it'll be interesting to see how that plays out as more people start doing things like that in production. But it poses a whole brand new set of questions. For instance, if you've invested a lot of effort in monitoring your data plane, collecting logs, collecting performance data from Solometer or using tools like StackLite to get some metrics out of it, that whole picture changes when containers are in the mix. Some projects like Magnum have some interfaces with Solometer, but they collect a different set of metrics. And if you're orchestrating, say, Kubernetes through Morano, you don't even get that much. So it poses a lot of different operational questions to think about. In terms of the underlay and the control plane, there's a lot of interest in containerized control planes and that will certainly help scalability in the sense that if you want to deploy five keystone controllers instead of three, it's really easy to do that with containers. Just spin up a couple of new containers. But fundamentally, there's parts of the control plane that aren't going to fit that microservices scaling model, MySQL, RabbitMQ, things like that, Neutron L3 agents, unless you're using an SDN that's more of a microservices compatible model. And again, there may be more of a direction towards federation. Nova cells to tear the scheduler and the message bus and the database, some other models like that. So that's another interesting area to look at. And this is again, I think we need as a community to really put some effort into this area because people who adopt OpenStack are going to not want to necessarily have to deploy a lot of small clouds. In some cases that makes sense, but then you need something on top of it. You need something on top of it that can orchestrate things that should be in common if you have images that have to be shared, if you want to collect metrics across all your clouds. So that's a choice that some people make. There was a talk I think yesterday about how to deploy a large number of small clouds, but then you have to invest in that management layer on top of it. So if you'd rather go the one big cloud model or a handful of really big clouds, as a community I think we need to start answering that question more. There's a lot of thoughts and a lot of ideas here, but there's nothing that's beyond the concrete things like Nova cells and some of the messaging improvements. There's more questions than concrete answers right now. So with that, I think we've got time for a few questions. Mike will answer all the hard ones. If you have questions, please step up to the mic. Any tuning that you recommend on the database I've heard, like on the Nova site, the records that could be cleaned up? There are. So queries tend to be very joint heavy. You tend to, so like your sort buffer, you want to have the right packet size to be able to track and transfer all that data. Those are the main ones that I would pay attention to. Also, in ODB write concurrency, IO threads, these are things that tend to not be tuned by default and they're pretty important. Yeah, and I think Mike mentioned this in his talk too, but with MySQL sometimes, if you're using regular monitoring tools, you may find that your database is still a bottleneck, even if the database nodes aren't running hot. And in that case, it could be a lock and tension problem. So one thing I think I've learned is have a good DBA who knows how to troubleshoot MySQL. That would be really helpful. I should mention something else. I gave a talk in Portland a few years ago and another way to just get rid of this is to distribute your reads and writes. The only native way to do that in OpenSec today is in Nova. It's actually code that I wrote, so I'm gonna promote it. So yeah, I mean, you can use a read slave and you can offload some of those reads to a slave cluster. And that'll limit your concurrency a bit. Yeah, and if you're a little bit more adventurous, I think with some custom health monitors in HA Proxy, you can distribute reloads across all the members of an active passive Galera cluster and still have writes just go to a single master, so. For questions? We covered everything. Very comprehensive. Okay, you've been a great audience. Thanks very much. Thank you very much.