 All right, that's kind of a nerve-wracking way to start a presentation. So anyway, just to introduce myself, I'm Mike Wilson. Some of you might know me. Most of you probably don't. I work for Mirantis and previously was at a nice place called Bluehost and did a cool deployment at scale of OpenStack there. Some of my colleagues from Bluehost are here in the audience lending me their laptops and stuff. So hats off to them, thank you guys. Even though I'm a trader. So I wanna start out this talk by telling a little story and it has to do with deploying at scale. And it will kind of give you a big picture for why I'm even giving this presentation, why this is even a relevant topic. So when my first at scale deployment, we had to spin up on the order of tens of thousands of compute notes for the particular thing that we were doing. You can go look at my Portland talk if you wanna hear more about that. But one of the things that surprised us is we started spinning everything up and we got to a couple thousand nodes and these were just Nova compute services and kind of the bare bones of an OpenStack installation, Nova compute, Neutron, Keystone. And my SQL just started crawling, man. The thing, it couldn't handle any kind of queries. Our API slowed down. It was just completely unworkable. So our first shot at scaling the database, which is the central piece that was so slow for all these OpenStack services was adding more hardware. So we put it on SSDs. We found the biggest thing that we could out in our server racks and whatever. Give it more cores, give it more memory. Let's give it SSDs, let's see how that goes. And that got us roughly double our capacity. So we got up to about 4,000, 5,000 nodes and we had the same crawling horrible MySQL not responding problem. So we took a little different tack this time and we started optimizing MySQL settings. And that, again, that got us to about 10,000, 11,000 nodes. And then we saw the same bad behavior that we had before. So we stopped doing the wrong thing and we started analyzing our approach and deciding, well, we need a fundamental change of direction. We need a place to offload this workload. We need to be able to scale it horizontally and reliably. So by the way, at this point, we really didn't have any API users. The only load that was being generated was kind of a periodic task, a baseline load that you would have in any OpenStack installation. So yeah, what we did is we took a bunch of that traffic that was being generated by periodic tasks and we offloaded them to replication slaves. All right, so there's the big picture. So I've wanted to, ever since that happened, I've wanted to get it into Upstream OpenStack in the Ice House release. I was successful at that. And I've wanted to kind of share this experience with the community at large. I've been thinking about this a lot. It's been occupying my mind for months now. You know, I'm talking to my wife about it, talking to my colleagues about it. And I think I was maybe like in a furniture store or something. And I came up with this really succinct, pictographic representation of what you need to do. That was a joke. Anyway, if you haven't been by the Berantis booth, it was kind of inspired by our awesome swag. So yeah, here are your simple instructions on how to scale your DB layer for your OpenStack service. We start out with some tools. We have some instructions along the way. We end up with a nice product at the end. We'll go through each of those steps. So first of all, one thing that OpenStack service is having in common for the most part with a few exceptions is they have a SQL backend. This is generally MySQL or Postgres. Like I said, there are exceptions to this. I'm not gonna talk so much about Postgres today because I just don't have a ton of experience with it. I know that some of you out there are Postgres sellets. Feel comforted in your superiority or at least your analogous functionality with the MySQL things that I'll talk about. PG pool, anyone? You can feel superior. It's okay. So this picture here is kind of what we do in proof of concepts for OpenStack. We spent up a single database instance. We make sure that everything works. In production, we do not wanna go it alone. We wanna have database replicas. We wanna have help. Why? Kinda the first reason is availability. No one likes it when you lose your database and you lose your cloud. That's no fun. So right away we need some sort of disaster recovery. Strategy. We also, nobody likes downtime. So we need to have high availability. We need to have active, active, active, passive setups. And this requires us to replicate our database in one fashion or another. Another reason would be scalability. Like I said, we started seeing with just baseline load not doing anything other than spinning up nodes and installing Nova compute and a few other things on them. Our baseline load was killing our database just because of periodic tasks. So I know we can go in and we can optimize our periodic tasks and we can turn them off and whatever we can go into the code and optimize the queries and stuff. We didn't really have time to do that. So yeah, anyway, that's a use case for the scalability of the DB layer. Also, since I've been at Mirantis, I've gotten acquainted with some of our customers that don't really have a lot of nodes. Way less than thousands. But they have really busy APIs. And those APIs contend on database resources. Again, I know that the database is not the end all to making APIs fast, but just another example. Another reason you might want to replicate and make your database everywhere else is a performance question. So you may have an application that is okay with running your database on some seven-year-old hardware that's running Nagios and is serving something via Apache Tomcat and it's on 5,400 RPM spindles. Maybe that's fine for your application. And by the way, that costs way less money than buying new hardware, right? So that's cool. We want to be able to support that use case. And we also want to be able to support the use case where round-trip time is the metric. It's how we recognize success in our deployment. So for that requirement, we can deploy a separate super nice hardware class and we can send our workload to that super nice hardware and enjoy the benefits of that super nice hardware. Reliability, I don't think I've named this point correctly, but I'll try and explain it well. So some of you guys out here are probably software developers and some of you guys out here are probably operations-oriented. And so I'll tell a story that happens quite a lot. You're a software developer. You've been given some requirements and you write a beautiful piece of software. My goodness, it's gorgeous. It's elegant. It works so well. You're so proud of it. You turn it over to operations and they go, oh, it's really complicated. It's hard to use. It's hard to monitor. It's hard to configure. And I don't understand what's going on. This is kind of the reliability, ease of use, not a pain in the butt factor that I'm talking about here. We want things to just work with minimal operational pain. So this may not be obvious right away, but we'll make a connection with why replicas helps out with this. Yeah, so there are probably other reasons. This is just a few that I came up with. Now let's move on to the moving parts. How we make it happen. By the way, if you guys have time, you really need to Google Ikea parodies. There's really some talented people out there. Hats off to whoever came up with the Rectabular Exclusion Bracket. It's not even Exclusion, Exclusion. The trichometric indicator support. Yeah, these are brilliant. So I put these complicated parts up here for a reason. Like we're looking at this guy right here. And it looks like it has three places that we can stick it into that other part. But is it really? I mean, whoa, that's, am I really seeing what I think I'm seeing? I don't think that's actually possible. And these ambihelical hex nuts, these are really cool too, but I'm not sure they can actually screw on. Right? So there's lots of moving parts in scaling out a relational database. I wanna talk about three strategies that are commonly employed. Sharding doesn't necessarily have anything to do with the database in and of itself. But it does address some of our problems of performance, reliability, and scalability. So we'll just talk about it. So sharding, especially when you have lots of really relational, strongly relational data, you generally are gonna take your sharding and you're gonna move it into your app. And that can be hard to do when you've designed your app, not thinking about doing that in the first place. We have an example of this in OpenStack projects in Nova. I don't want to make anybody mad by saying this, but Cells API was kind of an afterthought. And it's kind of a bolt-on and it's a way to shard Nova installations. And it's elegant and it's pretty freaking cool and it accomplishes its goal of being able to handle these scalability issues. Kind of the downsides to it, it's a bit more complicated. There's a bit more management involved. It's harder to develop features for it. And I guess the point that I'm getting at is it's just not transparent to the application. It's ideal to do this when you're writing the application and not after the fact. So yeah, this is just my opinions on sharding. Let's move on to the next model, which is a asynchronous replication. So there's some pros and cons to that. Your data is just kind of fired off and the master doesn't try and get any confirmation or anything that it was ever written. Point of async, right? It's fast and unreliable. And there's some misnomers out there in the MySQL community. They have this thing called semi-synchronous replication. That's just a lie. It's synchronous replication with some caveats. Don't pay attention to that and rush that aside. Another con to asynchronous replication is it's very sensitive to operational entropy and kind of external events. Now I know some of you out there do this. Please don't do this. But I know you already have and I know you will in the future because it's just the way things are. You set up a database slave and it replicates from the master and then someone comes up with a brilliant idea of why don't we put another database on that slave? Or why don't we use this slave to do dumps so that we can do backups? Or why don't we run humongous locking report queries that take three minutes to complete? Yeah, I see some guilty faces out there. So we know that because of the single threaded nature of replication in Base MySQL, that's a problem. You kind of need to set those aside and use them only for what they're used for. Oh, there's some ways to speed that up to make that better. Bracona has some cash warming type stuff that will help you out. MariaDB has a parallel slaving model, which is pretty cool. But yeah, it's something to take into account. Now some of the advantages of asynchronous replication is it's very simple, it's very performant, it's very easy to understand. I also think one big advantage is it's not surprising. I think everyone that's ever dealt with databases across the industry has dealt with this asynchronous replication scenario at one time or another. So you have lots of people that know how to use it. There's lots of tools built around it. In general, it makes operations very easy and very familiar for a lot of folks. Let's talk about synchronous replication. I've kind of mixed synchronous replication with clustering. I'm sorry, bad me. But I feel like they kind of deserve a discussion together. So a big advantage of synchronous replication commits are guaranteed. I can read and write from anyone at any time. My data that is supposed to be there, that I've made my application guaranteed to be there, is actually going to be there. Causality, I guess, is the fancy word for that. It's not single-threaded. I can have multiple readers. I can have multiple writers. No waiting in line. It's transparent to the application. Hooray, my application does not have to care about the persistence of relational data. It hands it off to the database and it says that's your problem to scale. Go deal with it. Some of the cons. So Galera is kind of the only slash most popular way that you can do clustering and synchronous slaving into the MySQL world. And so it requires a little bit more package maintenance. You need a WS rep patch. I think there's packages out there, but you kind of want the latest and most up to date. The Galera certification replication model is exactly what it implies. It's a certification that things are gonna work across my synchronous cluster, but that certification can fail. And sometimes in surprising ways, and if you have enough of these certification failures, your cluster is foobard, right? So just, I mean, this is a way that Galera kind of reaches its hands into your application. Just be aware of that. Also, split brain I think is kind of an operations nightmare for anybody that's dealt with clusters and quorum. The split brain problem is not fun to solve. You should also be aware that if you're counting on a synchronous Galera replicating cluster, that your writes are only gonna be as fast as your slowest node. Or another way to say that is, if you have a slow node in your cluster, everything is slowed down. Also, writes are gonna be writes times n where n is the number of nodes in your cluster. All right, I'm pointing these out and I don't want to make this sound bad. This is just the reality of synchronization points. I think we're all familiar with that. In general, Galera is freaking awesome. Clustering is freaking awesome. So yeah, but I just wanna point out some of the pros and cons of all these approaches. So I'm not gonna claim to have done a thorough job of everything that can be done, but I want to take all of these, what I would say are imperfect approaches and I want to see how we can marry them and use them as a scaling strategy going forward for our OpenStack services. So this is my result of all the work that I've done. I have a shelf. No, no, this is a metaphor. But I actually have a way to put things into compartments, which is really cool. So I can have a compartment for writes. I can have a compartment for reads and do something surprising here. I could have a compartment for API. We'll go into this a little bit later. Just wanna explain the work that I've done in NOVA today, what actually works as of the Ice House release. So I talked about those periodic tasks earlier that they generate a lot of load and they're kind of heavy on the system and maybe you don't wanna go in and turn them off and on and tune them around and stuff. So I think there are 15, 16, 17 or so periodic tasks that run in NOVA and there are five left that don't send their reads to the slave. I intend to finish that in Juneau. But so for the most part, yeah, your reads will go to a slave if you have one configured. So that's really cool. And then we also have, so we have this part of the configuration that we're all familiar with. We give, in our NOVA comp, we specify a connection to an SQL host. This guy is supposed to have a replication relationship with another handle that we've called the slave connection handle. So you have that guy in your configuration. Your periodic tasks are gonna send their load off to wherever you've pointed them. I've got these imbalanced here. I've got this little balance scale going on. This guy still is doing the majority of the action by far. Like I said, in my scale deployment, this was about a third of the load, but my scale deployment is, it's kind of unique. It's kind of a different beast from probably what you guys have. So before I move on to the next slide, do we have any questions at this point? I just wanna answer any of those. Okay. So, all right, so this really isn't like a perfect solution for scaling going forward. I know that. I wanna talk about what we could do with this compartmentalizing of workloads going forward and what the advantages and disadvantages of this approach could be. So these two guys up here, oh excuse me, this guy and this guy would be the right handle or the normal handle that we're all familiar with today. And this new slave handle that has been added in Icehouse. There's a large case for something that I've called causal reads, which I totally just made up and probably makes no sense at all, but I'll explain what it means. Let's say for example, when you create an instance in Nova and you're using Neutron, not Nova Network, and you don't give it a port, you give it a network, one of the things that Nova will do is it'll call the Neutron API and create a port right away, right? And then in a later process of provisioning, it then asks the Neutron API, it builds like sort of a network information structure. And so it asks the Neutron API for information. If you pointed the Neutron API at this asynchronous replication cluster here, there's a possibility that you'll write the information to the Neutron API that'll get committed to the database, then you'll read it from a slave and it won't be there and things won't work right. Your instance will spin up with no network. I use this use case because this actually happened to me. So there's a couple of approaches. I could just retry a bunch and make that work. I don't think that's optimal. I think what's optimal is being able to further separate the right workload and optimize a hardware class for that workload and separate any reads that are causal in nature that need the data to be exactly as you've written it to put those out in a separate cluster and give the application a separate way to send its requests there. And then we can end up still with this asynchronous replication guy. This gives some advantages, obviously. You can separate the workloads. There's a cost advantage also. These asynchronous replication slaves tend to be like super cheap and you just have a dozen of them and you just throw them out there as you need them. So that's kind of a cost advantage rather than having a big expensive database cluster. You can keep this cluster size, the synchronous relationship between a master and a synchronous slave. You can keep that pretty small, which is good. You have less chance of running over other transactions. You're doing less writes overall. You're buying less hardware. And this really fits into the availability requirements that a lot of people have fairly nicely. If this guy fails, we move over to this guy, right? And then we pick a slave from here and make sure that he's up to date and promote him to the new API handle. So I don't think this is perfect, but I think this kind of work, this kind of tiered replication approach, combined with some optimization and some not so abusive use of our database APIs and our services across OpenStack allows us to scale maybe not to what the charter wants, which is millions of nodes, but I think hundreds of thousands is very doable. So, oh, look at that. I skipped a bunch of these. Bad me. Any questions? Can you just say it in the mic so it gets recorded for the benefit of posterity? Where is the decision made whether you're gonna call the asynchronous instance or the fully synchronized instance? Is that done in the API layer? And how does that work? Does it first call, and if it doesn't get what it expects, it calls the other one? How does that work? So the decision is made by the application. So for example, in Nova, these periodic reads, they basically, they know that if a slave connection is configured, that they prefer to send their reads there. And so it's just kind of like, it's just a little Boolean flag that we pass on to the data, the object layer. And the object layer knows, oh, okay, I'll pass this through the database API. The database API is gonna use this other handle. So it's very context dependent for the asynchronous case. For this other case where reads have to, you have to have reads that are up to date and in sync. This is much easier to do actually. We can intercept any kind of read at the database driver itself. And we can just make the decision. We can say, well, this didn't go to the slave handle. So it must be going to our all other reads handle. Does that answer your question? So my question is, can you please go back to the previous slide? Certainly. So this node that is called casual reads, it seems like the most loaded one because it receives writes from all writes node and it is read by this async replication node. So does it actually helps to split the workload because you still have one node that is the most loaded and that becomes the bottleneck? So that is like a really fair question and that's up for debate. And I don't really have empirical data to say yes or no. But I'll just give you what I can from my experience in an at scale environment. There was kind of an unreasonable jump from being stuck at 12,000 nodes and being really slow to as soon as we moved reads off, we were able to scale horizontally. Right now that environment is at 20,000 nodes or something and it's just fine. Like I attribute that to contention, lock contention, row contention, table contention, other stuff like that. So I also attribute that to, in this environment, there are no reads going to that write master. There are no reads at all. It's completely optimized for a write workload. So. Yeah, but it is read by a sync replication. Right, yeah. So again, my gut feeling is that yes, separating the workloads allows you to optimize the MySQL instance here for more write stuff and optimize the MySQL instance here for more read stuff. But it's up for grabs. I still have to be able to say that definitively. If you're using synchronous replications that are synchronous and then you can actually load balance the API calls across those load balances. So what if you don't use Galera? So semi-synchronous is actually synchronous, right? That's another way to look at this and there are other synchronous slaving models out there. So with semi-sync replications, semi-sync on the bit log, so you're still single threaded, right? So. Right, yeah. So that's a sign. But if you're using Galera for the synchronous application site, you would typically at least have three nodes because you want to avoid that split-ranks there that you're using. Sure, yep. So you'd have your quorum, but then that allows you to load balance across multiple synced nodes. So I've seen this where you've got three or four nodes, how many ever, with the async off of those and that tends to work and that solves the problem that you just mentioned. Okay. Cool. Just curious what kind of experiences you have with the synchronous replication as far as quorum sizes. Are we talking like three, five or larger and maybe relative to the number of nodes in the cluster? So I'm fairly new to the synchronous side of things. I guess my limited experience is a quorum size of around five. That tends to be the average. I don't know exactly, but I think we'll keep that for hundreds of compute nodes. I don't know enough about the threshold of where we would consider a different setup. So for Nova, I know the workload pretty well. I'm actually gonna, there's a presentation that I'm gonna refer you to here that does a really good breakdown of the workload for each service. Nova is more reads than writes, I would say about 60, 40. But yeah, I don't have anything in my head about the other services. But yeah, check out this presentation because this is actually J-Pypes and he's done a really good job of breaking that down. Yep. So I think conceptually the way I'm envisioning this is I don't care about it inside of open stack services. I delegate that responsibility to the load balancer or the VIP or whatever you're using to distribute load. It should be doing health checks. Maybe it's your operations team that's sitting there watching Nagios and going, wow, the slave's really far behind. Shut it down. But I'm not taking responsibility for that in a service. Yeah. Yeah, yeah. So like I said at my first deployment, all reads go to a slave and all writes go to another handle. So we ran into like that neutron use case that I talked about that that's one thing that we ran into. That's the significant ones. Kind of after we did that, we went through and looked at the API, IPIs that we exercised and we decided based on context, yeah, this is good to keep sending to asynchronous slaves or no, no, this has got to be up to date. But that deployment is a special case. The API is special. It's not used quite like we would expect it to be used. Another question that I have is failover and disaster recovery. Do you have some automation for this or what do you do if any of these nodes go down? So in this, so as far as async replication, there's lots of tools that are built around kind of providing HA with, you know, promoting a slave, figuring out which one is the most up to date and making that guy the master. With this little tiered model that I'm toying around with, you'd probably be able to mix that with the existing tool set that exists for clustering. But yeah, I mean, a Google search will show you all kinds of tools around asynchronous slaves. Is that answer your question? Well, for async, yes, it's quite clear how we handle it, but for asynchronous part, like what do we do in such case? So Galera has a scheme for that and I'm purposely not gonna talk about that because I don't know enough about it, but it's fairly automated in how it heals itself from what I understand. Well, from what I know, Galera sometimes have issues with bringing slaves back online. Yeah, yeah. Yeah, the theory is I take a guy that's pristine out, I take the broken guy out, I use the pristine guy to rebuild him, I do some certification, I put him back in and sometimes that doesn't work. But for the most part, I feel like it's a pretty good model. Let's see here. So I hadn't really thought about that part. I kind of just left it up to Galera and said, you know what, they'll work out some of these kinks, they'll work out some of these bugs. So it's like a half done and still needs some work? Yeah, it still needs work. Yeah. Okay, thank you. Yeah. Does it affect, oh. In NOVA, the periodic tasks don't care about data that's behind. Yeah, yeah, because the scheduler does get state out of the database. But in a general case, that's kind of an acceptable thing for the most part, right? The scheduler can get slightly out of date information because by the way, our schedulers are always out of date. So we're not really changing things a whole lot. But as far as other periodic tasks, they're kind of cleanup tasks for the most part. So they're not sensitive to old data. And kind of what we've put in the operations guide is please don't let your slaves get more than 100 milliseconds behind. That's your responsibility as an operator. And those are the assumptions that the code makes. Your work looks great for read, but I want to run a problem that we encountered by you which is focused more on write. Sure. So with 100 compute nodes and two services and over network and compute on each node, we observed a lot of contention for table services. Yeah. I don't think your work would really address that. No, but the service group API does address that very well. You have lots of back-end choices for the service group API. You can use ZooKeeper, you can use Memcache or you can use SQL. And the default SQL implementation is rather bad for any sort of scale. So I just recommend you look at some of those other back-ends because they're pretty cool. Cool, thanks. What else? Well, thank you all. You've been a great audience. Yeah, if you guys want to, oh, it looks like I'm done. I am totally out of time. So thank you very much. You have my contact information. Feel free to go hold me on Twitter or IRC.