 past our starting point, so we're gonna get going here. People can straggling if they want to, that's all good. We'll have time at the end for Q and A, that doesn't mean that you shouldn't ask questions as we go along if you have them. Just raise your hand, I'll call you out. I'm Randy Bias. Dan Snedden. We're with Cloud Scaling. Cloud Scaling's done a lot around deploying production grade OpenStack deployments, beginning in 2011 with the first OpenStack storage, public cloud deployment anywhere outside of Rackspace, and we've got a lot of different kinds of perspectives around how to do redundancy and fault tolerance and resiliency in systems and in OpenStack in particular. And it's interesting because I was actually told last night by the track chairs that this session had some of the most votes of all the sessions out there, and Florian's presentation on HA did as well, so I think that there's a keen interest in the community as a whole about how to solve some of the redundancy problems because it's not really baked into OpenStack, right? If you take stock OpenStack and you download it and you try to run a production grade system off it, that's not gonna be a fun adventure, right? You're gonna have to put a lot around that to make it actually work. So we're gonna talk about that today. And so it's basically a three-part story. We're gonna sort of discuss how we think about HA and HA pairs in particular and why they're not the only way to think about redundancy. Sort of discuss some of the different ways there are to do redundancy at a high level and then deep dive into some of the ways that we did redundancy in our OpenCloud system, which is an OpenStack power distribution that's really designed for production grade. And so we did things a little bit differently and that might be useful for you or not, I don't know, but we're gonna share. So first, I need a level set us, right? So what do I mean by HA? Well, I kinda mean what everybody else means when they say HA because usually what they mean is they mean this active passive mode or they mean this active active mode. And they mean two boxes that are really sort of viewed as one. So that's what people mostly mean when they say HA. Now, highly available systems have a lot more than just pairs in them, right? But this is what most people mean. So for the duration of this talk, when I say HA, I mean a HA pair. So I think a lot of people would like to take a pair and they'd like to scale it out. And there are ways to do this, HA clusters. Most people don't try to do that. And the reason is is that HA clusters tend to be a cluster fuck, right? Yes, I did just drop the F bomb. So the problem with cluster pairs is that HA, HA pairs traditionally are all about sharing state. They're about state synchronization. That's why you use them, right? You wanna make sure that when you fail over from the primary to the secondary that everything that was there is on the secondary as well. Now, when you get to a cluster model, things get very complicated. Here I've got an example diagram from a real world vendor of how they do the first two boxes in the cluster. Now, imagine what happens when you add the third and the fourth box, right? You have an HA pair. You connect a couple of cables between them. You're doing state synchronization of them. You add the third box. What do you do? You direct connect the three? No, you add a switch. Now, you've got a switch in your HA cluster, right? The complications and the complexity grow exponentially as you go from the second to the third to the fourth to the fifth to the six boxes. So in the real world, most people don't run clusters. So in the late 90s, I was deploying huge amounts of HA firewalls for customers. It was all checkpoint. And at that time, we were at the point where we were deploying two of these pairs per week. And we did a lot of clusters as well. And the clusters had huge amounts of problems. They were very complicated. We had to do multicast MAC addresses. There was all this setup. And they had a tendency to fall over. So we backed away from them. And what we did is we started to just make all the boxes as big as possible so that we didn't have to scale them out. We could just scale them up. So what I see in a lot of the OpenStat community and in enterprise data centers as a whole is that there's this predominant pair, which is that whenever I've got something that might be a single point of failure, I just make two of them. And it's basically a hammer that gets used to solve all redundancy and resiliency problems. And so the challenge with that is that not all resiliency and redundancy problems are nails. You really need to use the right tool for the job. So a hammer has a tendency to have these two major problems with it. The first is that there's a propensity for catastrophic failures. And I want to talk about that in a little bit more detail. The second is that they don't really scale out. So what I mean by catastrophic failure is that an HA pair tends to be all the way up until it fails and then it's gone. You're toast until you basically recover it yourself. The airline industry, it takes seven simultaneous failures for an aircraft to drop out of the sky. Pilot air, navigator, just as examples, actual mechanical failures in multiple places. It takes seven simultaneous failures. For an HA pair, it sometimes takes one failure for the entire HA pair to go belly up. And the problem is that when it goes belly up, you go from 100% uptime to 100% downtime. So just briefly, I think everybody knows what scale out is and scale up is. But I want to sort of reinforce this to be really clear. So scale up is the whole process whereby when you've got a box and you basically run out of resources and you need to have the box to do more, you replace it with a bigger box. And you see this in enterprise data centers all the time with pairs, which is that your two firewalls run out. What do you do? You get two bigger firewalls. You get two bigger switches. You get two bigger load balancers. You get two bigger, two bigger, two bigger, two bigger, two bigger. That's how everybody does it at the moment. So that's not a scale out pattern. A scale out pattern is one where when you run out of resources, you add another box and you just keep adding boxes until you get to end. Now there are complexities. It doesn't always cleanly get to end. But we've seen some examples like Netflix, who even with state sharing using Cassandra is able to ramp up from 50 to 300 parallel servers running Cassandra on Amazon web services and do millions of simultaneous writes, triple replicated. So scale up is a de facto pattern for cloud. And it's the way that our scale out, excuse me, is a de facto pattern for cloud. And it even works with services that have state, although we're not going to talk about that today. Excuse me while I get some water. So scaling out is a mindset. I mean, it's not a particular thing, other than the fact of adding more boxes horizontally. And what's interesting about that is that you see a very different way that people think about solving the scale up versus the scale out problem. When it's scale up, people have this tendency to treat their servers like pets. They get sick and everybody scrambles to go fix the server to make it better. I mean, that's the challenge. The mail server's down, go fix Bob the mail server, the CEO can't get his email, we're all fucked. I mean, that's that sort of mentality. The scale out mentality is more of, we're going to have as many servers as we want, we're going to number them like cattle and if one of them gets sick, we're going to take that guy out back and shoot it in the head. Boom, done. You don't care when a server fails in the scale up mentality. The scale up mentality is very, very different. It's all about risk mitigation versus risk acceptance. In the scale out model, you don't care if servers die. They come and they go, who cares? Scale up, you care a lot. So you make each server HA. These are some examples of systems that we know that have major catastrophic failures using HA. And I'm going to talk about one in particular. But what's interesting here to know is that hardware almost never causes failures. It does, right? I'm not saying it never does, but it's a really solved problem. Hardware tends not to cause major catastrophic failures. Major catastrophic failures like the AWS EBS outage tend to be caused by operator error or software bugs. Those are the two things. No forms of redundancy inherently protect you from people who make mistakes or from software bugs. Nothing does, right? So just, and I'm going to go into that in one more second, but to kind of like give one example, in 2007, this company called FlexiScale, one of the early infrastructures and service public providers in the UK, they had a major catastrophic failure for 24 hours where they lost their entire cloud. The whole thing, boom, gone. Just completely down. What happened is an operator created a null config on the primary, SAN. There was only one HA pair, two boxes. And that primary SAN dutifully synced the null configuration to the secondary SAN. All the configuration was gone. 24 hours manually reconstructing. And there's no number of other examples here we could go into. Dan or Ike will brief you on any of them in private if you want to. But the point is, again, just to reinforce that one of the fundamental challenges with running an HA pair is that they're great until they're not and then you're in deep doggie-doodle. This is the way I think about HA pairs, right? I mean, it's essentially an all-in move. You're putting all your eggs in one basket and one gigantic failure domain and you're assuming that it's never gonna fail. Again, it's sort of the risk mitigation approach. I'm going to pretend that I can figure out every way that my system is gonna fail. Obviously, this is delusional. And then I'm going to try to account for every way that the system can fail. And then what happens when the system fails is that I've got no plan, right? It just fails. There's a better way to think about this which is to reduce risk and to have more of a risk acceptance-type approach. And this is what the Googles and Amazons of this world do. They do more of that scale-out approach. They take all those eggs, they spread them across many baskets, and they try to set things up so you've got a small failure domain so that if one basket goes belly up, you lost a few eggs, right? These big failure domains, this kind of risk mitigation approach, kind of putting everything in, winds up creating these huge craters, that is fundamentally scale-up, again, right? That approach of doing scale-out and having like kind of little cells, little small failure domains that have sort of containment or fault isolation that doesn't propagate in the system, this is the way that you sort of do the scale-out pattern. And so just to give a, you know, somewhat unfair example, but one that really makes it clear, if you sort of take a shared-state system that does HA, a paired database, and you compare it to a bunch of web servers that are being load balanced, right? And you sort of say, what happens if I lose two boxes of my HA pair versus two boxes out of my web server array? Something very, very different happens. In one instance, you lose everything, and in the other instance, you only lose 20% of your capacity, or if your load balancer active checks are working, you know, not even that. So what's usually HA in OpenStack? Well, the answer is, is that because we're using homers, it's usually everything, right? And so people put that HA pair pattern kind of on all the components. But the thing is, is that when we actually look at what actually has state in OpenStack, there's not a lot. It's mostly the database. You could argue that the RPC system does, I mean, it doesn't in ours, but you know, you could argue that the RPC system does, and that's it. But so all these other components really would benefit from more of a scale out model rather than that HA pair model, and that's what I'm gonna let Dan take over. Thank you, ready? So what you see up here is really just a handful of the fault tolerant methodologies that are commonly in use. There's dozens of them. They've been developed over the years, over the course of many years in the internet and UNIX community. There are lots of ways of doing this, but there was no single method that we felt provided all the features that were needed for OpenStack. So what we've done is taken several of these methods combined them together, and we get distinct advantages from that approach. So we call this combination service distribution. Service distribution gets us some specific features that you don't get from most normal HA implementations. In addition to high availability, it's implemented in a way that's resilient. And what I mean by that is that resiliency is the ability to fail gracefully to fail partially, to have fractional failures rather than total failures, to route around failure and to adapt to change. It's a stateless solution. And we felt that this was important to avoid the kind of state failures that Randy has alluded to where you lose your entire cluster because something went wrong in the state or because it went schizophrenic and the left hand didn't know what the right hand was doing. So wouldn't it be nice if the left hand didn't need to know what the right hand or any other hands were doing? So that's what we've done. It's stateless. It also scales out horizontally. And this is really important in OpenStack because we tend to go big when we build a cloud. So this has no limitation in the number of service endpoints providing a service. So here's the actual implementation. This is the configuration needed. This is the minimum configuration to get a single service distributed among many service endpoints. And you'll notice there's not a lot of code, right? That's not a lot of configuration. So OSPF in this case is an example. There are multiple dynamic routing protocols that will work for this, but OSPF is a good example here. And we've got a simple OSPF configuration with one network and that's a single IP address that we're then going to share among all service endpoints. And that's a technique known as anycast. Instead of one to one, which is unicast or one to many, which is multicast, it's one to any available server. And if your routers are running in the mode where it's per flow load balancing, then those sessions are sticky. Those sessions are guaranteed to have the same source and destination. Per flow ECMP. Did you wanna describe ECMP real quick down for people who are on network? Yeah, equal cost multi-path is the load balancing algorithm that routers use when they've got multiple destinations that are equal at the same cost. They load balance them all. And if it's per flow, that means that a single TCP flow continues to go to the same destination for the duration of the session. ECMP has the age of the internet for those who aren't clear, right? I mean, it predates firewalls, load balancers, all that stuff, right? It's very early TCP IP. It's got all the bugs shaking out of it, extremely well understood, supported in every single switch out there, et cetera, et cetera, et cetera. Now this is an optional component, but we've added a load balancing proxy. In this example, we've used an application HTTP proxy. It allows us to have finer green control over the distribution to the back ends. It allows us to do some protocol examination to make sure that it's valid HTTP and it gives us some service checks that once it detects that a service is not responding properly, takes it out of the pool, does service checks, and once the service back up, starts sending traffic to it. This configuration here, this little snippet, will get you two back ends and listening on a custom HTTP port. Can I interject real quick, Ben? Can you go back? So we use the technique with just OSPF in any cast in both the KT and the internet open stack storage Swift deployments and it was basically OSPF, any cast, Quagga and the Swift proxies and that's it. And we had full on super redone scale out load balancing across the entire Swift cluster with no single points of failure, extremely performant, extremely resilient, didn't cost anything, didn't need load balancers, worked brilliantly, still does to today. There's no load balancer in those examples I gave, which you ask you to work? This? Yeah, these don't share any state. These are just independent silos. He's gonna get to that in a second, so there's no state shared between them. Let me touch very briefly on the technologies used here. Quagga is the open source routing daemon that's used in Linux and open source Unix. It's standard, very well tested. The load balancing proxy that we've used is pound. You can use any one of many, but we like pound. It's solid, it's written in C, it's all memory, and it doesn't require any state shared. We use it more for like SSL termination really than load balancing, although it does do some load balancing. Go ahead. So let's look at how this works for OpenStack, because that's a goal here. We have resilient OpenStack. We have OpenStack with no single point of failure. We have OpenStack with graceful failure, partial failure. We've done that by using the service distribution method, which I just described, for API endpoints and for server threads, for worker threads, service consumers. So we've also got RPC that we've made resilient by changing out the rabbit MQ queue and using zero MQ. We've contributed that code back into Folsom. It's been accepted. Now it's just a configuration option. There's some limitations on how you can make MySQL resilient, so we've used multi-master replication and simple high availability there. So the advantages of this, of the service distribution method, you've got true horizontal scalability, no limitations on the number of servers providing the service endpoints. Servers are always running, and that's really important. You're not losing any of your capacity. You're not having servers at idle. They can be placed anywhere within your network. There's no limitations there. There's no state shared between them. It's reduced complexity when compared to some other high availability methods. And instead of having a failover, your worst case scenario is active passive. Half of your capacity is never used. In distributed model, you can have as many servers as you'd like, no limit. And it works really well, too, for multiple sites. And this is a great key advantage, we think, that is not provided by most other high availability solutions. Your routers understand that when the service request comes in, it'll forward it to the nearest available server that serves that service. But if it's not available, it'll send it across the network to the nearest one that is available, perhaps at another site. Let's look at this in action, step by step, how this works. So I mentioned Quagga, the OSPF Deben, that does an OSPF advertisement out to the router and says, I'm a valid route for such and such networks. So in this case, we've had the same IP address configured on all boxes on a loopback interface. So there's no conflicts. But they're all advertising to the routing fabric and saying, I'm a valid route for that. When a request comes into the routing fabric, it uses ECMP to load balance roughly the connections that come in to all available service endpoints. That's where they hit the application proxy, which does some finer grained load balancing and sends it out to end number of servers. There's no limit on the number of backend servers. You have a lot of flexibility there. Let's look at how this fails. And this is the resiliency part of it, a graceful failure, fractional failure. When the requests come into the routing fabric, the routing fabric distributes it to the load balancing proxy. They then distribute it to all servers on the backend. On this slide, I put 10, so imagine that each server is servicing 10% of the requests. If one of the load balancing proxies in the middle fails, the routing is going to know that that's no longer a valid route. It's gonna send connections to the other service and endpoints that are going to forward on to all of the servers. No lost capacity. 100% of your capacity is still available. But if you lost a backend server, some of your capacity is going to be impacted. If you lost one of 10 servers, then each of the other servers is gonna have a 10% increase in capacity. Now that's much better than when you've got an HAA pair, one of them fails, the other one hasn't a 100% increase in capacity. Load. We've got another example of using this. Now, this is one of our own services. Cloud Scaling provides a NAT service. That lives on the edge of the cloud where for those network engineers up there, we think it should be rather than on the compute nodes. And the service distribution technique makes sure that all the inbound and outbound traffic is load balanced between all of those endpoints. So that gives you scale out horizontal scalability. If one NAT server can handle say five gigabits of traffic, then you scale out horizontally, chances are you can get to beyond your bandwidth needs. And this is the other part of what we needed. There was no one approach that was going to work for every high availability problem in OpenStack. The RPC is also a problem. RapidMQ by default has a single point of failure because it's brokered. All servers speak to RPC. RPC speaks to all servers. It's in the middle. And that becomes a single point of failure. And there is state parent. Many people say, well, hey, I'll just have multiple RPC, active passive or even active active with some tricks to make sure that I'm not tripping over that. Okay, but we think it's better to just remove that entirely. To have peer-to-peer, roperless messaging using XeroMQ. This is a very standard message queue that's in use in very, very large installations, very high capacity. It's just a configuration switch now because we've contributed that code back upstream to Folsom. Yes, so just real quick on XeroMQ. XeroMQ is actually written by a lot of the folks, several of the folks who actually created the MQP protocol initially. And they sort of left that business and started to think about how they could rethink what they were doing because many of the ways that messaging patterns are actually used in production systems as in OpenStack is that they're used strictly as an RPC mechanism. And the store and four capabilities aren't strictly required. And in some cases actually create problems. So we're gonna go into Q&A here in a second. We'll have about 10 minutes for that, which is great, because I'm sure there's gonna be lots of questions. But just to recap kind of what we learned, homers are really for nails. There's a lot of other tools in the toolbox. This scale out model, I think, if you're really gonna be serious about cloud, cloud computing and things like this, you really need to think about what scale out means in that context. And then design for failure, right? I mean, Google has had four nines of uptime over the last several years across its various services. And that's because they built a software layer that assumes that things will fail all the time and they've built a hardware layer into that that does the same. And the tricks that they use, like taking the UPS out of the data center and putting individual batteries on each of the motherboards, I mean, those are the different ways of thinking about solving these problems that are really important to think about on sort of a go forward basis. If you wanna make a difference in sort of the cloud computing world, if you want OpenStack to be great, then you need to think about and try to internalize what does design for failure mean? What are the tricks that sort of the larger guys are using because they're the ones who have really cracked the nut on what scale out means, right? If you wanna use the patterns that have been in use in the enterprise data center for the last 30 years, that's great. You go ahead and do that. I think that that's appropriate in many cases, but I think that's looking back and not looking forward. And then finally, if it's not 100% clear, we favor and believe in resiliency over redundancy. As much as possible, you'd like the system to have a plan for when it fails as opposed to just try to keep from failing, right? Plan for failure, have it occur, have the system acting a resilient manner. All right, so. You said you use just a general HAPair for a database and then it's not clear how you use that with multiple data centers. So we would not replicate the database between data centers because of our particular model, we use Nova to basically drive an availability zone, so there's no need to replicate across data centers. You would simply, you would ask the tenants a failover of their capacity, but in terms of just answering the question directly, we use MMR, multi-master replication with a traditional HA failover mechanism. We can go into some detail if you want, but if you went and you looked at the Morantis blog posting on redundancy, it's that exact same approach they were taking. Yeah, I'm actually from Morantis, I think I'm probably the person who wrote that blog. Yeah, it's exactly like the way you did your MMR, pretty much. Thanks. Uh-oh, it's Florian. Now we get the curve balls. I actually have a few, but I just wanted to follow up on this one first. One at a time. Yeah, that's right. Just following up on this one. Yeah. Why not Galera? You know, I'd have to talk to the engineers who did that design work. I don't think that there was any particular, like there was no strong religion about using Galera or MMR. We even talked about doing clustering with the NDB engine. You know, I don't know how it is for you guys, but in our open stack deployments for our customers, you know, the database isn't really a scaling problem. And it's not, you know, I mean, there are other problems like SQL Alchemy, which if you stay into the session after this one, you'll hear a lot about that, that are really sort of where the bottlenecks are in the app. So the database isn't getting hit very hard. And the other thing about MySQL, which we all know is that it's super well understood at this point, scaling it and scaling it up or out and all that's just really, really trivial. So, I mean, it just, that hasn't been a problem right now so far, so. I'd like to point out that while that's what we've chosen for the clouds that we build for our customers, there's no dependency. So if you wanted to use Galera, you could keep the other methodologies that we presented today and use it that way. Yeah, I'm curious how you solved some of the problems around your Nova Network implementation. I can see how your Guaga strategy works for, you know, finding the gateway and that sort of stuff, but at least with the rabbit MQ based implementation, maybe your zero MQ is how you address this, but you can't have two Nova Network nodes with the same name configuring the same IPs. They fight for each other and they wind up split brained. So did you guys have to write a custom network driver or does your zero MQ allow for multiple subscribers to take the same request and perform the same action? So that's a good question. We actually do have a custom network driver, but it's not required for this. That's separate. What we've done is put the IP addresses on loopback interfaces rather than physical. So those addresses exist, but they exist inside the box. There's no way to get to them from the outside, except through the dynamic routing. It's all layer three, it's not layer two, so there's never any R request. Yeah, sorry, I guess what I meant is, so when you do an allocation for a new project and it's a new tenant, it's picking a network and it's gonna bring that up, at least with the rabbit MQ based implementation, you can't route, you can't have two running Nova Network nodes that will build the bridge, build the IP, even if you put that IP on a loopback, at least in the rabbit MQ implementation. So does your zero MQ solve that or is there something else you're doing to address that? We do something else. We do something that's not covered here. Basically, we've reproduced the Amazon Web Services Networking Model, so it's an all layer three routed topology down to the VMs. There's no VLANs in use in the system. We've taken the Nova Network controller and we've put it to the side and all it does now is it runs security groups and it runs, yes, the IP tables and metadata. Other than that, the NAT's done at the edge through our NAT service. IP allocation is static because anytime that we deploy a node, basically there's a set of IPs that are allocated to that node, just like on Amazon. So everything's completely static and the DHCP comes off a distributed DHCP service that runs on every compute node. It's actually the busy box UDHCP service. So it's a stripped down, super reliable service and if you lost that box, you just lose the state on that. And because we derive all the MAC addresses from the IP addresses, so literally you deploy IP addresses in a rack for a particular box and what routes are in the rack, what routes are to that box, what IPs are on the box, what MAC addresses are on the box. The whole system's laid out very statically, so IP allocation can happen. We can run tons and tons of parallel threads to basically do scheduling and network allocation. It's blazing fast and nothing ever really falls over. Thank you. Yeah. Oh, come on. All right, keep going Florian, if nobody else is gonna get in here. Yeah, so one thing I'd like to thank you for is actually pointing out this discrepancy between what many people understand as HA, namely this boring shared state failover thing, this HA pair. And I often have a hard time explaining that it's a lot more than that. Right. I'm thinking that we were having, like with bringing HA to OpenStack, most of the time what we really have is a sort of a user confidence problem. No matter what approach we choose, like people say, okay, I'm having a problem, like wrapping my head around pacemaker because it seems like a complicated beast and I don't really wanna use it. What I've seen relatively often is people getting terribly scared by using Quagga and multi-master MySQL replication. What is your approach to sort of alleviate those user fears and to sort of build that confidence? Yeah, I mean, we haven't had that problem. I'm sure that it exists. And we tend to have customers that engage us that are, I spend more time dealing with, explaining to the network team why Quagga is acceptable than what it is. I mean, they're very sophisticated typically and they're looking for a production deployment, so they're not, I don't, we have kind of the opposite end of the spectrum, right? The level of sophistication, like I have a 36 page technical reference paper that just describes all the networking works in the system because we get asked so many questions about it that if we don't cover it up front, we get crucified. So I have the opposite problem. It, your problem exists though. I don't wanna pretend that it doesn't. That's absolutely the case. And one way that we address it is with automated deployment. True, we've automated everything so that. So Randy, how do you handle it when the network team don't play ball? Haven't had that problem yet because they found out that we are very sophisticated network guys ourselves. So that hasn't had a problem. They've mostly kicked the tires in every circumstance, wanted to drive in some, and then they push us around a lot on the hardware vendor selection and stuff. But here's the thing, right? I mean, the layer three networking topology and using ECMP and all that stuff, like network guys really get that stuff. Like it's really old school, like networking from the 90s, nothing's really changed since then. So once we explain it, they think it's a little odd because they haven't seen like enterprise data centers look like ISP backbones before, but then they understand it though. So they usually get in line. I get a lot of questions about our plans for SDN and stuff like that, but let's not devolve this conversation into that, okay? More questions about? So I understand, there's still one thing about the database setup you have that I don't understand. So which data center is the database in? Say that again, please. Which data center contains the database replicas? Or does every data center contain that? It's all local. It's a local instance with the replication. So every data center has a cluster of a few database instances, right? Yeah, so we really follow the Amazon web services model in terms of our thinking. So a data center is a failure zone, so you're not expected to replicate across that. Okay, so there's basically a failure domain is a data center and there's an independent open stack installation. There's a failure domain at the node, there's a failure domain at the rack, there's a failure domain in what we call a subzone, which is the network layer and there's a failure domain at the availability zone and then there's a failure domain at the region. Okay, thanks. Oh, the MySQL stuff, I forgot to mention this, but we actually don't do automated failover from, we do, even though it's a multi-master replication because you guys would be interested in this, we don't do active active, we do active passive and we don't do an automated failover because what we found is a lot of the customers who are talking to run production systems, like they've had problems with MySQL and other databases where they do an automated failover from the primary secondary and there wasn't an operator check on it and the replication's not up to date or something broke or there was a software bug and that's bad. So it's actually a manual failover with an operator intervention on our part and that's a deliberate design decision, just like a lot of large websites do that because rather have our operator go in and do a sanity check on the secondary and make sure that everything's in good shape before failing over because that's really only gonna impact the API endpoints and not any of the running production workloads in. I still think that's mostly a best practice. So right after us is a presentation from LivingSocial and there's gonna be some information there about sort of the performance problems I alluded to earlier regarding SQA alchemy and some really great data and charts so hope you stick around for that. Thank you for coming.