 All right, so while we're in this talk, my name is Greg. I've been working on SEP for over 10 years now, most recently on stretch clusters. This is exactly what we're going to talk about. Work at Red Hat and have, since they acquired the SEP company five and a half years ago. So first, we're going to go through the fastest quick intro to SEP that we're given, and then we'll talk about the specific problem. In a SEP cluster, there are three main demons. There's the Monarchs here on the left. There's usually a small number of those, mostly three or five. They're responsible for keeping track of the other demons that are running as part of the SEP cluster. They keep track of their monitor peers. They keep track of the optic storage demons, which are these ones that run on every hard drive and are responsible for serving I.O. and maintaining data durability and availability. And they keep track of these metadata servers, which are only for the SEP file system, which we're not going to talk about today. A typical cluster of a typical SEP cluster this says RADOS because that is the optic store that is built of the monitors and the OSDs. It stands for Reliable Autonomic Distributed Object Store. A RADOS cluster will usually be composed by the three or five monitors and tens to thousands of OSDs, optic storage devices. When you perform a write in RADOS, you have an application that runs the LibSep client library or whatever. It connects to the monitors and becomes a metal egg. I climb over the cluster and gets an OSD map describing what the cluster looks like. And the OSD map can be updated periodically as OSDs go up or down or get added or removed. But more or less, the client gets an OSD map once and then it's got what it needs. And to do an I.O. on ObjectFoo, it runs our special crush algorithm, which is a bit of magic, to find out which OSDs are responsible for ObjectFoo if it exists or if it doesn't. It sends the right operation to what we call the primary OSD for that object. And that primary OSD gets the request it performs permissions checking and validation. It makes the change locally, and it sends off the request or the update to its replicas or pure OSDs. Those OSDs also miss the disk. They reply back to the primary, and the primary replies to the client. Visually, the client talks to the monitors, and then it says, all right, I want to write to, in this case, this is the primary. The primary replicates. The replicas reply, sure. The primary replies to the application. It is important mostly because you'll notice the primary is in charge of all of the rights. So he's in charge of all the ordering of operations. He's in charge of serialization and logging and things. Fucking fast. So in the context of this talk, a stretch cluster. CEP is designed to run on a local area network. So even in a stretch cluster, we still expect low latency, maybe five milliseconds or something, and very high bandwidth. Because all the rights that happen, the client has to talk to the primary OSD and the primary OSD has to replicate that data across the network links. And probably every right has to go across the network links and about a third or half or whatever of the reads do as well. But the big difference is that in a data center, you're running on LAN. You probably have a lot of redundancy in your connections between different servers. Whereas in a stretch cluster, you probably have two or three data centers which each have one connection to the others over some kind of dark fiber or rented thing. And so your odds are much higher than usual that you have some kind of network split where one of the data centers can talk to everybody, but the other data center, one of the other data centers can't. And so you might have portions of your cluster that aren't visible to each other, even though they're all running. And you also have a much higher chance than usual of a fire or a temporary power outage or somebody making a mistake on the network just totally disconnecting an entire half or third of your cluster than you would if you were running in a local data center. And it might still happen in a local data center too. Just the odds are a lot higher. And if you're running a distributed cluster, it's probably because you want to be more resilient to those things. So it's more important that we be careful about handling them quickly and without administrator intervention. So you can deploy one of these stretch clusters today. It's not super common, but we've talked to a lot of people. And I think that Red Hat has some customers who run where they have three data centers. And you put a monitor in every data center. I mean, more than two OSDs in each data center, probably. But about a third of your OSDs in each data center. And this mostly works fine under most conditions, just a few that we're going to talk about where it doesn't work great. The target scenario, though, is where we have two data centers that each have half the OSDs and one or two monitors each. And then a third data center or just a virtual machine running in a cloud somewhere that may have much higher latency as used for running a monitor that we call a tiebreaker. And whose job is just to say, if there's a net split between the data centers, it's to pick one of them to win. And this does not work right now for reasons that we'll get into shortly. And you might ask why you don't just run your monitors in two data centers. And the answer is because the monitors are consensus driven. In order to make any change to the cluster, you need more than half of your monitors to agree to it. So if you have two monitors in one data center and one in the other, and you lose this data center with two monitors, that guy can't do anything. You'd have to have an administrator come in and do a lot of manual surgery to make the cluster come back alive and start doing useful things again. So the big problem with the two data center case or even the three data center case, the first one is leader elections. So the monitors, in order to make things simpler, they always pick a leader. And the leader is responsible for everything that happens to the monitors. If you make a change as a client or an OSD turns on or whatever, then we need to make a change to the cluster maps. And whichever monitor, like the monitors, any one of the monitors can notice that change or get requested to make the change, but they forward that request to the leader. And then the leaders distribute those changes to all of the other monitors, which we call Pions. As a consequence of this, Pions don't talk to each other. They only talk to the leader, and the leader, but the leader does talk to everybody else. But the only time we care about sort of the full network connection is during election periods. So the leader elections are pretty simple for what they are. Someone like any monitor can decide to start an election, but it really only happens for a couple of reasons. Either something times out or they turned on. And when you decide to start an election, you send out a proposal to all the other monitors and you tag it with an election epoch. This is just an increasing number that we can use to make sure that a message isn't old when we receive it. So for some reason, someone decided to send a proposal to the monitors and they sent it to all the monitors they know about whether those monitors are or not really alive. And then everything else is driven off of either timing out an election or off of receiving a proposal or one of the other messages. So when you get the proposal, if the sender is someone that isn't in the quorum with you right now, you say, oh, we have someone who's alive who's not in our quorum. We should start an election with everybody and let them join. And so you start a new election with the higher epoch. If the sender is, and then we have to decide, if they are already in the quorum, we have to decide whether we want them to win or not. And in the current way, in the current election system, the way we decide that is that the monitors all have an ID from zero, one, two on up and the person with the lowest ID wins. And if the sender of the proposed has a lower ID, we defer to them, otherwise we say, hey, we should win, I should win, and we send off a proposal of our own. Then when you're running an election, if you get a defer from everybody in the cluster for your election, then you just become the leader and you send out a victory message. If we time out the election, because one of the monitors isn't alive, then we've, but we have enough of our peers have deferred to us, then we say, hooray, we're gonna be the leader, and we send out a victory message saying, I've got zero and one in my quorum, but two's not in it. And if we time out and we haven't had a def, well, this last point doesn't really matter right now. So that's meant to sound like a lot. So visually, we've got here three monitors. We're gonna call them zero, one, and two, because those are the IDs that they work with internally. And so for some reason, monitor one has decided to send a propose. And so he does, he proposes out to everybody. Now, monitor two receives that propose and says, oh, monitor one has a lower ID than me, and I haven't seen election epoch one yet. So sure, I defer to you. But monitor zero goes, hey, no, I'm a better number than you are. I wanna be the leader. And so I'm gonna bump the election epoch up again and propose to you and the other people I know about. And then one and two both get that proposal from zero and they say, yes, you are a better number than me. I defer to you in this election and zero wins. And he says, here's the quorum, it's all of us. Hooray. So that works great in most cases, but if you have a net split between zero and one, something terrible happens. One might again decide to propose in this case because maybe they were just in their full quorum, but now one hasn't heard from zero in a while and is timed out. So he proposes with epoch three, but that message to zero doesn't get anywhere because it's net split. But two says, oh sure, you're a better number than me and I haven't seen this epoch before. So I will defer to you. And after a timeout period, monitor one says, hey, I won with me and two. And I haven't heard from anyone else or seen in another election. And one and two run along happily. Except for the problem that at the same time now is happening, monitor zero is going, gee, I haven't heard from monitor one in a while and I sent him a thing that he was supposed to commit. And so monitor zero goes, hey, I'm proposing a new election in epoch three because we need a new quorum. And monitor two goes, huh, I already saw epoch three and you are not in my quorum right now. So I'm gonna propose a new election so that we can all get in a big happy quorum together. And both zero and one receive that proposal from two and go, oh no, I should be the leader. And they send out a proposal with a new epoch and two gets them and then two, depending on which order they're in, two might defer to only zero or might defer to one and zero. But either way, we end up in the case where two is in a quorum with one of these monitors and the other one times out is the election epoch or it's election timer and says, hey, we should run an election because I'm alive and I'm not in the quorum. And that just happens continuously. And so you can't make forward progress in the cluster when this happens and it's terrible. So a while ago I asked to make stretch clusters possible and this was the first problem I identified. And I came up with a plan. First of all, I need to be able to change the code at all. The election code that we've been talking about is some of the oldest code in the SEP project. It mixed up the message passing between monitor demons and the logic to decide what should happen when you get those messages in the same functions. And so it wasn't unit testable at all. We were confident in it. It works really well when there aren't net splits because it's reasonably simple and we've run a lot of integration tests to test this. But I was gonna make a much more complicated thing and I wanted to be able to test it on my local computer without running fake clusters. So the first thing I did was split out the election logic into its own class that deals with sort of abstract, proposed and act concepts by function calls and then leave the message passing between monitors in the existing elector class. And then I wrote unit tests for what we already had and also it demonstrated the things that didn't work like that case I just showed you with the net split. This was great because it makes the algorithm itself just sort of the ideas of it a lot easier to iterate on and experiment with. When I first came up with the idea we're about to talk about next, I missed several subtleties of it and the unit tests like revealed all of them except one to me that I missed. And so I came up with new invariants that I needed to maintain or invariants, old invariants, but that had sort of been implicit in the way we did elections that I needed to spell it explicitly and make sure I maintained more carefully on my own. And it's unit tests. So the binary runs in under a second. It's with the G test framework so I can pick one of them to run. And because it's all just running in line in a single thread with time steps, then I can create complex scenarios where I very carefully order what messages I want to appear at what time and make sure that it all behaves in its culture. So the basic idea is that, hey, picking monitors based off of their ID number is often dumb. And so we should instead pick monitors based off of the one that has the best connection to all the other monitors. And so that's what we do. We build a new heartbeat pinging system between the monitors and they maintain whether the monitor thinks its connection to a given peer is alive or not and a score that sort of reflects how often it's been alive over the past period of time. And then we share those scores broadly so that I always know, so monitors you're always knows what its current view is of all of its peers, but it also has a pretty new idea of what monitor one thinks of all of its peers and what monitor two thinks of all the other monitors. And then we'll receive a proposed message, then we calculate the score of the sender and of ourself and of anyone else we might have deferred to based off of our global view of the cluster. And we choose whether to defer or whether to ignore it or whether to start a new election proposing ourself based off of that. It's a little more complicated because those scores change over time and you don't want them to not change while an election is running and stuff. So we do like snapshots of the scores that only switch at certain times, but that's the basic idea. And this code exists, it's an important request, it's gotten good reviews, I've got just a few things that I need to change which I was hoping to get it all done before this talk and I didn't quite make it, but next week, I suspect I'll get this merged. And so we built, instead of the old classic mode I was talking about, we built what I'm calling connectivity mode in the monitor elections. And so just sort of all the time the monitors are pinging each other and maintaining these scores. They all say, hey, my peers are up and I'm up and because they've all been up for a long time everyone says their score is one, which is the highest. If we get a net split, then they'll try and ping each other, but it won't work. And so there won't be pings happening. They'll say, hey, that guy's down and so far it's been down long enough that I've degraded him from a score of one to a score of 0.8. And if at any given point, like the net split happened, it's zero and one have both timed out because they're not receiving because they aren't talking to each other and one of them was the leader and they say, hey, I'm gonna propose a new election to monitor two, who I can talk to and we're gonna try and inform a new quorum and monitor two goes, hey, no, my score is better than yours to both of them. I propose a new election in which I'm the leader and monitors zero and one having gotten updates from everyone else around the cluster agree, yes, you should be the leader and monitor two wins and you have a happy quorum which is composed of all of the monitors in the cluster even though they aren't all talking to each other. One of the key differences within this, within the connectivity mode is that unlike in the classic mode, if in this case we have four disillustrated, in this case, you know, we've got a really limited set of connections out of the whole cluster. One can only talk to two, three can only talk to zero. Now, the tiebreaker is still ID, so with the numbers I've got up here, I'm pretty sure zero would win the election and he would have a quorum of zero and two and three. But one can sit there and one will be like, hey, I'm not in the quorum. So he's trying to call elections with himself in the quorum and talking to monitor two. But when the connectivity mode, when you're in the connectivity mode, monitor two will, when he gets that proposal, say, hey, I'm sorry, your score is not good enough. I would keep electing monitor zero as the leader. You need to go talk to monitor zero to get into the quorum at all and to call an election. And so monitor two just drops those. And so in this case, we have a stable quorum. You can keep on operating your set cluster with whoever, whichever LSDs can talk to two or zero or three and life moves forward and is good. And the monitors are happy. Oh yeah. Question. It doesn't work if only five of these directions are working. So one can see the others, but not the same. So in that case one, so if they can't go in both directions, both of them will mark the other guy down. If it goes in one direction, then only one of them will mark. And I think both will mark the other guy down because they're not getting ping replies. But even if one says they're down and the other guy says they're up, that still reduces their scores because, so I didn't actually say this explicitly. In this case, monitor zero gives one a score of 0.8 but because it's down when they're doing the score calculation at zero. And so no one's gonna win in that case. The other problem we've identified with stretch clusters, oh my, is OSD peering. So we have a primary OSD and when he turns up or when he turns on, he needs to make sure he's got the newest version of all the data or like the cluster changes in some way and he becomes primary. He needs to make sure he's got the newest version of all the data he's responsible for. So he looks at the old cluster maps that the monitors give him when he says, all right, who was responsible for this data? He asked for the versions they have and for the newest versions of the data. Visually speaking, we've got these two OSDs. The new primary is not in version five and he's only got update A. So he asks for the version, it gets sent back and now he knows that there's a version A but he doesn't have it yet. He asks for the updates. He knows what the updates are but he doesn't have them yet. These are sort of, these are sort of logs of the updates saying what they were or saying what they covered but not the data content. And then he can incrementally ask for the updated data and get it back and move himself forward in time. As he requested. In real life, you will generally have not two copies but you'll generally have three copies of all the data and because we want to always be able to recover from losing a drive, the default minimum size which is a thing, it's a thing you can change but by default the minimum size for an OSD primary to go active and serve reason rights on a piece of data is two. He needs to have himself and another peer who have the current version of the data. Because if we go active with just a primer and he's got the newest version of the data and then you write to him and he dies, you're out of luck. He've lost data, there's no coming back. And for us in SEP, that's really bad. So only the OSDs which are involved in serving data know the specific versions but because we want to know which OSDs to talk to, the monitors know who was allowed to make updates so you can query those maps and say, hey, who do I need to talk to to see the newest version of the data? So if we start off with a system like this where we have two main data centers and all the OSDs are together and they're just, I mean it's unrealistic but they're just peering back and forth then if we lose one of the OSDs in a data center then time might move forward and he's now out of date which is fine at first glance but if we lose say this data center one then even if this OSD 2.0 comes back he knows that he's out of date and he doesn't have the newest version of the data and that he can't serve reason rights to clients because he might serve them old bad data and so you're just kind of stuck and you can't move forward and so that's bad and it's not usually gonna happen the way I described there where you lose half the data because you have two OSDs but if you're doing sort of, if you have sort of a wide, big cluster the odds are pretty good that at least you always have at least an OSD doing something that makes it out of date because it's rebooting or like you had a power go out in a rack or something and we don't need most of the data to be updated we need all of it so the design target for stretch mode is that we have two big data centers or two main data centers and each of them holds two copies and the sort of the reasoning behind that is that if you lose one of the data centers you still have two copies and like that the remaining data center is a supportable data center on its own and you don't need to immediately get back the dead data center for us to be okay with you reading and writing to it now in default set an OSD can talk to any monitor it wants to but in the stretch mode we're restricting it to talking to the monitors in its data center because if they can still talk to the tiebreaker monitor they can let an OSD can keep itself alive even if it can't talk to any other live OSDs and that's bad because it might be like supposed to be the primary but not able to talk to anybody but we make it talk to only the monitors in its data center and so that's prevented and then we extend the peering algorithm so that not only do you have a certain number of copies but we say hey you need copies in more than one data center and this is in progress I'm expecting to have a PR next week I have like a minute and a half left so you know this works or at least it will work and do we have any questions go for it the first of your diagrams on partitioning if I've lost connectivity in 0 and 1 2 is still in sorry this in the picture the way back oh way back the problem with the leadership algorithm the elections yes okay sorry what so you're saying that I can now have a cluster with 0, 1 and 2 but 0 and 1 can't talk to each other a monitor cluster, yes so what happens if 0 and 1 are my two primary data centers 2 is my time break I haven't got a live cluster where I can't talk between my data centers and I haven't been able to select oh yeah sorry so I didn't make that clear you're right so yeah so the stretch mode puts into a special mode where it says oh like we've detected like you can't well sorry it notices hey you guys can't talk to each other and I'm getting failure reports from the OSDs across that divide and so that's sorry that's part of going like going into the stretch cluster mode that I didn't put on the slides is oh it's is is monitor 2 says okay or because monitor 2 will be the one who can who the only one who can talk he's gonna be in what a sorry monitor 2 sorry I also left this out monitor 2 is not allowed to become a leader if you're in the stretch mode case we just market we market that way so that way either 1 or 0 is gonna be the leader the other monitors the other side is out and also OSDs get marked down it's just it's just an extra part of the algorithm that I skipped over I'm sorry that's a great question I need to update my slide deck before I do this again does that make sense it's just a configuration you say monitor 2 is our tiebreaker they're a special monitor who's not allowed to lead because their job is to vote and one of the other guys wins it seems as you need to have all to all connectivity within the active OSD yes yes right so so so if this is your data center representation monitor 2 picks 0 or 1 just flipping a coin and then 0 and then let's say they picked 1 1 and 2 are in a cluster together they know monitor 0 is out of quorum and they're getting and monitor and and the OSDs in data center 1 are reporting the failure of the OSDs in data center 0 and they're like oh we're not talking to anybody in data center 0 it's dead and then you know everything's in data center 1 so you have all of all connectivity so we also stopped to try to communicate with the lost data center? yeah OSDs own um yeah they only talk to the people who are in their peer set and and the lost data center will all be marked down so nothing so they're never in their peer set alright well I will be around the rest of the weekend and our time is up so thanks very much guys enjoy the rest of the weekend