 So, yeah, let's welcome him and some core distributed algorithms. All right, thanks very much. So, as you said, my name is Greg. I work at Red Hat as a principal software engineer and have been on the CEP project for over a decade now, which at my age is a little scary to think about. Every time I give a talk in Europe, someone tells me that the content they understood was great, but I talk too fast. So if that happens to you, just start waving wildly and I'll slow down. And I'm happy to take questions during the presentation. If you raise your hand and I've repeated an answer back or maybe I'll tell you that your three slides ahead of me and just wait a minute. So, I assume you all basically know about CEP, but it's a distributed object storage system. We have three demons, although only two of them are three principal demons, although only two of them matter today. We have in the middle here the object storage of demons or OSDs, which are responsible for storing data, serving read and write IO to clients. And when something happens to the cluster that changes the state, making sure that all the data is located on all the OSDs it's supposed to be, so we maintain our replication and durability guarantees. We have the monitors, which keep track of the other participants in the cluster. They maintain a series of cluster maps. In particular for this talk, the OSD map, which says what OSDs are part of the cluster and whether those OSDs are up or down right now and what their addresses are and things. And then also, we have a metadata server, which for this talk doesn't matter at all, but you heard about earlier with Patrick and Jeff and May here about later. A rados cluster generally consists of three or five monitors and a whole bunch of OSDs working together. When we serve writes out of rados, an application connects to the cluster, which means it goes and talks to the monitors and says, hey, what does the cluster look like? I want to talk to it. And they get their OSD map and other maps describing the state of the system. They say, hey, I want to do a read or I want to do a write to this particular object foo and they run our magic crush algorithm, which says which OSD is responsible for serving and processing operations on the object foo and that's called the primary OSD. It sends that right operation to the primary. The primary runs a bunch of validation to make sure the client is allowed to do that. It does some pre-processing. It sends out replication I.O. to the other peers or replicas that share that state and then once the replicas are replied, then it returns to the client. Visually speaking, you know, one round trip to the monitor and then we can do a whole bunch of writes off of that one round trip to the monitor and it looks like this. So for the purposes of this talk, which is about stretch clusters in SEF, then this is what we mean. We're not talking about big, wide area networks with hundreds of milliseconds of latency stretched around the world. SEF is designed for a local area network. It expects fast interconnects. But a thing that has become more popular or that we see more and more often and that occasionally shows up and that people certainly ask for is that they'll have two or three data centers, you know, pretty close on dark fiber and so maybe they have five milliseconds of ping and fast interconnects or they are running in different availability zones inside of an Amazon cloud or some other cloud provider. So we still expect low latency and high ping, but you are split into two or three data centers or similar things. So the particular risk we're worried about is that you might have an asymmetric network split that's a lot less likely to happen if you were in a single data center because you probably only have one link between each site and it's become pretty likely that instead of losing like a single rack at a time, you might lose a whole data center and that data center might represent fully half of the cluster or certainly a third of it and losing that much of a SEF cluster when it's on a local area network is possible because you might lose a whole like power unit that powers the whole thing or half the thing but the ways in which it happens will be different and you hope it's less likely. So people can deploy stretch clusters today with three data centers, you know, a very simple one. You probably wouldn't really do it with two OSDs in each data center but looking like this or they might do it with what they ask for although we mostly don't encourage them to do is with two data centers and then the third offset monitor that might be in the cloud or a VM just running somewhere and the reason that you want three is because you have two monitors in one data center and that data center goes down, you can't make any progress. Anything that the monitors are in charge of they are a Paxos consensus-based system so making a change to any of the cluster maps requires more than half the monitors to agree. If you have three, you need two of them to agree. If you have five monitors, you need three of them to agree. So if this happens and you lose your whole data center then you might have a surviving data center with all of the data still available in it but you only have one monitor, it can't say, oh hey, there's only one data center alive so just continue with this one. It requires manual administrator intervention. So the big problem, well, I was asked to make this possible because of some Red Hat product needs with the OpenShift compute storage or container storage that we were talking about last session so monitors have to elect a leader. That's because they're consensus-based and so when you send a request to a monitor you can send it to any of the monitors but they want to pick one guy who decides what order the operations happen in and who maintains membership and so all updates go to the leader and then he distributes those changes out to the other monitors which are called pions and I'm talking too fast. So a request goes to the monitor cluster. If it's a pion or one of the not in charge monitors they forward it to the leader. That has some interesting consequences in that during an election everyone talks to everyone else to try and pick a leader but once an election happens the leader talks to all of its pions but the pions don't need to talk to each other and I've written out the algorithm here. I'm going to go through this really quickly just because I've got a bunch of pictures that will make it much more obvious. So an election starts for some reason. Generally either a monitor turns on and he joins the cluster by starting an election or else some kind of timeout happens because they have leases over the data that they can serve reads off of. And so for some reason the monitor goes I want to start an election and I'm going to propose to everybody in the system that I become the leader in this new election epoch and we have an election epoch to make sure that everyone's on the same page and that it's not old messages. And then everything after starting an election is driven by receiving messages from the other peers. When you receive a propose you can do three things. You can say, hey that monitor is not in the quorum and I want to help him get in the quorum so I'm going to start a new election to help them join and I'm going to do that by sending out proposals or if the sender is a better candidate for leader than me and the way classically that we decide someone's a better candidate is they have a lower ID number then we'll defer to them or if the sender has a higher ID than us then we'll say hey I should be the leader and we'll bump the epoch and propose again and then to win if we get a deferral message from all of our peers we win if we time out the election but more than half of the peers including ourselves have deferred we win if we time out otherwise we start a new election so visually we have three monitors here one, two, three, hooray they have numbers zero, one and two and monitor number one says hey I want to be the leader it's epoch one so maybe they just all turned on monitor two gets that proposal and he says oh well you're a better number than me so sure you can be the leader but at the same time that's happening monitor zero gets that message and says no I'm a better number than you are so I'm going to bump to epoch number two and propose to everybody and so even though one got a deferral he also got the propose and he says oh you're better than me zero and two obviously says you're better than me and so one and two both defer to zero and zero becomes the leader and sends out a victory message so that works great normally but if you have a net split very exciting things happen in this case you know they were already together and we just kind of net split from zero to one and so monitor one eventually times out because he's not getting any updates from monitor zero and he says hey I'm going to bump epoch and propose that I be the leader and monitor two gets that message and says oh well sure you're a better number than me you can be the leader and after I think five seconds or something one we'll say hey I won because number two agreed with me and so I'm forming a quorum but it's only two of us it's just monitors one and two but meanwhile while that was happening eventually monitor zero goes hey I haven't heard back from my pions in a while and I think I should be the leader so I better run a new election in case one of them died and so monitor zero sends out a propose also with epoch three because he only saw epoch two and bumped it and monitor two says oh hey I got a monitor that's not in quorum and he's giving me a proposal I'd better propose an election to make sure that they get in from everybody and so monitor two proposes with epoch four that he'd be the leader even though he knows he'll lose later and monitors one and zero both get that proposal from two and say hey you can't be a leader I've got a better number than you so they propose back and they try and propose to each other but they can't because the network's still broken and you know monitor two will get one of their proposals first and this cycle just repeats and so you never ever get a working quorum because the monitors are trying to elect a leader and keep one stable so this was the first problem that we identified when I was asked to make stretch clusters work and we came up with the plan the first thing part of the plan was make it possible to change the code at all the election code in seph is some of the oldest code in the system it's from like 2004 written by grad students and if you've ever read grad student code it can be really exciting you know we were pretty confident in it because it had been stable for a long time and we run a lot of tests where we do things to the system and it didn't break but we actually just saw a new bug like a month and a half ago that was from this and that the other people assumed was from some my earlier changes in a regression and it wasn't so there were some issues and one of the things that made it particularly difficult was that the code mixed message passing from monitor A to B and the election logic about what to do with the changes in the same functions and that made it hard to test and hard to update and in particular I wanted to maintain multiple strategies for running elections I wanted to maintain the classic one we've been using and the new one that I'm about to talk about and so I decided to split that up into a new election logic function or sorry a new election logic class which would decide what to do when you say we got it from pose and the message passing election class that we already had and then once I made that split I also wrote unit tests for the election logic class which was very exciting for me because if you want to test this in a lab we need to set like we have a big lab we can use for testing but it still needs to run for tens of minutes or several hours and it's hard to like create specific scenarios whereas in unit tests I can just poke it directly at the code and run it and it all runs in line in one process and we do time step advancement and that made it a lot easier to work with it made it easier to iterate an experiment when I designed an algorithm because it turns out that I designed an algorithm and like the core idea was fine but I needed a bunch of new invariants that were always true when you're just going by ID numbers but that you need to make true when you're using more complicated systems so here's an example of the unit tests I ended up building this one is just a demonstration of that thing I showed you graphically before we the function is called blocked connection continues election and then we pass in an election strategy so we might pass in the election strategy classic which is what I named the old one that I talked about and then we create an election class and this one has five electors or fake monitors in it and we give it the strategy which is usually classic and then we say hey we're going to block messages between monitors zero and one and then we're going to turn all the monitors on and have them start going and then we do this run timestamps function and this calls into the test harness and it says hey like I want you to advance up to a hundred time steps a time step is the amount of time or when it advances a time step it like calls to every elector and says hey you get to advance one time step and they process their incoming messages and then can queue up outgoing messages and so delivering a message is a one time step process and if the and it will run up to a hundred because if everything becomes stable then it will stop advancing time steps if the electors have no timers waiting for something to time out and if they don't have any messages pending and then here we have a bunch of assertions so we have this test for is the election stable meaning that there are no timers if an election is stable there are no timers or messages queued and since in this one we blocked the messages we expect that they can't resolve the election and we will expect that the quorum has changed recently so by recently I doubled the number of timer steps that's how long a timer lasts because it takes two time steps for a round trip message so things time out after three time steps and I just doubled that for how long the quorum can take without changing and so we say hey it didn't finish but then we unblock the messaging between them and then we run forward it for more time and we expect that once there's no net split everything works and indeed it does we have a few extra assertions I check here that we want to be really sure that all the monitors agree who the leader is I can't imagine how they wouldn't but if they somehow manage to segregate into different groups or one of them like lied or then that would be bad in particular you might have a thing where well one of them might have a logic bug and agree to two different monitors in the same election or who both take that or something and that would be bad and then we also check that they all agree on what the epoch is because if they all agree on the leader but some of them think the epoch is ten steps behind the others something went wrong test harness is about 500 lines this is one of the simpler tests in particular this one you know this passes on the classic strategy but once I fix net splits like blocking progress then this fails so I don't run this on my new strategy and my new strategy is called the connectivity strategy and I said hey we want the most connected monitor to win so we're going to run heart beats between all the monitors and try and maintain a and try and figure out like how stable they are and generate scores for them and these scores are for every connection between pair wise these scores are this connection is alive number representing how alive it has been over the past time period we have a it's not quite the right formula but we have a half life of 12 hours by default and we will ping every two seconds or I think every second and assuming that ping comes back within a two second period then we say the connection is alive and it has been alive since the last time we got a ping and so we increase the score otherwise we decrease the score and mark the connection is down and those scores are the monitor maintains scores for all its peers but they also widely share those scores so I as monitor 0 know exactly what the current scores are from me to everyone else and I have a slightly out-of-date view of what monitor 1 thinks the scores are and what monitor 1 is seen from everyone else and then when we get a proposed message instead of looking at the ID number we look at the total score and the one with the higher connectivity score wins and then we still tie break based off of ID number but that's super likely once anything is died we can also specify that monitors are disallowed so if you have two data centers that are close to each other and a tie breaker monitor far away you don't really want everything to have to go through out to the tie breaker monitor who might be hundreds of milliseconds away so we say he's disallowed his only job is to pick one data center to win and there's a lot of nuance to that but that's basically the idea there's a pull request up for this now and visually speaking you know we got our three monitors they're pinging each other happily and they're maintaining, I've just got their local scores here but everyone says everyone's up and they have a score of one if we net split then after a while the zero and one both say that the other person is down and they've decreased their scores but monitor 2 can still talk to everyone and he says that and so the monitor 2 scores for everyone else are still high and everyone scores for monitor 2 are high so after a timeout period monitor 0 and 1 both will propose to monitor 2 because they can't talk to each other and monitor 2 gets those proposals and runs the number and says hey I got a better score than you guys do so I'm going to propose that I be the leader in epoch 4 and both of those guys say yes I agree you have a better score than me you can be the leader and then monitor 2 wins but if you then go on to disallow monitor 2 and I don't remember if disallowing will cause an election but we were running before and then we said oh I don't want monitor 2 to be in charge so we disallow monitor 2 and a new election happens and if 1 or 0 propose to monitor 2 then monitor 2 will say hey I am not allowed to be the leader and so what it actually does is if you're disallowed you just have a score of 0 but he'll say oh well monitors 1 and 0 both have the same score but I'm going to pick one of them and monitor 0 is the one I'm going to pick he's got the lower ID number and there was a tie normally there probably wouldn't actually be a tie just because of timing and so one of them would win randomly but whatever and then after the timeout period monitor 0 says hey I won I have a quorum of 0 and 2 keep in mind though that while that's happening even though the quorum is 0 and 2 1 and 2 are still happily picking back and forth and maintaining and changing scores they're being propagated back out to the world back to the other monitors through the working connections and monitor 1 is also like I'm running I want to be in a quorum I'm going to propose to everyone I can see so he can send out those proposed messages but in this strategy when monitor 2 receives a proposed message from an out of quorum peer he says oh you are out of quorum and he'll like run a test and he'll say oh well if you're in a quorum and now I would vote for which guy and if he would vote for the current leader or sorry if he would vote for the same guy you voted for last time then he'll just ignore the proposal if it would change it then he might respond to one or bump up and say hey I want to do an election but in this case he just says nope you're not like you sending a proposal would not change the outcome of where we are right now so I'm just dropping you that would be oh question so if the link what happens if you have an overloaded link with high latency or packet loss but not a full break often an overloaded link with high latency or packet loss will look to seph like a completely dead link just because of the amount of traffic we shove through it so if it keeps serving all the messages quickly enough then it might stay and we'll have a problem but eventually we would expect that link to get backed up and the pings will stop happening or the pings will stop being replied too quickly enough and so things and so they will migrate workloads well the monitors will migrate away from letting that one be the leader and you'll start getting failure reports from OSDs and the monitors and stuff so if you have a partial failure except for a partial detection and it may be slow for a while but it's a better state than where we are now anyway going back to the unit testing now that I have this cool connectivity strategy we want to test it and so instead of testing it in the real world we can just set it up to look the way we want to so I have this strategy here where the connection tracker is the thing that keeps track of all the scores that everyone sees and then a connection report is I monitor zero have these scores for monitors for myself in one and two and so that's one and connection trackers so we're grabbing out a connection report and directly editing the history is the score and I guess I didn't change the liveness and bumping up the version so that this propagates and we can do this to a bunch of them and then I disable the pinging between monitors and what that means is that they'll stop sending the ping messages so they will not change the scores away from 0.5 so I know exactly what the scores are and the only time the scores propagate is when they send election messages to each other and then I run forward in time with the start on and run the time steps and then I make sure hey we resolved a score or we resolved on a leader and that leader is stable and the quorum is stable even though they had disagreeing scores for what each other were and we've checked that the leaders agree and the epochs agree so that's an example of why we can test so with those changes the monitors are happy in multi-site you can have them in a cluster, they deal with net splits they can deal with dropping half one of the data centers if we need to for some other reason perhaps because of what we're doing with the OSDs so when a cluster changes when an OSD comes up or down or something about the cluster map changes the OSDs look at that change and they say hey do I have any data I'm responsible for now that someone else is supposed to be responsible for and if that's the case they send out a notification saying hey you're supposed to be primary for this piece of data now and when you find out you're primary then you need to make sure you have the newest version of the data and that means going through the OSD maps and finding the old peers who might have data that you don't and then you query those peers and ask what version they have and then if they have newer data you ask for the update logs and you ask for the updates graphically speaking let's say we have here in this case we have two OSDs and the one on the left just rebooted or something so he's got version 5 of the data and then the one on the right has version 8 so he asks for the version and he gets it back and he still knows that he's got version 5 with content A but he knows that there's newer updates and then he asks for the updates and says what are they probably what this is is this is a collection of objects and so he's asking for which objects changed and he gets back oh it's these ones and then he says hey send me the new data and like this is a series of messages because we don't want any of them to be too large so we might get two at a time and then he knows he has one left and then he asks for it again and he gets it back to get the newest objects makes sense generally speaking so that's a real simple example in real life we generally have three copies of all the data I'm not putting that on the screen because it's harder to draw and harder to show the problems and so we have three copies of all the data but we generally have a minimum size required to go active to serve any reads or writes on the data of two and the reason for that is that we're a very consistent system we notice if we aren't consistent we notice that we can be inconsistent and in particular if we go active with one OSD holding the data and we accept a write and then that OSD goes away we've lost data and we can't get it back and if you are running a system with three copies and we tell you that one OSD died and now you've lost all your data and now you can't access your block device you're going to be really angry at us now the way this works is that the specific version that everything's at we don't want anyone else to know the exact versions they're at because that would mean talking to them on every update but the OSDs know the exact version they're at and the monitors know that updates were allowed the reason like we do work to make sure that's the case and the reason for that is that when someone becomes a new primary they need to know who to go ask for the data and so the OSDs report into the monitors hey I am like going to go active mark down that I'm allowed to make changes and this is one of the things that we do just to make sure that we can stay absolutely consistent and so we know who to talk to so if we have a two main data centers and one offsite tiebreaker monitor model and this is a really simplified version of it obviously but the problem exists even with more complicated setups so let's say we have data center one on the left and data center two on the right and OSDs 1.0 and 1.1 and they're only peering straight across for whatever reason then they might have you know your clusters running and everyone's at version 8 but then we lose OSD 2.0 and you know life is sad but hey we still have OSD1 and so he can keep on serving the rights and everyone can keep making progress but now we've lost data center one and so that's bad because you know OSD 1.0 had the newest version of some data that no one else had and in particular maybe we get OSD 2.0 back but we still can't go active and serve reads and writes because he knows oh like OSD1 went active he probably did some I.O. and is that a newer version than me and so even though you had this thing you built it for for fail for like data center loss tolerance it no longer has tolerated data center loss because of just one bad OSD that wasn't up to date and that makes everybody sad and you know there are systems where this is okay there are storage systems where if you do that to the system it just won't even notice that you went back in time and maybe your application won't notice either and maybe you won't notice either but maybe your application does notice and things go horribly horribly wrong so we don't let that happen and you know the example I've given you is a toy example but it's not super unlikely that you have you know dozens of OSDs and one of them rebooted because you know you were updating a server software or you were like upgrading CEP and so one OSD was rebooting and it was continuing to process data in the other data center and then the data center links broke and so you know most of the data is still available but some of it isn't and we really need all of it because of the way we distribute virtual block devices and things so this is work in progress what I'm about to start talking about I'm working on it now, I was hoping to have it the pull request up so you could direct to you here but not quite, very very soon so the design target for our stretch mode is having two main data centers and then a third tiebreaker monitor or a fifth tiebreaker monitor somewhere else and we're going to have two copies of the data in each data center the reason for that being well that's mostly because of what Red Hat supports and what I felt comfortable letting out into the world we do support two copies with min size one if you're running on all flash and so if you lose a data center out of this you still have two copies of all the data and it is on its own even with the dead data center like the surviving data center and set of monitors are a supportable configuration if you just like cut it down we restrict the OSD monitor communication so they stay within a data center by default the OSDs are allowed to talk to any monitor they want to but besides the fact that you know it might be higher latency to go out and talk to someone else we also want to notice when OSDs are net split from their peers and so OSDs ping each other just like the monitors do now and they will report when they can't talk to their peers to try and mark them dead but if an OSD can talk to any monitor that's in the quorum he gets to stay alive and so we don't want a net split between the data centers and for the losing data center to have all OSDs talking to the tiebreaker monitor because they can reach it and having that tiebreaker monitor keep them alive that would be bad so we say OSDs are only allowed to talk to monitors in their data center to stay alive and then in addition requiring a minimum number of OSDs to be alive to go active we say you have to have OSDs from more than one data center or availability zone or whatever it's a configurable thing and that means that if we lose an OSD then it will have to make another copy on the same data by default you'll have two in each so you'll lose an OSD but we'll make sure that we go active with at least one OSD from both sides and probably all three of them and then let's see right so I already went through this cool and so visually everybody's pinging each other we lose the old OSD and the cluster keeps running and in this case OSD 1.0 isn't allowed to do anything so you have a data unavailability which is not great but you can go work and get OSD 2.0 back online and meanwhile OSDs 1.1 and 2.1 are updating their data but then when the links break we can still the tiebreaker monitor chooses data center to side for whatever reason and they make their own little cluster and they change the rules to say you are allowed to go active without having OSDs in multiple data centers and we are disallowing the monitors in data center one for becoming the leader and then we can keep making updates to OSD 2.1 and when OSD 2.0 comes back he can keep making updates and life is great and I have five minutes if you guys have questions everybody's stunned I guess so you mentioned there's a half life of 12 hours why did you settle on that number? it was completely arbitrary I needed to write a number down it's configurable but basically we want to not immediately forget if a connection died but we also want it to age out because if you reboot the server monitors on it's score will drop and we want that to age out the question is when this is coming out and the answer is it's going to be in before octopus is released at the end of the month or in March or else I'm going to be in a lot of trouble the question was if you have two data centers I'm suggesting four copies instead of three copies yes that's correct so like I said during the talk I really want that if you lose a data center because if you lose a data center maybe it's temporary but there's a good chance it's a very long term and so I want the surviving cluster to be supportable on its own that's you know some people may be sad about that and so in that case you can run with three data centers which will mostly work although still might have trouble with net splitting on the OSD side I'm planning to get to that after the octopus release but for the product needs that Red Hat gave me then that was sufficient we could go to four copies the question was if the reason we have pink statistics in the manager nodelist or because of the teacher and actually no it has nothing to do with it it's just good information network issues network issues underneath step in general have resulted in a lot of like problems that are hard to diagnose so we want to expose that more carefully and make it more obvious when they're probably an underlying cause of an issue ah so yes the monitors use their own statistics I think the pink statistics that you get in the manager are actually just from the OSD heart beating although I'm not totally sure about that so I guess we can plug these into the manager as well but we don't yet someone in the back had a question yes the question was if five milliseconds is a hard limit or something that we write up by testing and the answer is I was just five milliseconds is sort of like it's not completely arbitrary but it's not a hard limit it's just that like we have run stuff in a lot of data centers that had latency up to around two and a half or three milliseconds and five milliseconds when you add up the latencies on doing write IOs it's you know not great but it's sort of in the range of what you may see in a stuff hard drive it's just a little bit faster anyway so it's plausible whereas you know if you do a hundred milliseconds then some of your primaries are going to be across your remote link and your reads are going to be slow and your writes are like two and a half of those hops or something and so they're going to be really slow and everyone's just miserable so it just it needs to be you know kind of data center-ish well I will be ah so the question was why I'm talking about having our own pings and so I'm writing down pings but it's actually a normal seph over the wire message that goes through the normal seph communication pipeline it's not a like it's not a network primitive and that's on purpose we want to make sure that the entire stack is working properly because you know if it may be that the monitor is still alive technically but if it's like got gigabytes worth of data and swap and it can't handle messages then we want to mark it as not behaving ah the question was about messenger v2 and it is not necessary for this although I mean you'll have it and you should run messenger v2 if you have it but this is at a higher layer I'm sorry ah the question was about if this will be in a specific um Linux distribution and I have no idea this is going into the next upstream seph release and distributions will get it when they get it this has nothing to do with what the clients are so it's all server side it's just part of the packages okay we'll thank you thanks all