 All right, thank you everyone for joining us to talk about Swift rings I see a lot of Swift contributors in the room. They'll will keep us honest for the rest of you I hope you're ready to dive into some technical stuff Christian will guide us. I'm sure all right Thank you very much clay. So yeah, my name is Christian Schveder. I'm a principal engineer working at Red Hat Clay described me as a stand-up guy or whatever that means. Yeah Yeah Yeah, who's me as clay? Yeah, I'm clay Gerard. I work at Swift stack Which is another company that's contributing to Swift in addition to Red Hat. I am a senior software engineer there And I've been working on Swift for quite a while All right, so we cleared that yeah Yes, okay. We're this another title for this talk could have been rings 201 There Swift has been around for a long time. It is a deeply entrenched open-source technology There's a lot of information out there available for Swift And we wanted to take this opportunity in front of a video that's being recorded to go a little bit deeper on some of the Really important technology that's built up Swift So in this talk, we want to first talk about you know, why rings matter and what are the rings and how they work We're gonna get some deeper stuff and then also as an aside We're really thinking about people that are operating Swift clusters and try to give some tips so that people can understand How to use them and how to do cool things with them and how they sort of power different things that you can do in your Swift clusters Yeah, so that's it. This is not Swift 101 Actually, so if you're looking for more general introduction into Swift, there have been some great talks in the past we have a few links for for you here listed and If you're not want to watch videos, maybe you want to read a little bit more about this There's also some something linked on the slide here with a very Short well with a short introduction to Swift itself and the concepts behind Swift So if you're interested in that, please have a look at this. All right. Yeah All right. So the first thing why the heck are we talking about rings? Rings are you know, it's one of the basic tenets of the architecture of Swift and It's a really important concept that sort of permeates throughout the system It is not just merely an implementation detail Swift's rings are actually a really important component of its architecture and they enable a lot of different things Particularly anyone that's operated a Swift cluster Understands that you know, it's not your typical sys admin Swift operators are some of the best in the business And they understand that even from the very beginning when they're configuring their Swift cluster They're you know having to think about their clusters topology of this distributed system And you know, there's a lot of stuff that can be going on whenever you're running this large-scale distributed storage topology And in particular they're manipulating rings and ways and thinking about their clusters and how they're Rebalancing and managing adding capacity. These guys are ring masters Yeah, so that's that your rings contain well Most of the information that is required to run a swift cluster, of course All your devices are contained within the rings your servers But it's not only that you have concept of zones where you group multiple servers for example into a failure domain You can extend that a little bit more to regions So you group servers and zones into multiple regions if you for example run a geographically distributed cluster So for example users might be able to store data both in the United States and in Europe But but at the same time you can use multiple rings with storage policies and that allows An operator or a user to actually decide where the data should be stored So even if you have a geographically distributed cluster, you can decide, okay This data should be only stored in Europe or this data should be only stored in in the United States or wherever else While other data at the same time should be stored completely distributed This concept is made available by using storage policies Which basically allows you to to use multiple rings for objects and you can even use well the default has been to use a replicated Strategy first worked for a long time, but with the concept of storage policies You can use also a concept called a rager coding which gives you Well a better usage of your devices because you don't need to store all the data Let's say it's three times replicated you have less overheads here And that allows you to to lower your costs significantly And any one of these things you can kind of even go deeper into The devices you might have different types of hardware in your cluster like SSDs Zones represent the failure domains that may have taken impact on how you're thinking about upgrading your cluster You know of course all the cross region stuff that you talked about the storage policies each have their own ring The same as the account container, so this is definitely not you know Just something that doesn't matter. This is something that operators are thinking about as they're working through their Swift clusters So that's that's why they matter and what we hope to do here is talk a little bit about the data placement Technology the algorithms that go into Swift so that you have some concepts that you can take back when you're thinking about your cluster topology And how you want to employ these rings to do some interesting things in your clusters Luckily, Swift rings are actually very simple consistent hashing has first introduced back in 1997 Which was the same year the HTTP 1.1 RFC 2016 was approved So this is this is old tech and it really is a simple idea on a way to distribute things. Yeah So actually, let's start a little bit earlier Let's assume you are a company and you want to archive some snail mail classic letters actually and You have a bunch of zip codes for different companies For example, you have the zip code from Swift stacks as if the zip code from reddit And now you want to distribute your snail mails in an organized way into a limited number of boxes actually to archive them now one approach that you could use you track the zip codes and Create a table. Okay, this zip code goes into that box and another zip code goes into this box Well, that's a little bit complicated. It's a more zip codes you have Let's assume you have also the use also the company names, then you might have millions of this data sets and Whenever a new zip codes comes in you need to update your the tables and if you have a distributed cluster Becomes really complicated. You need to update and distribute all this stuff all the time So an easier way to do this is actually to just use a distribution function and remember this function So in this case, it's very simple. It's a modular operation. That's not exactly what how it works in Swift But for this example, I just take the zip code apply a modular to it And you know the number of the box where you want to put your letter into now This scales very well. It doesn't basically it doesn't matter how many zip codes you have as an input It's still the same function and you can easily describe which box contains your letters Now within Swift, we aren't using boxes. We're using partitions so what we're doing is we take the object name including the account and the container name and That describes an or is a namespace and this namespace is Actually mapped to a number of partitions within your Swift cluster the Swift partitions are fixed number. It's Not a partition on your hard drive exactly in the way that it's well in a partition table Which you might be familiar with it's really a directory if you look onto the disk so if you go to one of your object servers, for example and Just do a deep dive into the directory to structure of Swift You will see something like this and at the end of the of this whole name. You will find some dot data This is actually an replicated object but before that you find some Well cryptic looking stuff and this is your partition number which is computed actually from the object name Then you have the hashed object name itself and you have a suffix Used by or from the hashed object name and this describes another sub directory if you look into the source code of Swift and Start digging around How this is made of you will see that we use a little bit different names here So we have a partition via we have a suffix here and we have a hash here That are the names that we are using within the source code if you look into that That's really all you have to remember if you're digging down in the Swift directories part suffix hash Easy as that. Okay. All right, so That's a little bit about consistent hashing We're going to be talking about the partitions that Swift is going to be mapping to devices the If you've dealt with other consistent hashing algorithms You may have heard them referred to as V nodes or even buckets, but Swift calls them partitions So the reason that we refer to it Swift as being a modified consistent hashing algorithm is because one of the things that We've added in addition to just displaying your namespace across all of your partitions is they also have replicas and One of the really the the key thing about the ring the main data structure that it's distributing in addition to the Lookup function is the replicate a part to dev ID table And it's actually a really simple thing. We call these things rings and I don't know why they're actually more of a table This is the main data structure and all that it really says is that For each replica of every single partition we have to write down which device that thing is on and that's it This is essentially our address book This data structure is large There is a lot of partitions two to the ten two to the sixteen depending on your part power And your replicas you might even have seven or eight or ten if you're doing a racer coated They're not actually replicas at that point. They're just fragments But if you understand that this is all this data structure is you can put in you know, wherever you want those to go You can write down those devices, but of course, that's actually the job of the ring code is to figure out Which ones are the right ones to put there Once you have this table in hand You can provide the main functions of the ring which is to locate data on discs are on the nodes specifically We classify a placement of a partition in two categories the main one is of the primary lookup This comes out of the get nodes function And basically it's just an indexed lookup into the ring once you know the name of an object and you perform its hash And then you convert that modulo into your partition table That gives you the part which you can just look up and it tells you exactly what the three more It's an indexed lookup directly into that data structure So it's very quick and you know exactly what the primary devices are but in a Highly available system the idea is that you know at any time one of these nodes or multiples of these nodes can be Unavailable and this is worse with test to introduce a concept of handoffs Sometimes I think people get confused That's actually what the function is called that the handoff nodes are like written down It's something that we know and I think that it's really important that we take away that the handoff nodes are actually just a Deterministic function you can ask get more nodes for as many More locations as you want if one of them is available You can always get another one because it's just an algorithm that traverses through that table Throwing out different devices that are in a similar failure domain, but you always get them back in the same order It's deterministic, but we never write down the handoff nodes. It's just an algorithm like this little factory here Okay, so we now know how a ring is used to provide swift with the locations for the data that it needs to find And we understand what that data structure look like But if it's just a consistent hashing algorithm where you have to keep track of multiple replicas, you know Why can't we just write down any devices in that table? And that's where we've learned over the years that there's some properties that make one ring good and one ring bad So obviously a good ring is good. It's got good dispersion good balance low overload a little bit Not too much if you have a bad ring you get these very helpful error messages that tell you exactly what's wrong No, it's not exactly how it works rings are not good or bad rings are filling in a spectrum of gray And there's a number of different trade-offs that you have to make whenever you're assigning Partitions somewhere in your cluster in the failure domain trying to balance things out So we're going to dive into some of these concepts so that as you're thinking about your ring and you're understanding the feedback from the Balancing the rebalance operation you have some ideas that you can sort of map that to and understand how it works out All right, so the rebalance algorithms actually a little are a little bit constrained So we have devices where we want to distribute our partitions to but of course we don't do that randomly We want to take into account that these devices are attached to actual servers these servers are grouped within zones Which might be actually a rex. So if you have multiple rex you put a few of these servers into one rack a few other servers into the other rack and Even if a full rack might be unavailable for example switch error power outage whatever that swift is still Operatable and can still Return the data back to you that you start earlier and if you have multiple data centers you can even use or assign servers and groups into these multiple data centers multiple regions and the algorithm actually needs to find a way how to balance this our out and How to assign partitions and especially the replicas to different failure domains When you look into the source code of the swift ring building algorithm you will see that we often use the term tears So these are actually tears a region tier has for example multiple zone tears a zone tears Then multiple device tears and so on and so forth What is important is that a failure domain failed together? So if you have a single server and all your replicas are stored in the single server it will fail together That's actually bad, right? So you need to know your failure domains For example, as I said earlier power outlets network switches stuff like that Maybe even different parts within the same data center. For example, if you have fire walls between these Just to be sure that in case there is a fire in one part of the data center that your data is still stored safely Now, let's have a actually look. So What happens within Swift itself? We have a bunch of buckets here and these buckets might be disks actually now these buckets hold partitions and the partitions how old are object data and In a well-balanced cluster everything is distributed very very well and assigned to all nodes Let's have a look at a specific partition that holds actually three replicas So these three replicas are for example one object or a photo or video or whatever Now these discs are normally assigned to a server, right? So you have for example six discs attached to one server and six other discs attached to another server now the bad thing is here if that server fails you will have a very unhappy user and Probably very unhappy operator afterwards as well So the ring algorithms actually need to find a way how to distribute this in a better way So at least one replica should be stored on the other server and the other failure domain and even if the Server on the left side fails your user will still be very happy because you can still he or she can still retrieve the data back later on We have actually a term for this And we call it this dispersion which is actually a measurement if if there's a risk that we lose a replica or Yeah, a replica if a failure domain is not available if it's as unique as possible So the best case would be you have a dispersion of zero in that case Even if if one of your failure domains is actually unavailable It will be available using a different failure domain. Hopefully so when you have two failure domains actually and you Distributes real uppercase across these two failure domains that should be fine Actually, if one of them fails, you're still in a good shape but well, if you add for example another zone another failure domain more servers and One of your the both left failure domains are out of service Well, then it gets complicated again. The rings are blaming us again and data is unavailable and well That's not what we want, right? Yeah, I mean the key thing there is just that it doesn't mean that you only have one replica It's just as unique as possible So if you have as many failure domains as replicas then your replicas should be spread out evenly Among them and if that's not the case because of some of these other constraints that will lead to bad dispersion So it's important that you understand what those dispersion numbers are trying to tell you there's Other tools to help you out Gotta balance the rings Looking at another example. So if we have this fundamental constraint of dispersion One of the ideas, you know that you would have is just let's just put one replica evenly across all of our failure domains I mean particularly we have three domains Replacing say five partitions We just put one in each one of the failure domains and we just sort of keep this up and things are going fine until Suddenly we see okay on these two on yeah, you're right You see that we're already having some devices that have multiple replicas in it and on the server on the left And if you keep this up a while things are going to continue to get more and more overloaded in The servers on the right and less and less over the left and that's really You know just sort of an example of what can happen at a larger scale if you have very Regions or in this case servers You know we've got some sort of Multiple replicas on them and then we have some devices that are not being used at all and this is not going to be a satisfactory Placement algorithm even though everything is fully dispersed. We don't have a situation where our capacity is being evenly used Nobody wants to no one's going to be satisfied if the failures if the storage system is just not using some of the available capacity So that's a couple of things we want to take a break now and talk about What happens when you do swift ring Swift ring builder rebalance to a ring Yeah, so actually rebalance happens if you for example add a new capacity to your rings What you want to do is that the existing data is distributed well after you add a new capacity, right? You start with 10 servers most of them They are maybe full already you add 10 more servers now you want to have these 20 servers Maybe with a fill level of 50% or something like that So that's where the rebounds progress comes into play Again a simple example you have four buckets and you have three replicas. These are stored some way or distributed some way like this for example and Now you assign or trigger a rebalance actually which is a command that you apply to your ring not even on the cluster it could be on an operator's notebook somewhere else and The rebounds algorithm will tell you or will tell the ring. Okay, this is your new distribution So before we had one copy or one replica on the lower right and This replica should actually move to the upper right bucket, right? Well, it's not as easy as that Actually this introduces a failure so The replica is not immediately moved right if I rebalance my ring and push a ring out to the whole cluster The ring knows, okay, I expect my data on the upper right note But the data might not be there yet. It still needs to be replicated to this cluster So you have a failure actually the proxy will work around this so the proxy knows, okay If one copy is not there it will try another copy. So it will still find copies on the notes on your right on the right side and it will work around this but It's a little bit more complicated. So if you rebalance multiple times in a row for example and In that case you would introduce more failures and it might happen that all three replicas are now or should now be moved to a different note to a different disk and All of them are now unavailable. So you clearly want to avoid this That's the reason why the rebalance algorithm only moves a single replica at a time So when you call it once only one replica will move now When you call it at a second time, it might move another copy And to limit this a little bit we have another constraint, which is the min-part hours So the min-part hours tells the Well test the algorithm, okay You are only allowed to move one replica within a given time range That might be one hour or 24 hours. Whatever is suitable to your cluster so before you Before you finished replete a full replication cycle. You should not rebalance the same ring again All right Now normally you have a few more Partitions per disk, right? You don't have only a single partition per disk. You would have more data In this case, we have three partitions Distributed across four buckets and each of them has three different replicas stored onto different notes and Now we're adding more capacity now these new buckets are a little bit bigger than the old ones We have another concept within the swift brings called weights the weight is actually telling the algorithm Okay, this bucket is much bigger and this bucket can hold more partitions So that we get a well somehow balance usage on each of the discs So even if you have let's say two terabyte discs in your cluster They should have a fill level of let's say 75 percent But the sub but your 10 terabyte discs next year should also have a 75 percent fill level. All right so the rebounds algorithm assigned new data new locations for the replicas or for some of the replicas But now we need to actually to move that so and that's part of the replication algorithm And what you can see here is the replication algorithm will try to well distribute the data across multiple Target Target buckets basically might be discs might be servers. So the replication and the rebalance Procedure process will actually move some data, but it's only a single replica at a time as you can see here I just said only rebalance one time after four or only rebalance again after full replication cycle Now you actually need to know okay when is my full replication cycle done, right? So one option might be to to have a look to your log files But that's a little bit the tedious process especially with larger clusters and when you have like a hundred notes And you need to query all of them. There's a very useful tool for this It's called swift dispersion reports with dispersion report creates some small objects distributed across your cluster and well Yes, and the dispersion report itself will then query all of the replicas. So it will now after Rebalance algorithm comes into play. Okay, some copies might not be available some replicas And it will tell you this so for example this output happened with a three replica cluster we did a large change there and Only seventy nine percent of the data or of the object replicas are available at the moment So with with three replicas you should be all the time between 66 percent and 100 percent So the number 79 percent should increase because the replicators fix the problems for you And we'll move the replicas into the correct place All right, so let's take a look at this more. I got some visual props to sort of help us out This is a little setup that I had where I was had a standing Setup in DFW and then I was adding a region in Sydney And you see during the first crank of the rebalance of the ring We only had a thousand these devices were actually servers all actually had the same amount of storage capacity But in the first crank of the rebalance, I didn't assign as many partitions to them We did a gradual weight increase and so you can see there's less partitions assigned And then you can also see that in the standing nodes down in DFW We had a lot more actual capacity on disk actual weight on the devices and the CID region was starting to fill up So that's kind of an idea on what's going on in the rebalance. You've got The the capacity filling across So looking at it another way This is Another way to think about it You have all of your primary partitions that we talked about those are the assigned locations on the disk Actually in the ring in the replica part to dev table and then you have your handoff partitions Which are partitions that are not supposed to be on this disk They're assigned elsewhere and you see right here We made our first ring push and that immediately like Christian was talking about introduces a fault into the cluster Suddenly a bunch of partitions are not in the right place the total number of partitions didn't change But our representation of them did and that's why that big red dip comes down And you can see slowly as that green line is pushing up as the replication is actually pushing The partitions out and in this small cluster everything sort of finished at the same time finally once all of the partitions were actually out in the primary locations then the Standing nodes that were holding the handoff partitions were able to delete their local copy This is why whenever you see New rings come out into the cluster that you're going to see the total capacity used go up for a while because you have to Replicate them out first then you can delete the ones that you have but as soon as the Everything was synced out. They were able to validate that the replica was fully placed out on all the nodes They could delete their local copy you see those handoff partitions trending down and as soon as it finished We kick out that next ring and so it's sort of the process repeats again All right, so I think we have all the tools now to talk about what's going on in a rebalance and in particular Examine some cluster topologies whenever some of our fundamental constraints about balance and dispersion come in conflict with each other So dispersion is all about spreading things out. It's about achieving highly availability And balance is all about making the maximum usage of the capacity that's available in most rings We can solve both of these things perfectly and just give you a ring that works really well But in certain situations these can come into conflict and I think we can we can maybe look into why that's the case But to do so we need some mental props that we can use to sort of think about how things work And I'm going to introduce a concept to you that we use in the ring code and hopefully you guys can follow along Mental thought exercise here. Imagine that you have two failure domains and three replicas Of part worth of partitions that you want to place So the first thing that you'll do is put one replicas worth of partitions in one failure domain and the other Another replicas worth of partitions in the others failure domain. That's only two. You still have one left So you split it up half in each All right, so it's very easy to model. This is just being 1.5, right? But it's 1.5. What? Well, it's we call them replicants It's not a third or a fourth or a fifth. It's a portion of a replica. It's a replicant obviously But what does that really mean? I mean we can't split up a partition But if you think of all the partitions that have multiple replicas, you could say that well, this has 1.5 replicants It has a replica of every partition. That's the whole one and then it has a 0.5 replicants or you know, how many partitions have two replicas here? Well, exactly half of them Exactly half of your partitions have two replicas in this failure domain And that's how that works out whenever you start to split things up But we'll do another exercise and maybe get some better ideas on that We're gonna start with four four zones One of them you can see all right I hope you guys can come along here. It's gonna go fast. We're running out of time So first of all you guys are gonna have to give me some slack here Let's say that these three are all equal in size I mean and then this one up here. We're gonna say is like twice as big So if these four failure domains are short of size like this Then we can think of it as having basically five different piece, you know parts units if you will And then we're gonna put three replicas into a cluster topology that looks like this So if we want to put three replicas into five evenly spaced buckets Five units then we can just divide it out and we can figure out that each one of these units is worth about point six Is that right? Double both of them three. That's a six over ten six tenths now point six So each one of these is about point six. All right So if that was point six then we're gonna say that this is about one one replica and you can you know Look just I've drawn this in crayon here. Come me some slack It's a little bit longer than you know points if you just made it half and then doubled it it's close enough It's just a visual representation But if we place one here, and then we place point six in all of these We haven't placed all of our replicas yet, right all of our replicants We've got point six point six point six in the lower three And then another one is that right two point eight you got one point two plus point six one point eight Plus another one. Yeah two point eight So there's still two left two tenths of a replica and where do they have to go? They have to go up here on the top and actually we sort of saw this coming If we remember that this one was exactly twice the size and each one of those is worth point six Then we know that this was going to hold more than one whole replica worth of partitions So when we're looking at it here, we're actually holding two replicas of some partitions Well, how many partitions are holding two replicas exactly? 20% exactly point two and that's because whenever the works out it's holding one point two replicants So we don't want to do that This is introducing some risk into this failure domain if we ever to lose this failure domain That's multiple replicas at risk in particular during a rebalance where we may have introduced a fault because we're changing Where things are to lose a single zone have cut out two of our replicas That means that at least 20% of our partitions are at high risk So we don't want to assign it like this. We would rather assign those somewhere else But these are already at capacity these other Other zones that we have down here and every time that we take some partitions out of the top We have to assign them somewhere else So we end up filling them in to these ones on the bottom in fact The more partitions that we take out of the larger fatter zone the more that go into the other ones that didn't want to Hold it and you can actually calculate how much you need to do this Whenever we get done We know that we've assigned one whole replica to the top domain and on the lower domains. We have to assign Two left. There's two left. We have to split two replicants between these two We know that they only wanted to hold point six and we've pushed that up to two-thirds point six six repeating So it turns out that the amount of extra partitions that we have to assign into the failure domain That down here on the on the bottom is about 11 percent more somewhere between 11 and 12 percent Overload allows this ring to have full dispersion And you know that's quite a lot It's actually saying to us that if we allowed every single device in the ring to hold at most 11% more partitions than it would want by weight and then we told the ring Rebalancing algorithm to go at it. It would be able to give us a fully dispersed ring But it does it at the cost of overloading the Devices in these smaller failure domains up to as much as 10 percent, which can be quite a lot Yeah, so actually the overload factor it was introduced very recently like one or two releases ago I think So it's not available if you're using an older open stacks with version But it's it's a good idea to actually use it So but don't use a value that is too high because then your drives are filling up if you're not using not enough as then you might run into the problems that you just saw and Actually, well if there's a disaster well, hopefully you can recover somehow from that But just using a value of like 10 percent is probably fine in most use cases Maybe even five percent, but 10 percent is really a good way to balance this out now That would be fine, right be good. All right, so we talked about partitions and Actually, we need to talk about partition power because when you create a ring you set a value that is called partition power Now, where's the partition power? Why is the partition power required and what does it tell you so? Actually, we need to balance some unknown amount of data coming in So you have objects for example, some of them are zero byte in size other are five gigabyte in size You don't know that in advance, right and we just store them on disks So some partitions might be actually using well, let's say 10 gigabytes to five gigabyte objects Well, other partitions might just use a few bytes on the disk So what you want to do is you want to assign multiple partitions per disk to at the end of the day balance out So when you have let's say a thousand partitions Then the smaller ones together with the bigger ones it will balance out when you distribute this across a cluster So let's give you an example about this Let's assume you have a single partition per disk and when you do that and you put random data random size to data Onto these cluster you will see that some of your disk are actually already filled to 100% Others are only filled maybe to 78% while the average is about like 86% at least well You see that it's not very well balanced So we're going to increase the partition numbers per disk let's say 10 partitions. That's much better already You're again at 100% because the cluster is already full Well at least for some disks and now you have an average value of 95 percent per disk Which is much better than before because you're not throwing away unused disk space on some of your disks If you go to 100 partitions per disk Then you will see that it bands out very well most of the average is about 99 percent So you have very little difference between all of these discs and your cluster Now you could have the idea or get the idea that it might be a good idea to just increase this number further and further right Just use a million partitions per disk. That's a bad idea actually so The replication time will go up and it will like take weeks if not months if you do that So you need to somehow find a way to assign a specific number of partitions per disk Balancing this out and at the same time give you some good replication time so The problem here is that a partition power is fixed you said at once and you can't change it afterwards anymore So if you add more disk to your cluster and you spread out the existing number of and the fixed number of partitions To your increased cluster you have less partitions per disk now, right? So a good way is actually to choose a partition power with only well Let's say a thousand partitions or a few thousand partitions at maximum per disk based on today's need don't shoot for the stars and Imagine that you're going to be the next Facebook if you don't if you're not sure about that Storing the next few dozens and of exabytes so When you do that it is also very unlikely that you have a partition power that is larger than or much larger than 20 so it's 2 to the power of 20 and 32 is our actual maximum. It's definitely not 32. Absolutely not and if you're Not sure about selecting a partition power come either Or join us on the ILC channel ask us about that or just use clay's tool Clay's has a to have written a tool a small Python script You just enter the number of this that you have and it will give you an estimation what it might be a good value for you My observation has just been that people tend to use their cluster before they add capacity to it And if you're planning on getting use out of your cluster you need to plan for today Luckily the way that that exponential growth works out in the way that the balancing works out if you just use a Same amount of partitions for today You'll be able to scale that cluster one or two orders of magnitude Before you have any sort of trouble with the balancing stuff that Christian was talking about. All right, however Sometimes you become a unicorn, right? And you are the next Facebook the next figure and you are storing Billions of cat pictures whatnot and you have a skyrocketing growth and actually in this case Well, you remember the partition powers fixed. You still want to have a balanced cluster, right? We're working on that so we're working on a way to increase the partition power actually There's a patch patch available. Have a look if you're interested in that But keep in mind that it one will probably never be possible to decrease the partition power at least not with a Significant and serious downtime. It's only possible or hopefully becoming possible in the future to increase the partition power so it shouldn't be a big problem if you select the same number or the same partition power today and Get much bigger. Yeah, yeah, Christian got you covered. Yeah. Yeah, hopefully All right, so there was a lot of information Let's do a quick wrap up and just repeat all this keywords that we saw in the last 30 35 minutes What's a good cluster? Let's assume you have four servers These four servers all have eight discs for terabyte in size. You assign a weight to them. There's 4,000 here It's just a random value, but well not a random value It's just a value, but it it matches your capacity of your discs in the cluster now these four servers Need some partition power So in this case, I choose 14 that gives me about 16,000 partitions Using three replicas a total of 32 discs within my cluster and that will end up with about 1500 partitions per disc which is fine for today, but gives you enough room to grow The cluster to about let's say 15 15 times the current size Maybe even 20 or 30 times the size that should work out very fine Now your servers are probably not all located in a single rack. You have multiple racks So you put them into different zones and these are your failure domains actually So you have to fail as the domains here and for example attached to each rack and switch Okay time passes you add more servers these servers only have six discs But you have new discs these are five terabyte in size. So you have use a different weight to tear the ring algorithms. Okay, these guys are a little bit bigger than the other ones and Actually, these new servers are in a new data center So you introduce or use a concept of regions to tell the algorithm Okay, these are my discs and my servers in my main data center while these are my second data center, right? All right So if you sum up all these values and you will see that my two zones on the left side will each Hold about 64 terabyte of data while the region to can only hold 60 terabyte of data No, I want to start through replicas that won't match, right? That's a problem. We talked about earlier I can't find an exact value how to ensure that there is at least one replica of Each object in region number two at least not if I follow this strict constraints, so I Actually would need sixty point 62.66 terabytes of free space and all these regions and zones to distribute it equally across my cluster now That actually means I need to use an overload factor and so this ring builder tool will tell you okay This overload factor would fix your problem in this case. It would be four point four point five percent and As Clay said earlier ten percent is probably fine It fits fits this yeah in fact if you're following the advice that we've given and you're creating your new rings with an overload of about Ten percent the ring will automatically use exactly just the four point five It only takes advantage of as much as it needs to it is striving to optimize for balance But it'll actually optimize perfectly for balance unless you allow it to over assign your devices a little bit In this case, you know, you're gonna see a little bit extra weight on the devices in region two Maybe as much variance as an extra five percent, you know beyond that little bit of variance that you normally get just between parts in general All right, that should be fine, right, but it yeah, I mean this version zero. That's great All right, I think it was a very I can't believe we managed to do it. Yeah, we did like 42 minutes We just barely made it Hopefully you have some questions you blasted your way It's all right. We'll make sure this gets up online and the slides get posted so that you guys can look at it later Does any of the Swift contributors have any questions or want to make any corrections? Actually clay we forgot something. What's that composite rings? We didn't talk about composite rings. Yeah. Yeah, sure Yeah, composite rings is something that one of our contributors from NTT is working on another Sam from Swiss NAC has been looking at it We are trying to change the way to give you more capabilities in the way that you manage global clusters by having Vultral rings and rings that are comprised of other rings. So there'll be more stuff to talk about next time. Cool. Sounds great Thanks