 Well, thank you everybody for attending this talk About using OpenStack for extreme data durability. I'm here together with Florin Flamon. I let him introduce himself So hi, my name is Florin Flamon. I'm I'm working at CloudWatt on an OpenStack basically, I'm I mean, I'm trying to have OpenStack working at CloudWatt and Here are my details if you want to contact me Yeah, my name is Christian Schweder. I'm a developer working at Innovon's Red Hat Mostly working on Swift automation testing and developer tools. I'm also one of the Swift Core developers So there's a very good chance to meet me on IRC channel from Swift OpenStack Swift and of course using email address and my Twitter handle So before talking about the durability and availability and Swift Let's first have a look a small look into the architecture in Swift Swift is an object storage that means it's not in block storage. It's not a file system most of the time or all of the time you're accessing your data in Swift is a URL where you just send a put command or try to get data with a get command and These commands or these operations are executed against a proxy node This is one part of a Swift cluster the proxy node This is a part your users are talking to and in the back end you have some storage nodes that actually store your data on disks Now if you want if you start with a very small cluster Let's say only two nodes and you configure Swift to have three replicas of every object that you start with these objects Land on different disks in your cluster and on different storage nodes And in this case on three different disks and two storage nodes But well, that's not the recommended way to operate a Swift cluster So the documentation says let's start at least with five different storage nodes So let's add a few more nodes If you add more nodes and if you have more nodes and replicas configured your data ends up on different storage nodes So Swift always tries to move the rep each replica inside your Swift cluster as far as possible from each other So if you have more disks then replicas configured it ends up on different disks If you have more storage nodes than replicas it ends up on different storage nodes now There's some well, I would call it failure domains in your data center So for example your network uplink power both or Network switches something like that. So you might want to Ensure that each of the each copy of your objects end up in different areas in your data center and To ensure this there's another concept and this is called the zones. So for example, you can group different storage nodes into Different zones Let's say you have three racks with a bunch of storage servers in each rack. You can ensure that When you put an object into Swift All the three replicas end up in in different racks in your Swift cluster now What you also have to keep in mind is that each time you put an object into Swift Swift only confirms a success if the object is written at least Well a quorum times and in case of three replicas it means it only confirms the operation if you have two successful write operations Now maybe you are hopefully you have a really large Swift cluster and You also want to ensure that your data or a part of your data lands up in a different data center and you can extend this concept and add more regions or group a bunch of servers and zones into regions and In this case where we have two thirds of the storage nodes in one region and one third of the storage nodes in another region You end up with data or with two copies in your one data center and a third copy in your second data center And of course, you're not limited to two regions. You can have five regions or three regions for example one in the United States maybe one in Europe and another one in Asia, but Swift needs to know where to place the data where to read the data where to write you and This is done with a ring with a hash ring and Florence will tell us a little bit more about the ring Okay, thanks Christian So the ring is something three developers are always talking about because Basically, it's one of the main components in Swift It's the it's a map of the data you have in your cluster So basically you have one file per type of data in Swift For instance, you have one file for containers one for accounts and one for objects and And this files map each copy of each data on a physical devices through a mechanism that we call partitions So what's a partition a Partition is a number that is computed from an object's name Including the the container and the accounts name actually, it's what we call the full path of the object so the The partition number is computed from a fraction of the hash the md5 hash That is computed of this on this full path and the number of bits considered and interpreted by Swift Depends on on your cluster. So it's a setting that you do during the creation of the ring and this setting is quite important because Once you you specified the number of bits that will be interpreted you cannot update this number So there are recommendations in the the Swift operation guide On open stack documentation that will help you to to have to choose the best number so Basically in the ring file, there are three Three elements There is one big table which which associates each partition for and in each copy to a device ID Then you have a second table, which is a devices table that associates each device ID to And provides all the information that a swift node that requires in order to reach this device And then you have the number of bits to consider in the hash in the hash of the object So there is a little tool that I linked there that is called swift sense that allows you to to visualize the ring So that the common representation that we would do although I would rather represent this structure as a table So here is a little example to make it clear what the ring is So this is an example of of eight partitions Three replicas cluster So on the left top left you have the two dimensions table that allows to find the device ID of an object For instance, if you have an object whose partition number is two you select column number two and And if you want to know where the replica number one is which is a copy number one You go to the intersection of the row number two and sorry row number one and column number two I've got the device ID three So this is a device where the object is stored at least the first replica So you go to the devices table and here you see that this device Is located on on the on the host which IP address is 192.168.0.11 and the port to reach this service is 6000 and the device name is sdc1 and Finally the you have the last information in in the ring file is a bit count The so the number of bits to consider in the in the hash of the object Which is a three in that case because you have eight partitions, which is two to the power three So still about the ring that there is a new feature in in a swift which which ended in swift 2.0 It's what we call storage policies so it allows operators to to let users choose between different storage strategies and This is you so this is made possible with the use of several set of files Several ring files actually so one set of ring files per storage strategy. For instance, you can have a basic storage policy where Where each object we have two copies stored on on spinning disks in a single data center Then you can have a strong storage policy with three copies of each object in three different data centers And and then you can have for instance the fast Policy where all data is stored on SSD devices So I let Christian talk about Availability and durability. Okay. Thanks very much. You're welcome So availability and durability. What do I mean with this? So with durability I actually mean that you don't lose an object with availability I mean that you're able to access that object Now for example, if you power off your storage notes your objects are still there, but they are not accessible so let's first have a look at durability and inside swift and What you see in your data center during a given year is a lot of disk failures So from my experience and from others You most likely see two to five percent of disk failures per given year, which is quite a lot and Doesn't match exactly the number of meantime between failures that are given by hard disk vendors But it's actually what we see in operations You also have a very good chance to get a read error during Operations of a swift cluster because your disk is not able to recover from from a corrupted object for example on your disk and The vendors specify most likely a probability of 10 to the power of minus 15 per bit read and If you make the calculation That means if you read about 12 terabytes of data, you will have at least one corruption during read operations So let's assume you have three replicas of an object Actually, when you have two to five percent probability that you lose one of these discs the object is stored on Then you have a degraded cluster for at least for this object. You only have two replicas left Now swift has some built-in mechanisms to recover from that. One is a one of the replicators that try to what I try but they are recovering from the remaining two objects or replicas and Recreates a third replica in this case the other one is object auditors which Go over your objects and see if there's a bit read error and in case the object is corrupted They're moving these objects aside and the replicators later on will recover the third replica now if you Have only two replicas left There's also a chance that you get another disc fader This the probability here is now much smaller because you're not longer looking into one year for example, but only Let's say 10 hours or 20 hours. This is a time until the normally the replication recreated the third copy And if you're really unlucky, then you lose another disc where your object is stored on and Finally, if you continue with that and if you don't for example, if you don't let the replicator run you might end up with some data loss So if you only take this these numbers into account and calculate the probability to lose an object You get a durability in the range of 10 to 11 9s was in a given year for one object With three replicas So if you're interested in the numbers and the calculation itself There's a small web site Okay, you can see it Looks like this you can put some numbers into that and you get a number and you get a graph. I Will come to that in a few seconds again. This is public and You see the link here So as I said, it's important to keep the number or the hours between Or the hours for replication quite low So normally what what are you doing when you have a disc failure? Let's assume you have only six discs and on the six discs. You have five partitions for example and The disc on the right side. It's marked red with a colorful dots It's now going to fail or is already has already failed now Normally you just replace the disc, right and after you replace the disc you need to replicate the now missing partitions back to this disc Now if you have a disc sized, let's say five terabytes or six terabytes or maybe even more in the future This will take a long time. It will take well if you're lucky it will be done maybe in 20 hours But if you have billions of objects with one bite size, it will take even more time So and during this time span You're already at risk to have another disc failure, right? So to keep this low you can Do something different you set the weight of this device to zero in the ring Rebalance the ring and push a new ring now. What now happens is that the remaining discs inside your cluster Each get one or more new partitions But only if a subset of the partitions so each of these discs get only a one new partition the example and You need only to replicate much less data to the remaining disc and this is much faster So and once you have done this of course you can add new discs afterward and Replicate and distribute the data more equal to your remaining discs Now let's have a quick. Well, let's let's think about the durability number. I gave you just two minutes before a Chance of one in ten billion to have data loss Because of a disc failure does that make sense at all? I don't think so Because it's much you're the chance that you actually have Real disaster for example that you lose a complete data center because of fire or well maybe an earthquake a thunderstorm whatever else it's much higher and To get to these numbers you really need to distribute your data across more data centers and As far as possible from each other And as I said earlier, this is done with a concept of regions So for example in this case where you have two data centers and so one data center is a little bit smaller Stores only one sort of the data You can actually lose one of the data centers and you're still able to access your data You're also still able to write new data to the Swift cluster and once the second data center comes back online The data that is now placed. Let's assume in data center number zero region zero is replicated To the second data center because Swift now recognize. Okay. These originals are accessible again This also increases availability. So Network outlinks for example are quite well quite often a little bit This is a wrong term, but let's assume you have five hours on of network outage in a year That decreases your availability for your swift cluster for your data and the swift cluster if you distribute it this across multiple data centers for example, then you have a much better chance to deliver the data to your customers and your users and And of course You may want to Well upgrade your swift cluster or distribute your swift cluster now to two data centers or even more and How do you do this? Flora will tell us. Yeah, thanks Christian. So I will tell you what we did at cloud water in order to take into account what Christian told us about having several data centers But just one word about about a peculiar written swift It's that swift Has a good habit of splitting the complex task of of storing safely your data into two simple tasks because there is Basically two two steps when configuring a swift cluster a first you've got a standard tool that is called a swift ring builder That allows you to to manipulate ring files and then create them actually and so this tool Has all the the architectural architectural information of your cluster in in files that are manipulated by this tool which are called builders for instance It has all the information about the relations between your devices and your your notes your regions and your zones and And this tool is in charge of smartly assigning the partitions that we told about talked about before as As far as possible in your cluster For instance an object the different copies of an object will be if possible stored on different regions then on different zones and then different devices and also in different nodes if it's possible To this this tool at the end generates the ring files that can be checked by an operator They are flat files and once this is done These files who store the map of your data can be uploaded to the notes to the swift notes and On the other hand you have the processes running on your swift cluster That are in charge of ensuring that your data is stored uncorruptedly on the appropriate location so as christian told earlier if you want to to take into to benefit from from the Durability of swift it's better to separate your data into different clusters and At what we we have been running a near hundred nodes a cluster in a single data center for for a few years and We recently decided to to split the cluster in in two regions in order to avoid the loss of data Because of of the crash full crash of a data center because of a fire for instance or a plant crash So in order to do this we We decided to follow a bunch of guidelines in order to be sure that we won't lose any data of our users And also we wanted to limit the impact on the performance of the cluster For our users to be able to use our service So in order to ensure that no data is lost We wanted to be able to move only one replica at a time at each step Because we wanted to have An operation done in separate separate steps So we wanted to be able to to do the Disoperation of adding a new region in small steps so that if anything goes wrong It doesn't impact the the whole the whole cluster and all the data We wanted to be able to check for data corruption during each operation So in order to to see if there is any abnormal amount of corruption that appears and we wanted to be able to check that the location of the data is correct at the end of each step and We wanted to be able to roll back in case of a failure And so the second second Second aspect is the impact on the performance of the cluster. So we did we wanted to be to first have an idea of the The load on the cluster according to different days of the week and also different time of the day in order to be able to to do our operation during the The periods where the load is the lowest for instance during the night We have less users that during the day and we wanted to be able to assess the Load that will be incurred by your cluster during our operations For instance, we wanted to know the the network bandwidth that would be used for each step and We wanted to be able to to do small enough steps So that it fits in time frames where the load our on our cluster is low from the users and Eventually, we wanted to be able to control on which notes are accessing our users in order to be able to to have notes with low weight that our users still access and use and We have notes that are stressed by our operations, but users shouldn't go there so what is nice is that all the The items that are in blue are already implemented in in Swift for a while For instance moving only one replica at a time Actually, it's something that is already integrated in Swift ring builder the rebalance command Normally you if if we do only one rebalance operation at each step We should only move one copy of each object at a time So we are sure that two copies of our objects are untouched if something goes wrong checking for data corruption is done by By a process which is called Swift object auditor with which computes consistently continuously the the hash of each object on the notes and If the hash is incorrect compared to the hash that has been set during the creation of the object If this hash is incorrect, the object is quarantined and it's considered as deleted and The object is replaced during the the replication mechanism of Swift Then to check for data location. There is a tool which is called Swift dispersion report that is shipped with with Swift so it allows to check the percentage of your data that are is at the appropriate location and If you do for instance, you push a new ring you can follow the state of Of your cluster you will see that 80% of the data are the correct location and then when you relaunch the tool Several hours later. You will see that the percentage of of object will be increased for instance 90% and running back is something that is made easy by Swift because the The Swift the ring files that are a copied on the Swift nodes Reloaded automatically by the processes without the need of being restarted So if you push if we push new ring files and something goes wrong We just push the former ring files and the data goes back to the former location And eventually to control which Swift nodes are reached by our users There is a mechanism which is called read and write affinities that we call we can tune inside a swift configuration files though the new thing is that since Swift 2.2 Which landed in last month's actually we have to We have the ability to To move the data to a new region smoothly Step by step. This was impossible of before Swift 2.2 actually so if I wrap up this The thing about adding a new region So we want to be able to to smoothly move our data to a new region when we add this new region And this is really possible since a swift 2.2 In the end if we want to have at least one copy of each data to the new data center We want we must have a total weight of the region of at least of one third Because actually the weight is a mechanism that is used in Swift to Specify to tell Swift the amount of data that we want in a in a region or in a node or in a device So as an example complete example to add a new region we can add New devices or and new nodes to a new region with a very very low weight for each device that only one partition is Assigned to each device in the new region and Then with subsequent steps we can increase the devices weights in the new region by steps of five percent for instance In order to move five percent of the of the data of the whole cluster to the new region and so on and so forth And so we can increase the weight of the new region progressively until we reach one third of Of the total weight of the cluster so I Provide more details about how to compute the weights in order to have a given percentage of your data in the new region In in an article whose link is there and also I provide some scripts as example in order to To be to compute these weights Okay, Christian. I'll let you provide an outlook and summarize. Yeah, thanks for welcome. So A quick outlook what's coming up next in swift and a little summary for this talk So one of the hot topics in the swift developer community is a razor coating and the people over at Swift stock in the land box are really pushing hard for this feature and Well, it's coming real soon now hopefully So what's the idea of a razor coating a razor coating the idea is so instead of storing for example three replicas of each object you only Store well you apply a razor coating to an object and this means that you split your object in multiple smaller fragments for example 14 store these on different discs and nodes and so on and now you are able to Recover rebuild an object or order a subset of Fragments for example 10 if you configure to use a 10 over 14 scheme So what does it mean? It means that you tolerate a lot of four fragments and you can recover from that So you have a higher durability of your objects. You can tolerate more losses in your cluster but the important part is That you only have well roughly 40 percent overhead in this example compared to 200 percent if you have three replicas Which means that you have much cheaper cluster or the operations are much cheaper in your cluster For the durability calculation So what I want to do is a more detailed calculation to take into account the number of discs for example service and partitions and Add some calculation for a razor coating and this is also a community effort We started this other discussion on this about a month ago at the last swift hackathon There are people involved from entity code. I was involved here people from a swift stack IBM Seagate redhead. Well most of the community And we want to do an ad hoc session at the end of the week so if you're interested in this just drop me a line on by email and Maybe it's even well possible to include a sample calculator in the swift documentation itself, which would be great. I think Just to wrap up this talk So what's swift is giving you very high availability even if large parts of your clusters are failing and What I forgot to mention earlier on is Using the zone zone concept. You also can ensure that or avoid that you have That you have data loss because of for example Upgrades of your cluster. So if you're upgrading your cluster for example, let's let's say a kernel upgrade or firmware upgrade of your discontroller or whatever else do it zone by zone and Between each upgrade or between each zone watch the cluster for a few hours and see if everything works fine if something is failing then you can stop your upgrade and Well remove the zone whatever you need to do that your remaining zones are not affected by the upgrade at that point in time The automatic failure correction mechanisms inside swift also ensure a high durability of your data so and Depending on your cluster configuration You can even exterior non-industry standards. Well, let's say you have some data That is well a billion dollars worth Maybe you start in five different data centers at the same time if you have the money for it. What go for it? The latest swift release a Juno release As far on terror told us just a few minutes ago There's a more smoother and more predictable way to upgrade and enhance your swift cluster by very fine-grained steps when adding new zones or regions and The storage policies also allow you to give your users more options to start data. So for example, you don't want to have Your complete cluster running on for replicas because it's too expensive But it's subsets of the data you need for replicas So provide a storage policy for for replicas and you users can select the storage policies that they need for the data and Rager coding finally it will increase the durability inside your cluster even more while at the same time lowering your operational costs which is great feature and With these words I'm done. So thank you very much for attending the talk and if you have any questions feel free to ask now No questions There's a question. Could you please use a microphone because? Thanks, so actually I have two questions. Yeah, so the first one has to do with Reducing the weight of a failed drive to zero and then push the ring back So from your experience that makes sense. It seems to me like a lot of pushing of rings to do So does it make sense in large? deployments where you were two to five percent of the discors failing Well for my experience what? Pushing a new ring is a day-to-day operation from my perspective because you always have disc failures And from time to time you're upgrading your clusters. You're changing IP addresses of nodes stuff like that. So For the clusters I I'm aware of and I know Most of the time people have some automatic automated mechanisms For operation for operations like this so to push a new ring and stuff like that From my perspective, it's fine to push new rings even if you have a failed device Okay, so the second one. Can you elaborate a little bit more about the new feature in 2.2 that allows you to do the more fine-grained? controlling the yeah, so When so before 2.2 when you had for example a single region and You added a new region Swift always tried to very hard to ensure that you have at least one copy in the new region Which actually means if you have a cluster only with a single region before 2.2 and add a new region One sort of the data will actually need to be rebounds That's still true that one sort of the data needs to be really balanced But it's only one step and you have no control about the amount of data in a given point of time with 2.2, let's say you add a new region and You add let's say 100 discs in the new region You set the weight of the disc when you add them and add the new regions to zero So there's no data movement at that point in time because now the swift ringbird announced Okay, there's not enough weight in the new region to hold all the data Now you increase the weight of the devices in your new region for example to 1% total weight in that region and Swift will not only move or the swift ringbird actually will only move a Frag a subset of the partitions to the new region which actually means that you only move a little bit of data and Not all the data that needs to be stored in that region at a time and Finally, you need so in the example before where you have two data centers One data center with one sort of the data and another data center with two sorts of the data you need to ensure that the total weight in The regions is two thirds or one third of the total weight in the cluster so by using the weight parameter was with ring builder, you're now able to Control the data flow much better than before Yeah, you're welcome There's another question So hello my question about this any new region But why you can't just add new region and said to sweet what please we slowly slowly sync it Why why you need to add five persons ten persons if you need about 50 persons at the end? Well, it's mainly based on the swift ringbird because the swift ringbird is actually just a small tool that operates on your ring and Well, of course, we could add more features like let's say You add your you tell your swift ringbird. Okay, move some part of the data only without calculating the amount of data on yourself but you need to be in control of which rings are pushed to your cluster actually and so It shouldn't be done automatically. I think Because thank you. Well, you want to tell yes Just a typical case that I think about is the if if you Have a terapetabytes of data and if you move one third of your data in one row It will take several days or weeks maybe to move all your data and during this this Period your cluster will be heavily loaded and your users will be impacted and by by splitting the Movement the operation in small steps. We can do this for instance with by Hundreds of gigabytes per hundred gigabytes which will be done during the night and you won't impact your your users as much Maybe maybe we can just specify a speed of syncing data. It will be easier. I think this is this is an option also and it has is Yeah, it has his pros and cons and The things that we wanted to be able to do was also to be able to to roll back if for instance at At a giving step giving step given steps, or if we have a failure or devices that are not good or Part of our network that doesn't work We want you to be able to be able to roll back to only one step and not to cancel everything and And redo it again. So so if you're limiting the amount of data or the is a replication traffic the network level for example or on the storage level This is fine if you add a new region because you can just say okay Let's replicate at 10 megabytes a second or whatever else But the problem at that point is if you have actual failures in your cluster You want to recover from these failures as fast as possible So you you would need a concept to split the traffic so Instead of this let's keep a high number of replication traffic actually and control the data that is moved For example, if you had new regions of zone with with this with ring files Yes, but when we fail if a lot of amount of disks will fail and it will and it is to none we have Were where is low speed for for customers then once we will sink it So we need to set up speed not only you know, not only any Yeah, sure, but now you have more control points So if you're you of course you can set your replication traffic on the back end network because now we do We do the same thing. Okay. Thank you. Okay. Thanks Did you a durability calculations reflect the probability of losing a specific object or losing some object in a cluster? specific object because if you think about this, let's let's Let's think about a cluster that stores 10 to the power of 50 billion objects The probability will be one or even bigger than one that you lose objects in a given year but To compare it to other existing public cloud vendors, they are always Look at her into one object. What's the probability to lose this object? The probability that you lose actually one out of your objects in your cluster The bigger the cluster gets at one point in time You have at least one object per year that you will lose or even more maybe but that doesn't mean that the Probability for a given object is increasing. It's just for the old whole set of objects in your inside your cluster Have you done anything to document or calculate how that probability changes with the number of objects in a cluster? Or was this your Yes, but it's in progress. It's something we have in mind for the next steps And to maybe to integrate it into Swift documentation itself. Thank you. Yeah Just another quick durability question. I a lot of research has indicated that there's a Pretty strong correlation between disc failures If whether it's manufacturing lots or for a number of reasons that I'm just curious if I've you included those kinds of risks of Correlated, you know chain conditions in your calculations our experiences that those are a pretty high impact and your durability numbers seem a little high No, not in this calculation, but So from my experience, it's well if you create a new Swift cluster from scratch Of course, it's much likely that you use the same kind of discs on your on all your storage nodes But if you create a Swift cluster from time to time and you add new storage node with different kind of disk and stuff like that Of course, you can just start with a Swift cluster that involves different vendors to lower this problem But then at the same time if you look into Publications for example from Google or from back place to large storage Operators, you will see that this number two to five percent of disk failures per year is actually quite a good number and a reasonable number and day-to-day operations. This is nothing that Correlates to the number given by disc vendors because disc vendors will give you for example in meantime between failures for one million hours One million hours. So a year has 8,760 hours. It's unlikely that you see the first failure in a hundred years So two to five percent disc failures in a year is quite a reasonable number reasonable number from from my point of experience Yeah, I was mainly asking about the risk once you've lost another disc of losing a second disc or a third disc that was so one of the things If you lose discs and in the probability calculation because it's a simplified model actually is If you get to a number of ten nights or even if you get seven nights maybe in the calculation is it's unlikely that You have this that you have data loss because of an object Because of a disc failure. It's much more likely that you have some other kind of disaster in your data center As I said fire aspect whatever else So you need to take take care of that too. It doesn't make sense to optimize to only optimize for the disc failures You need to take into account other failures as well at that point Okay, okay. Yeah, thank you very much