 Okay. It's time. Let's start. Thank you for joining. I'm Romain from OVH. I've been working at OVH for the last three years on our Swift product. So today I'm going to share with you our experience about migrating large clusters from replica policy to airbrush recording policy. First of all, just an indication of what we do with Swift. We launched in December, 1211, an offer for the general public called Ubic. It's a cloud storage solution with some application for mobile and desktop and also web interface. At that time, it was not running on Swift and we quickly found out it will not scale enough. So one year later, we migrated all the data to Swift. Since then, it has been running smoothly. One year ago, we launched a public cloud offer. So there is all the instance part based on Nova and Neutron and all that stuff and also, of course, an object storage based on Swift. It works fine as always with Swift. And six months ago, we decided to convert our Ubic clusters to airbrush recording. So to give you a few numbers to show you how good it works, we have, so you will see there is a plus sign in front of every number. It's just because it's growing so fast that I almost lost track of the numbers. So we have between 25 and 30 petabytes of user data, more than 10 billion of objects. And we have quite a lot of devices between 25,000 and 30,000. So before going into the details of the conversion of the objects, I'm going to show you quickly the difference between replication and error decoding. In a replication policy, each object will be written many times in your cluster. So in this example, we have a replication factor of three. So every time you upload an object, it will be written three times on the cluster, up fully on three different servers, maybe three different zones if you had them. This replication factor is the durability of the object. If we lose, in this example, two devices, the object is still available. So the more replication you have, the more durable is the object. Also it's having multiple replicas, easier to scale your download bandwidth because you will be able to access all the replicas of the object in parallel so it can help you to propose something like a CDN, for example, because you will be able to scale horizontally your bandwidth. The drawback of a replication policy is that it's written many times in the cluster. So this is the overhead. In this time, if the user uploads six bytes, 18 bytes are written on your cluster. So this is an overhead of three, like the replication factor. On your cluster, each object, each replica will be stored in a file. So I show you an example of the path on top. There is two important information in the path of your file. This is the hash. The hash is a computation of the object URL. Partition and suffix that you can see right before the hash are extracted from the hash itself. Then the second information that is very important is the timestamp. The timestamp is the date the object was uploaded to the cluster. So it is set by the proxy during the upload. The user cannot set it. And it's a very important information because all the eventual consistency model of Swift is based on it. If for any reason you end up having different versions of your object in your cluster, maybe if you have a network split or something like that, Swift will use this timestamp to choose the good version of the object. This will be the last one, actually. Erase the coding is a bit different as your object will not be written many time in the cluster, but it will be split in different fragments. And some fragments of parity will be added. So in this example, my object, which is still six bytes, will be split into three fragments of data. And one fragment of parity will be added. So, sorry. So this means that you can control the overhead of your policy. And like a replication factor, when you set a replication factor of three, you have an overhead of three. Here, we only have an overhead of 1.3. But the durability is not very good with this configuration. Because if we lose two devices, the object is not available anymore. But with a Erase the coding, you can show the number. You can, for example, have 10 fragments of data and two fragments of parity. This gives you an overhead of 1.2. But this will give you a better durability, almost the same as with the replica with three replicas. Because if you lose two fragments, you can still access your object. So this looks good. I mean, you can control the durability. You can control the overhead. Now it's not perfect. Because every time you will access your object, you have to fetch all fragments to rebuild the object. So you will not be able to scale as you can do with replica. You cannot have many parallel downloads of the object. Also, all the computation of fragmenting the object and calculating parity fragment is done on the proxy. So there is some extra CPU consumption on the proxy. It's not a lot, but still it exists. So you have to take this into account when you prepare your infrastructure. If you look at the past, it's the same on the disk. Each fragment is stored in a file. The only difference is there is new information at the end, which is a fragment number. Because the object is split in a certain way on the proxy. So when you fetch all the fragments to rebuild the object, you have to regroup them in the same order than they were split. As we just saw, each fragments are unique in the cluster, so when you lose a device with replica policy, you can just copy another version of the object to the device you just replaced, for example. You cannot do that with errors according, because each fragment is unique. So you cannot just copy one fragment in place of another. So you have to rebuild the missing fragment. This is why you cannot use erasing anymore with errors according. This is why there is a new protocol in Swift. Well, new. It has been around for a few times, but it's called S-Sync. And this is a protocol that is used with errors according. So instead of having object replicator for replica, we now have the object reconstructor. The object reconstructor does two kind of job. First one is a reverb job. This is moving data to the correct device. For example, if you just did a rebalance, this is a reconstructor that will do a reverb job to move the fragment where it should be. And the second job is the S-Sync job. It's rebuilding the missing fragments. So now that everybody is a error-recording expert in the room, I think we can do some conversion. Let's see how we do that. First of all, what was our requirement to convert our cluster to a error-recording? First one, it must be transparent to our user. Ubic is a general public solution. So our customers are not API experts, storage policy experts. They don't know what is error-recording. And they don't want to know. So it must be transparent to them. They must not notice that we are changing the storage policy. Second requirement, the conversion must happen in place. We are moving to a resrecording, let's be honest, is for cost reasons to reduce our cost. So if we had to spawn a new cluster with petabytes of capacity just to copy the data and then having an empty cluster on the other side, it would be a loss of money. So we want the conversion to happen inside of the same cluster. And last but not least, it must be scalable. The conversion must scale because we have petabytes of data, we have billions of objects, and we don't want to wait a few years to see the end of this process. Actually, a few months look like a nice target. Replica or error-recording are just storage policies in Swift. So a storage policy is an object feature, but it is declared on the container level. So when we will convert the object from replica to error-recording, we must take care of keeping this information in sync. The easy part that was the first step was just to declare this new error-recording policy, setting it as a default in Swift. So all new customers will be automatically in the error-recording. So this is really the easy part and just a few lines of configuration. And after that, we add to convert the whole customer to error-recording. As I just said, information of the storage policy is declared inside the container. So we must take care of a dating container, also while converting the object. There is no API to, as of today, there is no API to convert a container to change the storage policy. So basically, it's like running some SQL query against a SQLite database of the container. So I give you this small example of queries. There is some more precaution around. But it ends up being just to update in the database. Of course, you have to run them on all replicas of your container. And so to avoid the Swift process, like auditor and replicator and stuff like that, messing around while you're updating your database, you have to disable them. One thing with this kind of modification is maybe you know the container are updated by the object server. When you upload an object, it's the object server that will update the container. So we have to take care that if an upload is started before we run this update command, the object server will not insert the object with the old storage policy. So we added few line of codes to handle this case also. Basically, just comparing the incoming updates to the container with the storage policy declared in the container. And if it doesn't match, fix it on the fly. So once we did that, Swift will think that our objects are in error decode. But they're not. They're still in the replication policy. So because we cannot convert all the objects instantaneously, we have to maintain the access to the old data. So to maintain this access to the old data in the replica policy, we will handle this at the proxy level. So when your user will try to download the object he uploaded a few days or few months ago, the proxy will try to reach the data on the object server on the error decoding policy. If the object has not been yet converted, the object server will just say, sorry, I don't have this file here. So it will return an HTTP code of 404. So the proxy will catch this error and just rerun the request on the replication policy. It translates very easily into code. So this is the only line of code you will see in this presentation. Basically, third line, you run the normal request. And if it fails, you have the condition under, if it's a 404, let's run it with the replication policy. There is some more condition to handle a few countercase, but this is 90% of the code actually. So what I just showed you is the example of the get. But it's the same logic for the delays or for the post to modify some metadata. The last thing, so now that we handled the transparent part of our requirements, let's see how we can make it scale. The idea is to run it where the scaling happens in Swift. If we look at the numbers on Ubic, we have like 20 to 30 proxy, like 50 account on container servers, and about 5,000 object servers. So it's clear, just looking at the number, that the scaling in Swift happens on the object server. If you're running some Swift cluster at a large scale, you may have some scaling issues with the object expirer or container sync. I'm actually deeply convinced that if they were running more clover to the object server, we wouldn't have these kind of issues. The community is working on it, so hopefully it will be fixed one way or another. But my point is scaling is in the object server. So this is where we will run the process that converts the object from replica to a rather cunning. One problem with running this process on the object server is that the object server has no idea of what are the objects it is storing, what are the objects on this device. So to handle that, actually, it's quite simple. We scan the device, and we create a map to map the account and container to the hash of the object. So if you remember at the beginning of the presentation, the hash is the computation of the account, container, and objects of the URL. So it's unique. And with just this information, you can rebuild all the paths of the object on the device. So we just create a database mapping the container to the hash. So we run it during the night, because most of our customers are Europeans. So we have low traffic during the night. And a scanning device can consume some IO. So once we did that, we have the mapping. So if we decide to convert a container, we can quickly find all the objects we have on the devices of the object server just by looking at this database. So we can quickly convert a container to a executing. Conversion will happen between object servers themselves, because we run the process on the object server. And to do that, we use a class in the Swift code that is called the internal client. It makes your process act like a Swift proxy with the pipeline and almost all the stuff. But you don't need all the features you usually have on your proxy. So you don't have to put a telemetry middleware or authentication middleware, SLO, DLO. You don't have to write them in your configuration. You can keep it to the minimum. So we use this class to run on the object server. And it will act like a Swift proxy. So you will do an upload with this class. And like a Swift proxy, it will write on the object server that should handle this object. So this is the example of an object in a replica on the left. The process will act like the Swift proxy and write it on all the object server that are supposed to handle this object in errors or coding. This is what happens when you convert one object. But we don't work by object. We work by container. So when we launch a conversion of a container, there is many objects. If it's a big container, there is at least one object on each devices. So it makes something like that. Every object server converts in every way to the others. This is what makes it very fast, actually. Because all objects get converted all together. So except if you have billions of objects in your container, which I guess doesn't work quite right in Swift, you don't have many problems. So at that point, we have a working solution. And you're missing. Actually, we did. What could possibly go wrong? Well, a few things. Let's be honest. First one, I talked to you how we maintain access to the data that are not yet converted. And I told you it was for get, post, and delete. The head request is a bit different from the other. Because it doesn't access all the fragments at the same time, because it's optimized, it only gets to get one fragment to return the information to the customer, the metadata. So the proxy just try to reach the first object server. And if it don't get the information, it will try the second one, third, fourth, fifth, one, et cetera, until it runs out of primary nodes. Then it will try on the hand of nodes, 15 more. And so it makes 30 requests. And during that time, the user is waiting, or maybe he's not waiting anymore, because it can take a lot of time. And after 30 requests, the proxy gave up, and then our code tried to fall back to replica. To handle that, we just limited the number of tries the proxy will do. So we chose five tries, because we think it's a good compromise. Because if the proxy don't find any fragments in the first five tries, we have a big problem, very big. So just answering a head request won't be our main concern. This is just because we don't have that much dispersion in our cluster. So this was the first program our user told us. They had some latency for the head request. Second problem, when you start the conversion of a container, especially if it's a big container, all the object server will start to convert all together. And as they act like a proxy, they will do some request to check that the container exists, some kind of head request. And when you have a 20,000 process reaching to the same container server, it hurts the container server. So it's pretty easy, just using the cache middleware on the pipeline of the conversion process. And just adding that, using memcache, fix this load problem on the account container server. Also, we added a distributed mutex to the solution to control the number of parallel conversions, because converting too much object at the same time was creating a lot of high ups on the cluster. So controlling a bit the number of conversions at just to handle this extra load. Third problem, this one, hopefully we underlined it before our customers saw it. The arrow in the middle is a timeline. So let's say you have your object in replica that was uploaded with timestamp one. If before the conversion, your user upload a new version of the object, it will be uploaded in error-resiconing. So you will have two versions of your object, one in replica and the new one with timestamp two in error-resiconing policy. But your user delays this object very quickly. So you end up with a timestamp file, which has the timestamp three. And at some point, the tombstone will get reclaimed. So by default, I think it's one week in the suite cluster. So if your object was not converted within one week after your user deleted the new version of the object, the processor cluster will have no way to know that there were a new version in error-resiconing. And so when the conversion will happen, it will recreate an object based on the old version of the object the user had a long time ago. So it can be a bit disturbing for customers to see all versions of the object reappearing in their folder or in their account. So it's not very complicated to handle. You just have to be sure conversion happens in less than reclaimed age that you wrote into your configuration. So the default is one week. Conversion of container is quite fast. So you should not eat this case. But if you put a reclaimed age to, I don't know, a few hours, maybe it's a risk. Last one, I told you we were scanning devices to build a mapping between container and object hash. The problem with that is first, it's IO intensive. So just scanning all the device every night. It's a lot of listed here. This is a Python call. Every file must be opened to read the extended attribute to get the content container. So it's very IO intensive. And you have to do it every day, because every day you have new upload, new delete. If you do rebrands, a lot of data move between devices. So it really must be done every day. It's a solution. It's not implemented yet, but we are working on. It's updated database in real time. We first thought of using a feature of the container, which is called EBPF. It's in the recent container. It allows the process to hook to some function of the container. The one we were interested in was VFS create, and link, and rename. So with these three functions, you can follow the life of a file on every device. Problem with this solution is that it was asynchronous. So the database was sometime late on the reality of the file system. And also, if we miss some events from the container, the database was completely out of sync. So we are working on extending the disk file class in Swift to update the database in real time for each creation of object or dilation of object. Problem with working with this file, it's that only Swift can know about the update in the file system. So if you're, for example, using AirSync to rebalance your cluster, your database will be out of sync. So just use this sync. Actually, we have very good experience with it. Even for replication, it's better than AirSync. So we moved completely away from AirSync. And of course, don't go dilating file by hand on your devices. Don't mess with your files. Swift does it. It handles everything. I really like to having the database up to date in real time always on my object server, because it opens some possibilities. For example, you can map more than just a curtain container. You can map the Xperia header so you can dilate your object locally faster. You can store, like, the A time on your file system, the last access to your object. So you can take some tearing decision based on it. You can also map some iNode information so that it will allow you to access your file faster than just accessing it the standard way of the file system. And if you index everything, you could also avoid doing the least dear call, which is used a lot by replicator, reconstructor, auditor, which consumes a lot of IU on devices. So I think this database opens interesting possibilities. And we are working with that at OVH. So what is the current situation? We had about 26 petabytes to convert. We started in March. And from March to August, we converted about half of our cluster. Not all the conversion was done with our process because we had some help from the official Ubic application that, instead of overriding existing objects, always dilates the whole version to upload a new version of a customer file. So it helped us a lot also. We posed this conversion because we are having some scaling issues related to error decoding file system. So this is stuff we are working on to start to end the conversion as soon as we can. So this is a good time to give some feedbacks on error decoding. First of all, if you're wondering about bandwidth, it's pretty good. You won't see a lot of differences with the replica. Except if you try to access the same object a lot, well, then if you have a lot of parallel download on the same object, you can have some, you will have better performance with a replica. But for a normal user, a general public use like we do with Ubic, it's not a problem. The extra load on CPU on the proxy is very, very low. Actually, we didn't see the difference. So it's very low. But you have a small increased latency to access your object. It's linked to the way error decoding works, because there is a buffer on the proxy that must be filled before sending the information to the customer. So you have some increased latency. So it depends on your use case, I guess. Rebalance. If you're running a big error decoding cluster, Rebalance have a very huge impact on performance. So you must really run it during low traffic at night, if you can, and doing small steps. So it will take a longer time to add your devices to your cluster. So you should anticipate more when you are adding devices. I will skip the reason. There is some discussion on that. Another feedback on error decoding is that you will have more files on your cluster, because instead of having, in our case, three data file per object, we are now having 15 data files and 15 durable files. Well, the durable files will disappear soon because some patch landed in the Git recently. But still, you have more files. And each file consumes one iNode. Actually, if we are syncing with fragment, each fragment will consume two iNodes, one for the data file, one for the directory. So you will end up with a lot more iNode on your disk. And in our case, at half of the conversion, on the six terabyte disk, we have about 43 million of iNode. One iNode is one kilobyte. So it's 43 gigabytes of iNode per devices. So it can fit in memory, of course, except if you dedicated 64 gigabytes of memory for one disk, but I adopt it. In our situation, it means that we have about 85% of cache miss on the iNode. So a lot of access to the disk is done just to read the iNode, not to read the data. So it's a loss of time, I would say. Another interesting number, if we're looking at the file system, is I was talking about the list year. List year is actually a read year which translates to get the entries, syscall. And about 75 of these get the entries doesn't come from the cache. It means that every time you read a directory, if you're doing a list year, or just trying to go through a hierarchy of path, half of them will come from the disk and not from the memory. So it's a lot of lost IOPS. Last, it's not really a feedback. It's more like a tip if you will run a rescoding. When you configure your policy, you have a configuration parameter called ec object segment size. It's actually the size of the buffer for a rescoding to work on. You have another parameter, which is not related at all, which is called the client timeout, which is the time the object server will wait for data coming from the proxy. So when your customer uploads an object to your proxy, it will not be sent directly to the object server, but it will be in the buffer of the proxy. But the proxy already opens a connection to the object server. And if your user is uploading really slowly, you will hit the timeout before the buffer is filled. So it took some hours to understand that. So you have to configure these two parameters. It depends on your user bandwidth. So in our case, as we already had a rescoding running, we couldn't change the segment size. We increased the client timeouts. All this work, we didn't do it all by ourselves. There is a review and progress on changing the storage policy of the container. And we took a lot of code from there. So hopefully we'll get merged one day, I hope. It's very interesting for operators. And before handing, if you liked what we are doing with Swift, you can join us. We are recruiting, really. So come talk to me or find me on IRC. If you want to have fun with Swift, we have a lot of things to do. I think we have five minutes for questions. So if you have some, feel free. There is a micro there. Thank you. What do you guys use for the distributed mutex? Actually, we took a solution based on rabbit. It looks strange, but it works efficiently. And it's very simple. So I like when it's simple. What do you guys ended up running for your reclaim age? We're going through a process. We're increasing the default one week reclaim age on a lot of our clusters. We're thinking a little bit longer is better. I'm going to ask Brian the same question. What did you guys settle on? Are you still doing a week? Yeah, we're still doing a week. And we don't really want to increase it, because it will increase the number of file on devices. So one week is good for us. Maybe I misunderstood, but from what I got from what you said, basically, when you start to migrate your objects, you start from an index that you build scrapping your notes, the storage notes. OK, so since you have the object in replica at three places, how do you handle not starting the migration three times over? Because you want to migrate once. Sure. Actually, it wouldn't be a problem, but we run the conversion on only one zone. So we have three zones, so it works that way. I think we're good. Thank you.