 So welcome everybody. We will start. So we are OVH. You probably heard about us this morning during the keynote. We are a cloud provider. And we are here to share you a little bit about our experience on Swift infrastructure management and what we did with the automation part. My name is Jean-Dél Bonne-Toux. I work at OVH since two years as a technical evangelist. You can find me on Twitter with my nickname. And as I'm not the technical guy, I came with Romain. And I'm Romain. So I'm a developer at OVH since three years. And I've been working on Swift for almost three years. I'm one of the two guys of the Swift team at OVH. And I'm sometimes on the ISC channel of Swift. OK, let me show you and explain a little bit about our Swift journey and what we do with Swift. So we started our first cloud product in 2011. The name was Hubic. And the goal was to provide space in the cloud for our user to synchronize their data across their different devices. It was based on OpenVT. So as the success comes, quickly we had to think about a more scalable solution. And one year later, we made some R&D on Swift. So we were in 2012. And after three months to move all the counts from OpenVT to Swift, we were able to run a large cluster. It was not so very large in 2012, but it's still the same cluster we are running now. So we had a lot of upgrades and scale actions on this cluster. Last year, we launched our public cloud offer, which is a classical yes product. We provide the Nova, Neutron, Glance, Cinder, Keystone, and of course, Swift. So if you used to use Swift in another provider, at another provider, you could use it the same at OVH. So you have some number here. Oh, sorry. The difference between, maybe I should explain, the difference between a big and a public cloud. On a big, we don't provide the Keystone API. We want to provide access through our application, web application, or mobile application. On public cloud products, we provide all the API, the classical API from Keystone, every product I already mentioned. So you have some number here. We manage more than 25 petabytes of user data, which means we have almost 75 petabytes of space on our disks. We manage more than 10 billion of objects, and everything is running on 23,000 devices. So now, let's me explain you a little bit of our infrastructure. This is a very high view of our infrastructure. We have many clusters. The three most important are two in France and one in North America. They all be the same. So we deployed the configuration with Puppet. And we monitor our cluster with Shinkan. Currently, we are running Juno and Mitaka. And on the right side, you have a Zoom on our infrastructure. First, you have the HAProxy, which is targeted by a round-robin DNS, which is the swifting point, in fact. The reason why we use HAProxy is for the SSL computation. HAProxy use the hardware acceleration, so it's really efficient for the SSL transaction. Then you have the Keystone Galera. I will show you a little bit more on the next slide on these parts. The proxy node holds the account job and the container job. That's why we had to use hardware with good IOPS. And then we have the large storage node. From here, we have different hardware profile. The goal is to have the good ratio between the CPU power and the space available on disks. So this is a quick Zoom on the relation between HAProxy and the Keystone Galera cluster. As you can see, we root the requests, depending on the type of the requests. So if it's a right request, we will send it to one node. And if it's a read request, we will send it to the other node. The reason why we do that is because when we didn't do that, we had many locks on the database. Every right request was sent in and distributed to the different nodes. And we had a lot of locks on the database. And with this routing algorithm, we don't have any lock. And this is a priority list. So if the first node is not available, the HAProxy will select the second node, et cetera. And we had a similar problem to manage when a node failed and he wanted to re-sync itself. So we used the same list, but in a reverse order to avoid the selection of the right node. And with this workflow, we are able to re-sync without disturbing the whole cluster. So now you have an idea of our infrastructure and the size we managed. We wanted to share you our automation. So this is the two most important actions we have to manage every day, almost every day. The first one is adding a node in the cluster. And the second one is manage the failed disk. So what happened when we had the node? This is an overview of our process. The first thing to know is the OVH storage team is a classical customer of the OVH dedicated server team. So as a classical customer, we had to order a new server in our data center through our OVH API. When the server is available, our robot will launch the installation. And when the installation is done, we will collect the hardware profile of the node and then associate a role to this server in a database. So now we have an available server and a role defined. We are able to launch the POPET models. We have 20 POPET models to manage from the disk partitioning and formatting to the H-approxied configuration, for example. So there is no human intervention. The whole configuration of the node is managed by POPET. And then it's time to add the node in the Swift cluster. So it's managed by Swift ring manipulation by our robot. Roman will explain you a little bit more about it. That's what I wanted to share you here, is that for us, adding a new server to our cluster is just about ordering a new server through our API. There is no other human intervention. And this is a screenshot of our Swift Recon which describes the device usage. It's one of our clusters. We have many of them like this. So yes, it describes the device usage. So as you can see, most of our device are used between 85% and 92%. And as you can see, we have some device used a little bit more than 50%. And it's probably because these devices are added a few hours or a few days ago, maybe. And the robot is feeling them more and more, manipulating the ring. And now Roman will explain to you how it's done. So one of the most important part of increasing the cluster size is to add the devices to the ring. As Jean Daniel told you, we describe our cluster in a database that you can see on the left of the slide. And we have a robot that will try to synchronize the ring with what is in our database so that the ring eventually match what we have in our database. But first of all, it's important to check that you can effectively manipulate the ring. So you have to check if the cluster is healthy enough to be rebalanced. So the metric we want to control is called dispersion. So we run on every storage server a probe that will calculate the dispersion of all the local devices. And this information is sent to our monitoring so we can get a learning if dispersion is too high on some devices. And also we get graph of the evolution of the dispersion on each device. This information is also collected on the Swift Proxies using an adapted version of a Swift recon, middleware, and a command line interface. So before explaining you how the probe works, I'm just going to explain you quickly what is dispersion. Dispersion is when data are not where they are supposed to be. Each time you rebalance your ring to hot devices, you create dispersion because you ask Swift to move data from one device to another. But moving the data is not immediate, as it may be a gigabyte or terabyte of data. Also, if you have a disk that is down, Swift will write data temporary on the backup device, which is called end-of-device. This is also dispersion. So back to our probe. On the left, you can see an example of a small ring. It's actually a simplified vision of the ring with three replicas and four devices. We see that on this ring, device two is a storing partition three, partition one, and partition two. So this is a view of the ring. If you check on your device, a partition is just a folder. So you can see what partitions are on your device by just listing its content. So on this example, we have partition one, two, and three. But we also have partition zero. Partition zero must not be here. It must be on device zero, device one, and device three. So we will call it a dispersed partition. It may be here because you just rebalance your ring right before. And the process that must fix the dispersion didn't move the data yet. It may also be here because maybe device zero is down. And Swift is writing data on device two until device zero get back. Dispersion is fixed by two processes in Swift. It's object replicator and object reconstructors. So our probe just compare what's on the device and what's in the ring. This information is sent to monitoring. And it will also be collected on the proxy. So all the object servers on the right will calculate their dispersion. And the proxy will collect the dispersion of all the devices and calculate an average of the cluster. When our robot decided to modify the ring to increase the cluster, it will first get the dispersion of the cluster from a proxy. And it will decide if the cluster is OK to be rebalanced. We consider that the dispersion is OK when it is lower than 0.1%. It's a really low value. But on a normal cluster, it should be under almost every time. So the robot checks the dispersion. If it's OK, it will fetch all the information from the database. It will fetch all the information from the ring of the cluster. And it will compare both. And it will create a list of jobs. We have three kind of jobs, adding a device to the ring, removing a device, and updating the weights of a device. Quick note on the update of the weights. I like to keep a friendly relationship with my network team. So I want to avoid to create too much data movement at the time. That's why we only increase the weight by 100 points at each run. So we have this list of jobs. They will be applied to the ring. And finally, the ring will be rebalanced. Then all the object server will gate this new ring. We use everything for that. They will fetch the new ring every 30 minutes. And then it's up to Swift to move the data to their correct devices, object replicator, or object reconstructor. We use the size of the device as a weight we want to reach. So it's a size in gigabytes. So for example, if we are adding a 6 terabyte device in the ring, we will try to reach 6,000 points for the weight. So if you do a quick math, because we add only 100 points at each run, it means that we have to place this whole scenario up to 60 times for a device to be fully added to the ring. I don't imagine to do the same thing 60 times in a row, because it's boring and because it's error-prone. So that's why we automated it. Increasing a cluster is something we do a lot. But there is something we do more. It's replacing the failing disk. Every year, we replace about 1,000 disks in our clusters. It's an average of three disks per day. If you work 24-hour a day, seven days a week. I don't do that. I only work five days a week. And clearly not 24-hour a day. So it's more like five disks per day. So if we had to do it by a human, it will be a half-time job. Because replacing a disk is not just replacing the disk. You have to do some check before. And there is a lot of operation to do after the replacement. For that, for the disk replacement, we rely a lot on monitoring. As Jean-Daniel told you, we use a shinkan for the monitoring. And because the number of servers and device we have in our clusters, we use a passive check. Passive check means it's a server that sends the information to the monitoring server. So to be able to write good automation, you have to have the right information of the state of your cluster. So we basically write a probe for every recurring problem. And of course, failing disk is one of the most recurring problem. And we have three ways to consume all the information in our monitoring. The first one is a web interface for the human. The second one is text messages for the people on duty. And the last one is an API for all of our robots. There is one we really like. We call it self-filling. That's him on the picture. Its job is to control what is going on in our clusters and to take decisions. For example, if a server is non-responsive, the robot will out-reboot it. If an object replicator is stuck, self-filling will restart the process. And if we have a disk failing, the robot will decide to replace it. As I told you, replacing a disk involves many steps and many checks. It's mainly a four-step process. The first one is collecting the information about the alarms in the monitoring. So the robots will connect to the monitoring API and get all the alarms of the failing disk. Then it will decide if it's safe or not to replace the disk. I will say more about that just later. And if it's safe, it will set the server on maintenance. It will create a ticket to our data center team so that they physically replace the disk. And then when it's done, it will reinstall the operating system if it's needed, if it was a primary disk. And it will run Poopette. Poopette will format the disk, mount it, restart the process if it's needed. And once it's done, if we have no more alarms on the server, the server will be set off maintenance, and then it's back in production. I just told you that the robot have to decide if it's safe to replace the disk or not. What it means for a disk replacement to be safe? It's the safety of all the partition it is storing. We have another example of a ring here. And we have an alarm on the device 2. So we will check what partitions are on device 2. So in this example, it's a partition 3, 1, and 2. So we will check that each of these partitions are safe. For example, if we check partition 3, we see it is stored on device 0 and device 5. So we will check that device 0 and device 5 are OK. What does it mean? They are OK if there is no alarm for this device, but also if the device haven't been replaced recently. Why that? Because if device 0 was changed one hour ago, sure, there are no alarms on it. But also, there is no data on the disk because it takes time to re-synchronize the disk to reconstruct all the data that must be on it. We made some bench, and we know that in our cluster, one terabyte of data, it takes 12 hours to reconstruct it. So depending on the size of the disk, we know how long we must wait to be sure every data are back on the disk. So we will do this check for every partition. And if every partition are OK, it means we can safely change the disk. So in this example, if device 0, device 5, and device 6 are OK, it's safe to replace device 2 so the robot will do all the steps I just described right before. There is one last thing about disk replacement, a tricky thing. If you try to hot replace a disk in your server if it's supported, you cannot always unmount the disk because some IOs may be stuck on your file system because it is corrupted. So we tried many ways to do it the right way, like killing the process, like trying to unmount force the disk or unmount lazily the disk. And at best, it was not working. And at worst, the kernel was crashing. So we took the easy way, the most secure way, to just take off the disk without unmounting the file system first. Sure, the kernel will cry a bit, but in the end, it's OK for it. But the tricky part is if you take off the disk without unmounting it, the block device is still in use in the kernel. So when the new disk will be put in the server, it won't get the same letter. For example, if you're taking off SDB, you rack a new disk, it will be named SDC because SDB is still in use. And as we automated the formatting of the disk, it was a bit tricky to say to Poupet that now the disk that was SDB is SDC. And we don't like to guess this kind of information when it's about formatting a device. So we implemented a kind of constant naming of the device in UDAVE. We use a name for the device based on the enclosure ID and the slot ID of the disk. Some people are OK with me. So this way, no matter the order, you put your disk in the server. Even if you reboot the server, they will always get the same name. So it's safe to do formatting with Poupet or stuff like that. So this is how we automated the disk replacement in our clusters. As you can see, no human intervention because humans do wrong things. They do errors when they do things. And our robots are really reliable. But with all these automations, I mean, there are no more work for humans because the robots do everything. It was almost getting boring to manage shift clusters. So that's why we had to create some new challenges for us. The current challenge we are trying to achieve is to convert about 20 petabytes of data to a razor coding. And I'm sure we will have a nice feedback to give on that pretty soon. It's a work in progress. So we hope to be able to present this result in Barcelona in six months. Thank you. If you have any questions, it's a tie. It's a tie. Well, I'd like to thank David, which is in France and which is managing the cluster while I'm having fun here. How you deal with the enclosure hangups because sometimes drives can hang ball enclosure and all disks become unresponsible. How you do know which drive to extract? Thank you. So we check smart values and we check the kernel messages from the, we check the kernel messages in our probe. And we are able to link the kernel name, so SDB, with the name we gave. We wrote in Perla some stuff for that. And it was not your question, maybe. Not exactly. There is a cases when device block all IO on enclosure it connected to. So you get kernel messages for every device in this enclosure. They all looks like failed, but only one device is real source of the problem. We never had this kind of issue. You're lucky. Yeah, maybe. Hi, what base operating system are you using? We are running on DBN. And are you using the puppet open stack modules? We are using some of them for, for example, for Keystone. And for Swift, we started our own module a long time ago. So we are still running on it. Thanks. Hi. Do you guys have any problems typically with the extra IO you incur from a rebalance affecting your latency on customer side, API traffic at all? It depends really on the size of the cluster and the kind of disk you choose. No matter the size of the disk, if you're running SATA disk, you have the same amount of IO's available. So if you have more disk, you don't have this kind of issue. So we try to create big clusters with small disk, and it works better. Of course, there is some impact on the performance of the cluster, but most of the time, it's not really important. OK. And can you talk a little bit through what your zoning strategy is within the ring? Can you repeat, please? Yeah, can you talk about how you structure your zones in the actual Swift ring? Yeah, sure. So we create a cluster. We don't use the multi-region feature in Swift. All of our cluster is built on one zone in one data center. And we use to start a cluster a part power of 18 or 19, depending on what we think would be the future of the cluster. And that's basically all for a ring. Yeah, one zone in one data center. When you were describing the dispersion collection that you have, it's run per node. And then you're retrieving those metrics via the proxy, which is not your stock dispersion populate dispersion report commands. So I wonder if those enhancements or changes that you guys have made is something that's currently available, or if you think that they might be interesting enough to share. Or maybe you could go into more detail about how they're different from what other people are maybe using. They're not available for now. They can be, but they are based on our probe, which I don't think would be made upstream, because it runs every five minutes. And it sometimes takes a lot of time to run because we have to do a list here of all devices. And also we have to check if the partition really has data. So we check the content of the partition. And so it takes time. So we cannot do it in the recon middleware because it will block a worker during the time of the run, which is about one minute. So it's a probe running cron and information collected with the sweep recon. Yeah, I think that's a great strategy, just run it in cron and dump it out into bar run Swift and then have the recon interface to deploy. If you think it can be interesting for everybody, I will try to commit it, of course. I think so. I think, yeah, in some ways it's a better way to organize the problem. Does the collection look for both misplaced partitions and also missing partitions? I mean, at 10 billion objects, you guys have a pretty filled out cluster. So every partition that might exist in the cluster surely does. That directory is somewhere. So with the dispersion metric that that node is collecting for a device, say, hey, according to the ring, this partition should exist on me. And it's not here as much as, say, I have this partition that actually belongs somewhere else. Do you report either direction or do you differentiate or is one more important than the other in your experience? We only report a misplaced partition because when a cluster is starting, there is no way to know if a partition is missing or if it just doesn't exist yet. So for now, we only report misplaced partition. And when you do the collection, do you have something running in the proxy as sort of like admin middleware, or do you just hit the recon interface on all of the object nodes concurrently? So we extended the recon middleware on the object server, and we patched the sweep recon key to generate a JSON file for every metric the recon can get. That sounds like another interesting enhancement. I knew it. All right, I think that's it for now. Great work. OK, thank you. OK, I think we're good. Thank you, everybody.