 Okay, so hi everyone. I'm really glad to see you here. It's four in the evening on Saturday, so I think a lot of people started getting out of here, so I'm really glad to hear that many people. And today I will try to talk to you about backups of TEF, because we know that there are two kinds of people, right? Those who do backups and those who don't. And actually there's a third group of people, those who do backups but don't know how to restore them. Yeah, so always remember to also try to restore your backups. And my talk will be mostly about our experience at OVH, the company I'm working in, and the story how we get to current point in time. So my name is Bartek Święski and my daily job is working at OVH at Wroclaw in Poland, where we have research and development center. And my current focus is on bringing awesomeness to our self-infrastructure. So quick show of hands, who does not know TEF? Okay, few people. Who does not know OVH? Okay, also few people. So very quickly about TEF itself, we've seen few introductions already, but this is an open source project. The most important fact is open source. Right now, developed under Abra of Red Hat. It's meant to be a network storage with different access methods. And it should be perfect solution for large scale deployment, so it should scale out very, very easily. It should be reliable. So if there's any problem, it should detect it. And once it detects it, it should be able to recover from the situation by guarantee that it heals the data that is lost. And last but not least, it's very fast. So when you take a look at the history of TEF, they put really a lot of effort into making this very fast platform and squeeze out the last bit of performance from your hardware. And that's why actually we like it at OVH, because we want to have fast things. And basically, how much do we like it? We like it because we have 40 petabytes of raw hard drive storage that we need to manage somehow. And it's growing. And it's funny because last time I was presenting, we had 30 petabytes. And it was, I think, half a year ago. So the growth is really fast. And it is a lot. It is really a lot. And we don't put everything into one cluster because we split it into smaller parts. Right now, we operate around 150 clusters in our infrastructures. Those are both our internals and external ones. And our base workflow are the images in OpenStack. And that's where we put focus on our backup infrastructure. So I just said that TEF is reliable. Right? It will heal itself. It would replicate data. I didn't say that, but it will replicate data and also consider your physical infrastructure. So we'll make sure that it's replicated, for instance, across racks, not inside one server. So if we have all these benefits of TEF, why do we need to backup? We have three copies. Can't we consider it backup? We could, but yeah, better save than sorry. So we always can find some kind of software backs. We don't know when it would strike bar. And we haven't yet seen something that could destroy our data, but we know that backs happen. Right? And it's better to be prepared for that kind of situation. And backs can happen in SEF, let's say, but also in the lower parts of the stack. So let's say that there's a bug in the Linux kernel. Something in the file system layer. And it starts raising data. We want to be protected from that kind of situation. Also, there is a software running on our hard drives. And it can happen that it sometimes just behaves differently than what it was expected to do. So that's one thing. The second thing is that we want to be prepared for disaster. And when we are working at the scale, like at OVH scale, the probability is top getting intuitive, like the probability of hard drive failure. So if you have laptop, if you have desktop, you can expect that you will replace your hard drive in a few years, right? Because for two years, the hard drive should work. When we have a few thousand hard drives, you can expect that hard drives will be dying daily. And remember, we have three copies. So if three hard drives die at the same time, we just lost our data, right? It's very, very unlikely. But we already had a situation where two hard drives died at the same time. And Ceph is very good at that situation because when you lose two copies out of three, it just blocks traffic to the cluster. Our clients would complain, but it would start to copy the data as fast as it can. So it has at least two copies. That's the usual setup. So Ceph helped here, but as I said before, better saved than sorry. Also we've seen because we are using Firestore. That's, let's say, original data layer for Ceph. And it uses XFS Firesystem, which tends to have problems where there is a power failure. So we've seen that when you just shut down the power and bring back the machine, XFS cannot recover. So that means that if we have big power failure and we know that big power failures happen, we may have problems, right? Hopefully we didn't have this yet. But let's be prepared. But when you take a look at the statistics, what's the biggest factor for losing data? It's always human, right? Always operators, right? So let's be prepared for a situation where someone has, let's say, very hard duty and just types something, mistypes in the common line. And then you have this feeling that you just feel cold because you just did something wrong, right? We want to be protected for that. And also because we are giving services for our clients. I can assure you that sooner or later there will be someone that came to you and say, I've lost my data. Please, please, can you recover it for me? Because even if we do backups, not always our clients do, right? And it will be very good to be able to help that kind of client in a situation, especially if he says I will give a lot of money for this, right? Yeah. Also, when we talk about SEF, SEF works very well when it's in a local territory where not in the cluster can quickly connect together. So if we start putting data in the data center very far away, the pink cross, SEF has performance issues and it's not yet fully ready to handle this kind of situation, geographical replication. And we, of course, know that the best backup is a backup somewhere far away, right? I'm not talking about Mars here, but the other side of the globe will be the best way. So we need to somehow overcome this and be prepared also for this kind of scenario. And I see that inside SEF itself, there is a lot of effort to bring this kind of geographical replication backup. Right now you can do this for RBDs, but it's very early and it doesn't yet scale well. So we have to have something instead, right? We take, we try to take a look into this and we see also that there's an effort to backup also Rados player if someone is interested. So I've seen some rumors, so it will be very, very good solution in the future, but we need something for today. So we know what we want, so let's do some planning. So we started planning out of VH what we want to do with our backups. So let's first think about backup software. We need to have some kind of software that will backup our infrastructure and we made few assumptions. What do we want? Yeah, we want something that would do compression. Why? Because if we compress data, we will need less backup storage, right? Less is mean cheaper and of course we can upload it faster. Also we would like to have some kind of deduplication. Why? Because sometimes we will be backing up something a few times in a row, right? So we have our one RBD image. The next time we backup it, we can compare it with the previous content and there will be only a small difference between those two. Also we want to have encryption. That was one of our main goals. So we don't want anything to leave our clusters unencrypted. This is custom data. So even if we're talking inside our VH network, right? We want to keep the data of the clients only inside our clusters. So everything that leaves that must be encrypted. Also we want something that is speed because we have a lot of data to backup. And because exporting data from SEV is using RBD export, we want to be able to process stream of data. Now the file is just a stream of data. Just send it to the backup and everything should work. And we also wanted to support open stack Swift. That's a funny thing because we already have open stack Swift of VH and this is totally separate thing from SEV. This is separate code. This is separate team. So if something wrong happens in SEV, whether this will be a SEV back or something in our management infrastructure, the separate team should still be fine. So even if in our case we just broke everything, other team should have the data. Yeah, and we didn't find that kind of solution unfortunately. So instead we tried to look at VH what experience do we have? And it turned out that there is a team that is doing something similar, backups of RBD images and they are using something called duplicity which is an open source software to basically do backups. And we decided to use this because we already have knowledge how this works and what are quirks. There were some things that they said initially even that we have to consider. But still we are also analyzing alternatives. There is very nice promising project called Rastic. It's still pretty young and it doesn't have compression but it looks like it may be our replacement in the future. Okay, so we selected software. Let's think about resources. So we have to do some assumptions because we don't know what kind of data our clients have. So we have some 30%. So if we compress the data, they will duplicate it. Let's say that we left with only 30% of original data. Also we want to reuse as much of our internal infrastructure as possible. So as I said already, we used the internal of VH Swift cluster. And also we knew that we will need compute powers because we need to do this compression, the duplication. So we decided to use just of VH cloud where you spawn view to our machines and you can very easily spawn new ones, destroy ones. So that seems very flexible and if there are any problems with resources, there are dedicated teams to handle those problems. So it's less burden for us. Okay, so let's also do some example calculations. At that time we had 20 petabyte of raw data. We have three copies. So when we divide one by 20 by three, we have 6.6 petabyte of data. And let's say that we want to backup this daily. So let's just divide the numbers. 281 gigabytes per hour to backup. So 4.7 gigabytes per minute, 0.078 gigabytes per second. So throughput of around 0.63 gigabytes per second. So it doesn't sound that bad, right? We usually operate on 40 gigabit nicks. So we should be able to handle this. Similarly we calculated this for raw hard drive speeds. You can calculate how fast it can work. And results were that we can basically backup everything in a few hours if we are happy. And this is initial design. So we have Swift cluster. Of course for the image we need to create a snapshot. So to have it in kind of consistent state. Because we don't want to backup something that's changing. Right, then we export this stuff through some kind of backup virtual machine. Also we containerize some of our infrastructure to be separate from differences between self-versions and to have local storage cleaned up automatically. We put this into duplicity and then throw out to our Swift storage. So it looks fine. So let's see how it works in practice. So we started implementing it. And very soon we started getting challenges. Okay, so first thing to hammer is duplicity. Our selection of the software has some limits. First limit is it doesn't work with data streams. It just works with files. So what we do? We take the image, just back up it locally and export this local copy of data to duplicity. We have backup VMs. And we have local storage of let's say 500 gigabytes at most. But some of our clients have 2 terabytes of images. Can we handle this? We thought that we can use whole punching in Linux kernel. So if we dump something into the local file system it has a lot of zeros. We will just eliminate those zeros and maybe we will be lucky to just put those gigabits into much lower space. But no, it didn't work out. So we actually, I think I just moved to the next problem, but we also have problems with duplicity because it doesn't tolerate large files. If you go beyond, let's say, a few hundred megabytes, our sync algorithm that's used inside duplicity start having some performance issues. So you just have CPUs at the top. Just wait for the backup. A few hours later I used to wait for the backup and nothing works. So our colleagues at OVH said that you need to split. You need to split, you need to chop into smaller pieces. Only you can backup only smaller pieces. And that's what we started doing. So when we dump stuff onto local hard drive so that duplicity can backup it, we just split into 256 megabyte chunks. It turned out to be pretty good. We don't have to split into lower amounts. Duplicity was able to handle this size pretty well. And the problem that I just talked before, we have large, large images. We have basically limited local space. So we also need to split the whole image into chunks. We call these chunks and backup such a large image in parts. Each part separately. So it took us some time to implement and fix all the issues. And basically that's the architecture. So we start with self-cluster and RBDImage and Snapshot. We just cut it into pieces. That's the first layer of the cut. Out of this cut we have 25 gigabyte chunks. And then we have to split those chunks into files. Each file is 266 megabytes. Put it into local SSD drive and then we can put it out into Duplicity and finally go to Swift. Yeah, so we made it. It took us a few months. But this is right now what's working on our production. And it works really well. So really we don't have to care what, how large images our clients have. And I've seen 30 terabytes once. So it is pretty, pretty large. And we can handle you with this system. Of course the time needed to backup will be proportionally bigger. But it works. Okay, but we also thought about some alternatives. Yeah, we have a lot of splitting here, a lot of files here, some local storage here. Maybe we could do better. And yes, we can. Yeah, fuse. File system in user space. So very, very nice mechanism in line scanning world. So basically you create user space application that just do a virtual file system somewhere in a directory. So if we create this kind of fuse client, what we can do? We can do all we did previously with files and with this chopping. So we can just initially select only portion of the image. Then internally inside the memory, chop it into smaller pieces and expose this as already chopped series of images. And just put it through Duplicity. It should take it as if it was just standard files. So we made this client right now it's in testing. It's very promising because basically we don't need any local space. So what we take from the cluster, the fuse client just sucks it into the memory. Then when the client request the data, it just throws it as directly to the some Linux kernel world, I don't know where. Basically it just travels then back to the Duplicity and it sends it back. Yeah, backing up was pretty simple. We had some problems with restoring data. Yeah, because Duplicity has its own rules, right? So for instance when you start restoring stuff in Duplicity, it just raises everything at the directory of the destination. So you need to pretend that you handle removals. And then when files start to appear, you just need to handle the situation. And of course keep in mind that you cannot show that there are some files where they are still not uploaded yet. So tricky situation but we have a very bright colleague and he implemented this. And we plan to send this into our production pretty soon just to remove the need for local SSD space because that's currently one of our limiting factors. Because we need just a lot of VMs with reasonable amount size of local hard drive space. Okay, so let's talk also about something else impact of our production because when we start reading data from our clusters, we can expect that it will affect our clients. So from the beginning we knew that we have to throttle the backup system. We have to be able to very easily shape the amount of data that falls through it. That's why we implemented few layers where we can basically just have this vault and rotate it and even put mod into backup or just reduce it. So one of that vaults was made because we used salary tasks. So salary is the task processing architectures. It has workers. So the amount of workers which we have spawned is basically a number of simultaneous backups we have in the system. So by just spawning more workers or reducing the amount of workers, we can easily just do more or less prior backups in whole system. But still we want to protect one cluster. So we implemented some limits. For example, we can set up in one cluster at a time. We also want to limit the time when we back up. So when we schedule new backups, for instance we can say schedule this only during night because there is a clients don't do much on this cluster. Also when we have those backups virtual machines. So we have a virtual machine that has let's say 200 gigabyte hard drive. And we need to put in backups there each of 25 gigabytes because that's the chunk size. So we don't want to cross some limits here. So each backup VM at one time can accept no more than some calculated value. Some calculated number of backups. And for one image we decided that there must be only one backup. So we have to either finish the backup or just cancel it. It just simplified everything. And all those limits we implemented using ZooKeeper and some semaphors and logs which are built on top of ZooKeeper basically. Ok, so let's see how it scales. Of course we have issues. And Skyling always shows some issues. Even if we test it locally, if we try to do some simulation you can never predict what will happen at a large scale at a production. So ZooKeeper our decision sometimes has a problem because ZooKeeper is good for rarely changing data. So it will distribute the data in its own cluster but when you start doing frequent changes once in a while there will be a snapshot. The snapshot takes a hard drive space so if you do a lot of changes and acquiring logs and semaphors do a lot of changes then you have alerts that run out of space. Because there are a lot of snapshots and what we did we just increased the number of transactions after which ZooKeeper would do a snapshot. And basically that fixed our problems with ZooKeeper. Also it turned out that we have problems with solar workers because it has some problems with memory. Sometimes just memory spikes also has a problem with CPU when there are connectivity issues and we put this into virtual machines so in virtual machines you don't have a lot of CPU characteristics. So sometimes you have a lot of CPUs, sometimes you just saturate it very quickly because there's some other VM that just tries to do something CPU intensive and things like solar doesn't like it because you have to respond to pings in the network and when you don't get a ping in a reasonable time you say okay this guy is dead just kick him out. And we had situations where we have half of our solar workers dead. So what we did, we just increased timeout for pings and right now it works reasonably well but still we look forward into a new version of solar to fix stuff. Also we have problems with Docker because we want to control everything to clean up automatically so it turns out like a lot of local hard drive activity you put a lot of data inside Docker sometimes it doesn't clean up and you just run out of local space. So this is something that we fight currently with and also duplicity has problems with scale. So it turns out that when you do backup you can do it reasonably fast but when you do restore it needs four times more time to do it because of how it's written. So we started looking for something better and we come up with code hold backup strategy and okay so let's think what better we can do and this is basically something that we should start with. What's the fastest way to backup a CF cluster? It's to use another CF cluster because CF is fast, right? So if you backup CF on self it should be fast and fast, so just fast, right? Okay so what do we do? We just create a separate cluster that is supposed to have just a copy of original one and just be dedicated to a backup, right? We also thought that there is a nice feature RBD export if, RBD import if, so basic something that so basic in CF you can just say export me only the changes from given snapshot put somewhere in the time and just trust me those changes yeah so we thought yeah there must be something in it and intuitively we said that we may have benefits here. I will show later yes we had. Also there is one nice very nice property which was very crucial for us. If you have this kind of spare cluster with data of let's say thousands of virtual machines if we have disaster and our original cluster dies what we do? We just take the spare backup cluster and just plug it into OpenStack and just start firing VMs back so you can very quickly recover. We don't have to just copy the data back to somewhere. Just take the cluster that we already have. Also we thought that we can use our previous architecture so previous code that we did to just chop things, manage this have these throttles everywhere we thought that we can easily use the same code and when we have this backup cluster and we use what we did before so just take the hot backup put it through duplicity we have two levels original cluster, hot backup cluster and then called duplicity so we still have this separate software separate software solution, separate team we still have the data built somewhere and if we just break everything there is a separate team that should have data of our clients. The architecture is actually simpler than with duplicity so what we do take original cluster do a snapshot of image we also decided to still keep this 25 gigabyte chunks because of that we can easily just when there is a problem network problem we can just restart one part not whole image so there are some advantage over that take the chunk put it through backup container I will talk in a minute what it is and apply it on backup of the image backup container is something that we can very easily spawn next to the backups of cluster we actually didn't need that much of processing power because there is no compression nor the duplication so it just pipe of data from one place to another yeah we have advantages of that solution great advantages it turned out that we can backup very large clusters in less than 24 hours we also don't need that much computing power because we don't do this compression duplication so we just the data just falls through and the recovery is very fast because we can take this backup cluster and use it instead ok so let's talk about how good it does work and let's bring some numbers at OVH ok so first a global statistic right now we actively backup 34 clusters and daily it's about 1000 images and it is still mostly using the duplicity the old swift base solution so that means that we around roughly 0.6 petabytes of data daily so it already works well but still we can do better so this is a case study of one of our biggest clusters which we experimented with we actually this is the first production cluster where we implemented the CEPH on CEPH infrastructure we started with the 3350 images backed up per week so that was something that we could do on this cluster it turned out that we actually wasn't able to backup whole cluster in one week using the first infrastructure because of the largest images the largest images they just needed more time once we implemented CEPH that's the second bar here we still kept this 7 day period and it turned out that we can back up whole cluster just easily within one week so we had 4776 images there and everything was being able to backup in one week and then we shortened the amount of time to do a backup in one day and the performance spikes that's where we implemented the diff strategy so it shows that our intuition was right so if we start exporting diffs only we can take advantage of the fact that actually in urban images there's only a fraction of data that changes and you don't have to reupload all the time everything else so it's around 10x performance improvement to the original implementation one more nice chart so this basically shows the age of the image so if we base age of backup so if we do new backup we just remove old one so it just goes down to the very young one on average our images initially were about 6 days old so many images we were able to actually back up more frequently than one week but still the worst case scenario was around 1.7 once we started implementing theft to theft backups we just reduced this down to around 7 days you can see some spice here that's basically where we were fighting with salary, zookeeper and that's why it doesn't always works perfectly but the most important thing is here we switch to diffs and suddenly we just see a drop that's I think the most important from this talk what you also can see that there is an initial part where we still are on theft level without diffs that's because you actually have to do a first backup first full backup so it just takes few days once we have the first backup done we have snapshot required then we just start actually implementing the diffs start applying diffs instead of full backups and that's where we can expect this reduction of time and similarly this is something that show us if we can keep up with the amount of data so this chart shows the this line shows basically new images that we can backup and once per time window that we want to have we just reduce it to 0 so that means that if we have total amount of 4700 images this line should get up to the top and cross it at some point before we can say that we can backup everything in time so initially for duplicity swift implementation we were not able together once we switch to save if we just forget about this period the trend is something that can easily cross the line but once we add the diffs we can very easily one day just backup everything just to sum up backups at scale are possible we do this we have a lot of data and we can backup it but you should do what self actually gives you just use the export diff and that will give you the fastest way to backup your data and it will you can definitely backup the clusters which are highly utilized in our clusters there is a lot of activity a lot of stuff is going on 24 hours a day and actually every 24 hours you can backup data of your clients that is something that you can easily achieve actually you can even get shorter backup times but you have to watch out for the latency on the cluster because it may start increasing but also even if you have self on just think about some alternative storage because we never know what will happen and we never know what bugs can be there waiting for us in the next few years and the bad thing that we have discovered is that self on self is good as a first line backup and then just do some kind of cold backup at the end so that will be it from my side there are some nice images so this is accreditation and few minutes for questions yeah so the question is if we have a workload like databases do we have any coordination between the stuff that is going inside the VM like dumping the data and our backup system so currently we don't do this because first we want to have the backups ready and working at the 24 hours scale but we discuss this with our colleagues OpenStack team because in OpenStack you can also do backups and it has some ability to freeze local file system do a flush, some kind of sink and at that point do snapshots so we want to explore this in the future not yet implemented but we think about this yeah so the question was about RESTIC and the fact that it supports stream we still consider it and we want to try it in next few months actually the streaming feature is very important because you don't have to have local storage right so that could be one of the benefits of RESTIC at this point in time we just focus on this hot cold strategy and want to have this first and then we will try to improve the cold backend and most likely RESTIC will be a very good candidate there have you considered other candidates than RESTIC have we considered other candidates than RESTIC there were few, right now I cannot recall how many but basically once we heard that other people at OVH are using duplicity we knew that we can use their knowledge so we didn't do very huge research in that regard yep, didn't yeah so currently we do and okay so the question was if we do self-on-self backup do we have equal sized backup at the end or do we have less replicas so currently we are using identical cluster so the same size, the same number of replicas because when we have to do a quick switch in case of failure the backup cluster will be able to handle the traffic but we also consider a situation where you have totally different tailored hardware especially for backup like more hard drives less performance and actually even two replicas out of three maybe because that will be a cost saver right, okay here okay so when we yeah so the question is when we do a chunked backup do we reapplied chunks on the end on the backup clusters yeah so each chunk in our system is a task and each task is responsible for exporting diff only of this chunk we created some small script in Python where you can very easily just export only diff of one chunk and then on the other end we do something similar we just apply something regarding only this chunk and once we have all chunks backed up on the head image then we just can finish the whole backup just do a snapshot at the end yeah we'll still using Zupipe yes yes that's something that we will consider we I think we were thinking about at CD but right now we just solve our major problems and started focusing on increasing our infrastructure so it's something definitely that will be on our roadmap okay that's all I got so thanks for the meeting