 Hello everyone, so my name is Axel, I work at Quarkslab, which is an information security company. I'm part of the infrastructure team, and so one of our roles is to manage all the services and services of Quarkslab, and one of your duty, of course, is to handle backups of all the services. So after we had an issue with a backup server, we decided to redo all the backup systems at Quarkslab. And so in this talk, I will present the solution we designed and we implemented. It can also be applied if you have like a personal infrastructure, you can also back it up using the solution. Everything is on GitHub, so you have links at the end of the presentation. So basically at Quarkslab, our infrastructure is composed of different types of servers. So you have virtual machines, of course. You have also bare metal servers that are on-premise. Some project servers, we call them, who can contain some sensitive information that we don't, well, we only want a limited number of people at Quarkslab to have access to. And sometimes that doesn't include us, but we have to manage the affair, to backup them as well, find a solution to do it. And also some servers in the cloud. It's mostly bare metal servers that are on hosting providers that are encrypted as well, that we need to handle. So basically what we want to do is to be able to back up all those types of servers. So we looked at existing solutions to manage backups, and we found that's quite complex. So often they try to handle all the cases to be very generic, to have the ability to process a Windows service to all kinds of machines. And also they often need an agent on the host that needs to run continuously. They need a server, maybe with a database. They provide access control release, the web interfaces to manage the backups, stuff like this. And we found it quite hard to actually understand every part of each system. And what we believe in is that we need to be able to understand how everything functions to be able to debug any issues that work. So we decided to go with tools we knew, and we decided to package them together to be able to answer our needs. So basically backups rely on two things. To have effective backups, you need to have effective storage. So basically if you write data to disk, backup to data to disk, you want to be able to get it back later. So it's very important. Even if your backup system is a software, it's really advanced, it can actually survive like a hardware error. So to handle the hardware side of things, and to handle the storage side of things, we decided to go with two tools, which are FreeNAS and OpenCFS. So FreeNAS is a Linux distribution that you can install on the server. The server can have any disks that you want. It's not forced to be like SAS drives or enterprise drives. It can be like out of the shelf's normal hard drives that you can buy consumer drives. So basically it transforms this server into a very powerful solution to be able to effectively store files and files. So it offers some interfaces like NFS, those kind of interfaces, so you can access and store your data on the server. It also does a really great job of handling emails to automatize stuff like, for example, disk scrubbing. So you want to verify that all your disks, the data on the disk are intact. So you can use it very effectively to do it. And this server relies on the file system, well, file system, which is a ZFS. So ZFS is more than just a file system. It also handles the red side of things. So you provide it with multiple disks and you want a redundancy across the data you write to those disks. So ZFS can do this. It also offers functionality similar to LVM. So for example, if you need to be able to create separate partitions, let's say, you can do it quite easily, Redimensions. What it brings with those kind of right partitions, which are called data stores, is the ability to do snapshots of them. So for example, in point in time, you can just freeze the content of the data store where you store your data. And then you can mount it read-only later if you want. So you can access those files as they were at the time you did the snapshot. And another ability is that you can actually send snapshots to other hosts that run ZFS. And the FreeNAS does a great job of integrating it. So this allows you, for example, to do off-site backups of your data. So you just take a snapshot of your partition, of your data store, and you can just send it to a remote ZFS server. And so you have the ability to basically self-host a complete backup system. So you don't have to rely on existing backup systems or to rely on servers hosted by other people. And at Wastrab, we are very keen on self-hosting everything we can because we deal sometimes with sensitive information and we don't trust external entities to handle them for us. So the issue you also have to take care of, of course, in hardware is actually the hard drive themselves. So for example here, this is a Seagate model at the 32% failure rate after a few years of work, which is quite unusual because Seagate was trying a new technology at the time and they claimed that it was, of course, very safe, but it wasn't really. So they faced a class action lawsuit. So if your server happens to have those disks in size, then you have a very high probability of failure and even the ZFS won't be able to do anything regarding it if you lose like a majority of your drives. So it's very important just to have separate types of disks into your server. So effective storage is one part of the things. The other part is having effective backups. So here we decided to use Borg because we use Borg ourselves. For example, I use Borg to backup my personal laptop and I decided that maybe it was also a good idea to use this tool to also backup our servers. Borg provides a number of advantages. So it's open source and standard binary, so you can just drop it on a server and it works. You don't have to install a ton of dependencies to make it work, so it makes it very easy to install on a heterogeneous infrastructure. It unlocks compression, the duplication and encryption. So basically it works at a block level of files. So instead of treating files individually as a file, it treats data as blocks. So for example, in the file, you can have multiple blocks and then it does the duplication across those blocks, not across the files themselves. And it's really powerful because, for example, if you have virtual machines disk image on your system, you can have multiple virtual machines that have the same base image, but small modifications inside them, of course. Borg recognizes the data that is the same in every file and it allows you to deduplicate it very effectively. Only the data that is unique into all the files in the file system will be backuped. So the duplication is very powerful and, of course, it has a compression as well above it. So this means that if your data is highly compressible, like logs, for example, they will be quite effectively compressed. It's quite fast, so I have an example. And you can do remote backups of SSH. So it's quite fast. Here it's my last backup of my personal machine of this laptop. So you can see that it did a backup of 1.4 million files in four minutes. And in four minutes, so there were 350 gigabytes of total data on the drive, total used data. It compressed it to 228 gigabytes just by compressing the files that were highly compressible. And it further deduplicated the data. So this is my last backup. Of course I had previous backups. So in this last backup of 350 gigabytes, only 247 megabytes were changed between the last backup and this one. So it only stores 247 new megabytes of new data on the disk. And so you can see that it's quite effective because the total number of data that were backed up ever on my machine were 5.5 terabytes of combined data, so of unique data. Sorry, not unique data, like unique files. In those 5 terabytes of unique, in those 5 terabytes of data, so it went to 3.47 terabytes of compressed data. And then the deduplication brought those 300, so these 3.47 terabytes down to 266 gigabytes. So effectively on my backup hard drive, there's only 266 gigabytes used, but if I were to expand all the backups in time to recover all the files, it would have a total of 5.7 terabytes of combined data across everything. So it's really, really effective deduplication. It is very fast. I mean, 1.4 million files can in 4 minutes to determine what was the differences between the previous backup and this one. So we decided to use Borg onto the different servers. Like I said, Borg, you can do backups over SSH. So you just connect to the remote server and the server connects to the backup server and basically does its backups across the SSH connection. The issue we have is that our backup server, which is the server here, is self-hosted in our internal, like, on-premise infrastructure, because we mostly have self-hosted servers. The issue is that we have some hosts that are on the internet, like I said, like the metal servers and hosting providers that we need to backup as well. So we can connect from the backup server to the external server without any issue, but the external server can't connect back to the backup server. It can initiate a connection. So the issue is that our firewall only allows basically outgoing connections and not incoming connections. And we don't want to allow incoming connections for security reasons, basically. We don't want to expose any internal service to the internet directly. So the issue is that Borg allows a server that needs to be backed up to connect to the server that is storing the backups, but here the connection will be blocked by the firewall. So to solve this issue, we initiate the connection from the backup server itself. And we do reverse port forwarding using SSH. So this means that the backup server will initiate the connection to the remote server. And then it will, across this initial SSH connection, it will open a listening port here on the remote server. And every data that is incoming into this port will be redirected to the local backup server on any port we want. So here on the SSH port. So basically this allows us to expose the SSH server of the backup server to any external host without the host having to connect to the backup server. Because we initiate the connection first. We establish a SSH tunnel between the two servers so the data can flow back across this tunnel which is inside the SSH connection that is already established. This also, of course, works for internal hosts. So we have some overhead because we have two SSH connections where we didn't need two because the server here can connect directly to the backup server because they are in the same network. But it works for both cases, so it's the most generic case, so we are going to implement it. The second issue we have to deal with is that some servers, so they handle sensitive information. We want all our backups to be encrypted. So this means that the key for the encryption is to be stored on the server itself. So it knows which key it has to use. The server will do the backup across the SSH tunnel like I just explained. So the data will flow encrypted across the tunnel and will be stored encrypted onto the backup server. So the issue is what happens if your server dies basically like a hard drive dies because the encryption key was stored on it. So basically you have no way to recover your backups, so that's a bit of an issue. So what we do, we don't have a very good workflow to handle this at the moment. What we do is basically we store a copy of the backup encryption key in the infracting guys' laptops basically, and we have a few backups of this key. So in case of the server catching fire, we have a backup of the encryption key that we can directly use to recover the data. But the encryption key is not present on the backup server itself because we don't want someone finding like a vulnerability on the backup server, gaining access to it. We don't want the person to be able to read the backups because it's a content sensitive information. So we prefer to have the keys stored separately from the data itself, encrypted data. And this is also very convenient because some of the servers, we want to backup using this infrastructure, but we are not allowed to access the data inside. So it's a bit of an issue because if we add a copy of the encryption key, we will be able to just use it to decrypt the data. So what we can do instead is tell the person managing the server how to set up the system on his server, and then this person can store a copy of the key themselves securely. So this means that we, since it means infrastructure to go, never have access to the decryption key. But in case of the server catching fire, we can provide the person with encrypted files and they can decrypt the files themselves. So to handle executing the backup process on all the various servers, we created a smaller Python script, Python 3 scripts, which basically is in charge of regular intervals of connecting to the remote servers. So triggers the SSH tunnel creation process and triggers the backup process. So these scripts is running on the backup server itself. And it collects all the logs from all the various backups and it sends us emails. So it sends us, of course, emails on error. So if the backup has an issue, we are informed directly. But it also sends email on success, which is very important because if you never receive emails on success of your backups, you are never sure that the backup process actually executed. So basically, the backup server can be down, for example, for some whatever reason. And then you will not receive any error message informing you that the backup can't be done because the server is down. So basically, it's important for us to receive emails when everything is going well because then we know that everything actually went well. So that's a subject of the emails that are sent to us. So informing us of the total number of server backups, the number of servers that adds issues during the backup, and the total time that was spent back-uping the servers, which is a useful metric if you compare the times between different backups. You can know instantly if, for example, some server is acting weird because it takes a really longer than usual to backup. So it handles storage of all those backup logs so we can access them whenever. And it's really simple. We have a full code coverage on it with a unit test. And it's accessible on GitHub. So you can use this script if you want to replicate the same concept as it did to have this concept of back-uping using Borg on remote servers. So now I present, no, sorry, yeah, so this script is running on a free BSDJ, which is running on the FreeNAS server itself. So the recommended way to run code on the FreeNAS server is actually to create a BSDJ and run the code inside it. So that's what we do. And we automate a few things. So we automate the creation and the provisioning of the jail and the provisioning of the servers that need to be back-uped using Ansible. And you can find different Ansible roles and an example playbook that implements those roles on GitHub as well. So if you want to just recreate the solution as is, you can very easily do so. Now I'll present the complete backup process. So basically you have the backup server here. This is a Chrome that is running periodically on the backup server. So it's once a day, but you can have the frequency you want. So this Chrome is in charge of executing the backup script that I just talked about. So the script will read its configuration file. So configuration files contain basically the list of hosts that need to be back-uped, how to connect to those hosts, so which SSH port should we use, some various information, for example, the topic of the success and error emails, what they should be. So it reads this backup configuration. Then it runs SSH command. So basically it instructs, it creates a SSH connection to the host we need to back-up. For each host it does this, of course, sequentially. We can also parallelize it, but it's a work in progress. It's not done yet. But sequentially it will find which host needs to be back-uped. It will create a connection to it, and it will establish the reverse port for warding using SSH, so to have the connection back, to be able to connect back for hosts that are on internet, for example. So the host will receive the SSH connection. The host that needs to be back-uped will receive the SSH connection. So the tunnel will be established. So in the SSH configuration, the host needs to be back-uped. So there are two important things. So basically we permit root login, but we only allow commands that are predetermined in advance. So we don't actually expose root access to the two natural attackers. Well we expose it, but they can only run one command that we defined. And we permit user environments. So we can send environment variables to SSH. So you can see that we don't actually execute any commands on the server. Why? Because it's restricted to executing only one specified command in advance, and this specified command is called to Borgmatic, which is a tool that handles Borg configuration. So basically, we tell Borgmatic, hey, you can execute. You can read this configuration file to know what settings you need to provide to Borg. So which directories you need to back-up, which server you need to back-up to. If the repository doesn't exist on the backup host already, you can create it using an encryption key that we store on the backup server itself, sorry, that we don't store on the backup server itself, store on the host that needs to be back-upped. Then you can create a new backup, check all existing backups for integrity, so we are sure that the backups don't become corrupted over time, and that we can actually reassemble all the various block from all the various backups we stored. And then you can delete all the backups that you don't need anymore because they are expired, for example. And it uses an SSH key to connect to the backup server that we provisioned in advance using Ansible, but you can also provision it by hand, of course. So basically, the host will establish a SSH connection to the backup server using the mechanism integrated into Borg. The backup server will receive the incoming SSH connection established by Borg. It will also restrict it to executing one single command only, which is Borg serve. So basically, Borg will act as a remote server, and will handle very efficiently all the incoming data because it can understand what the Borg on the servers that need to be back-upped is sending. So we work in an app-only mode, so that means that remote server can only store new blocks inside the backup server. It can't actually delete any blocks. Because it's very important because if someone were to compromise our remote host, then if they want to cause us harm, they can just instruct the host to delete all of its backups. And then we don't have actual backups. So by using the app-only mode, the server can't delete any information, so we are always sure to have actual backups of the server inside our backup servers, even if the remote server is compromised. So we know we can roll back to a state where it wasn't compromised, basically, if we need. And we restrict it to a specific folder, so it can only store backups in this folder. So it's the same issue. If someone were to compromise a remote host, they could basically instruct Borg to store the backups, to erase, so to overwrite backups of other servers, so you don't want this. You want it to be restricted to only one repository. So that's what we do here. And this SSH key is actually stored on the SSH key of this host, which we have into our backup server. So then the connection is established, Borg serve is running here, Borg is running here. They can exchange data, which is encrypted, of course, with the encryption key that is present on the host that needs to be backed up. And then as a script, once the backup is finished, it sends us an email. So for example, we started the backup completely successfully, we don't have any errors. It spends this time back-upping, and so we have only success, we don't have failure or skipped host. So thank you, everything I talked about, everything we created, so basically the answerable role is to set up the backup system, the script that is handling or creating all the backups, and the backup process itself are all documented and present on GitHub. So it's on the Quark Slab organization, and this is the name of the various repositories we have. So feel free to ask me any questions. I will also be present outside the room if you have any questions for me that are a bit too long to talk about here, and feel free to take a look at the scripts if the system seems interesting for you and you want to implement it yourself. Yes? Can you repeat the question, sorry? Yes? I'm sorry, I don't understand your question. So basically the box server allows new blocks to be stored inside it, but it doesn't allow any blocks to be erased. So the box server has read write access to the files themselves, but the box server itself restricts blocks to only be written if there are new blocks, say it doesn't allow any old present blocks to be erased. Yeah? But if you want to be storing a backup. Oh, if you restore a backup. If you lock up anything, if you have issues with that, it seems not in a way, so it seems not case. No, I don't, we never experience any issues with it, so I don't know, sorry. Is there any part you talk about mounting really only snapshots? Yeah, so that's handled by ZFS, yeah. So basically I skipped a bit because I didn't have time to present everything, but basically the data is stored by the box server onto ZFS, onto a ZFS data store, then we can create snapshots of this data store which are read-only, but the data store itself in normal use remains read-write, and it's only those snapshots that we can mount at a later date to be able to access the backup files as they were at this date. This prevents us, for example, if there is an issue in our script that is running on the jail, on the backup server, if the script decides to erase all the backup files, we can actually restore a ZFS snapshot of the data in the previous, like of the previous day, so we can basically recover the data. I don't know if it answers your question, but we can talk about it a bit later if you want. Any other question? Yeah, so restore is also quite fast. The only thing that happens is that it uses quite a lot of CPU usage to do the backups and to do the restore, because it has to basically compute ashes of all the parts of each file to be able to know which ones are different for which one or the other, and during the restore it has to, and also it does compression over those parts, so during the restore it has the same operations. So it's quite CPU intensive during the backup itself, but you can mount any backup you want using Borg that is exposed as a file system basically, like a Fuse file system, so I didn't notice any slow access to this as long as you have the correct CPU to handle it, basically. Time's up. Is your backup server sequentially entrapped in the server to be a vector, or are they a vector in parallel to the order? It's a sequential, okay, sorry, I didn't repeat the question. So it's a backup process sequential or parallelist, so basically it's sequential right now, but nothing prevents it from being parallelized. So we provide one configuration file to the script, and the script executes a configuration file in order, but we also have a mod we did quite recently, where we can just instruct the script to only backup one host in this configuration file, so you could imagine a case where you have N cron entries, one per server that needs to be backuped that can execute at the same time, and that all use the same configuration file, but all restrict the backup to one specific host. We'll just receive a lot of emails of success or error, but that's all. But it can be done here. So time's up. I'll be outside if you have any more questions. Thank you for listening.