 Alright, it's 1500 hours. Time for the first afternoon talk here in Heiberberg. And it's about back up out of the box with Debian. And our speaker is Lars Viserius. Give him a big hand. Thank you. Then everyone hear me at the back end. Okay, I shall try to speak loudly and clearly. Thank you for coming. I will start with a couple of questions to the audience. How many of you have set up a backup system for yourselves that you like and trust? Okay, maybe half. That's expected. How many of you have done that for a family member or other loved one, a friend or at work? Yes. About the same people, slightly less. How many of you have run a backup this year? Any kind of backup, even if you don't like it or trust it? Excellent. How many of you have done that this month? How many of you have done it at Debian? How many of you are running one now? Excellent. Give the man a pause. One of the things that people in a position to do technical things often have to do is to provide technical support for people that they possibly owe their birth to or have siblings or are married to or whatever. The family IT support. Anyone here in that position? Yeah, most of you know the situation. It's nice when you install Debian and it provides certain things out of the box, like a mail server for sending mail or a web browser or Chrome for automating things or, dare I say, an init system. It is part of the things that an operating system should provide, an operating system in the modern Linux distribution sense. It should come with a number of services that most people want. Not necessarily everyone, but those people who don't want them can tweak. But if the majority want something, say, supporting printers and scanners, it's nice if you can plug in a printer and it works without having to rewrite your Etsy print cap. That was not fun. I have been thinking about backups for a while and I think it would be nice if Debian came set up by default in default installs with the necessary services to make backups easy. And this is what this talk is about. I will start with a bit of personal history. April 1984. My father has started his own company and he has bought a computer for his home office and he let me play with it. This is my introduction to computers. And my father gives me a short lecture saying that there are certain things one shouldn't do with computers, like throwing them out of the window. And he gave me two floppies. His computer had two floppy drives. He told me that this is floppy number one. This is where you put all your data. This is floppy number two. Every time you have made any changes, you make a copy. So I was indoctrinated to backups from the start. So far I have not experienced catastrophic data loss for my own data. Some people in my circle of friends claim that I am obsessed with backups. This is not true. Not at all true. I had a question of how I did backups some years ago when I was trying to set up a company to provide backups as a service. So the entirely illegible top circle, red for those who can see colors, represents my laptop. And from that I backed up to a server I had co-located and that was my backup number one. That was then backed up to a second server in the same rack that was a cold spare. So if the first server would go down or be broken, I had in minutes a working second server. The second server was then backed up to a secret server located somewhere entirely different so that in case the front-end servers get broken into and someone modifies the backups, they have a history of the backups on the secret server. Then at home I had a pair of USB drives to which I backed up and then took one of them to the office where I worked and the other one back at home. I had a file server at home to which I backed up the laptop and then I had a backup server that backed up the file server. I don't think this is obsessive. This is merely ordinary precautions. So the grand vision I have is that you install Debian, you tell Debian where you want to have your backups. It all happens. You don't need to do anything else. Just like if you want to print something, you want to tell Debian that I have now installed a printer. Ideally so that you don't even have to do that. The system notices that oh there's a printer, I shall print something there. There will be large numbers of files with some, large number of pages with some weird junk on them because Debian got confused between PCL and PostScript. But I don't like the printer. I would like backups to run automatically and also be verified automatically so that if there's a problem with, say, the USB drive you're backing up to, you get notified. If you want to go and have a look, you can always check the status of your backups but you don't have to. It will just run. If you want to restore data, then that's what you need to do some interactive with. But as far as I know, nobody ever restores anything. I have written a backup program called Obnam. This talk is not about Obnam. Via Debian, we choose every alternative. There needs to be a default that we use, but whatever that is, it's not relevant. Obnam is not the solution I'm advocating here. Competing operating systems made by an American company near Seattle or in California without naming names. When you get them, they come installed with backup software and sometimes they push you into using these. These are desktop solutions. What I would like to have is that any kind of Debian system has to use the backup. Who here has a server? Who here would like their server to be backed up? Yeah, that's what I thought. If you happen to run Debian on, I don't know, a watch or a phone or a router, there's no reason why that couldn't be backed up. So this is not only about desktop. Backups should be stored somewhere. That's the unfortunate part of backups. You have to buy more hard drives. Typically, people think about backup storage as USB drives, and they're fine. They're easy. For a non-technical person, they're very obvious that, yes, I have this drive on my desk. It's called backups. Backups happen. That would be really nice. We don't have that now. But there are other ways of providing storage. You could have a backup server on your LAN and set up backup so that whenever that server is available, when your laptop is at home, then backups happen. Or it could be a server on the internet, what is currently known as the cloud. Backing to the cloud, having your backups in the cloud, is sometimes funny for Finnish people, because in the cloud, in Finnish, means being high. I would like to prefer to have my backups in somewhere where they're slightly more reliable than being high. But there's no reason why we can't support this. Obviously, there needs to be some kind of configuration saying, yeah, I want to use that backup server, and then it just happens. This is also a business opportunity for ISPs. They could make it very easy for Debian to provide this configuration. Some people really like tapes, and if you do that, I'm not judging you. So, that is the grand vision. Any questions so far? Expressing the grand vision is the simple part. Nobody has questions. Good, everyone agrees. I'm continuing a little bit, and then expressing more details. I would like there to be a configuration format for this so that a way to configure these backups so that regardless of what software is being used, whether it's obnem or tar or cat to provide backups, the user experience of configuring it would be basically the same always. Ideally, there's nothing to configure, but the world is not quite that simple. Good defaults are essential, because if we require people to start writing regular expressions for all the files that they don't want to backup because they're useless, like, say, the web browser cache files, which can be many and numerous and large, but mostly are obsolete by the time the backup runs, then we don't want people to have to do that themselves. We should provide defaults for anything that is usually safe to exclude from backups. Anything that is actually important should be tweakable. I don't want to say that configuration is unnecessary, but defaults are important, and it should be easy to do the things you do. I'd also like to support multiple backup repositories, so multiple locations to which you backup, so you can have both the USB drive and the server on the internet. Everything should be automatic so that after the initial configuration, backups just happen. I've seen backups being handled by various people in various organizations over the years. In situations where you have a paid person to do things like rotate tapes from the office to the bank and back, that works. If we expect normal people who are not paid to take care of their backups to do this, we fail. If we expect people to go out and buy a stack of Bluray disks and swap 50 disks to a drive in order to run a backup, this isn't going to work. Other operations that need to happen are things like checking the backup storage that is still valid, that the backup repository has internal consistency, and that the backup data matches live data whenever possible. All the kinds of things that those who are satisfied and happy with their backup solutions obviously are already doing, but doing it so that it happens for everyone is the actual challenge. Oh yeah, restores. Restores are unhappy moments. If you need to restore something, you are under some stress, possibly a very large amount of stress, and your spouse or your boss might be screaming in both your ears at the same time. So my preference is to make restores as simple and obvious as possible. I would like something like you open your file browser, and the backups are just visible there, and you can just copy the data you need. Most people who can use a computer can copy a file, so this is not new stuff for them. If you need to go and enter hexadecimal numbers on a command line tool, not so much. Obviously, if this is architected properly, then we can provide a number of user interfaces and user experiences, and those of us who really like hexadecimal numbers can use those. I haven't done anything. This talk is about expressing what I would like to see, and seeing if anyone else agrees. OK, several people, many people. Good. There are a number of things that need to be done. That's not a well thought out list. That's something I wrote because I need to fill a slide. I'm not saying that if we do this, we're done, and I'm not even saying that we have to do this or we will never be done, but it's things that might be useful to consider. I seem to have entirely forgotten to put in some consulting management sales speech there. Sorry. Would anyone be interested in helping in any form? One person. Anyone else? Filing bugs is helpful. Trying this, testing this, speaking about this, telling the people who do these technical parts of this what they would need and like to have is helping. With these definitions, anyone like to help? Anyone like to tell other people what they should do? OK, some positive response. Most of this should not be very difficult. The most difficult part is writing the actual back end backup application that does the part of shuffling data in some place or other and bringing it back when needed. That's what I do with Obna. All this other stuff is taking pieces, it's mostly about taking pieces that exist and writing some small stuff and then integrating this into something that works. If we can find enough people who work on this, we can do this for stretch, which would mean that it's the first debut release where people can rely on backups being there from out of the box. I happen to think that would be cool, but I admit being slightly weird about backups. Just slightly weird, I'm not obsessing about anything. And my backups are still running. OK, that was a very short talk. I was hoping to have more questions. Anyone have any opinions? Wait for the microphone, please stand up. Given who we are, might it not be harder to agree on something than to implement it? No, I think it's very easy to agree, as long as I'm doing it alone. When I said that Debian has a tendency to choose all the options, I said, I meant that. We have a tendency to, when it makes sense to make it possible for people to mix and match various components. So we have a default MTA. Males are X in form. Sorry, it took me a while to remember the software that I always replace. But I can replace it, that's a nice part. And what we have in Debian is a system as a whole where taking one of these components, replacing it with something else, is usually quite easy. When the various components are sufficiently similar and compatible, and we can provide the interfaces that are needed. I have two questions. First is a system like Postgres. You can't just back it up. You need to take some additional steps. Are you planning to provide certain special handlers for specific software, or are you looking at a solution whereby Postgres will install its own backup handler eventually? Yes. The second one of those. I haven't decided which one. I haven't architected this at all, except a little bit in my head. I didn't want to come here and present that, yeah, this is what you people should do. I've maintained lock check for a very long time and it was a nightmare pertaining to this, because we did both. So I think you're going to have to make sure you do one thing well from the start. The second question is automated backups. I run automated backups. My mom runs automated backups, thanks to me. And that's great, because it works, except when it doesn't. Occasionally, for instance, when you're sitting at home and you're doing a backup and you accidentally downloaded that Bluray thing and left it in your home directory, and now it has like backup 24 gigabytes. Yes. That's right, exactly. That's the one I meant. And now it's uploading 24 gigabytes through your DSL-uplink and it fails to do that and takes the next 400 days until it finally concluded, because it only does it during the night or whatever. And so the PhD thesis or whatever my mom's working on which comes right afterwards in the file system, if you know what I mean. Never gets backed up until 401 days by which time the laptop has failed. Is there a way for her to A, find out? Yes, is there a way in your vaporware? How are you thinking about letting her know that this hasn't been backed up yet, that there's a problem, and would it be or do you have a sensible idea for her to be able to say, this is important to me, and this is less important to me, so prioritize this? So yes, there absolutely needs a way for the user to say that I care about this data. I don't care about this data at all, because I can always read down the Debian installation BDROM image. And this data, I would like to have it backed up, but it's not as important as all the other data. This is the kind of thing that people need to be able to do, absolutely. How does that happen? I don't know yet. But technical details are solvable, more or less, mostly, sometimes, by someone. Before we have the next question, I would like there to be a way to notify the user saying that, oh, I have noticed this large file, new file, that you didn't use to have yesterday when I ran the previous backup. What should I do? And then we can have defaults saying, it looks like an ISO file. We probably don't need to worry about that that much. We put it at the end of the queue. And we notify the user, and the user can say, oh, I don't care about that at all. At which point it gets removed from the queue entirely. If we do this well, or well enough, then the user will rarely be bothered with these questions. I would like the user not to be bothered saying, oh, you have a doc file there. It's two gigabytes. It'll take two minutes to back up to the local machine. Do you want to back this up? This would be idiotic. Bombarding the user with questions all the time would not be useful. But when there are really exceptional cases like tens of gigabytes of data in a new file, yeah, sure. Ask the user. Hi. So the case about the Postgres backup, this is an example where it would be useful if a package would also install some information that says although the database files are in Valip, it doesn't make sense to back up this particular directory. Ignore it. It would of course be, it might make sense to not back up var cache. The name says it's cache data. It can be recreated. But Valip would be backed up, although it doesn't make sense for my SQL. It's the same, and there might be other programs like that. And then there is data in the home directory. If programs do as the XDG directory specification says, they use dot cache directory, and you can just ignore everything that's inside there. But unfortunately not all programs do this, not all packages do that. So it would be useful to say, okay all home directories that have dot something, it's not worth backing up, because it's just cache, and it might be even a lot of data inside. And related to this, I have a question to the audience or an appeal. I have a page in Zadibianwiki about XDG directory specification and a long running request that the free desktop people might include a state directory, which is also data that it's not worth backing up. But many programs modify their config file for unusable data like the last window position. And if somebody here has relation to forks at the free desktop group, now how the process works, would be nice to have a state type directory in this directory specification. Yeah, I agree that it would be nice to separate things that are just data that is never useful to back up or not important to back up. Luckily, things like window positions are quite small, but it's annoying that they're in config information. When it comes to cache directory specifically, there is already a specification for tagging a directory as a cache directory. It's called cachedir.tag file, and there's a link to it on the internet that someone should find. Obland supports this, and a few other programs support this, but it's mostly unknown. Yeah, should we make it a policy that MySQL and Postgres should put these files in vallip Postgres and vallip MySQL? I wouldn't mind. I have a lot of time for questions. Actually, no, Harry, who was... Yeah, did our director want to ask something here? What about encryption? I know I use for this encryption on my desktop and on my servers, and to me, encryption is needed. So when I back up things, it would be stupid not to encrypt the files I back up, because I don't necessarily trust the place where I back up things. So if not a default, I think encryption should definitely be an option. I fully agree, and I think the default should be to use encryption when backing up remotely. Hi, I would totally disagree with Thomas not backing up vallip, because sometimes I just want a quick backup of my database, and it's much faster to shut down the database and take a snapshot of vallip instead of trying to dump the database and then copying the same data. It would take twice as much time because my IO is probably limited to take a dump. So if I'm in a hurry, rushing to the airport and I want to do a backup, a quick backup that's probably working in 90% of the time is better than nothing. So excluding it always is wrong, so probably your vaporware should have a quick backup that's probably working and do a good backup, which will take some time, but maybe I don't have this time at the moment. That's also an excellent point, and it's not just airports, say at a talk at Debcon and you want to go to the next talk and you don't want to shut your laptop because carrying it open while running a backup is a little bit risky, as I've noticed this week. So my ideal backup solution would be such that when you're working on something on your laptop, by the time you're ready to go and you shut your laptop and suspend it, it's already backed up. Backed up as you make changes. I don't know when I will be there, but working towards it. But in general, being able to say, even if it's not just saying things like shutting down your databases and so just saying I'm at the airport, I have five minutes, backup all the important bits would be really useful to say. Hi. I seen a previous slide that you want a backup easy like a restore, easy like a copy. My question is, a copy is not a backup. The main difference is that sometimes you want to restore an old copy from a month ago. So I don't think that it was easy as a copy. I think it is. If you do it right, it's that easy. I think more some way to say I want not the recent copy, but for example a copy. Absolutely. I agree with that. But that can be done by showing all the backups of your data in the file browser in parallel. So you can see this is from yesterday. That's from last week. That's from last month. And then the user can say I want that file from yesterday but that thing from last year because I made all sorts of stupid changes. So we agree, we just need to agree on the implementation. Can we revisit the database problem one more time? So besides the system databases we also have personal information management tools like in KDE. It's Equanady and it helpfully stores your personal contacts in a MySQL database in your home directory. So there are the same problems that we have for system directories containing databases also applied to home directories. We don't know exactly where they are. I think it can be a more generic solution than taking those system directories is required over here. Absolutely. I think some kind of hook system where sorry, I missed the name of the program but the thing that uses MySQL in a home directory so that it can be told that a backup is about to run. Do whatever you need would be nice. And then using these hooks to say both the user and the system level was to say that if the solution to the problem is that yeah, you do a MySQL dump into a file and that file gets backed up then that's what's happened. It's even better if this can be done without having to dump a 14-petabyte database every time you run a backup. But there are tools for this and we need to find a way to integrate those tools into a system that just works. I think on one of your slides you said the technical parts are not that difficult here. I think to do a part that's true like if you have a local disk we have a full-screen encryption we can use something like ButterFS on top and then we get snapshots and all these nice things. But I think if we integrate this versus remote services possibly different kinds of remote services it's a whole lot more complicated and I think there's probably a lot of work that needs to be done there to provide some sort of working interface and then one that's compatible with all the different types of backups that we need to be able to do. If I'm understanding you correctly you're implying that CAT is not the perfect backup tool. I agree. But there is a set of programs that can be used to solve this and it's possible that some of them only work on say local hard drives and or tape routes for those who actually want them. Or and some possibly only work realistically on a remote server. Maybe someone decides to write something that only ever works on their own personal PHP script on the server. It's really nice to be able to allow them to integrate that into this thing. But these are certainly things that need to be considered when the solution gets architected and implemented. When we want to have this fully automated and not thinking about I believe we quickly run into the question that we have to manage different states of the nation. Am I at home on my own network? Am I traveling and have limited bandwidth even if I'm connected to a wire? Do I have limited battery? Am I giving a presentation right now and I can't be bothered to give anything away from my bandwidth or from my CPU or I don't want the hard disk to spin up so there are of course other use cases for this during presentation I don't want notifications from new emails to pop up and so sorry yes of course but my question is is there previous work regarding this questions to let the computer now that currently you can do whatever you want or please don't do these things because you are currently used to give a presentation or I'm traveling does system D does this already? Yes, is there something like that to say to the laptop you are in that state and thus you shouldn't do those kind of things? I believe there is and I don't know what the status of that current is but there are people who do things like they have a script that gets run by IF up down on network connections this looks like my home network and it sets a flag file somewhere and then the cron job looks at the flag file and it gets a bit complicated but there is some work on that and this is again an excellent point at it can be taken into consideration when designing these backup things there is no point in running a backup if you are going to run out of battery in 5 minutes But to be sure it's not only about network but for example also about notifications Many of the problems have been talked about here seem to be solved by simply having a divas announcement saying I'm doing a backup now programs that are concerned that you should not do a backup could then signal back and say stop I am running and I'm noticing that I have low power or I'm doing something more important so stop what you're doing do your next backup next time is that divas seems like an excellent way of solving this can be integrated into the kernel that was a joke I just started to realize that maybe there are many types of backup just for example most of the backup of people think is like a snapshot now I want a copy how the disk is now but for example in the word of databases you have a sort of continuous backup you have journaling so every modification is sent somewhere else so if you crash you can recover this journal and recreate this state of the database in that moment so I was thinking if this kind of backup can be managed in the same way or in the same infrastructure I would hope it can the more how shall I say not difficult but complicated situations like this we have the harder it becomes but the devian we can do it Martin was failing it sounds to me that if you do this at the user level you're going to have inconsistencies at some point in time and generally one thing you could do is LVM snapshot and so on but that would then require for every single devian system to help LVM which I don't think that is your goal but how have you thought about this have you thought about using a sort of standard BTRFS does it too without LVM there are other file systems out there that do it have you thought about using some sort of abstraction layer here Yes I think some kind of consistency needs to be attempted file system or rock device level snapshots are not a complete solution because even if you take an LVM snapshot you can have an application that needs to write upgrade to files and if you write something to the first file you take a snapshot and the snapshot is inconsistent I'm not saying that this is a good way to write applications but raise your hand if you've ever noticed that programmers are a little bit stupid sometimes I don't have a solution for complete consistency I don't know if one exists but we can see what we can do and if we are consistent to I don't know 7 to 15 9's that will be fine you had a question My question goes a bit in a similar direction I already talked about that you would like to be able to plug in some functionality at Will to make it a bit modular because for Oppner you said you want to make it very generic so that it doesn't assume much but especially for consistency to assume some things can help but to make it required is a problem too because maybe people don't have butter as for example with butter as you could do a snapshot a database which probably uses to synchronize data and use a journal file you can get away with a snapshot I think and you can even ask use the old snapshot and the new snapshot and only different files and it has new files within a very short time a shorter time that any find will find the files on the disk for example do you think of ways to integrate this functionality as an option and make it do you what do you want to assume as a requirement on the machines what do you want to have as an option as you thought about this a little bit but I thought it would be a really good idea to gather a group that thinks about this together because I'm a brain with a little stupid bear problem I'm not really smart all the time sometimes I do really stupid things like giving very short talks and long QA sessions I think this would have to be the final question perhaps we can have time for one more after that hello I thought it would seem necessary that individual packages dictate files that should not be backed up individual packages provide an exclude list so that individual package maintainers can specify that instead of trying to do a comprehensive database within one backup tool also for individual packages I was thinking databases to provide a script to be triggered in case they need to know that a backup is about to start or even to dump out a backup like in the Postgres situation if you did want to create a full dump that could be backed up more easily than the contents of Val Lib you could exclude the package the Postgres package could exclude Val Lib Postgres and create a more flat backup elsewhere at the request of the backup program yeah I agree in general we need to be able to accommodate these kind of special needs of things that do anything except have a single file that they never write to because that would be released in backup I think we are out of time I don't know if there are not further questions we thank you Lars, do you want to allow one quick question it's not so much a question but the scope of what you're proposing is huge and I'm thinking and the ambition level of it is huge and I'm thinking it might be helpful if you kind of like prioritize some kind of scope on this because there are so many use situations with Debian and it's used in so many ways if you kind of make an application that can do work in every situation absolutely it's very ambitious I'm not saying it can't be done but how you'd use it in the real world might be a problem so some kind of scope on your ambition would be good I agree and my first scope would be to backup files that don't require, that are not databases or require any kind of special handling because that's doable for stretch and if you can then start adding hooks and so on for stretch plus one we have plenty of time thank you, Lars, and thanks all for coming this concludes the second talk this afternoon