 Hey everybody, welcome to my talk about bub. I think I'll be pretty short on time So please ask questions on Twitter use these hashtags if you have any questions I Am Zoran Zarec. I study computer science at Teodamstadt and I'm involved in bub since April of 2010 I'll first give a little motivation on what to do with backup What our goals with backup are then I'll go to detail on git's file structure and then I'll Show you what bub is Okay Backups. Yeah, sure. We want to do backups and we want to have our data and it shouldn't be corrupted and Yeah We want to have space efficient backups So we don't have what don't want to have complete copies full copies of every Backup we do this would fill up our drives pretty soon So space space efficiency Second is we want to have convenient access to backups It doesn't help if we have 10 years of backups in one tar file and have to download 200 gigs to get two files back Third we want to be safe against bit rot against those Small changes on on disks that break our files. That isn't too good. We want to be safe against that And we want to be sure that nobody can change the history of our backups So if somebody infiltrates our system and they change anything in in our backups, we want to know that they did this Who have no who of you knows git? great Who of you who of you knows the internals of git? Okay, fine This shouldn't be news to you Git is a distributed version control system. Is anybody here that needs a next that needs an explanation on this? Okay, cool um It is content addressed you in git you have these sha one hashes And everything is addressed by a sha one every content every commit everything Every object every commit and i'll go into detail on this is immutable You back you save something to git and this is never changed And git is snapshot based not like subversion Diff based it is snapshot based and this makes it pretty fast Okay, let's take a look at a git repository or better. It's objects First we have blobs blobs are really content No no files like Any any file that contains hello world and has a file name, but only the content hello world Second objects are trees Trees are kind of like file system directories They have references to blobs and other trees. So with trees you can build trees Yo dawg Um, third our commits. Um, yeah, you probably know what a commit is commits are snapshots of the state of a repository at some point in time They first have always have a reference to a tree And they might have a reference to a parent Unless they are the most recent commit And fourth we have tags and branches Tags and branches are references to commits and they are just plain text files So by default you have a master branch and this master branch is a file in your git repository in your git directory and it contains a shower one in ASCII You can retrieve the commit with this this is how a Git repository looks like the the tree of a git repository First we have master the branch which has which has a reference to some commit This commit has reference to a tree and to its parent And the tree has commits that has references sorry to blobs and you see Because we are content addressed and we have immutable objects. We can reuse objects So this might be the read me of some project and it didn't change So both both trees both commits use the same blob This is pretty neat This brings the duplication on the file level Which is nice for a backup Last objects are pack files if you do a commit in your git repository Every object is written to dot git objects and they are loose objects. They are single files Which is nice for the start But um once you have a lot of them it gets really Expensive to retrieve them because you have look you have to look up every night node and this is expensive So we have pack files pack files are just objects written one after another In one big file. They are up to two gigs And you can do a sequential read if you retrieve more than one blob Or more than one object at a time and this is pretty fast Indeed, this is faster than a copy on the file system Okay, sounds pretty great. Why don't we do backups with git just playing git and I guess some of you might have had that idea Hey, git is great and it does de-deplication and it is all small and I'll use it And here I have this 200 gig sequel dump and I throw it at git and Damn it. It's slow When writing pack files git throws both versions of a file through xdelta and if you know xdelta You know that xdelta has the assumption That both versions fit into memory and this is fine for most source code And this might but might be fine for jpeg files But if you have like I said a 200 gig sequel dump or some vmware image or anything It gets pretty slow and memory hungry Second git doesn't store metadata. Okay, it stores the executable flag and It stores sim links, but nothing else. No permissions. No owners. No groups. No acls. Nothing That brings us to bup. Yeah Bup is a software that uses the git file format that is developed by every penarune um, which Also developed a git subtree s shuttle redo That's the URL of the project Everything development regarding is done over the mailing list. We have no bug tracker We don't do push pull requests on the github patches are sent to the mailing list Yeah, that's the heart of the project At the moment bup isn't Production ready yet. I'd say There's a dabbion package, but you know dabbion packages are pretty old And you want to be pretty bleeding edge with bup. So you better clone the repository and make it no configure a step required Yeah, and of course we have some dependencies Let's take a look how bup is used Um currently a backup step Backup action is done with two steps First an index step um the index tab Traverses the file system takes look at every file and writes it to writes it to an index And you can tell if you need to back it up or if you know it already if you have have backed it up already And the safe step traverses this index this has um the The advantage that we have a progress bar that is reliable not like a windows progress bar If you know the xkx kcd Yeah, the the safe step um traverses the index writes all needed stuff to To the repository and creates a commit third command is Backing up to a server This is done over ssh. So you can you can do push style backups to a remote machine This is doing a backup of a server to your local machine. So you can do pull style backups too With the on command you can run commands on on remote server You have to have bup installed there. Of course you have to have um ssh configured And then you do the index step and you do and you do your save step And you have your backups locally This is how you can take a look at your backup. This is my backup set the name of the backup This is similar to the latest backup and this is my home directory Okay, let's take a look at bup's features the application Yeah funny fact The restore command is there since uh 0.23, I think That there was there was ways to restore single files before but restoring a complete backup Is pretty new okay first feature is Did application You know one of our motivations is to have space efficient backup This is the link to a mailing list thread That discusses Some benchmarks I did Somebody came to the mailing list and said okay, I use our snapshot. Why should I use bup? No yeah Then I sat down and made a benchmark I started to amazon EC2 machines with debian Put some random data there and changed a bit in it on both machines did my backups and um You can read it on the on the thread Our snapshot had 4.97 gigs bup 2.18 I also did an import of my our snapshot backup at the moment. I use our snapshot for my production backups and yeah, my my our snapshot backups are 12.6 gig and bup Fits it into 4.6. We have metadata support kind of almost done um In fact, we we have a working patch set that is just about to Be merged into master since A half of a year I think Um Yeah, if if you do something we want to do it right and we want to have it tested And we want to be sure it works for 100 percent because it's backups And you don't want to have your backup backups messed up So um, we don't only store owner and permissions, but we store exact times as exact as the file system. We're backing off Backing up off. Um gives us the timestamps permissions extended acls and se links attributes Nice feature is a fuse module You can mount your backup with a fuse module and Can browse it with your file manager you can restore files from Your file manager just copy them out or you can also um, I think Here are some sys admins too Um, and often you have somebody who says oh no, I deleted that file and I need it back And you can just say okay. Um, here's your backup. Just mount this Samba share and get it yourself Nice feature two is a web interface. Um you can browse your backups with a Nice web interface with your browser and also retrieve files from there It runs on dvwrt So yeah, you can In fact, I thought about leaving that slide out Um Yeah, you can run it on your on your router on your NAS, whatever um pretty neat We have an import script for our snapshot. So if you're using our snapshot right now and think about using bop From tomorrow on or from next year on or whatever There's a script for it and it doesn't just save your our snapshot Tree, but it also creates commits with the correct timestamps Um, probably more will follow. I heard duplicity is pretty um widely used too. So I think I'll write a duplicity importer We are fully fully compatible with git You can use the git binaries to access data in a bop backup You can look at your backups with git k git x tig You name it This yeah, this is something that's pretty important to us too because yeah, we might Someday say, okay, bub isn't interesting to us. You can always fork it. It's lpl gpl Um, but you can always access your data with git 2 And if you remember, um, one of our motivations is to be safe against bitrot and Failing media for this we use part 2 part 2 creates parity information So you can restore files if some some bits are broken Yeah, okay, let's take a look how bub does things the algorithms and data structures We have hash splitting minix and bloom filters Okay, what is hash splitting and how do we do deduplication? um I have to say we don't do deeply deduplication only on a file level like git does it We go deeper and split files into chunks. So if you have a 200 gig sequel dump And at one line in the middle, you don't have to copy you don't have a full copy of that file But only the changes This is done with hash splitting We use a rolling checksum algorithm The one that's used in rsync 2 Who's familiar with rsync? okay, um rsync is a tool you can use to copy big files to some server or anywhere and it only transfers the deltas so if you have a 200 gig sequel dump and Change one line and want to upload it again. You can use rsync um, we use this algorithm um a rolling checksum is a algorithm that It rates over a file with a window and calculates some hash some based on a window and if this hash sum this this hash sum hash sum is again only um Depend of the content not of some wind ups of some fixed size window or anything And if this hash sum has 11 ones in the least significant bits we create a new chunk Why 11? Don't ask us if you if you have if you have a battery number Come to the mailing list and help us computer science through the rescue So we we have an average eight kilobyte chunks um and save big files and eight kilobyte chunks So this is how we get deduplication um, we have you do you remember the pack files and git every pack file is just objects written out one after another and It takes files are indices for these pack files if you want to look up a object in a pack file you need About three to four lookups per pack file In a big backup set you can have several pack files And if you do backups you want to check do you know this blob or do you know this tree? Okay, I have to iterate overall back files. I didn't find it next pack file. I didn't find it three to four lookups per pack file If you divide, I don't know 200 gigs by eight kilobytes You know how many objects you get and This is many lookups We have medics for this medics is an index for several pack files And an object is found in about two lookups The only problem is if you add something to your repository you have to rewrite the complete medics file And this is this this takes some time to So we use bloom filters. Who have you know is bloom filters? okay, um Bloom filters are a probabilistic data structure You can check if you have seen a datum before with a given probability The probability is the probability for false positives At first this sounds weird. Why do you use something that has a false positive rate? But they are Pretty fast and you can append to them and you can calculate How big your probability of a false positive is And if you have a probability of bigger than one percent, we just recreate the bloom filter A bloom filter is a bit array. You know bit arrays? They are just arrays of bits no No, you have you have you have some binary memory and you said bits and there and You want to use a hash function that is optimized to have at least to have least ones in the result as possible and the bloom filter this bit array is then Used by using the existing bit array and adding the The new ones with or bitwise or Yeah, when it is found we do a mid-ex lookup to be sure that we don't have a false positive and we don't Skip objects that we need Okay recently um We have yeah, like I said metadata support about to be finished It's available. You can use it. It's publicly publicly publicly available. Sorry um, you can test it if you're into python bubblis developed in python join us write tests I recently wrote a repack patch set Before this it wasn't possible to get rid of old backups So you added your new backups and you added your new backups and there was no way to get rid of them in fact This isn't the trivial thing you you have to Get all references to objects you need and then rewrite all pack files We have i notifies based demon discussed at the moment on the mailing list i notify is a kernel module that um notifies you if files are changed and you can say okay I want to monitor my home directory and every time something's changed. I want to have it backed up This might result in a lot of commits um but Your hourly or half hourly or whatever interval you choose backups will be very very fast then because nothing has to be saved And just a new commit is done and you can throw away the i notify branch later And yeah Don't ask what bub can do for you You know it Yeah, it's python performance critical parts like calculating The rolling checksum are implemented in c We don't have a native windows support port um bub runs in sigwin fine I have all already backed up windows systems with it But yeah, if you're into python and windows join us Um, also, we don't have meta data support for os x or windows Um os x has I think it's called as fs event Which is something like i notify you could port The i notify demon to it We we are pretty short on um end user interfaces um Like you saw in in the in the examples backing things up is an index step and uh and a safe step this will probably get Done with one command later, but we don't have gooey. We don't have nice diff This stuff that you can do if you want to yeah Yeah, that's pretty new I didn't have time to look into it, but it it looks pretty neat. Sorry Uh, yeah, the the question or interruption was uh, there's a k sorry, sorry There's a kde interface to bub. That's pretty new. Um Yeah, I didn't have time to look too deeply into it, but the screenshots look nice What's so funny about that? Thanks This is how you can contact me Spam me on twitter or contact me on on free note or hack and Um, yeah, I think we have some time for questions if there's some yeah very few minutes like maybe two Short questions or something. Um, I think we start with one from isc Like I said, if you have more questions, um either grab me here. I'll be around for a moment Um or use the hashtags bub and 28c3 on twitter Assaults and orion are asking does bub work with special files. So can you back up slash the root directory? You can back up your slash root directory, but um devices aren't saved yet. So no no special files now Okay, there was a question over there You can back up deaf u-random though Might take some time Um Does it work? Yeah How is the repack? How is it implemented because I can imagine that if you remove dependent files from later commits Later commits have may have May depend on blobs that are very early introduced into the system. Yep. Um What I do is I traverse the commit history. I traverse the whole tree Um and keep references to objects that I've seen And then I go to my pack files And write every object that I've seen that I need to new pack files um That's not trivial it if you have a big backup and you just Save your show once to a list or anything. Um, you probably will run out of memory So I use bit arrays again and um, so look room filter step is repeated now No, I don't I don't do hashing um For every pack file I have a bit array And I set the bits of objects that I need So okay, maybe one last question um when when you uh Said you provide a duplicate data application on sub file level and how do you um generate those object objects because the Git objects are big chunks of a whole file and how do you split them and still be compatible with the gift format? um We we split big files into trees and blobs again. So um, if you if you have some dump dot sequel um in your backup in your git repository, there'll be a dump dot sequel dot bub And this is a tree that points to trees and and blobs So if you access um big files with plain git You'll get those chunks but there's a one liner on the mailing list that Packs them together again. So if Everything breaks you'll get it back Okay, time's over. Thank you Soran