 Hi, I'm Lena Vataling. You might know me from SystemDeen stuff. I'm going to talk today about CS-Sync, which is my little site project. What's very unfortunate is that exactly at this time, or like in 20 minutes or something, there's another talk about CS-Sync actually somewhere on the other end of the campus. I would much rather be at that one actually. But anyway, for the ones who might have seen this talk, like for example at AllSystemsGo at DefCon last month, it's probably good idea to go to the other one and you'll learn something new, because this one's just mostly the same slide as before. A little bit updated though. Anyway, I'm talking about CS-Sync. Yeah, let's jump right in. What is CS-Sync? The name CS-Sync is supposed to suggest the relationship to content addressable file systems. By the way, if anybody has a question about any of what I'm saying here, I much prefer if you interrupt me right away and we have a discussion on topic than doing all of that in the end. By any means, you're completely welcome to interrupt me. CS-Sync, content addressable file systems. Everybody knows those. Everybody of you probably plays around with Git every day. Yeah, you know how this works. Content addressable file systems are these things where you have these hashes and they refer to objects and then you can use these hashes in the place of the objects and then you can build trees and these kind of things of that. So that one concept that CS-Sync picks up and the other one is R-Sync. Pretty sure R-Sync, everybody knows that too. What most people don't know probably who use R-Sync every day is the actual smart part about the algorithm about that. It's the R-Sync algorithm. The R-Sync algorithm is something that only actually happens in R-Sync if it realizes there's the same file on both sides like on the local side, on the remote side. Then the actual smart part of R-Sync takes place which is that it tries to figure out the differences and recognize the same data blocks in those files even if they shifted in the file by variable amounts of bytes. So yeah, R-Sync is an awesome technology. I think it was originally written in 1992. The ideas behind that unfortunately never became standard in what people do. There are lots of projects who use the R-Sync algorithm but unfortunately there are also a lot of projects that should use it that don't. It's an interesting little algorithm we'll talk about that later on. So this is the combination. Of course this doesn't really say much so what is it actually? I call it a content addressable data synchronization tool. So it's a little bit like R-Sync and a little bit like Git but it's also not like Git and also not like R-Sync. Its primary use case is file system like it can synchronize file system trees for cases where you have many similar trees. One major use case and the original one, the one I cared about the most initially, is about delivering OS images. Like OS images meaning container images, VM images, IoT images, whatever you want to call them. Something large that you tend to update in pretty regular intervals but mostly stays the same except for a couple of things that you fix about it and add something. OS container IoT VM whatever you call it. It has two other use cases though that I might mostly focus on adding functionality for that right now. Which I think are pretty much the same actually. It's the synchronization of your home directory between systems or backup. Backup of course is very similar in some ways but also different than others. It's similar because you actually also focus on large file system trees like for example your home directory which mostly every time you want to do a backup stays the same except for the few places where it doesn't. A backup system that wants to be efficient should take benefit of this. Of course if people backup their stuff with tar that's not very efficient because every time you tar the whole thing up. First of all it's very slow and secondly you store a lot of redundant data. So yeah these two use cases. Image delivery and backup slash the home synchronization are different but I think they're similar enough so that we can cover them with the same program. CI sync can operate on two layers. Like which one you pick is up to you and depends on your use case. First of all it can operate on the block layer. So the data images that it's like the delivers to systems can be basically what you did of a block device. So raw XT4 image like the actual blocks of the XT4 image or squash image or whatever you like. Secondly it can operate on the file system level. In that case it looks at files and directories like your home directory. So that's the level further up. In that case you're independent of the file system used in the below to some point at least you just look at more structured data in a tree fox. So what does CI sync actually do? Understanding that is the core of this talk. First step what it does is it serializes everything. We operate on the block device that's pretty easy. We just read it off the block device. It's already serialized. If we operate on the file system level it's also something that everybody does all the time which is tarcings up essentially. I don't actually use tar for some reasons. That's one of the reasons I care about reproducibility and tar has certain issues with that. Like reproducibility basically means that if you have one file system tree on your disk that you get the guarantee that serialization of it is exactly one thing and one thing only and does not change depending on which day of the week you do the serialization on or what the backing file system actually is or anything else like this. Generally tar is not very good at that because in tar the files inside the directories they appear in the order that ever the file system decides to push out and that might depend on many factors including hash algorithms and what not. So anyway, CA tar like that's how I called my serialization format like all the things that I came up with here always start with the two characters c and a. But the CA tar thing essentially just tar however it's reproducible and well defined in that and secondly it's random access so that if you want to access some file at the end of the serialization you don't have to read all the beginning like you have to for tar. But for the sake of discussion let's just assume it's the same thing as tar. On the block layer easy just read the block after block and on the file system later almost as easy just tar it up. That was the first step. Now we have this long serialization start somewhere and somewhere and it's just a series of bytes. Now we split it up in a way like how the arsync algorithm does. So we take serialization, chop it into series of chunks. These chunks do not have the same size all the time. The size is a function of what's in the data. What's actually happening there is that there's like this is the arsync algorithm ultimately is the hash function is calculated basically for every set of 48 bytes of the stream and when the resulting hash value holds in some mathematical expression which is hash function modulo q equals something then we do a cut. What's the effect of this? This effect of this is that same data results in cuts between these chunks in the very same places. Why is that interesting? That is interesting because if we would actually cut in equidistant intervals like we would always cut after I don't know 64k and you insert one byte to the front then all the blocks through the rest of serialization would change too because they all got shifted a little bit to the right and this will happen all the time because again like most of the time we talk about a tarring here so if you add one file in the front yes sure you added a couple of bytes to the front so all the blocks in the end would change. This algorithm the arsync algorithm like this chunking stuff has the effect same data will result in the same chunks will result in the same chunking locations that basically means that if you insert a byte somewhere remove a byte somewhere it does not explode into the rest of serialization but it will change this one block around it but after that block you get back into the normal chunking that you had before. This is like the interesting bit about it. It's what like the duplicating file system at least the good ones generally do. It's what arsync does it's what what's it called dropbox does to some degree and it's the interesting bit about it. So summary again first step to realize everything second step slices up into little chunks yeah why do we do this yeah for the adding extra byte in function ripple up to the rest of it. The used algorithm that we use for this is bus hash which is a cyclic hash function it's ultimately just an implementation detail that we do this. The good thing about this is that it's relatively cheap to calculate right like we don't like I mean think about that if you actually wanted to calculate the 48 byte window over the entire stream and shift it along that we would calculate a lot of stuff. The fact that we use bus hash makes that much much cheaper for the common case. Yeah this is what I mentioned here like you use the hash that you calculate for each of those 48 bytes take it modulo some value q and check if it's q minus 1 and if it is so you place a cut. If not you continue calculate the same thing for the next byte if that holds then you make it cut if it still doesn't you go to the next one and so on and so on. By picking this q the right way you can select the average chunk size right which the idea basically is that yeah in average we want 64k chunk sizes but they can be a little bit smaller they can be a little larger it's fine as long as the average is 64k. That was the second step. Third step after we chunked it all up we calculated a strong hash function like these bus hashes is like we forgot everything that we did about the bus hash thing about this rotating hash function. It's not a strong hash function we only used it for chunking things up we forgot that now. In the third step we used a strong hash function the one we're actually using is shaft 512 256 it's not so well known a member of the shaft family it's basically shaft 512 but cut to 256 bits and the reason why that's a good thing is because it's a lot faster to calculate than shaft 256 on 64 bit processors which is what we mostly have. It's not really shaft 512 cut to 256 bits though because it actually starts with a different value start value so it actually results in different values. Anyway one story short it's a strong hash function. It's like what get uses shaft 1 we use something a little bit more modern ultimately. When we hash these chunks we then can use the hashes as an identification for the little chunks. Now at the same time as we do all this we write out an index file that's how I call it which is very simple it's just ultimately just a list of these hashes. Actually it's not just a list of hashes because I want random access and random access to the serialization stream is a little bit difficult if every chunk has a different size because you never know if I want to go to buy 5 million in this serialization then you would have to that would be over 1 of n and that would suck. So actually it's just a list of offsets with the right hashes. These index files are sufficient like they explicitly define one version of the tree very explicitly because all you have to do is like the hashes refer to chunks if you concatenate the chunks in the order that the index file says you back at the original serialization. Fourth step after we did all this we wrote out the index file and have these chunks we compress the chunks individually so we just use some standard like compression like ZSTD this Facebook fancy compression algorithm that everybody likes these days. Then every single one of those chunks which are as mentioned uncompressed there around 64k we place them in one big directory and this is what we call the chunk store. We place them in the directory where each file is named after the hash. So now we have basically one big directory with lots of little files that individually compressed whose names are all hashes and then you just have to pick out the right ones in the right order and you're back at the original serialization. And that's all the CAsync ultimately does. To recapitulate we serialize first then we chunk then we hash everything and create the index at the same time then we compress it and store it somewhere. If we want to extract one of these things we do the opposite. We acquire the index file, we go through the index file item by item, look for the chunk store, uncompress it, concatenate the whole thing so that we have serialization and deserialize it to disk. And that's really all there is. Yeah. This is basically very similar to Git. Well I mean Git doesn't do anything with serialization right and Git also focuses very much. So one of the major differences here is that I don't care about the file boundary ultimately. I get rid of it very very early on. Git and R-Sync too. They all care about the file boundary right. So I think that is weakness. I mean there's strengths in some ways and weakness in others because what R-Sync and Git are never capable of is like they cannot really track when content moves between files right. It also has this specifically in R-Sync that's a problem right because R-Sync is not capable of recognizing when you rename a file or something like that because like this algorithm, R-Sync algorithm is an R-Sync only applied on individual files of both of them like if they exist under the same name on both of sides. Then it's efficient. Otherwise it's not. So this stuff however has this benefit like I serialize first. Then I forget everything about what I just serialized and the file boundary and anything like this. And chunk it up at that point which has the nice benefit. Small files that got lumped together with other small files until I reached the average chunk size. And the big files that are split up into small bits so that I reached the average chunk size. And it has a nice benefit that the same files that move between directories are perfectly well recognized with this because I only care about the data contents. I don't care about file boundaries or anything like this. So yeah, key difference to R-Sync or R-Sync I mean there are many systems that use some of these bits. There's at least to my knowledge no system that uses this combination to build a system like this. But the key difference, yeah, forget the file boundaries. I don't care about the file boundary. Limiting yourself by always keeping the file boundary in your mind makes a lot of things very difficult like tracking changes across. If we don't care about file boundaries, how do hard links work? That's a very good question. By the way I figure I should repeat those questions, right? So the question was if we don't care about file boundaries how do hard links work I don't care about hard links. That's my answer. It's not quite that easy. I do care about hard links. I thought about this for a long time whether I should actually care about hard links and actually serialize the fact that they exist. Ultimately they generally don't. If you tar up a Debian image or Fedora image or something you won't find many hard links in there. I use them though heavily. There's a mode that I'll come to later where you basically can use an existing tree that you already have on the system and then extract a new version and I will, if you want to, hard links the old version to the new version so that it basically can have two versions of your image and everything that's similar or identical is hard linked up if you follow what I mean. But let's talk about that a little bit later again. Any other questions at this point? Yeah, so the question was about hash collisions, right? Well, I mean it happens, sure, in theory, but we don't have enough atoms in this universe to make this like build a machine that can actually happen. I mean it happens for SHA-1. This is a way stronger one. SHA-212 flash 256. But if you, if we live in a world where collisions are actually happen, you can't really do that. If collisions are actually happen, then we have a problem anyway, it was good and so on and everything else too, right? Like before I have to think about that. A lot of other people had to think about that and fix it for them and they just can copy the solution. Anything else at this point? How does it handle repetitive data? So how does it handle repetitive data within files themselves? So like this authorization, I mean it reads all the files and then if you have repetitive data within the same file, this will always result in the same chunk. So on the server ultimately, the number of chunks ending up, like if you add them by size, it will be much smaller of course. It will automatically recognize similarities or identicalities, whatever you call them, within files, it will identify them across files because it doesn't really care about files, right? Like that's a good thing. So yeah, it's very efficient. It will reduce, like throw all the duplicates data around. Of course, always within the average chunk size, right? Like if you have identical data that is shorter than the average or minimal chunk size, then we'll never be able to find that, right? It's not for that. But yeah. What is the name of the compression algorithm? The name of the compression algorithm is ZSTD. It's a Facebook thing, like it's relatively good compressing it very fast. But I guess you designed it to be swappable. Like you can use X that and you can use something else too. But yeah, I mean you can use the hash function, you can replace that too, but the thing is I wanted to, like, and you might even want to replace the hash function. Like if you care about 32-bit processors, SHA-256 is way fast on those. So if that's what you care about, then you should probably swap that out. But I think most computers are probably 64, but so that's what we default to. Okay, let's continue. Yeah, so in an average, the chunks are evenly sized and we can recognize similar blocks in different files. File renames this way, files moving this way. We can recognize the same contents within files even too. And yeah, why do we all use this? Similar file system entries will result in mostly the same chunk files and hence you get efficient storage of many related trees and all that without keeping any kind of history. It's also really nice about this that everything is implicitly validated, right? What, me? Weird. So everything is implicitly verified. So there's this dmvarity thing that some of you might, oh it reminds me that my talk has begun, that's nice. So there's dmvarity that some of you might have heard of. It's a nice functionality that basically allows the system to validate every read access to the hard drive and guarantees cryptographically that what is being read is actually the version that whatever the vendor put together. This stuff will give implicitly a similar behavior, right? Because we check everything against the index and the index contains the kind of cryptographically strong hash values of all the data blocks, all the chunks. We get the same behavior there, right? We get the complete guarantee you won't be able to play games with us and provide us with wrong data and we wouldn't notice. It's also relatively CDN friendly, like content delivery network friendly, because you know it roughly our chunks are always the same size, right? And you can pick while when you use CA sync what the average chunk size shall be. So since CDNs generally, like you pay for the number of objects requested or something like that, so you can actually pick like how much you want to pay to CDN. Use larger chunks that basically means fewer objects get requested by clients if they want to download something but less similar patterns. We won't be able to deduplicate so much. So it's relatively CDN friendly. I mean other systems like for example OS tree as one they traditionally started out at least as something where every individual file would get the information would be put on the HTTP server. So if you look at Etsy or something where you have these tons of very small files at the host and whatever they're called they would all get this little object and you would have to pay CDN with millions and millions of get requests because every client requested that. Their way out of that problem was adding binary deltas. Binary deltas actually something very opposed to what I'm doing here because binary deltas between different versions always implies history, right? You need to have somebody who sits down and figures out what are the image versions that it's worse updating between so that he pre-calculates the diff and puts it on the server, right? So this is management work. With this stuff nothing of that sort is required, right? Like every image stands for its own and if data blocks are reused that's automatically detected because I don't care about history I don't care about anyone managing anything I just care about the chunks and they automatically out of themselves de-duplicate themselves. Any questions otherwise? So why all this? Yeah, when acquiring a new image we can actually take benefit of the fact that usually like if you do an operating system upgrade you already have one version, right? Like that's the definition of an upgrade. So we can actually go over the file system and do the same algorithm that I just explained and then we'll get a list of hashes that we already have that we can read on the version that we already have and then when we do an update we just copy it from the file system that we already have into the new version that we're about to create and only the chunks that we don't have yet we actually acquire from the internet. So yeah basically with everything that we have we get a pool of reusable chunks interesting thing about this that actually allows even updating efficiently or relatively efficiently between theoretically foreign images. Like for example you could actually Telsia Sync to use your Debian image as a base for installing the new version of your Fedora image, right? And the Sync would automatically recognize similarities and there are some. I mean it's not going to be as efficient if they actually share some common history but it will recognize what the similarities like time zone data and locale data which tends to change relatively solemnly. So which is kind of nice, right? Like because you don't need any actual historical relationship if there is one that translates to better efficiency but yeah you can actually throw any kind of treat on it and CIS Sync will recognize similarities and if they aren't then it doesn't hurt. It just makes a little bit slower initially because we have to index everything but other than that it doesn't cost you anything. Yeah so there's automatic robust reusing of what's down before right? Because everything is cryptographically verified even if you have like the old version was modified locally because somebody hacked it or whatever else. We wouldn't use it, right? Like we wouldn't use it because we read the stuff from disk, hash it up again and check it against what we expected to be and if it's not we just don't use it and use the data from the internet instead. You said at the beginning that you can choose the chunk size on average right? So I guess if you choose like the smallest size the chance of usability are pretty much higher. So what's the trade-off and how do you choose that size? That's a very good question. It really depends on your use case by the way the question was regarding chunk sizes and how do you choose the right chunk sizes for what you want to do and what the trade-offs are. Yeah I can't really give you the perfect answer for that because it really depends on what you're doing and for many use cases I don't have any answer at all but yeah as mentioned that I don't know for the backup case for example things will be very very different than for the image delivery case and for the image delivery case it really depends what you actually ship. Like for example some people ship squashFS, I actually have a slide about this later. SquashFS in theory is very much contradictory to this concept because squashFS removes redundancy anyway so I won't recognize any data within it and in theory if you have fully compressed data then every change in the beginning will explode into the rest of the image anyway. Now squashFS thankfully isn't like that because squashFS still needs to be a random access file system. So what they actually do is they compress a little and then they stop the compression, restart the compression for the next stuff and then add an index at the end. So because that is that way if you synchronize things properly with squashFS chunking like the block size that squashFS uses and the chunking size that squashFS uses you can actually even deliver squashFS relatively efficiently. But if you're asking me now what the best the right setting says I can't really tell you that because it depends on your use case and people have to crunch the numbers first to figure out what's right for them. Now this is actually something that the other talk on the other end of the venue here is doing like he actually put a lot of container images on CISync and tried to figure out when this actually starts making a lot of sense and when it doesn't do so much and he gave me quick overview about the results of this but yeah if you want to answer for the question you have to go to the other side again. In this case about this question can we have index for different chunk sizes in one place? Regarding can we have at the same place index files for different chunk sizes? So I mean you sure you can do everything right like CISync won't stop you but of course if you operate with different chunk sizes CISync will chunk at different places so actually I mean it's not that you could reuse chunks it's very unlikely that you can reuse chunks because we will chunk differently ultimately right but you can place them but honestly before you deploy CISync you should do your homework and figure out what the perfect chunk size for your stuff is. Ultimately I mean it's not that it's completely have to be set in stone right you can change it later like it won't hurt you technically it will only hurt you that yeah when you do fiddle around with a chunk size all the time over the progress of your project each time you do it the level of reuse for the chunks goes to zero usually but it's not that it's like it won't create technical problems just if you fiddle too much then the bandwidth savings that this is supposed to provide you will not be delivered. So in principle could you update for say live IoT devices with a new image and so what may be the gotchas on that? So the question was regarding updating live IoT images with this and what the gotchas are about doing this so this is definitely one of the use cases right like you saw the IoT on the thing and well I mean the double buffering thing like the A.B. partition stuff right this should be pretty well like it really depends what your constraints are do you actually care about runtime for this or do you not? Like if we try to like if you take benefit of the seeding stuff like seeding is how I call this that you look at the operating system version you already have and chop it up and use it as a pool of chunks that of course takes a lot of time like indexing all of that and you could cache the results of the indexing but if you actually have file system trees that change then the cache like we can't maintain this there's no sufficient API in Linux how we can detect changes on the block layer at least on the file system layer that's not too much of a problem but yeah if you want a one stop solution that already is well tested and people know exactly what the parameters are that you want to put in there then seeding is not for you but yeah this is stuff that's still relatively new I know that a lot of people have been doing this as I learned recently there's even re-implementation of the CAC client side at least in Go that nobody told me about until yesterday and so apparently there's some adoption but ultimately most of the code is less old than a year so I think I mean my answer is definitely it's absolutely the use case for this right that's what I care about what I want to build is like a self updating stuff where people are a little bit smarter than just DDing things around or tiring things around but it's not a ready made solution like it's not a product it's a building block that you have to make fit to whatever you want to have there was a talk yesterday on exactly that oh there was another one about this oh good that I know now who did that oh awesome yeah okay they're great people anyway I probably should watch the video about that and the other one 3CA sync talks at the same conference is awesome anyway if nobody has questions let's continue with the slides yeah this is something I really wanted to stress there's no revision control I already mentioned this right like this problem was where OS tree tried to solve by having binary deltas between the key versions that people wanted down upgrade with this is not necessary at all no revision control, revision control is a useful thing for developers I think but for deployment I think it's a weakness I think for example it's a weakness as a docker model for example because I have these layers which are two things on one side they are revision control for developers right so that they start from the devian thing and then they make their changes but on the other hand they also are their way how they want to reduce downloads right like because everybody already has a version they don't people don't need to download them I don't think you should intermix that ever and since this is about delivery primarily and not supposed to be another git like not supposed to be revision control the emphasis is really yeah there has to be no history and you don't have to manage anything and if you yeah it's all standalone individual thing and if you if two people happen to have the same data somewhere and share no history together the same chunks will be recognized anyway regardless yeah forget revision control and we find the similarities automatically there is yeah revision control is for developers it's not for deployment then yeah already mentioned that I'm not using tar I'm using ca-tar this little thing that I came up with it's strictly reproducible right there's only one valid serialization with ca-tar for directory tree this stuff is defined so that actually be tar 2.0 like all the warts of tar removed but then again I'm not pushing for that I don't really that's not the problem we're trying to solve and yeah that's random access it has a couple like the random access thing is awesome because it basically allows ca-sync to mount these index files remotely into the local file system via fuse and then in the background I can download the chunks as I need them as the client accesses them and because I have random access I can give you proper file system there that basically ends up on your local system incrementally as you use it so yeah the random access thing is something very much a shortcoming of tar there are a couple of other things by the way like I care a lot about metadata control so metadata control means like depending on your use case you need different metadata in your archives like for example if you do an container image you generally don't care about M times like modification times because M times are a bad thing for that thing because they are like contradict with reproducibility right like if you put together your image today with gcc and rpm or whatever else and generate the exact same bytes on disk they will still have different M times than if you do it the tomorrow right like because the modification times will then be the ones of tomorrow so if you do container or IT images you generally don't want M times because they mean changes that you're not interested in if you however do a backup of your home directory you very much care about M times they are actually a useful thing to figure out like on what date did you work on which document so in CITAR I actually do care a lot about that so that you can explicitly pick the metadata that is actually included in serialization and that's kind of for making a requirement also for the reproducibility right so that if you actually are in control whether to store M times whether to store user access modes whether user identities like chone and sync like that whether we store ACLs or extended attributes all the things only when you're in control you can actually make sense of the reproducibility there the metadata that we store is very comprehensive I don't know any tool that goes into that Magigetel because we store all the weird stuff that we have nowadays in Linux like these file attributes like the chapter stuff and I don't know quota project IDs and all the exotic weirdness that we have nowadays yeah I mean for the question at this point if you're mounting your series in parka is that read only or? That's read only like all of this is about reproducibility all about immutability so that every access is validated all the time through this index thing and that basically means everything is read only right so if you're looking for a general purpose file system this is not it this is an archive format this is a image delivery format and a cryptography secure one when you actually play around with CS think you'll see a couple of different files the primary one is CIDX that's the index file it is as mentioned just a list of hashes with the offsets or the lengths of the individual chunks there's some saying CIDX exactly same thing internally the difference is only semantically like one is that if you operate on the block level the other one if you operate on the file system level right CITAR already mentioned that is pretty much the same thing as TAR except that it has reproducibility, files and attributes, random access and these kind of things and then there's a dot CISTR which is a chunk stored directory so it's not a file it's a directory and that's where if you actually go in there you'll see a lot of little files all named after hashes that if you actually look into them you see that they're all ZSTD compressed and if you decompress them and use OpenSSL to carry out the hash so you'll figure out that yeah they're exactly the file name they're stored under so if you actually come into contact with this these are the four things that you will see a couple of more actually like this but these are the ones that actually matter so does that mean that when you do CISync with the space of your system is going to grow up? It does mean that when you sync because you are creating the CISTR it means that your files can grow because you have to generate all that hashes which contains the so the question is regarding if my local file system grows if I do use this so CISync can store stuff locally if you want to and of course you will have to pay for the local store but it can actually do the same thing remotely right now only through SSH to some other side and in that case we store a little bit of temporary data but that's not substantial data that's a little bit so the idea really is when you create an archive like this we send to the other side a list of chunks that we would like to store there and then the other side tells us oh I already have these chunks but these ones I still need and then we send them the ones that it still needs so it's relatively efficient there but then again so far it was optimized about making delivery cheap not about creating making creation cheap right like not about the archiving step but of the extracting step this is changing now as I'm looking into solving the backup thing more because in the backup thing more suddenly the archiving thing becomes like the big problem because it needs to be fast and things like that. They probably take up I don't know 10 megabytes at most or something. Okay so the question was regarding if we generate in the next file then the changes made on the directory and we generate another in that file do we take benefit of the fact that we already indexed it once the answer to that is yes since yesterday. So it's a big thing like if we actually care about the backup case right this is what we need to optimize for right like I want to go for high frequency backup so what I'm really hoping to deliver eventually is that we can do your home directory every five minutes or so you do backup and you don't have to pay for it massive amounts of time. Linux makes it really really hard because we don't know what files have changed there is no generally accepted API for doing that right Butterfus is something like that but it doesn't really work and who uses Butterfus anyway and the other ones don't have that at all so what we actually do is we do what Git does and what everybody else does as well we stop the whole tree and see what has actually changed we try to be a little smarter than most people though we use file system generation counters which is a little known feature that most file systems on Linux that we have nobody knows how precisely they actually really defined but the essence is that supposedly in every change you make to file they are increased in some way and so if they haven't changed then you can use that to know that they haven't changed right so since yesterday is when my colleague emerged my patch that took benefit of that we have this caching thing in place where basically yeah so for the first iteration we serialize the whole thing as I explained but then we store information that last time I looked at this file in this version this mtime inode and generation counter it hashed to this thing and by the way in the chunk that refers to this hash and also this and this and this and this files with this inode number mtime blah blah blah and then I verify that if that's still the case and if that is the case then I just use the hash and I can I don't have to actually read stuff off disk I don't actually have to hash it again I don't actually have to compress it again so yes we do take benefit of that now since yesterday when the patch was merged like bup yes so the question was if I looked at the Borg backup yes like I mean a couple of systems that are like this they all end up the same thing we have nothing in this area like ideally if Linux was like a really good operating system we had some better APIs for this like for example recursive mtime it's a thing that 15 years ago people already wanted this and still not there and then people want that for search engines people want this for backup solutions if it was there I would so love to use it because we don't have recursive mtime we have to descend in the entire tree start everything and sucks thankfully Linux has been optimized for that thing because that's what everybody does right so that's why get actually pretty fast if it wasn't then yeah people wouldn't be fans of get so much so the question was regarding like is how are the index files and the stores linked up they aren't the idea really is that the data can come from everywhere and you have to tell cacing when you invoke it where which stores itself shall be using it can use any number of stores the idea is even that later on like it can use local seeds that means like local versions that you already have it can use stores but ultimately I want to go to some point where like we have a multicast protocol that on the local broadcast domain of your network can actually create us for these these blocks as well so the idea is basically that if you have this big cloud thingy installation where everybody runs every node runs the same operating system that instead of every single node starting the new version of the operating system they pick something downloaded and then for everything they constantly ask on a multicast for other blocks and they would be perfectly safe even because it's all cryptography secure right like if somebody gives you bullshit data you just calculate the hash figure out it's not right and throw it away but the idea really is that the model like this is not only useful for cloud it's also for IoT devices right like if you have lots of things like that so yeah there is on purpose implicit connection between the chunk store and the index because I want people to enable to get the chunks from whatever they like yeah if you're building on that question and this idea of multicast then have you considered as the next step DHT and then eventually a bit toward clone essentially I mean like the question was regarding what's the next step after adding this multicast stuff if we do something like BitTorrent or DHT stuff well I mean it's not my focus right now is image delivery and backup and for backup I'm not even sure how you would use torrents like stuff there but sure I mean sure there's lots of things possible right but this is not finished yet not even with the stuff that I'm doing and then I have ideas about lots of things but not even the multicast thing is in any way more than a start somewhere do you think I notify to find files directly with the type change so the question was regarding I notify if we can use I notify to get the notifications about changes so I notify is an awful idea I know it's another one of those Linux file system APIs that suck it's first of all the synchronous and it throws away stuff if the event queue is crowded then it throws away and if you do you're supposed to start everything anyway right like so then it doesn't do offline changes right so it basically means like if you took the hard disk out and put in some other device and made a change there you will never I notify won't tell you about that of course and the biggest problem is it's not recursive right like this is supposed to be for home directories that have a couple of gigabytes of size or terabytes whatever else and deep directories and I notify does not work recursively right like people try and then they run out of I not handles but it's not designed for that it's just no it doesn't work and also I really don't want an online component because online component means like it sucks for for like embedded stuff and things like that right I want a component that I look at the stuff as it's now and do the best out of it and then I go away and then eventually call me again and I do the best thing and I don't have to stay around and watch whatever you do because that's evil anyway that's more questions where's my slide about this I had a slide about this oh this is a slide of course so there's this slide here that's how what you can type and then you see like you have this CINX so if you don't specify a store explicitly CINX smart enough to look for the store next to where this is so it will automatically make up the URL http example.com default.castr and look for the stores there so that you have this one thing and then if this just works if you do this then you get a directory somewhere and then you can go into it and we'll download everything then it's actually kind of nice even because it does this progressively but while you access it like it has a prefetch thing in there but while you access it has pulls those with a higher priority and things like that well the file names are the hashes so there is nothing so the question is regarding where does the directory tree information come when you do something like this basically you know like there are multiple layers here right like there's the index thing and the index thing I have a random access thing interface to that right like I can basically say give me byte 1 million from my stream and then I have a relatively efficient like it's over log n algorithm to figure out in which chunk it is and then I can download that chunk from the internet wherever I have and maybe I have it already I don't know and then within it because I have random access on the upper layer as well about in the tar layer as well I just need to like basically the way tar works there is that it's the file form is supposed to be composable meaning that the serialization of a directory tree is strictly the serialization of all the stuff within it concatenated plus some header and footer right so it's strictly composable the composability is a nice thing if we want to recognize stuff but this basically means that there are never pointers from up to inner and from inner to upper to outside if you see what I mean and the random access stuff is reached by having at the end of every directory we have a little section table which we use to find the right file so ultimately what happens there basically is that it's a little bit like a file system but a weird file system because I actually I don't not only want random access while reading I also want like a serialized access, efficient serialized access while writing and reading so it's like this hybrid of something that is random access but also is streamable because a streamable functionality in tar is actually kind of nice and I thought that would be nice to keep so it's something like a file system and then you have the upper layer on the lower level and the upper layer figures out where to look on the lower level and translates that to actual chunks that have to be requested through ATP I hope that answers your question does that random access means that you can just pull any part of an image in the building so the question was regarding whether random access means that I can download any part of any image randomly yes so the question is whether you can extract parts of the archives, random parts, random subtrees and yes you can I think I did not implement that here though like you can't like but it's not that it wasn't possible it's just that it was too lazy. There's also by the way this thing here if this is the same thing as the mounting but on the block layer so as you see this mounts a CAABX instead of a CAADX and then if you do this you basically get a device that is a block device like any other you can mount it you can look at it with whatever tool you like and as you access it it downloads stuff in the background and makes it available locally kind of cool actually you can recognize things by the way, we already mentioned that hardlinks are pretty cool like you can do that with your multiple trees we also do reflinks, how many minutes do you have? So it does reflinks, this is new file system concept like Butterfus had that for ages and now XFS is getting that too reflinks are basically way how you can have two files in the file system and share the same data on disk so that the copies that you have don't come at the full price of being copies and all of that in a copy and write fashion so when you write to one of these two files it gets automatically duplicated so that they don't interfere with each other which is massively different from hardlinks because in hardlinks both ways into the file are identical and if you change one then the other one changes too the hardlink thing is an optional thing because it has these effects basically means hardlink, if you do the hardlink stuff then you can never write to the trees that you just extracted because you will then also modify the other trees that you might have the reflinks thing is however fully transparent to applications because of this copy and write nature that it has it's actually really cool this thing, like the fact that I use Butterfus and I extract one of my images and then the other different version of my images it's basically the disk space usage is like I don't know, 1% more I pay for each image version there identically in every way. I don't know about any backup system that can deliver anything like this by the way but yeah, cool stuff. Any questions otherwise at this point? Yes, I have a large file it is also useful if you have a IOT device which has a thermal size of 20 kilobytes or 100 kilobytes, and a very slow connection to it in your opinion it would be suitable? Well, so the question was regarding whether this is suitable very constrained systems like where you have very little space and very little time. I don't know, this is not the world I live in. I generally, this is not optimized for utter minimalism. This relies on open SSL and these kind of things. I mean, it's not that we pull in millions of dependencies, but we do pull in open SSL at minimum and libACL and these kind of things. And the stuff it does, it calculates caches and compresses and these kind of things. It's probably, it's not optimized for the tiniest bit. That said, I'm pretty sure it's fine for anything like arm, like the regular arm, I made a device that you find. It should be fine for Raspberry Pi perfectly. But if you're talking about microcontrollers, no, forget that. So, yeah. What about backup case? Will you add encryption? Yeah, that's a big thing. So let me quickly, like I got like two minutes or something. Three minutes. Reproducibility matters a lot to me. So here, actually, because we can't serialize that stuff and because it is so perfectly reproducibility, there's actually these two commands, digest and entry. And if you run those on any directory, it will actually calculate a digest for you that identifies that version of the tree perfectly. So what this actually does, it CA-tars it up, throws away all the data, but calculates the charsome over it. So it's a very nice way how you can get a chars for some directory and to see if the directory managed in any way. So it's completely out of scope for the rest of things, but it's kind of nice to have. CA-syncMtree is similar to this. Mtree is a format that the FreeBSD people came up with. It's a manifest tree. It basically is a list of files and directories that should be there and where the content should have this hash and where the metadata should be this and this and this. And CA-syncMtree allows you to very efficiently generate the same thing from a CA-index or from a raw file system. So I don't know, it's kind of useful this stuff. It's kind of the fringes of what we do. It's a side effect that we can do this very easily. But it's interesting to mention still. CA-sync can do local operation like directed file system. It can download from FTP, HTTP, HTTPS, FTP, like the usual set in SFTP. It can upload currently only through SSH, but that's not because I didn't want to support that, but mostly because I didn't find the time to yet. The idea is that later on maybe we get back ends for S3 or whatever else so that you can do a local backup and put it on Amazon or whatever you like. So this is interesting. You would get shifting for those few people who use containers with username spaces. They want to go into detail there. One other thing that's actually kind of nice is when you operate in the block device layer, usually the file systems that you store in your block devices are way smaller than the block devices themselves. Like for example, you have SquashFS and it's compressed. But the partition you put it in, usually you have at least twice the size because you want to have some room for upgrades and things like that. This is annoying for a sync like this. Because depending on big SD card or whatever it is you have in your embedded device is, you might end up with completely different petitions. Because there might be different amounts of space behind that and the space might not be initialized or made. It might be who knows. Anyway, I think it's relatively smart. It actually can read the size from the file system. So it has a minimal positive for a couple of important file system headers and figures out the actual size of the file system so it can do stuff. So the last bit is about the future. And that's the encryption stuff. So over Christmas I actually sat down with my brother who's a crypto postdoc about figuring out how we do the cryptography. So he'll get a paper out of it and I'll get the crypto system that will hopefully convince people enough. But the idea, yes, there's going to be crypto and it's going to be strong crypto. And the idea is basically I want to go for this model that people use it for the home directory. They can store it on servers that they don't have to trust. And those servers don't know what they're storing and they have no way to figure it out. And the information is all available on the client side. And we will still duplicate at least between you but potentially also with others. Anyway, no time anymore, right? But one question? One question, yeah. One question? What about preserving saving or restoring some fancy file labels like a Selenix or whatever? Yeah, the question was regarding what about saving, restoring fancy file metadata like a Selenix level? As mentioned, metadata control is important. So if you look at this, there's actually, I have this one slice here, slide here. Metadata controls and like you have these dash dash with and dash dash without things. And on these arguments you list explicitly the metadata you want to store and the metadata you wouldn't want to store. So unless a ton lot of metadata, that is as a Linux label, so extended attribute. There's project quota, there's UADs, whatever you like. And then depending on your use case, you say which ones you want and which ones you don't want, right? And like for the home directory backup thing you would probably say I'm not interested in user identity because the stuff is owned by you anyway, but you do want empty times. For the IoT thing you say, oh I care about ownership file, ownership, but I do not care about the entries. So you speak specifically what you want and I think they have about like 40 or so different bits that you can pick. Anyway, I think that's my time. Thank you very much. If you have any questions, ask them very soon because I'm heading off to the airport and yeah, unfortunately you couldn't see the other talk which I would really have loved to see. Thank you very much. Thank you. Thank you.