 Then let's get started I'm going to do I'm going to talk about see I think a new project that I have been working on by the way of the People standing around here that you can actually go up there as well and look down here if you still want some more closer Look on this Yeah, I'm going to talk about the see I think see I think is a new project. I've been working on it's it's a system for Synchronizing OS images From the internet to your devices. It's supposed to be a generic tool. So This is a talk that hopefully is interesting both for embedded developers and for Server people but also actually for desktop people. So, yeah, it's supposed to be generic infrastructure Nothing nothing specific. It's not a product. It's just a building block. You can build your stuff off from so Yeah, the name see I think Is supposed to reflect The major two inspirations that Led to see I think the first one is content addressable content addressable file systems I'm pretty sure everybody of you already came in contact with because that's what get is Meaning that you have a hash that I didn't identify some blob of some kind and you use the hash to reference it So the address of that content is the hash and then there's another concept Which all of you probably also came in contact which was is our sink our sink is a generic file synchronization tool It's actually very interesting Algorithm behind it most people who use our sink probably ever never came come into contact into that With that amazing Algorithm most people just assume it's kind of like a SCP that works better, right? but actually it's way more than that and It's it's it's pretty exciting technology that people invented in the 90s or something And I think it's technology that deserves like this our sink algorithm that deserves A lot more room in the in the in the light because it's so awesome See I think is just the combination of these two ideas content addressable file systems and the arcing algorithm so What does that specifically mean it's a content addressable data synchronization tool? It's a file system tree synchronizer It's basically Supposed to speed up Synchronizing file system trees if you have multiple of them and they are large but very similar The use cases I already mentioned that are as if you if you want to update your OS if you want to update your containers If you want to update your IOT devices your VM images if you have them It's about delivering the initial image as well as about doing updates later on It's it can have different related use cases as well One of them being synchronizing your home directory or doing backups right like the big differences in the First use case like delivering images to to clients. It's about downloading something from the internet and installing locally In the case of home the synchronization backup It's the other way around right like you have something on the individual device and you push it into the cloud The actual problem sets are very similar But also distinct in some cases. That's why for now we only focus on the first bit, but the later bits hopefully comes later as well See a sink in contrast to our sink is a tool that can actually work on two different layers One of the layers a block layer right you can actually use it to deliver file systems On the block layer. I mean actually actual raw file system images like a XT through XT3 image that you read from disk or a squash of as image or whatever you have It can also operate on the file system level. That's how our sink does it right like That we actually look at the individual files and directories and the metadata about them and serialize them The fact that it does this makes it actually particularly powerful Different use cases might want to deliver things on different layers like for example if you do IOT stuff It might be interesting to deliver images on the block layer Right like you want to have your squash FS and you want to deliver it as squash FS and should always always be a squash FS Other use cases the file system level is more interesting like containers traditionally Like Docker and these kind of things deliver their stuff as a tarball hence they operate on the file system level so Yeah, and that area might more make more sense to to operate in the file system level. I mean they then snap Like this this Ubuntu Desktop app thing actually works on the file system level as well So I mean really depends on your use case on what you're designing stuff for whether you want to Operate on the block layer on the system layer level and that's both okay So let's actually have a technical look at what I see I think now really does and why it is an interesting concept For that we'll just discuss what it actually is doing if you Upload or download an image with see I think the first step of that is Of the uploading now it's we serialize all things right on the block layer That's actually easy. We just read the sectors of the disk right like one after the other And have a civilization because I mean that's actually what the block device is on the file system Like it's almost as easy and everybody's used to this kind of stuff because it's yeah Effectively tore it up right go through all the files and directories put them in a specific order stole some metadata about it and Turn it into big long stream That's the first step. The second step is now split up the serialization, right? Specifically we take the serialization and chop it up into series of chunks Now the interesting thing about this is we don't chop it up in into chunks of the same size always Which is what traditional systems do what we try to chunk it up in in chunks of varying sizes where the sizes Depend on the contents of the blocks, right? So the same data should result in the same chunk size individually why this complexity of having varying sizes size blocks This is to deal nicely with if you have multiple images that are very similar, but also different Because let's say you have version one of your file directory tree and then version two of the file directory tree And it changed very little however a couple of bytes were inserted somewhere in a text file where you've noticed You made a typo and mistyped a comma for for a full stop and you wanted to fix that, right? so you added a couple of bytes and removed a couple of bytes or something like that now if You made this change and if you if we would chunk everything up into equally sized blocks This was basically mean that from that point on where your change actually was made all the blocks following Would change as well, right like because a few bytes might be added to this these blocks so that the To the end they have to be room removed or the other way around right so this is kind of a ripple effect if you use equally Size blocks all the time that basically means up to the point where your first change is The blocks will stay the same everything after that will basically explode and become a change if you however use Varying size blocks where the block sizes derived from the contents of the block you can protect yourself from that So ultimately what you get is that yeah The block immediately surrounding it will change, but as soon as enough time or enough bytes Have been processed afterwards you back in the original Chunk size logic, so you actually get back to the same chunk sizes as before the change the algorithm This is actually What the r-sync algorithm is about right this is the stuff that the people who invented our sink actually came up was So that it only in our thing though becomes a visible if you actually have two file system trees Where two files like in the source and the destination tree actually carries the exact same name because only then our sink actually starts doing its R-sync algorithm if the file system trees have the same files, but they have different names then our thing is not able to optimize anything that so the way that actually works with the with Recognizing like like deriving the size of chunks from the contents of the chunks is by using a so-called cyclic hash function It's basically, I mean we could actually use any hash function, right like anything like that is remotely Similar to a cryptographic hash function even But it would be pretty inefficient if we did so what what actually happens there is that we go through the through the data streams So our civilization and hash the first button that second bite and then With the first second and third by the first second and fourth by then so on we do this in a sliding window over that entire Serization stream and yeah for every combination basically we calculate the hash function and then we do a test With the resulting hash function of every single byte ultimately or every single window at every single byte position The test used is this model operation here. Actually. I mean it doesn't have to be exactly this model operation This is what we use in CIS thing for a couple of reasons, but you can use anything else It basically just like yeah, it's a model operation and by Like H being the hash value that you calculated at a specific position Q being something like like Value which allows you to pick the average chunk size, right? So under the assumption that the hash function would be close to perfect This would basically mean by by picking Q. We can make it more likely or less likely to put it put a Cut a chunk cut at a specific bite right, so Yeah, on top of this On top of doing this hash trick that we basically calculate the hash function on every single byte for the window and Do the test we also enforce a maximum and a minimum chunk size meaning Even if the hash tests very early on we ignore like a like the test succeeds very early on we ignore that on until we reached a certain minimum amount of bytes and Similar if we reached a maximum amount of bytes with our chunking without actually having ever tested the test successfully we force a break so that Yeah, basically we can say our chunks have a minimum size and a maximum size and In average there somewhere in between Yeah, by picking here we can the right way we can select the average chunk size After we did the chunking right now we like let's recapitulate Recapitulate after we serialized everything after we chunked everything up. We now have these little blocks all of different sizes that together Result in the original stream again We then take each individual chunk Calculate the the hash value of it a strong hash function in this case right the original one the cyclic one Which we use for the chunking is only used for the chunking for the chunking now that we have chunked things up We forget about all that and apply a strong hash function a cryptographically strong hand function to that to make this Chunk recognisable right what we actually do here is sharp 512 slash 256 that is something it's actually it's a it's a standards defined hash function that is basically sharp 512 Cut off at 265 bytes and with a different initialization the reason why we use this is actually because we want something relatively short but something that is relatively quick to calculate and Yeah, on most CPUs of today sharp 512 is actually faster to to calculate than shot 256 So yeah, I mean ultimately doesn't really matter which has a cryptographic hash function we use and as long as we use a good one At the same time where we go through all these chunks that we just generated and calculated the Hashes of all of them we've routed right out an index file an index files ultimately just a list of these hashes, right? Hence the index file identifies the original Stream perfectly because it basically just tell is a list of chunk identifiers And if you put the chunk identifier like the chunks belonging to these identifiers back together in the same order You're at the original stream again Now after we did that we actually Goes through all the chunks we generated and we hashed and we stored the hash value off and compressed them individually then Replace these chunks in one special directory call that the chunk store The file name being the hash Function we the hash value generated for it Yeah, and we do that in compressed fashion what the compression we actually use these days is that STD But you can actually choose whatever you want like you can do it with X that X that is a little bit slower But compresses a little bit better that STD is kind of the new industry standard I guess for people who built new systems So we use that too Yeah, this this a chunk store is really a very very simple thing, right? It's really just a directory where you have a couple of files that happen to be named after the hashes And you can actually go to these files uncompress them on the show with the ZSTDs tools tools then pipe it into your shower 512 265 556 tool and verify that each chunk actually contains exactly what the file name suggests And after we have done that It's all complete right so again. We took the original like we originally serialized everything We chunked it into pieces we calculated the hash function of each of these pieces pieces We took the hash and denoted it in the index file We compressed individual chunks and drop them in individual files and directory. That's what happens if you package up some kind of directory tree or some kind of file system image With see I think you as output like as input you put that in and as output you get one index file and one of these store directories right, so To make it very briefly a civilization chunking hashing an indexing and compression storage is the chain we calculate here So let's now think about the other operation extracting it again For that we simply reverse this first step is we acquire the index file Again, the index file was this list of hashes simply referencing the chunks second step we acquire and uncompress each chunk file from the chunk store listed in the index file, right? So now we have the original serialization back Well after concatenating them and then yeah, we have the serialization. We simply deserialize that again We thus regenerate the original block device data or the file system tree And that's already it. It actually fits on one slide the extraction there Now you may wonder how does it actually compare to systems that are actually that are in some ways related to this like Yeah, first of all our sync which is again one of the major inspirations but other systems are have similar properties like OS trees one I want to mention in restick which is a Backup tool, but the other like similar systems like bork or something like this For us, it's really important. We forget the file boundary, right? Like all the other systems tend to put a lot of value on the file boundary our sync for example It iterates to the file system trees and the our sync algorithm actually never as mentioned never is actually used until on the source and on the destination there are two files found that have the exact same pass, right? Until our sync finds that the our sync algorithm isn't used So for our sync the files system boundaries everything right the file boundaries everything right like it's it's how they think It's how the algorithm works. It first operates like look for the files and then it applies the algorithm C async doesn't do that and C async we give up the file boundary. We really really don't want the files boundary because Caring about the file boundary means that small files translate into small blocks and large files into large blocks If you however first serialized stuff giving up on the file boundary You have this nice effect that small files are lumped together, right? Like if you have multiple small files, they become one chunk all together And then on the other hand large files are split into individual chunks So an average you get your average chunk size that we're actually looking for But you neither have overly small ones nor overly large ones This is actually key difference like I think for example It's one of the weaknesses of why of OS tree that they do not give up on this File boundary because for them it basically I mean Unix directory trees traditionally have a lot of small files Like look into Etsy this things like Etsy host name Etsy hosts Etsy FSTEP and usually those files have less than a K Bice or something like that and if you always generate one object out of that one chunk out of that then you pay heavily because you have will have so many small chunks around Yeah This way our chunks are evenly sized and we can even recognize similar blocks in different files Right, right, which is something that our thing can't do in our sink if you rename a file on disk, right like on this Source something that last week was called foo is now called bar And then you are sink it to the destination that had a copy of foo our sink is not smart enough to notice That the file is already there just under different name because again the file boundaries everything for it And the way into it is the file path and hence will not recognize that for us. It's different We don't care about the file boundary and we hence neatly handle renames and things like that Because we actually just care about contents So why do we all do this? Similar file system images will result in mostly the same chunk chunk files, right, which basically means we can very efficiently store Many related trees right like think about this you do your IOT development. You have US image put together Like a classic unix system in some ways usually between all the versions most of the thing actually stays exactly the same It's just where you actually make changes Where a couple of files change, right because of the smart chunking algorithm? Ultimately most versions will result in exactly the same chunks. They will end up in exactly the same hash Values the only chunks that actually change are the ones that you really actually modified, right? So it becomes very efficient to store many related trees, and it's not just about storage. It's also about delivering them Another benefit is the stream is always implicitly validated, right? Because everything like every time you download one of these images we do is strong Like because it's content address was a strong hash function as we downloaded chunk We can validate that the chunk actually contains what the name suggests because the name after all is the the strong hash function So you get something like dm varity like behavior for those who don't know dm varity is a scheme How it's it's like my other talk actually was a little bit mentioned that all as well It's about making block devices offline Like secure against offline modification, right? What actually happens there is that every single disk exact read access that you do is verified With cryptographic primitives to match the original version that was deployed on the system in Casing we get some very similar behavior because the index file has the list of all the hashes of all the chunks and As we go through the civilization we validate all that so if the index file wasn't modified We do know that the entire stream was modified either What's also cool about this is when you download These images this becomes very cdn friendly cdn in the sense of content delivery network, right? because you by picking this queue parameter of this modular check that I mentioned you can actually Configure how small or how large your chunks will be an average and this is this is what you want to do if you Deploy your images on a cdn because you basically can say do I want an average relatively small chunks? if you do The chance of reusing old versions is increased. However, you lose You pay more money because most of the cdn networks actually make you pay for for individual requests not for actually sizes also the metadata like the price you pay for for the metadata becomes a little bit higher if you can also increase the chunk size In average if you like and if you do then of course there's a smaller chance of reusing what you already have But you have fewer requests and so you save a little bit of money Yeah on your side of things so But it's extremely cdn friendly because in average all the chunks will have the same size so Which is again like for a system like a westerie. That's kind of very expensive thing because As mentioned they end up usually with lots of very small objects because you pay for every single object request This kind of breaks the neck of it There are other reasons When acquiring a new version of a file system image, we can actually Apply the the employ this algorithm like the the series Asian chunking things like this on what we already have like for example On an old version of the of the image that we already have or on a somehow related version of Something that is somehow related right For example, like if you do a IOT a B setup right like a B setup meaning that you have one Operating system image on the system that is currently booted and then the other one which is the next one That's gonna be booted or that was the last one that's gonna be booted so that you basically always have one version that is in a Perfect condition while the other one is the one that's currently being updated might be in an imperfect condition The nice thing is like if you see a sync to do updates on such a system You can use what you already have on disk To update the other version and yeah, because you will chunk it up the same way because you will hash it the same way you can then Basically to cross updates and recognize all the same blocks and make copy them locally instead of requiring them from the internet So yeah, you basically get a pool of reusable chunks from whatever you throw at ca sync To use as a pool. This is actually really powerful because it actually, you know, it's it's not just about Using the exact like I mean the the the trees that you can use as a pool Don't actually have to be related to what you want to update to right? For example, you can do cross updates with this like for example, we have a debian OS image or a debian container image or something like that and you're downloading a fedora image For for your new container or something like that Then there's a very good chance that much of the stuff that's contained in there It's actually very similar because Linux distributions don't tend to be completely different Like they have lots of data like locale data and time though data That's actually the pretty much the exact same thing on on all our Linux distributions. So Yeah, because the ca sync doesn't really care what you throw at it It tries to find similarities and and what you throw at it and if there are none doesn't hurt either So, yeah, there's automatic robots reusing up what was downloaded before without the having to be a historical Relationship this is very different from other systems like rest or your things like that Which basically implement a little bit like a version control system, right? Because like what they do like for example also the Chrome OS update or work like this because they generally say okay I'm doing an image today, and I'm doing an image tomorrow and to make it efficient For those who do this kind of update I calculate the delta in advance put that on a server so that people can download the delta This sounds like a very good solution But it tends to explode because not everybody updates this his system all the time So it's not sufficient to keep one delta around you basically have to keep deltas around from between all the versions That you want to permit upgrades between and that becomes ultimately very Expensive like just from a management perspective because every single update you do you have to calculate these deltas to every Older version that you might want to support updates from we see a sink you don't need that right like because see a sink There is like the only delta algorithm is the client requesting the chunks that it needs for its stuff And it will automatically use what it already has Regardless if there's a historical relationship or not of course if there is a historical relationship There's a much higher chance of it reusing stuff, but if there's not it's fine It will still use what it's usable and not use what's not So this makes management of Providing these images a lot easier because you don't have to think about history anymore you can just put your stuff there and It will be used if it's possible and if not, it's not so Let's focus on a key difference to docker right like I already mentioned this See a sink will not do revision control for you, right? Which I think is actually I mean it's it's a The key thing right like revision control in my opinion is is a very useful developer tool, but it should stay a developer tool It's not a tool for for deployment right like it's not a way how you to optimize your deltas and docker and these things they kind of Like they they aren't clear on this right like they have this kind of weird version control which they also use for delivery And it's it's how they work, but I don't think they should be intermixed right like because for version control You want metadata you want information about times and whatnot, but for For deployment you don't care about all of that like there is no reason on the end user device To know what was the version from last week all you need to know Yeah, this is the the block device of the new one and that's the block of the old one if you still care, but yeah Yeah so much about the difference to the existing systems I mean there's lots more, but I don't have infinite time here to talk about them. So yeah, so much about that a Couple of points unrelated to this specific thing is like one of the questions people might ask is As mentioned see I think can operate on the block layer and it can operate on the file system layer Right when it operates on the file system layer. It does something that is pretty much tar Right, it goes to the files and directories serialize everything so that it can chunk it up later So the question is why don't we actually use tar because we don't and the reason for that is we want a couple of Of properties that tar cannot provide so we came up with something or I came up with something called CA tar Which is I mean you can actually use CA sync in this mode as well, right like you can just tell Casing to generate a dot CA tar file from a directory tree and it will be pretty much I mean then no no no chunking chunking or anything like that will happen, but you will still get the serialization and things like that Not sure why it would do that, but you can if you like to So CA tar is a lot like tar, but it's strictly reproducible, right like and that's I put a lot of emphasis on this that basically means that One directory tree will result in the exact same serialization always, right, which is a property that tar doesn't have And tar because very very weakly defined basically you can put the files that are in the in the tar in any order you like And actually that's what most operating systems do or most implementations of tar do they do them They put the files inside of the directories in the order that the file system actually passed them in to the application And that can be anything so it really depends if you use x t3 or butter fesson will already be in a different order And actually even worse than that because file systems even the same implementations tend to Not have a well-defined order in which they return stuff in CA tar All of that. It's not permitted right like there is only one valid serialization of one directory tree So yeah another big benefit of CA tar that tar doesn't have is it's random access, right? I want this ability that if you have serialized everything and you want to jump to file so-and-so in that directory That you do not have to go through all the beginning of the stream until you find that file But that you can actually look at the serialization and jump directly to where you need to go, right? so These are the two big things that tar that classic tar does not provide that I think are very very crucial for For a system like this So random access kind of cool because it basically allows you to mount the CA sync archives the CIT is CA tar files Locally as if they were a file system. You basically got to read only file system there Let's talk a little bit about the Artifacts that I see I think will generate for you. So we have the let's actually see that you works We have the CA ID x stuff. These are the index files that I mentioned basically lists of hashes CA ID x is a index file that references the directory tree, right when we operate on the file system layer See a I b x is the same thing But it's about the serialization of a block device, right? Ultimately, they're exactly the same file form of the exactly the same stuff the only name differently to give the user a hint What they actually reference like are they on the one layer or the other then there is CA tar as suffix already mentioned that That's basically when you don't do the chunking and all this stuff But you just want something like a better tar that is reproducible and mountable And then there is dot see a str It's actually suffix for a directory instead of a file and that's basically a store directory It's where we put our compressed chunks, right? Like if you CD into that and you can then you'll just see all the chunk files in there nicely compressed And you can verify them with your shell tools if you like to so this is just to give you a little bit of Feeling of what actually are the artifacts that you'll see on disk Something I forgot to mention when I talked here about the Reproducibility of the tar serialization is metadata, right? Metadata means all the stuff that you store about files. That is not the contents meaning for example m times like modification times user stuff like ownership of files and things like that Metadata is a big thing right like because metadata When it shows up in a civilization Usually like often conflicts with your reproducibility, right? Like for example m times In some cases like for example, if I would Back my up my home directories. See I think every day I actually care about m times a lot because I actually want to know if I worked on dot document today and on that document the Last week but for OS trees that actually doesn't matter It usually hurts because you want reproducibility and we had talking about reproducible bills, right? Like if you run GCC on your C program today and tomorrow And it generates exactly the same output that that's a cool thing And if you then fuck it up by giving them both different m times because you ran it at two different times Then you lose a lot so What I'm the point. I'm trying to make here is Depending on your use case you actually have to control what's kind of meta that they actually want to store, right? So if you built in an IOT image, you probably do not care about m times But you do care about file ownership about permission bit and these kind of things So, um, yeah in that case you want this in the case where I back up my home directory I do care about the m times But I probably do not care about things like user IDs like numeric user IDs and numeric group IDs Because I want the abilities that I can extract this on some other machine where I have a slightly different UID or git And I don't want this to conflict, right? So the corollary of that we probably should have a way how to control Precisely which metadata gets stored and restored in these in the serialization and that's what what? Casing actually provides you with so we have a couple of switches This is actually supposed to be dash dash with and dash dash without but Laitech Found it funny to turn it into one dash. I don't know why but sure so When you invoke Casing you actually have this with and without switches and you can specify precisely What kind of metadata you want do you want m times? Do you want ACLs? Would you want extended attributes? You want the butter fs concept like sub volume and things like that? so we are usually like by default if you don't specify any Anything we are as precise as possible, right? Like we try to we actually go into a lot of detail about saving restoring your file system trees And probably in a lot more detail actually as than tar goes like for example because we can store butter fs sub volume information for you We can Store like the change address stuff like for those you know Linux value there a couple of additional file system attributes that you can attach To files to make them immutable and things like that we can store we save and restore that for you as well If you like some use case might be useful so Yeah, so much about metadata Casing puts you in control depending on your use case to pick exactly what you want if you want more if you want less Another point that is interesting to discuss is a squash of s images. I already mentioned squash of s a couple of times squash of s For those who don't know is the compressed files system for Linux Usually compression and chunking Conceptually conflict right like because compression is supposed to remove all the entropy of your stream And that basically means I mean it the purpose of compression ultimately is in a way that if you make one change at the Beginning that the this will explode into the rest of the of the stream This is of course exactly what we don't want with see I think however One would assume then that squash of s and see I think kind of conflict, but that actually they don't all because squash of s actually random access file system right like this it's Efficient to access the file at the end of the file system They don't actually compress the entire thing from top to bottom in one single stream What they instead do is they have some structure how you can find files, and then they Have some structure how you can find individual blocks, but the blocks are individually Compressed right so this is actually helps us a lot because they don't have one Compressor that goes from the front of the civilization all the way to the end in one go But they have actually a compressor that starts with this chunk then ends with a chunk and then starts with a chunk and ends with this chunk and so on so What we can do now is we can actually synchronize that up right like because if they do this kind of chunking on some level And we do this kind of chunking in some level We can take benefit of that because we still can recognize the chunks Yeah I'm not going to tell you exactly how to synchronize them up because I didn't do the research That's the that's the github issue about that specific thing though the point. I'm really trying to make is Yeah compression by default appears to be contract contradictory to the chunking stuff But actually it isn't in real life cases like squash of s It's just a matter of you need a little bit of research like how you pick this a chunking size that squash of s uses in Relationship to the one that see I think uses Yeah, I mentioned this already for me reproducibility is everything right like I think security requires Reproducibility that means that if I do it deploy an image today named but deploy an image tomorrow. I Not only need to guarantee that's exactly the same version that ends on the machine But I also need to provide tools that you can verify that As as a result of this and see I think actually suppose has some tools like see I think digest and see I think M tree that completely outside of the scope of delivering images to people are actually pretty useful to For an interest in general case see I think digest what it does is you Invoke it and then I will calculate a hash some on the object that you invoke it on like for example If you don't specify any arguments it will actually operate on the working directory You call it in and what it will do then it will implicitly behind the scene serialize the entire Working directory so all that away, but before throwing it away will actually calculate a strong hash function Hash value of it and then show you the hash value. So seeing digest see I think digest a little bit like invoking Shah 256 Some from the command line with a star or something, but then actually do thing that recursively for the entire tree in a completely Stable and guaranteed way See I think M tree is pretty similar M tree is actually concept that was introduced by the BSD people It's it's short for manifest tree. It's basically way how you can List the directory tree and show like it's a standard file format for that For showing a list of files in a directory tree in the hashes so that you can verify them see I think M tree can generate that for you To and can validate that for you as well And generate that form an index file or from a directory tree or from everything so yeah, repeat is ability matters And see I think is completely outside of the scope of actually delivery provides you these functions What's kind of cool if you with see I think is You know if you if you have multiple local Images installed like for example you have a couple of container trees They're pretty much all closer related because they're all versions of a door that have been adjusted to some specific service Or they're so like boom to whatever you have then it's kind of cool If we can take benefit of the fact that in large parts they contain the same data by telling the file system about it I'm speaking entirely about running see I think on the file system layer now And for that you can actually invoke see I think was a special switch that tells it that when extracting a See I think archive like downloading from the Internet and things like that that it should use a local pool as a source Not only for copying things over but also for optionally instead creating hard links where that's possible Right, so it basically allows you to it's something that that basically we copied from OS tree because in Western you get this property Right you you have one file system tree and you get another file system tree and if all the files that are identical are hard linked We see I think you can do the same thing And you can do it completely directly and and see I think we'll do it for you as you like It's hard links have a bit of a problem though right like if you ever modify the tree Then both the original tree and the one you modify will change right they they propagate changes ultimately between the trees Modern file system specifically XFS and butterf s and a couple of others actually no concept called ref links like referential links It's a it's a copy-and-write concept It basically it's like a hard link in a way so that you can have one file and I have another file And they have the same contents that only stored once on disk However, because it's copy-and-write it will be changed like as you modify one of these two files The the the unification will be split up and you get two versions of it ref links are kind of cool See I think is able to create them for you So basically if you have one container tree already unpacked and you call see I think to unpack a related other container tree then See I think will use the first one as a pool will recognize similar blocks And then use ref links to make sure that the second one is stored as efficiently as possible and reuses as many blocks as possible from the original one That's actually super powerful. And I'm not aware of any kind of tool like this that can do this kind of stuff Something else here, I mean I got like five minutes left or three minutes left now. I got two more slides or something But I think it's okay if I don't cover them. I'd rather go towards questions. I think so Anyone has a question Thanks Since you said that you don't care that much about the file boundary Is the transfer always an all or nothing proposition or can you just specify a slice of the destination file system to transfer over? so I See I think as like yeah You can always Download only parts of it like this is actually used for example for this thing This is say see I think command where you basically get a local block device That is backed by the data in that index file, right? And as it will basically give you a device node and slash def called def SD whatever Something and as you access that file see a single go and download the the chunks needed for exactly that from the server So you basically get something like a like an on-demand downloading file system thingy And we can also do that on the on the file system layer where you basically say I want to mount this CA IDX This index file that references it's Directory to your station and want to have it locally and we will not download it in the beginning all we will download it as you go And we'll actually cash the chunks locally. So it's not an all or nothing thing It's designed to be random access and it's designed to to delay downloads until the point where they're actually necessary so suppose I have some Other tool which wants to unpack file system trees But has no idea how to handle device files and I want to integrate CA sync into this Have you considered putting bits somewhere clearly in the hash identifier so that I can tell instantly offline what kind of object I have? Well, I mean there is no different kinds of object right like and see I think we split everything up into chunks, right? Like we forget everything about the context of the data that we operating on like the Serialization for us is just a serialization and then we chunk it up and store that away and at that moment There's only chunks right. There's no no object type or anything So I'm not sure I follow I mean at some point I like What's what's what's interesting to mention is that actually I'm see I think you can also Pause the serialization already pre-packaged into see I think like for example You could actually use GNU tar to tar something up Push that thing into see I think and see I think we'll just do the chunking and everything for you then right like how the serialization Comes to be if it see I think that generates a serialization or if you already we pass it in is completely up to you See I think works in both ways. It's it's relatively generic in that regard. Maybe that's an answer to your question Hello, my question would be it doesn't matter whether it's block device or file system. So if it's block device I might have like some sparse space left like for future space, right? Like an embedded or anything or if I have a file system and I have sparse files How do you handle that? Do you just deliver a single chunk for all the zeros or do you ignore it or? That's a very good question. So I mean You know the entire system is designed that we recognize chunks, right? And these song chunks can have any kind of contents including Nothing like null right and then if see I think works correctly It will recognize that you have zeros here and zeros there and series over there And we'll all merge them into the same chunk with the same hash function. So See I think automatically will recognize that and handle it the right way that said See I think extraction actually is smart So if it notices that it is extracting a series of zero bytes It will actually automatically generate sparse files, but you can turn that off if you don't like it, right? So the idea is basically to generate the most efficient directory of presentation possible using reflings Blah blah blah blah and also sparse files where we can What see I think doesn't do however and it has been requested, but I'm not sure I want to implement that is that we We are not capable of reproducing the exact sparse file layout of your source on the destination, right? So we will either make it less sparse or we'll make it more sparse because we don't actually Store any information about where our holds in the file somewhere not we just consider zeros Okay, that's my time if you have any further questions, I'll be around here and thank you very much for your interest