 I'm John Wheely, as you probably figured out, and I'm in the file system storage area at Red Hat. And I want to talk a little bit about the index, the deduplication index that's used by VDO. Let's see if I can... Oops, oh dear. It's a little difficulty with my... Oh dear. All right, well, new platform. I'll just skip this slide. A quick review of what VDO is. It's a device mapper target called DMVDO. It's like any device mapper target. It maps from a logical address, block address space to some physical storage underneath. And while it's doing that, it does some deduplication and compression. And the compression part is out of scope for this discussion. How does it do that? It computes a hash signature for each block that's written. It looks it up in an index called UDS. And based on that lookup, it gets a hint about whether it can do deduplication on it. So what does an index have to have in order to be useful for VDO? It has to have a minimal overhead both in time and space. In particular, it needs to not be a bottleneck on the right path. The right path is complex enough as it is. So you also like to minimize the storage overhead. The reason for that, of course, is if finding and keeping track of the deduplication takes too much of your storage, you lose the benefit of the duplication. It has to have a small memory footprint because in general, the memory is better used for other things on a busy system. And as kind of a sub-goal in order to meet the overhead goals, it should take advantage of some of the properties of actual real-world data workloads. And quickly, the interesting properties there are temporal locality, which means that if a block is found to be a duplicate, it's more likely to be a duplicate of something that was written recently than something that was written a long time ago. And then there is also often kind of spatial locality, runs of duplicate blocks, particularly in things like backup workloads. Today's full backup probably has a lot of runs of blocks that are duplicates of yesterday's full backup. So how do we minimize the time and keep it off of the, as a keep it away from being a bottleneck on the right path? Well, it's very decoupled from VDO. It's so decoupled that it's a separate module. The module that does the management of the underlying storage and the mapping from the logical blocks to the underlying storage is the KVDO module. Then the module that manages the deduplication index is the UDS module. And the interface between the two is very slender. It is asynchronous and it's treated as advisory rather than definitive. So asynchronous means that VDO computes that hash signature for a block of data. It computes the hash signature for a block of data and it appends some metadata that it's going to use later. And it launches a request to the UDS index and it gets notified sometime later by a callback. And VDO treats this as advisory rather than definitive. So VDO has to actually go and do some more work even if it gets a positive answer that there was a match found. But what this means is that VDO can launch a request to the index and then go about its business doing whatever it can usefully do. And when that callback provides the answer, VDO can act on that and otherwise do the deduplication or it can just ignore it. If the system is heavily loaded, if the storage is slow, VDO can just blast on without the deduplication. And the only thing you lose there is a potential deduplication. There's no correctness issue. It will never lose data. How do we minimize the memory in the storage? Well, the index is sized by its memory footprint. We decided that was going to be the defining characteristic. And that's defined once at creation time and you can't change it. And that memory footprint ranges from a quarter of a gigabyte to a terabyte. And that determines the overall size of the index, the size on storage. And then that defines the deduplication window, which is basically how far back you can look for duplicate blocks. So the storage, the total storage is 10 times the memory footprint and that provides a deduplication window of about 10 terabytes per gigabyte of memory footprint. But so, and that also means that the on storage part, the whole index is also a fixed size and doesn't grow. And that's useful for VDO because it can allocate a fixed amount of space to hold the index. So how do we need to manage this? Well, the part that's in memory has to be a very efficient way to drill down into the total index. And for terminology, we call the whole index a volume. The volume is divided into chapters. The chapters have pages and the pages contain records and records are just pairs of hash signature and metadata. So what are the in-memory pieces? There is a master index, there's an open chapter and there's a page cache. The master index is used to find a chapter. Each chapter has its own index that's used to find a page within the chapter and then the page is searched for a match. The open chapter is a chapter that's currently being added to and in order to exploit the temporal locality, the duplicate being more likely a duplicate of a recent thing, when a request is issued to the UDS index, first looks in the master index to find a likely chapter. If that chapter is the newest chapter, if that chapter is the open chapter, it looks there and whether it finds it in the open chapter, we're all set. If it finds it in the open chapter, we're all set, calls the call back with a positive answer. If it's not found, it adds it to the open chapter with the metadata that was passed with the request. If it wasn't found in the open chapter, that is, if it found a possible chapter in the master index, it goes and consults the chapter index for that chapter and that will give it a page within the chapter to look for. If the page is in the page cache, again, the page cache is the most recently accessed pages, so again exploiting that temporal locality. If it's in the page cache, it can look directly there and if it's not in the page cache, it'll evict the oldest page, bring the page in and search it. And in all cases, if the hash signature is found somewhere in the volume, it gets moved to the open chapter with the metadata that was already associated with it and that will let VDO know that it can look for deduplication there. If it wasn't found anywhere in the volume, that means it's a new block of data, or at least a new signature, and we'll get it added to the open chapter with the new metadata that was provided with the request. So how does this work? How can we get efficiently locate these things in the small memory footprint by the magic of delta lists? Delta lists take advantage of the statistical properties of random numbers. With a good hash algorithm, the hashes of blocks are statistically random. They're independent from each other. And the particular statistical properties that are important here, if you take a sorted list of random numbers, the differences between successive entries in that sorted list also have a probability distribution. Excuse me. Around the mean difference, so the hashes in the delta index can be represented by the differences between successive hashes in the sorted list. Those differences cluster around the mean, but of course they are of different magnitudes and different binary orders of magnitude. So the most compact way to represent those is with a variable length encoding with a Huffman code. And this requires a good hash algorithm that has a good distribution, but not necessarily a cryptographic hash. It's just the hashes just have to be statistically random. And although member hash three is not a cryptographic hash, and it is theoretically possible to construct blocks of data that would have specific hashes or patterns of hashes, a denial of service attack is not terribly practical because UDS does protect itself. It does protect itself from perverse patterns of hashes. So to walk a request through this process, it would be impractical to keep the master index as a sorted list of all the hashes that have ever been seen. The hashes are 128 bits, so that would be a very long list. Also, because of the variable length encoding, the delta lists have to be sorted that is searched with a linear search because they have to construct the difference from the Huffman code. So in fact what the master index has is an array of delta lists. So to look up a hash in the master index, UDS takes a subset of the bits of the hash, which again with a good hash, those should be as random, the subset of the bits of the hash should be as random as the hash hashes, the total hash. So it uses a subset of the bits to first, a subset to locate a delta list within the master index, and then it uses some more bits from the hash to match against the entries in that delta list. And that delta list you can see is much smaller because it uses a subset of the bits. So the data or the payload, the value associated with each partial hash in the master index points to a chapter. It locates a chapter. So the next step is to look at the chapter index. Each chapter has a chapter index that works the same way. It is a list of delta lists. So searching the chapter index also takes a different subset of the bits of the total hash and finds a delta list within the chapter index and then searches that delta list linearly. And that gets a page. And then when the page is pulled into memory, if it wasn't already there, the pages, the records within the page are searched with a quadratic sort because they're in effect randomly distributed. They're sorted, but they're, you know, they may not be uniform across the space. And there's one more way that we can exploit the locality properties, and that is the spatial locality. Many data workloads have runs of duplicate blocks. For instance, today's full backup probably has a lot of runs of the same blocks that were in yesterday's full backup. So in that case, we can get, we can use just a sample of the hashes in the master index because once a page containing one of those sample hashes is in memory, a lot of the adjacent hashes will also be there. And that gets a 10 times larger volume for the same size memory footprint. And so, so where can we go with this UDS is particularly suited to for deduplication because it's because it takes advantage of those properties. So the important things are that it's advisory. It's a fixed size, which means that the oldest entries age out, as entries are added, and it's optimized to find the most recently used things first. So it's particularly suited to deduplication. It is, of course, it's the index for VDO. It is also, before it was open source, it also was used in a storage array from a major storage vendor to do their deduplication. And they do much the same thing as that VDO does as far as managing the block mappings and the deduplication. So its behavior is something like a soft hash map. So the key thing is that the user of it would have to, would have to actually manage the actual data, use it only for advice, and would have to do garbage collection or otherwise manage the data. So it's not really a, it's not really a general, general purpose hash table. I kind of rushed through this because we started late. Are there, what can I go, are there places I can go back and elaborate more or the things that weren't clear. Okay, the question was that a good hash algorithm is needed, but not necessarily a cryptographic hash. The benefit of a cryptographic hash is that you can treat the index as more authoritative. One of the, another thing that it, it was another thing that UDS has been used for is a, some of a couple of us from Permabit, which actually, I guess that's one of the things I rushed past, I guess. The video was originally created by a company called Permabit, starting like 10 or 12 years ago, and Permabit was acquired by Red Hat a few years ago and the code's open sourced and we're working on getting it into the kernel. There is, or there is kind of latent support for larger hashes and cryptographic, a cryptographic hash at the current hashes are 128 bits, a cryptographic hash at 128 bits probably doesn't buy much, but a cryptographic hash at 256 bits. A cryptographic hash, first of all, has the property that it's hard to reverse. It's hard, and it's hard to create, it's computationally infeasible to create blocks of data that would have a specific hash, which means that it would actually make it probably more resistant to denial of service attacks, but also several of us at Permabit a number of years ago wrote a prototype kind of a backup capability similar to R-Sync in that it generated a cryptographic hash and used UDSS to duplicate the backup traffic. In a case like that, a cryptographic hash, a strong cryptographic hash, means that a false positive that is two blocks that accidentally hash to the same value is less likely than a cosmic rays zapping a bit on the storage. You can do much less, back up one more step. When VDO finds a match in the UDSS index, it still has to go and verify that the block of data matches the block of data that's being written, the block of data that's already on storage matches what's being written. With a cryptographic hash, you can skip some of that overhead because you can be reasonably sure that if the cryptographic hash matches the probability of two blocks with the same cryptographic hash is smaller than the error that would be created by various other things, such as a bit error on the storage. I was a little disjointed, but that's at least something about the value of the cryptographic hash, anything else. One thing I was thinking of by way of something besides deduplication that might be able to use a UDSS would be something, if you were doing some kind of, I was thinking maybe if you were doing some kind of real-time analytics where you were only interested in the, you were most interested in the most recent data, that would still require that the client, in all cases the client has to be prepared to manage garbage collection or otherwise manage the, if UDSS is being used to manage data then the client has to do the actual data management, has to do garbage collection or somehow keep track of the data so that, for instance, aging older entries out of the index doesn't leave dangling references. It looks like our time is pretty much up. Am I right? Thank you and I hope this was at least moderately interesting and or useful.