 All right, it's 5.30. I think we'll get started. Okay, I'm Jeff Spees. I'm the Chief Technology Officer and Co-Founder of the Center for Open Science. I'm also the co-lead of SHARE and a visiting assistant professor at the University of Virginia's Department of Engineering and Society. This is what I've been calling my fun talk. I've been looking forward to this for a while because I think there's a lot we can learn about data integrity from these very neat but very stigmatized technologies like Bitcoin, BitTorrent, and Usenet. So, librarians and archivists care about data integrity. Persistence, preservation, decentralization, distribution, robustness, fault tolerance, and inclusivity. And these are actually the concerns of these technologies, these protocols, these services like Bitcoin, BitTorrent, and Usenet. Although these services do have this stigma because the qualities of pseudo anonymity make them prone to illegal use cases. But again, I think there's a lot that we can learn from these. So I wanna start with a question. If someone removed the Oxford commas from a document, for example, the title of my talk, how would I know that? And so the most simple way is direct comparison. For every character in my title, I could go through and say the D and D are the same, the A and A are the same, all the way to the comma. Okay, it's missing. Someone thinks that librarians are archivist and criminals, that was not my intent. So now I know that this has been subtly changed out from under me. But what if direct comparison is impossible or impractical? For example, if I send you a file, you don't necessarily have direct access to my version anymore. You can't do this character by character comparison, this byte by byte comparison. Or let's say I wanna track the integrity of the file over time. That original file may be corrupted. So how do I, what metadata do I need to store and or share with you that would represent this idea of sameness, that these are identical? So how about file size? We can look at that. So the title of my talk has 56 characters, it's 56 bytes of information. And if we just took the length of that, if we look at the one with the Oxford comma removed, we see a difference. But this is very easy to get around. I can just add a space there and then I wouldn't know that this has changed. So in this latter example, I wouldn't necessarily know that anything has changed. If a bit flipped but the same number of bits were there, I wouldn't recognize the change. And this is where hash functions come in. Can I just get a quick show of hands of how many people know what a hash function is? Okay, good. So it'll be a pretty high level overview and I'm happy to get deeper in the Q and A or later. So hash functions, a few definitions. Hash functions deterministically map input data to output that is generally smaller in size. So if I take, and I hope you can see this well, if I take the md5 hash function, I'm telling it it's a string, this is just on the command line on Max or Unix platform, md5-s, the string that I'm looking to hash, I get a output that is of smaller size than the title. md5 is a 128 bit hash, so I get 32 characters information in hexadecimal notation. So if that was using the full alphabet, I'd actually have a much smaller hash output. And then if you were to run the same command, you would get exactly the same output. This is because it's deterministic. Okay, so hash functions map arbitrarily sized input data to output that is of fixed size. So if I do that hash again, but if I compare that to a hash without the subtitle, I get two different responses, but they are the same size, they are a fixed size. So the smaller input did not change the size of the output. I could do this with a PDF with a PowerPoint presentation, I could do this with gigs of data, terabytes of data, and I would have the same 128 bit output or 32 hexadecimal characters. And hash functions map input uniformly over an output range. They should appear random, they should be random. I ran 1.6 million hashes against random strings and because there are six characters in the hash, this is the hexadecimal output. So the first character of the hash, I just took that and plotted it. We see about 100,000 counts of those because it's uniformly distributed. I could do the same thing for the second character, the third character. I could do the first two characters and map that out across the space and I would see the same uniform distribution. And then a special type of hash that I wanna focus on is the cryptographic hash. And these hashes are non-invertible and collision resistant. For a given input, the output is practically unique. So here, I've taken the hash of the title, but I've just removed the S. But the two outputs are very different. It wasn't just one character that changed, even though I just changed one character in the title. This is coming to this idea of non-invertible and collision resistant. So non-invertible means that simply knowing the output tells me nothing of the input. You can't give me this new hash here and I would somehow be able to tell you this was the one missing an S. I know nothing of the content. And they're collision resistant in that it's hard to find two inputs that have the same identical output. And this is important. The birthday paradox tells us that with 128-bit hash like MD5, it would take two to the 64 hashes of random inputs to find a collision. Given any two strings or files, the chance of collision is two to the negative 128th. This is practically unique. So back to my original question. If someone removed the Oxford commas from a document, how would I know that? Well, I could just take the MD5 hash of both versions and compare them and see that the commas were removed. Someone rejected clarity and communication and removed the comma. That'll be the end of my Oxford comma jokes. I'm happy to talk about those in Q&A if you'd like. They are very important. And you're the only audience I can actually tell that to and think you'll laugh or find it funny. So why does this matter? Because data integrity is important. It's the concept we started with. Media is prone to data degradation, data decay, data rot. Unintentional corruption, which is the most use cases that we'll be thinking about, is pretty rare. Hard drives are pretty reliable these days, but we're working with such large scales of data and accessing it so quickly and so often that even those low probability events are actually quite common. Two examples of that. And this just comes straight from Wikipedia on data corruption. NetApp found more than 400,000 silent data corruptions. 30,000 of those were not detected by their hardware raid that was supposed to detect these changes. That is a lot of unknown corruption. CERN found 128 megabytes of data permanently corrupted across 97 petabytes. So yes, that's a big amount of data, but 128 megabytes of permanent corruption is a lot of bad files. That is a lot of bad data. Or data that we no longer really can trust. Okay, so this is again where these cryptographic hash functions come in. There's quite a few MD5s are the ones I've been talking about. Those aren't recommended if security is an issue, if you're concerned about the chance of intentional corruption, someone falsifying data, there's some vulnerabilities, and they're just very hash. So brute force attempts are very easy because it's so easy to go through and try different combinations to get a certain hash. SHA-1 is in that same camp. There's no vulnerabilities necessarily, but it is a pretty fast hash. NIST is recommending that people use SHA-2s, SHA-256s, SHA-512s. These are more complex hashes. They have a greater number of bits. They result in longer outputs. They're slower to calculate. So there's some benefits beyond just the identification of data because there's this uniqueness ideal. We can save space. This is called content addressable storage. We can change our storage from, for example, Mary storing her data on a file system under her folder. John also storing some data. They have the same file name, but they have different file sizes, different hashes. And then Chris maybe makes a copy of Mary's data. They store that in their folder. If we have a content addressable system, a storage system on the back end of this, we only need to store that file once. We identify it by its hash. In this case, 23c1d. I'm just using a shortened version of a hash. But then we've saved all the space. And then we can create interfaces on top of that to create these virtual file systems with folders and files if people need them. So the takeaways from this is that we compare hashes with downloads. If people are downloading data from us, from our services, from our websites or applications, you can include hashes. Then teach them how to use those hashes with tools that are pretty common, like MD5, MD5Sum, the same for SHA-1, SHA-256, Python's hashlib libraries is very nice, very easy to use. And then we should be thinking about using content addressable storage just to save space. We're collecting a lot of data. We want people to reuse that data. We want people to reproduce it. We're going to see a lot of copies of that. And so a content addressable storage system is quite efficient. Okay, so now to the fun stuff. So Usenet, how many people have heard of Usenet? How many people have used Usenet? Good, established in 1980, one of the first world wide web type networks. Users can read and post messages on topics called news groups. They're similar to bold and bored systems of the time and modern internet forums. There was no central server or administration needed. The news group services talked to each other. And then over time, people started using it to store and share files. But that was a bit problematic. Usenet was not designed to transmit binary files. More than that, there were message size limits. Most clients were similar or the same as email clients. And so you had the same limitations with how big of a file that you could post to Usenet. And this is where some of the unsavoriness sort of comes in as people start using this. This is a peer-to-peer file sharing service for sometimes illegal or copyrighted information. So how people got around these constraints though, being very clever, was they split the files up and they would split them up into what were called raw archives. That would result in many files. We would then encode those from a binary encoding into something that was handled by Usenet, some text-based encoding like YNC. But what would happen is you'd go to download these files and sometimes there'd be hundreds of these split files and one or two would be missing. Or you'd find out that a file was corrupted. There's synchronization problems, things were taken down, copyright notices would be posted so they'd pull partial files from the systems. And so what people did was use a common technique called parity computation. And I wanna make sure that's differentiated from parity computation. Thank you for laughing. And the best case of this that we might know is redundant arrays of independent disks raid. The idea being that you could lose some data, you could lose a drive and you'd be able to recreate that. And so one form of parity comes from the XOR operation. It's pretty simple. You XOR a zero and a one, you get a one, one and zero, one. Everything else you get a zero. So how this works is that you might have three drives of data or data on three drives. You XOR the contents of those drives. So we're gonna go down that first column. One, one is a zero, zero and zero is a zero. So that's how I get the first bit. One, zero is a one, one, zero is a one. One, zero is a one, one, one is a zero. So that's how I get this information and I put that on this parity drive. So we did this calculation. I just stick this on this drive. Now if we lose one of the source drives, what we can do is take that same XOR operation with the remaining data and if we do this calculation we get the identical data that was on that lost drive. This is very powerful. This is how rate five works for example. Data is striped across multiple drives, including one and then parity information is stored on one of those drives. So the A files are stored on the first three drives with the parity on the fourth. The B, the parity is on the third disk and so forth. And so we use this type of paradigm when it comes to filling in these gaps. And so one of the, I'd say the most authoritative implementations of this, par two command line, we're gonna create par archives. And so it's very simple. You run the command par two create. We were gonna say a redundancy level. We set it with the dash R flag. We'll say 10. This gives us a 10% redundancy so you can have 10% fail. We tell it the name of the par two file generate and we take that Ubuntu ISO that we're going to distribute. And you get a bunch of other files. So you get, not only do you have the split files you get these par kive files. And the first of them is an index file of hashes. This is where the hashes come in to verify the integrity. So a very core role in this. And those other files have this parity information that can be used to regenerate missing data. And so when we have this, you just run two other commands. You do a verify and then you do a repair. And that's it. You now recover that lost Ubuntu release. And the great thing is you don't even need to have a complete volume set of those parity archives. You can still recover data with corruption and lost to the par kive files. So this is very neat. So I think the takeaway here is that we should be using parity archives. We can't trust systems like RAID to catch all of the corruption that exists in our systems. And it doesn't cost us a lot to keep this redundant information that can be distributed or used in the case of what would otherwise be permanent failure. Okay, next bit torrent. I'm not even gonna ask you to raise your hands about this. 2009 I think it was responsible for 20% of internet traffic. Now that it's sort of fighting with Netflix and YouTube, that's down to 3.35% of internet traffic. That's still a lot of traffic that is going across torrent networks. This is a peer to peer file sharing protocol. It was designed and released in 2001. And how it works is sort of similar. We split files up and we share with each other the information that we have. And so I might have the first few blocks of information. I share that with one person and they share with another person some of the information they have. So this really can maximize bandwidth throughput. This can maximize the ability to contribute. Anyone can take part in the sharing and storage of information. And so where it gets to the, again, the core concept of hashes is that a torrent file is really just metadata and a list of hashes. When you give me one of those pieces that you have that I don't have, I just look up the hash and see if they match. And if they do, I keep that information. If they don't, I get it from someone else. So this is a very efficient system. Another role that hashes play in more modern torrent systems is the distributed hash table. This is a data structure that's quite widely used, but it makes for the ability to look up things in a decentralized and distributed way. So for torrents, for example, I can look up a torrent by its hash, by its identifier, basically, and ask a bunch of peers if they have that information and then get their IP addresses back if they do. And so we distribute this lookup process. And this adds a high degree of fault tolerance and a large degree of scalability. Not one server is responsible for serving out all of this information. Takeaways there. We should use the torrent protocol for data storage. We should use distributed hash tables to provide lookup for these content-addressable storage systems and peer discovery networks. Okay, Bitcoin. Cryptocurrency, electronic payment system introduced in 2008. It's based on a blockchain, which is a public distributed immutable ledger. It can also be thought of as a database, a key value database. Bitcoins are mined by people that create and verify hashes. The hash plays a central role throughout the Bitcoin protocol. As of today, one Bitcoin was over $1,100 US dollars. Bitcoin again, it's very benign technology. It's being used for some illegal practices because of the pseudo anonymity. We saw it being used for the sale and purchasing of drugs and whatnot on Silk Road and some other sometimes pretty scary services like assassination markets. But you can look that up on your own. But there's some really interesting stuff behind it. And so one of them is the blockchain. The blockchain here is a chain. And so block 12, the header of block 12 contains the previous hash, the previous blocks hash, a timestamp, another hash of the actual transactions, the who sold what to whom, when they sold it, how much, and then a nonce, some randomness that you'll see in a moment. And so that previous hash of block 12 points to the hash of block 11's header, which has block 10's hash built into it. And so we create this chain such that if you change anything, the body of the transactions, the header information, the hashes, the whole chain falls apart. And we know that it won't verify and then people won't distribute and use it. The body of block 11 also includes those transactions by a hash. This is actually a neat data structure called a Merkle Tree. I'm not gonna get into that. But what the miners are doing, they're hashing these headers, which contain all of this rich information to verify the data. And so mining is actually hashing. It's shot 256 hashing. Mining requires proof of work. This means that the work was difficult and that it's easily verifiable. And so if we don't want mining to be too easy because there'd be too much Bitcoin and it would lose all its value. And so they've set these difficulty rates because hashing has a known difficulty and it's easily verified. It makes for a good computation for this purpose. So the miners' task, their goal, is to generate a hash of transaction metadata, those headers, by adding some random data at the end via the nonsense, sometimes changing the timestamp just a little bit, such that the resulting hash starts with, for example, 18 zeros. We're trying to meet some difficulty level. So I can't just generate a hash with 18 zeros. I have to do a lot of work to try to find one. And so there's a lot of computers built specifically for hashing. That's the only really viable way to make money right now as a miner. But what they would basically do is take the shot 256 of that metadata, of that header, and then add a one, then they add a two, then a three. And they keep doing that until they found, for example, 18 leading zeros. Last night, miners generated a hash like this. You can see those zeros. They, out of all the things they tried, they found one that met this difficulty criteria. They rewarded with 12.5 Bitcoin that was more than $14,000 to find that hash. This is the work they're doing to add these transactions to the blockchain. The challenge here is that the blockchain grows and grows and grows and grows. And right now, the Bitcoin blockchain is about 110 gigabytes. That's a pretty sizable amount of information. It is immutable. There's a couple of attacks that can change history. I'm not going to details about those, but it is very hard to erase anything in particular from the blockchain. Almost, you should just consider it impossible that it's intended to be persistent in terms of what it has inside of it. And because you can use these encoding tricks like they used in Usenet, you can put pretty much anything on there. They would just be transactions that don't mean anything, but they could get rid into the blockchain. And so, for example, there's some pretty bad stuff, links to some illegal things that exist in the blockchain, which creates some sort of ethical and legal questions around who has that on their computers and what that means. And those things will be there forever. So, the takeaways here, I think, is that blockchains are great for distributing immutable records. So we've been hearing about ideas, for example, of turning data repositories into blockchains. Just store the data in a blockchain, store journals in blockchains. I think it's a little quick to do that. There's still work being done in this area in terms of revoking content from blockchains in an efficacious way. And so, instead, I would go back to hashes. If someone puts a journal article that accidentally include, for example, personal health information, it would have to be taken down. We have to get that back. And these accidents happen all the time. So this isn't intentional stuff of people not wanting to share. These are accidents that we have to be able to deal with. So rather than putting content in the blockchain or to also reduce the size of that chain, I just recommend storing hashes. Store those things that we can then follow up perhaps with persons identifiers or metadata and find the content and then guarantee that it is the same content that's referenced. But if it had to be taken down, a notice could be put there such that you could still have this ability to revoke or attract information. And then there's some really interesting stuff with regards to compensation. You can imagine a system where journals perhaps extract fees via transactions where if they're doing copy data in a peer review and people want to pay for that, that could just come right intrinsically from the blockchain. So this has been a sort of whirlwind tour of these technologies and of hashes. I think it's one of my favorite topics in computer science and in the development world. Things I didn't cover, I've left out a lot of details. If you download my slides, you'll see a bunch of slides that I just didn't have time to get through. I didn't talk about hashes as data structures, hashes in website security, which is a fun topic. I didn't talk about other forms of parity or issues with RAID 5, other types of RAID, file systems that are better to handle data integrity, data corruption, vulnerabilities of these decentralized and distributed systems. But all of this is interesting stuff. And once you really grasp the core concepts, it is a little more manageable to understand. What have I stolen? Since I'm being recorded, nothing. No copyrighted material, just ideas. For example, in the open science framework, one of the tools and services that we develop at the Center for Open Science, every version of a file comes with hashes. Fast hash, MD5 and a slow hash, SHA-256. And so people can verify that the data is the data that they think it is or was intended to be. We also use the content addressable storage system. I want people to fork each other's work. I want them to make copies of each other's work and reference it. I want them to register their content. That means making a copy and archiving it. And I want them to do that because it's what's good for scholarship, but I also want them to do that because it doesn't cost us anything. It's okay if they do that. If this wasn't a content addressable system, this would cost us a lot of money. We also store three types of hashes, the MD5, SHA-1 and SHA-256 to do different types of auditing procedures. We also store parity archives for every file that is on the system. So even if the storage systems that we use do have some of this invisible corruption, we should be able to recover that data. And then with my R&D team, COS Labs, we're exploring the use of the blockchain and torrent protocols for distributed and storing metadata and data. Data storage is a big problem in the space and I think this could be one of the solutions. Other groups are doing these things too. These protocols can foster sustainability like giving us new ways to pay for these sorts of things or distributing the costs burden and facilitating collaboration and inclusivity by increasing the number of individuals that can contribute. I think these are very interesting citizen science projects. I think they're interesting ways for people to contribute in ways that they couldn't before. The torrents network is I think a good example of that. What can you steal? Hashes, teach people to check hashes, content-adjustable storage, parity archives, blockchains, compensation via these transaction fees, the torrent protocol to increase inclusive contribution and distributed hash tables for these scalable lookups. That's all I have. You can find the presentation online and I'm happy to answer questions or talk about this later. Thanks.