 Hi, I'm Jeff Spees. I was, well, I'm the co-founder of the Center for Human Science. I am recently resigned as the Chief Technology Officer. I am now doing consulting under this 221B company that I started. I'm also still working on the SHARE project in partnership with the Association of Research Libraries. And I'm a visiting assistant professor at the University of Virginia's Department of Engineering and Society. One of the reasons I left COS was to work on things like this. That wasn't really part of our mission, so it wasn't fair to work on them at COS as we tightened our mission and scoped with products. And I think some of these things are absolutely critical right now, especially while we're talking about concepts like institutional ownership. And while we have very motivated, very wealthy groups that only know the business model of lock-in, rapidly making their way towards these other aspects of scholarship like data, metadata, and preprints, and ideas, and analytics, and software, and all this other stuff with copyright assignment agreements. How many people know about decentralized technology? A little bit. OK, how many people have heard of torrents? OK, how many people have heard of IPFS? OK, well, good. Dat project. Any cryptocurrencies? Good. OK, so I would like discussion during this. I would like discussion afterwards. I can get pretty technical on this stuff if there's something you want to know about these things. We can try to tie it to how we can use it. We can try to pull out the pieces that at least I perceive as being the practical and incremental elements. So just interrupt. I don't want to just do a Bitcoin lecture today. So I don't know if you need it. So if you don't need it, just give me some sort of cues. Nonverbal or verbal, I will try to understand them. OK. All right, so centralized, decentralized. This is how we've been talking about technology since about 1964 when these sorts of models were proposed. Obviously, one point in common was centralized. Oh, that's another one point in common with decentralized. And this is sort of the thing that the community has been wondering about in terms of, is that the right word for that model? Is it distributed? Is it federated? What is that model? 1964, we certainly had networks, but things have changed, ways of talking about have changed. So this is the actual 1964 instantiation where we had centralized, decentralized, and distributed. So in my research and network analysis and whatnot, I would have probably said this is just a highly connected graph, a highly connected decentralized graph. This one, it certainly has some hubs in it with a very low number of degrees between, I guess, this main hub, which from a point of failure standpoint doesn't look good. And so I think the community is moving towards this, where this is still obviously centralized. And these are decentralized. And if we're going to call this one something, it's probably closer to federated than just decentralized. And then this one, this distributed one, is also decentralized. And it's degrees of connection, really, just how connected it is. It's a highly connected graph. So that's the definition phase of the talk. I'm happy to talk about this piece of it. I don't know how interesting it is. If you're interested, though, I will pretend like I'm interested. So the benefits of decentralization, that we had those two models that had one point of failure, that is sort of the point of decentralization is to get rid of that point of failure. We want fault tolerant systems. We want nodes to be able to come and go or be destroyed from the network and still the functionality of the network survive. My favorite example of this was sensors throwing in a river and they go where they go. The whole system can still work, but you're going to lose sensors. Some are going to get caught in the trees. Some are going to eat in. Some are just going to go down different forks of the river. But you have this fault tolerance, this ability to withstand loss of some aspects of the network. And the connectedness is actually a really interesting piece of that. This isn't a social network talk, but if you're familiar with the Kevin Bacon numbers that I familiar term, Eredo's number, things like that, that works on this balance of connectedness and degrees of separation where you have a lot of hubs. And the hubs are easy to get to and therefore then connect you very quickly into another group. So these exponential groups, these exponential networks, these small world networks is what they were called. Barabasi did a lot of work on this. Those have some interesting properties. And we see them in most human networks. You can actually exploit them pretty well and exploit in a positive way, because that means if you can then bring in a hub, you bring in a lot of people very quickly. So from a product point of view, that's one thing we can just learn right away from network analysis is get the hubs, and then you will get the audience. That's where the focus time. So attack resistance, if you take out a node, you don't lose the whole kit and caboodle. You still have something going, some functionality. And collusion resistance is if there's some bad actor who is trying to cause mischief in your network, the decentralization, those properties make it such that they can't be as impactful as if there were low connectedness or one single point of failure. And we'll talk about a few of the attacks that come from this. OK, so we hear about this stuff with scholarship pretty much all the time now. And I think we need to be asking ourselves, what are we trying to achieve? Are we trying to achieve properties like this? Is this our goal? Why is that our goal? What problems are we actually trying to solve? What is the most practical way to get there without having to wait for these technologies to really mature and be something that we'd want to use with something as important as the world's corpus of knowledge? So keep those in mind, and let's talk about that. This is a follow-up to my last spring, I think, yeah, spring C&I, where I went through a few of these topics. And so I'm going to repeat a little bit from that. OK, so hash functions. How many people know what a hash function is? How many people don't know what a hash function is? Very good. I'm glad you admit it. Please admit when you don't know something. So a hash is, I think, best summarized like a fingerprint. This is the easiest way to think of it. It's a unique representation of something, of me. The fingerprint is me, but it's a smaller representation. So hash functions, there's an arbitrary size that gets mapped to a smaller size, ideally smaller size. So you can take a book and get it down to a few characters. That's usually a fixed size. How many characters you're reducing that down to, so that all the hashes are the same length? It's deterministically mapped. So if you use an MD5 hash function, I use the MD5 hash function, we get the same value. It's deterministic. OK, we've talked about size and fixed uniformly distributed. If I have a set of characters that I am producing for the hash, I will have an equal chance to get ABC as I do CBA or BCA or whatever that is. It's a uniform distribution that we map this function to. So it looks sort of, it's technically, well, this is a bad way to describe it, but it looks sort of random. You don't see a relationship between the hashes, even if you change a single character. And we can see this, my whole story from the last talk was if some terrible person were removing Oxford commas, or I wanted to catch them when they removed them, I could just hash the text and see a difference. And you can see they're the same size. There's only two commas removed, but they look very different. This is that uniform piece. And it's smaller. The text is smaller. This is a hexadecimal representation, but it's smaller text. And we could have taken a book and then call all of the comma errors. And this is important. We're actually very important nowadays that we have so much data. We have RAID systems, the hardware that protects against data degradation, data rot, data decay. Because we have so much data, even these small probability of errors on hard drives, and even in RAID hardware, hardware systems meant to not have data degradation, even though small errors are quite meaningful across large data sets. We will see them come up. And the bad thing is that it's silent. We don't always know we have this. And so NetApp found more than 400,000 silent data corruptions across 1.000,000 drives over 41 months. 30,000 of those weren't detected by RAID. Now, the wrong bit being flipped in the right header of a file could mean you can't get into that file anymore. You just can't do it. If it's an encrypted file, you're not getting anything out of it. CERN found the same thing across 97 petabytes or six months, 120 megabytes of data that was permanently gone. That's a lot of data. It doesn't seem like a lot. But again, these little bits of error spread across many files that could have devastating consequences. So just in general, if you're going to generate a file for download, also generate the hash. And so that if someone downloads that file, they can compare the hash and make sure they got the right thing. Make sure there wasn't a transmission error. Make sure a year later, if they're reproducing a study, that it's the same data set that they're reproducing. It's not just some bits that got flipped. And that's why it's not reproducible. So the hashes are useful ways to do that. Another useful thing with hashes that I don't see a lot in some of our data repositories is this idea of content-addressable storage. If you have now a way to reduce a big file down to a few characters, relatively uniquely, I didn't mention this. So the length of these gives me some certainty that they're going to be unique. The longer they are, the fewer collisions. The shorter they are, most likely, the more collisions. And some are just attackable. You don't use those hashes anymore. But this collision resistance is part of being a good cryptographic hash. We don't want to take one file that is one piece of text and another piece of text and get the same hash. That's what we're trying to avoid. Typically, the longer the hash, the less chance of that. And with large hashes, it would be very, very difficult. OK, so yeah. If we have this uniqueness, we can just name the files with the hash. And then whenever we have any duplication of data, we just store it once. And then we keep some metadata that says, marysdata.csv is 23c1d. But then chris-marysdata.csv is also the same thing. We stored it once. We can represent it virtually twice. That's not a big deal. And so as we think about reproducibility and replicability and version control and all of these things, it's not bad to actually keep copies. It's not bad to keep that provenance around, because we could do things like this. OK, so different types of hashes. MD5, I mentioned, that's 128-bit hash. So there are two to the 128 possible values for that hash. That's a lot of possible values. And as we go up to like 512 bits, most people like the 256-bit 512. It's a lot of potential unique sets of strings of characters. The collision rate's fairly low. If you want to know what that collision rate is, you can look at the birthday attack. It will tell you the probability by which you might get one of these collisions. We typically don't use these for anything that is from a security standpoint, because they have specific attacks to them. But NIST recommends that we're in the SHA-2 space. OK, so two technical, two non-technical, where am I at? You're my favorite audience member so far. OK, so what is this? Does anybody recognize this? Anybody? OK, it's a NASDAQ. OK, you, I think, indicate you know what cryptocurrencies are. You may have heard of Bitcoin. This is when the Long Island Tea Corporation changed their name to Long Blockchain Corporation, and saw an immediate spike in their price. They make tea, and the tea sort of looks like a blockchain, I guess, in that picture. I think, as of yesterday, the NASDAQ booted them from the NASDAQ, and you can't get this data anymore. So I had to find a picture of this data, because they are not doing so well. But we saw this huge increase in price, because of the word blockchain, for tea. And so I think this may explain why we're hearing it pretty much constantly when talking about science and scholarship. Do you hear it a lot? Are any of you working on any blockchains for science or scholarship? Michael, yes. And I'm sure yours is fantastic. I bet I want to talk to you about it. Others probably are just tagging blockchain onto it and thinking, well, blockchain, that's it. We're hearing it about putting data on the blockchain, putting journals on the blockchain, preprints on the blockchain, registrations on the blockchain, all of these things on the blockchain. And I don't know if we really know what that means or why we're doing it. Sort of like long blockchain corporation. Here's an interesting one. This is a journal, an open access, non-APC journal about blockchains that gets put on the blockchain. So along with Michael's, these two are very good examples of using the blockchain for scholarship. You may have seen Digital Science's white paper on blockchains for research. They reviewed a lot of this stuff. It was a fine piece, a lot of blockchain talk. When I was at COS, I put an intern on a blockchain project for like a week or two. Just so when people asked me if I was working on blockchain, I could say, yes, it's an R&D, and then continue on. OK, so Bitcoin in the blockchain. You know what Bitcoin is? I found, by the way, I don't know if you had the experience of finding an old wallet with cash in it. I found an old wallet with cash in it. And that was very exciting. It wasn't a lot, but it was still very exciting. So the blockchain is a public distributed immutable ledger or database. And those are some interesting words. A ledger, a database. Public distributed immutable, those are interesting. We'll parse those apart. So Bitcoins are mined by people that create and verify hashes. This hashing thing comes up a lot when we talk about distributed and decentralized technology. That's why it really is something to try to get your head around because it is so important in this stuff. Today, it was around $6,800 down quite a bit. Luckily, from when I found my wallet. So the simplest way to explain a blockchain is it's sort of like hashes of hashes. So there's some information in these headers, which one of them is the previous hash. And that is the previous hash of the last blockchain's header, which includes a hash of the body of the blockchain. And so any tampering of any of these things is going to change the hash of one of them, which means the whole thing falls apart. Nothing would be validated. Just to show this a little more confusingly. If hash 11 is the hash of the hash of header 10 plus the metadata from block 11, we can sort of keep going down. So we replace header 10 with the hash of header 9 plus the metadata of 10. Metadata 11 stays in there. We hash all of that. If we break header 9 down, it's the hash of the hash of header 8 plus the metadata of 9 plus the metadata of 10 plus the metadata of 11, but that's hashed. It's this hashing of hashing. And that provides some interesting features around tampering. This is this idea of decentralization where we're getting this fault tolerance, this safety from collusion and these sorts of things. So if you change the content of a block, its hash changes and, therefore, the content of every subsequent block changes as well. This makes the data in the blockchain immutable. And that's, I think, what most people that go beyond just tagging blockchain to something. This is what I think they're thinking about. So it'd be nice to put our data somewhere immutable when we have administrations that are removing it from public websites and whatnot. That's a good, that seems like a good thing to do with an immutable data structure. It can never go away. You're going to see this keep happening in the PowerPoints on automatic mode. I'll just keep clicking back. So the, where were we? OK. So registrations, clinical trial-like registrations where you pre-register something and time stamp it. And you don't want that to change. Again, another nice feature of the blockchain is that you have this immutability. And you have the time stamps. And so then we can go back and say, well, did you look at other variables? Did you analyze other variables that you didn't mention in your analysis plan? Did you not look at some of you mentioned? And this is actually a big problem in the clinical trial space, comparetrials.org is a nice site that shows that even with the great work that clinicaltrials.gov has done, we still have a lot of data exploration, a lot of p-hacking going on in this area that's supposed to save lives that we really should be taking very seriously. So I like it. It makes a lot of sense. OK. So the blockchain is really actually immutable. It's immutable by consensus. And this is a word I want you to keep in mind because we'll talk about it more. So anytime the blockchain changes, including additions, that happens by consensus. We have to agree that we're going to change the blockchain. And this happens, for example, with what are called hard and soft forks. I'm forking the blockchain. I'm creating a new branch of it and starting it from there. So if something happened here, we recently saw on the Ethereum blockchain and a vulnerability that lost $250 million worth of Ethereum Ether. Yeah, there we go. We can say, OK, people don't like that. We're going to roll that back. I'm going to convince the community to go along with me if I get enough people to say, yep, this is the way we're doing it. Then we can start on this new blockchain and erase those transactions. You can't go deep in the blockchain because that changes everything, or you'd have to then propagate those changes and then get people to agree. We can roll back pretty easily by doing this fork. It's not easy. It's not desirable. And there was even in this loss of $250 million, the Ethereum now has three forks now. But there was two forks that continued on because the idealists said, no, we shouldn't be forking this. This is the way it was built, and this is it. So it's not an easy thing to do, but you can do it. So there's a little bit of mutability there. The other piece of this is that, well, OK, so proof of work, we make it difficult to get Bitcoin. You have to do this mining on this expensive equipment. We make that difficult for a reason, but we make it very easily verifiable. We just calculate the hash to make sure that if we found the block that we wanted to, we found the hash that we wanted to. Mining is hashing, and we know how hard hashing is. We know how hard it is to SHA-256 something, but we know how easy it's to verify that. So what mining is is actually just calculating a bunch of hashes until you find one that meets a certain criteria. And this proof of work then gives us some safety. And the safety comes from these typical attacks. So actually, these forks happen all the time in the blockchain. You get two people who claim credit for finding this hash that meets this criteria. They want the 12.5 Bitcoin that you get from that. And so you get people who say, oh, I have it, and then this person says they have it. And so how we deal with that is by consensus again. And this is an easy consensus, because you say who has the longest string of these calculated hashes? And you take that one. That one wins. And so this happens all the time, and we deal with it. And there's ways to do that. I won't get into the details. The interesting thing about this is that proof of work, it's hard, plus this conflict resolution. So in a distributed system, we're going to have conflict. We're going to disagree about things. We have to come to consensus, because we're working on things separately. We might not be connected. And so we're going to have these conflicts. This means that from an attack point of view, because I have to find that longest chain for someone to really attack the blockchain, they have to have more power than everybody else. They have to be able to hash faster than everybody else. And it's exactly 51% of the hashing rate that they need to capture. It's called a 51% attack. They have to capture that amount so that they can more rapidly calculate those chains, calculate those hashes to get that longer hash to then win the consensus battle. So there's these protections built in. And that is not an easy thing to do. To get the majority of computational power in Bitcoin right now, it would cost about $4.5 billion. So it's not going to be just a bunch of hackers get together and say, we're going to do this. This would need to be like a government who would do this. And if you look at rates of, if you look at how much they spend on banking and whatnot, this is something that government could do. And so we might need to be concerned or maybe we don't. It's something for us to think about in scholarship, because there are these aspects that we're going to need to protect. It's an important thing. We don't want this immutability to be in question, if that's really what we're trying to get out of it. And so how do we put those protections in place? How do we think about these things? And so there's a lot of thought right now about not proof of work, but proof of, I'm forgetting the word, but basically ownership of coin. And so then if you own most of the coin, you can move the market anyway, and you're probably very rich. So it probably doesn't matter to actually attack it. And so people like that a little bit better, that type of proof. But these are these things we need to think about as we're going to be dumping all of our data onto the blockchain. OK, this is a good article. I should give it a read if you haven't seen it. This Gideon Greenspan is actually quite thoughtful about this stuff. He has an interesting piece on what immutability is in the blockchain. OK, so if the blockchain provides immutability such that it cannot be removed, it's pretty permanent. It's pretty hard to remove something from the blockchain. And we start putting all of our stuff there. Well, how do we deal with this, for example? This is how large the blockchain is right now. It's 163 gigabytes. And we all should have that on our computers if we're going to be putting stuff in the blockchain. Ideally, everybody has it. There's ways not for everybody to have it, but still some people are going to have all of it. If you want to have that auditable trail. And by the way, if you want to look up data from earlier in the blockchain, better have all of that data. So it gets very complicated for us. In a financial system when transactions change, money leaves while it comes in and out, sometimes you don't need as often the stuff from way back. But we like to cite things from a few years ago. So I think this could be a problem for us. So this is a simple one. What about this April regulation coming out of the EU? Are you familiar with this? The right to be forgotten? How are you to be forgotten in an immutable system that can't be changed, or that makes it very hard to change? Something we're going to think about. Some people claim it's not an issue. Accenture, for example, says this is a huge issue. We couldn't use this in our banking system because of that need. OK, how about mischief? This is what Accenture calls it. I don't know if I'd use the word mischief, but I couldn't think of a better word. There is pornography, child pornography, classified documents, all sorts of stuff in the blockchain. You can just encode it into the text, into the format it needs, and you put it in there. Now we all have the blockchain on our computer. So are we, yeah, legally are we? I don't know how our GC is going to like all of our desktops on campus with this stuff in it. Well, I have a feeling what they might say. So we have some implications there. The last one I think is the most important. Well, these other ones are important. But this one I think is the most interesting is ethically, can we expect end users, and really, can we expect humans, to really understand true immutability? Is that a concept that philosophically we can understand? It is permanent, and not just like sort of permanent, where you can ask people to take things down, but like, it's there forever, sort of permanent. I have a hard time believing that our users who don't like to remember their passwords are going to be able to understand this sort of thing. And I don't even know if I can understand it. And I've been into this space for a while. Give me some indication that this is going, or I'm just going to, I don't know what. OK, there is a way around this. If we still like the blockchain, there's an interesting piece, Accenture funded it. We can use these chameleon hash functions. It basically allows you to have a key that allows you to create collisions. And so when I need to replace a hash in the blockchain, I can actually create another piece of data that matches that hash, stick that in there, and we're good to go. It just continues on. Nothing needs to know. There's no forks. There's no rollbacks. They're really, really interesting. The math is interesting. It's called this trapdoor key. The question, though, comes up of who do we trust with the trapdoor key? There's ideas to put that in the distributed. Put that out there using secure computing to get it back, where all of us have a little bit of the key and we all put it in together and we turn it. It's an interesting proposal. Here's the paper. It's a really good one. Who do you give this to? So trust is an important element. So with all these things in mind, I don't know how we can store data in the blockchain right now. I think we might be able to store identifiers to data, maybe hashes of things, and have those hashes pull out of a content-addressable system somewhere so that we can always change it, remove it, get a takedown notice from someone that's copyrighted and we need to get it out of there. But I don't know if we should be dumping data and journals and preprints and all this into the blockchain. Do you really need a blockchain for that? The conclusion is, unless you meet these five to eight criteria, you probably don't. And you can just use regular file storage, centralized database, a database with replication or subscription databases. OK. So I'm going to fly through a bunch of technology here that might incrementally get us some of these pieces. We just need to figure out what is right for scholarship, what is right for our groups. OK. So Arsync, my favorite command line tool. It's a fast incremental file transfer backup cloning, mirroring program. If you want a distributed system, you could just pass files back and forth every night between repositories and you sort of get some of those features of distribution decentralization. I give you my SSH key. We now trust each other. And I take your stuff, you take mine. In fact, you can even set it up to backup stuff. So if there's changes, it goes into another directory. You can have websites generated off this that allow you to go back and download those old versions. It's really nice stuff. This code would almost work pretty easy. So if that's all we need, it's just mirroring backups. I don't know if we should go with the blockchain when we have Arsync. And we could probably set this up in 20 minutes. You can add parity files. And it is not what you think. I'm below 50-50 for laughs on that one, so it's leaving my slide deck. That was it. Someone, like I said, that's it, though. That's done. That one's gone. Parity, it's in RAID systems. You can lose pieces of data and recover it if you have these parity files. I think I blew the punchline. Parity computation. That's why you didn't laugh. It's parity computation. I'll keep it for one more time. I'm reversing this one, and we'll try to give them one more time. OK, so yes. Oh, it's immutable? OK, well, we'll see if we can come to some consensus later. OK, so there's another app, the other tool that my second favorite command line tool is Part 2. And you can generate these parity files. You just send them along with your files. And as long as you don't lose more than, for example, 5% of the file, as long as it's not corrupted or lost, you can recover it by just running some simple scripts. So a really simple way to protect even against these silent disk errors. It doesn't cost that much in terms of storage. And we have a lot of that value out of decentralization, parity files. And give them to someone else. Back them up somewhere else. Three places geographically distributed in case of nuclear attack, and we're good. You don't have to worry then. You don't have to worry about RAID errors. I don't know if we need the blockchain. If that's what you're going after. We have another system, very similar locks. I'm sure you're familiar with this. I'm just going to skip over it. They do some interesting consensus stuff, by the way, where if there's a conflict, if someone says, I have a file, and this is the hash, and you have a file, and that's the hash, we sample the trusted network, by the way, and we do a vote. We have a vote. And majority wins. So that gets a distributed file system, a distributed virtual system. So I keep a copy, a clone of a repository. You keep a clone. We're working independently. Sometimes we have these conflicts, these merge conflicts. What's really interesting is that the data structures they use are hashes of hashes. And so I think just about every system we design, any data management system, can probably use hashes of hashes. It is a nice way to tie provenance together very strongly. And you don't have to keep the data, necessarily, but you keep the hashes of the hashes. And you know then that this thing I had, I had to remove for some reason, it was in the provenance of this other object. OK, commits are stored the same way. If it's constant or decibel, those are the hashes stored in the file system. You have these trees of hashes. So the name of that top tree is 3C4EDC. That is the hash of each of those other three hashes, the children hashes. And then the 8D8329F, that's the hash of all its children, or its leaves. So we can create these trees now that show different branches of provenance, different people working on things, and how they manage conflict. So really interesting from a provenance standpoint. Commits, like I said, the same things. You see actually the changes. You said one thing. I said another. Even a third person can come in and say they had another piece of this, so I can then incorporate several different pieces of different branches in different ways. And I have record of all that. And I have a way to give you a hash to say, this is your record. And if you question whether I changed that ever, you just will check the hash. And we'll see if then the children sum up to the parent. And the children's children sum up to that child. So again, very nice. The interesting thing about this one is that there is sort of a central authority. I'm going to be the ones to deploy the code. So I make the decision on what's right. I do a code review, and I say, OK, this branch is good. This one's not. So it's mostly distributed. It allows for distribution. But there is a centralized point usually with most code bases. Now, you do have forks, and you do have independent work and all of that. But there is some involvement from a central authority usually. OK, no SQL databases that can span multiple data sites, multiple data centers, multiple data warehouses. It just works out of the box. They do this stuff for you. CouchDB is a fork of CouchDB plus MemDB. Interesting stuff, real time. I need database. Graph database, that does the same thing. It could be two places in the world, and I can have all my data distributed. Automatically, this one's really nice and it does automatic sharding and rebalancing and all of that stuff just does it. CouchDB, we like SQL, SQL. You can install this many different places, and you can actually run transactions, transaction-based queries. What that means is, I know what I'm getting, and if there's a conflict, I know what happens. It's strongly consistent. So if you operate in one sector and I'm operating another and I make a query, I'm guaranteed to have a resolved data set. Other ones are eventually consistent, where I might ask for some data, and I pull it out of my database, and it's one I had changed, but you've changed yours. And then eventually, those two things will be consistent. This one's consistent right off the bat. They use this multi-phase transaction system. Really neat, typical databases. Little less typical, GumDB. It is a noDB. It's very esoteric. It lives in the browser. If you touch a web page with this database installed on it, you are now hosting the database in a peer-to-peer way. You can have centralized, distributed, decentralized. It doesn't matter. It's real-time. It's interesting. You know what BitTorrent is. Pieces of files spread across a bunch of nodes. There's metadata in there, which includes lists of hashes. To get a nicely distributed or decentralized Torrent system, you need to move away from these centralized trackers. So I need to find my peers who have this data. You could just tell me who my peers are, or I can start with a seed set of peers, and then my peers will tell me who their peers are, and I build this hash table, this data structure that is hashes of hashes. And then I can very efficiently query for who has a certain Linux ISO, or movie, or TV, or whatever else you do with Torrents. So we can get pretty nice decentralization out there using this data structure. I'm hand-waving a lot of the details. It makes it pretty fault-tolerant, pretty scalable, and we'll see it come up again. Web Torrents, this is really neat. If you go to webtorrent.io, you will start finding peers from a DHT, and you will start downloading movies from each other in the browser. It's very cool. It's built on RPC. It's neat stuff. Now we get what I think is a really interesting space when it comes to what is the future going to look like? You mentioned, well, I asked you if you knew about IPFS. I saw a lot of hands. It is this decentralized file system. It's this decentralized web. It's being talked about as the next HTTP. It's the next TCP IP. It's the next way that we do the internet. I don't know about that, but it's interesting. Data stored in chunks, the content addressable. So same things we've learned before. Hashes are shot 256s. There's some interesting stuff in passing metadata about the system in the hash. I won't get into the details. If you're interested, they're called unsigned varins with prefix continuation bits so that different people can be on different versions of the system and know what they get when they get it. You don't have to check with a centralized authority. So one of the hard parts of the decentralized system is naming. The name in an IPFS system is the hash of a public key. I generate a public key. I generate a key that I can share with you. And then we take the hash of that. And so that's a unique representation now of me, basically. And so my website would be that long string of characters and that's how you find my stuff. But as you surf, you start collecting some of those pieces and you then are hosting those things. So if I, in that DHT, if I find you as a peer, you might host my site for someone else. What I started developing some specifications for a system like this and then start deeper into a few of these and then I started developing what was essentially IPFS. They use these Git-like data structures to represent their data. So they have provenance and version control built in in a very similar way as Git. They were using these prefix continuation bits to pass around system information. But the thing that I don't love about it is that they are very, very committed to the fact that objects are permanent. This is an immutable file system. They do not want to have a mutation of these objects. You have different pointers, but the objects, they are there forever. If you know the ID, you can always get to the object in the system. There's the DAT project, very, very similar. I find a little more alignment with IPFS but I need to dig a little bit deeper into that. It started off as sort of a GitHub for data. And for those reasons before, I don't know if we can put people's content in there without the ability to take it down. And I don't know if we can keep bad actors out so they don't put content in there. And I don't know how to deal with the new EU laws. Yeah. And I think we can achieve this what we want in a little bit safer ways. Because one of the important things that I think we have going for us is we probably could define a network of trust. We know each other. We can make deals in person. There is some benefit to the idea of federation, to these governance models. We have some shared values, a lot of shared values. There's enough of us that we might be able to rally consensus for something if we can talk to each other. And so I don't know if we need that just live forever anywhere. I like the inclusivity of it. I want people to be able to participate in this stuff. Anybody should be able to host these things, come on and help with metadata, help with these things. But I think we might need to fall back to this trust network where I can just make some calls and say, we've got to get this piece of data out of our systems. And those other players involved, maybe they don't have access to that type of data or there's a curation process or no, there's many solutions for it. But I don't know if we're there with true immutability. Okay, so lock-in business models, I'll skip that. It's important that we own our work. It's important that we have these protections that decentralization provides. So how do I know who to trust? How do I know what content is trusted? It's these public-private digital signatures. I can share with you a public key that's matched to a private key. And if I sign an object with my private key, then you can tell that I signed it because the public key, it will resolve. So if you try to decrypt it with the public key, it will decrypt. And so what we do is we sign hashes. So if I tell you I have this file and it's from me and you ask, well, am I sure it's you? I don't see you, it's emailed to me or it's sent to me over the wire. Well, you can find my public key. I've probably given you before, I have it on my website. And you can then just see when you decrypt that through the hashes match. So a lot of these systems that I've mentioned have this built into them. This is a really important part of this, is this trust. Where can I store that public key? Well, I can give it to you with every piece of data, but then you've got to confirm that it's actually mine or you need to keep track of it in a cache. Or you can use another distributed system that has also sort of centralized DNS. So people would say DNS is not centralized. You have ICANN who hands out domain names and they could be bot and they've done bad things in the past and they're aligned with GoDaddy and all of this stuff. So yeah, maybe it's not totally distributed, decentralized. You do have this point of failure. It's been working out okay for a while. And so if I'm thinking about true immutability and all these problems, maybe I can give a concession that maybe we store these public keys in DNS. So on a domain name I buy, jeffspies.com, check it out. jeffspies.com I can put in this distributed server network my key and in fact, when I sign up for Google, Google mail, I put verification keys in there so that they can check to make sure I should be the only one with access to my domain names DNS entries. And so they just look that up. You do this NS lookup and you find the key. So I can just put my public key in there. There's some interesting things that are better than protections in place because people do like to forget their password. In this system of trust, I can generate two keys, sign the signature, pass that key, either just write that key down, put in a bank vault, or give it to another trusted party such that if my key was ever lost, they can come on and say, well, I have Jeff's other key. And because we sort of have this trust network, they might believe that Michael and I did exchange keys in the past and that he's not just trying to take over all my data and metadata. The naming piece, I think is really interesting because naming is hard in a decentralized system. Yeah, so I would make the two public keys available and then I'd keep one of those private keys needs to sit on my server so I can sign things as I push it out into the repository. The other private key I would keep to the side. I would never have it on the server, never have it in RAM such that there could be any sort of malicious behavior unless I forgot my other private key or someone did steal it and I needed to come in and say, this is me, you remember, I gave it to Michael, Michael can verify that, he can sign something with it, you know, that sort of thing. People don't like to not be able to forget their passwords. People don't use LastPass, people don't use OnePassword. If you're not using LastPass or OnePassword, you should look those tools up and every website should have a unique, long, complicated password to it. Okay, naming is hard. So I could use UUIDs, have you heard of UUIDs? Okay, unique, these unique identifiers. And there's four or more of them and some of them have time built into them and some of them have these sort of hashes of your IP address and they try to create some uniqueness. But time's a weird one because we're all operating at different places in the world and our clocks aren't the same sometimes and so there's some weirdness there and so actually this is not related to UUIDs so much but clocks being the same when time's involved in any of this stuff is actually really important. And so can we trust that? Are we all on the atomic clock? Do we all have a hardware link to a poll on our building that makes sure that we're getting high resolution timing? I've had to deal with this when I worked in video. I mean, it's not fun. And so naming gets hard for some of these reasons. DNS, decentralized, okay, okay. So we have the need to name things in the system. So we can share data, we can share metadata. Could we create identifiers in a decentralized way? Yes, absolutely, we can and we do. The handle system is a distributed system. DOIs are handles, it's the same deal. But URIs, we never use them, always use a DOI. You don't know. If you have a system, a really permanent system, why not? And if I can sign things, if I can guarantee that jeffspies.com is me and that if I ever go away, if I ever don't pay my $10 to renew the domain name, Michael's gonna stand those things up for me. Why not call me jeffspies.com in your system and I can generate my own IDs? I don't know why we don't do that necessarily. It's free. It's not consortium run and all those things. And I like these groups. I'm not too worried about these groups right now. But it doesn't seem very inclusive. Some of these things cost a lot of money to participate in and I think that might create some exclusion criteria for certain people to participate in the process. And it doesn't have to be that way. It could be $10 a year. How do I know who to trust? If I know their name and I have some signature or maybe I've trusted them before and I have that information cashed, that's who I trust. How do I know to trust their content? Because that's signed and their name's attached to it. And I can pull their public key and check that the hash of the data that you're giving me was signed by you. So trust can be dealt with in this way. And if we have a system where we sort of trust each other to begin with, why don't we sign signatures in the same way that we hash hashes? And you can use Merkle trees of signatures rather than Merkle trees of hashes, these summed hashes. And then if there was ever a question, did Michael take over all of Jeff's stuff? Well, we can see that did Jeff trust Michael? Because we can look at his provenance stored on maybe a Git tree. That's all hash to hash, right? And say, well, Jeff kept signing his signatures saying he trusted his stuff. And so it probably makes sense that Michael is taking this over. Something happened to Jeff and Michael's gonna host his content for a while. And we can use the fact that we are a social network that we do know each other. We have some trust. We don't need that full decentralization necessarily. And we can even use some of these techniques we've seen in decentralization to even add a little more strength to that. Okay, what I'm doing now is taking all of this stuff, taking these pieces and thinking about this. And I met George Strahan and he put me onto some of the DOA stuff which I hadn't read in a long time. I'm talking to Bob Khan next week who came up with DOA. They had a lot of these ideas in the past where they separated objects, identifiers, registries which is metadata and repositories, which is the data. They put public signatures in there. I think there's some elements we can bring in from these new decentralized technology. But I think this is a really interesting use case for what right now I think we mostly know of the handle system, but was pretty well specified in terms of how these things could operate together. So that's what I'm gonna be focused on with some of my own projects in the near future. I'm sorry, digital object architecture. So Bob Khan is one of the fathers of TCPIP, father of the internet sort of thing. He was very thoughtful about what the next stage of the internet could look like beyond that low level TCPIP him and others. Larry's a CTO, I've been doing great work. And it was this model, but without necessarily some of these advances we've made in the decentralized type and the speed of the internet and things like that. Okay, we have just a couple of minutes. Any questions, comments? Yep, thank you.