 Well, welcome back everybody to the last lecture 162. This is kind of a special lecture. I did get some requests for more information about distributed storage and quantum computing. And so I think we're going to do that. And I want to make sure that we talk through the core algorithms since that's a, I think, relatively simple thing to understand and is very cool and applied pretty much everywhere. So if you remember, one of the things we talked about last week was basically this cap theorem, which was really a conjecture that Eric Brewer put forth back in the early 2000s. And it basically said that you could get consistency availability or partition tolerance. You couldn't get them all three at once. You might be able to get two of them at once. And so that's the so-called theorem. And we've talked through a number of reasons why that might be true. But certainly you can imagine that if you have to be tolerant to cutting the network in half, then it's going to be very hard to be both consistent and available all the time. So oftentimes, the cap theorem is a good way to understand global storage systems as a result. Now, at the very end of last lecture, we were talking about key value stores. And the cool thing about key value stores is they're very simple in interface. Excuse me. So basically, you can have an arbitrary key, although that's usually a hash over some value. And you can have a value associated with it. And if you do put a key comma value that goes somewhere into the ether, and then when you do get of the key, you get back the value that you started with. And so this interface is extremely simple. It's certainly an interface many of you have used in languages on a single machine. What's interesting is if you use this in a global storage system, it turns out that the interface is simple enough that you can have some pretty interesting implementations. And if you remember, we started talking about key value stores with this notion of a distributed hash table where what I've got in yellow here is really the key value table that we might think about on one node, except that in reality, what happens is this gets distributed over a whole bunch of nodes. And so the question is really many parts to this question. One is, how do we actually do that distributing? Another is, when some client does a get, how does it figure out which node to go through? Clearly, we don't want to have a single routing table in the middle of the network. That's going to be really expensive. And then what happens if one of these storage nodes fails? And so there's many failure modes you can imagine. There's performance problems and scalability issues where we would like to increase the size of the system by just adding more nodes down at the bottom here. And so far, we haven't really talked about how to even make that work. And so today, I want to tell you about the CORD algorithm, which has been turned into storage systems of many sorts, including those used by Amazon, et cetera, Facebook. So before we get there, I wanted to remind you of this notion of recursive versus iterative lookups. So here's an example of a recursive lookup, which is like routing. So what we're doing, so our recursive, our routing. So basically, if I say I want to get whatever key 14 has got, it goes to the master directory. And then that directory forwards it on. It routes it to the particular node that's got the results. And then the node returns to the directory, which returns back to the original client. That's recursively routing its way through. An iterative approach is one in which the client basically talks to the directory. Then they talk to the individual nodes. And we're not routing queries through anywhere. Every individual client is doing that particular lookup. And you can imagine that this second example here might be more scalable because we're going to have many clients all driving the lookups. It gets a little tricky to maintain consistency though. And so that's one of the reasons we might want to have recursive. Another is that this is just faster because we're basically just routing the shortest path to the result and back, whereas this iterative one potentially is twice the latency if you think about this randomly. So let's keep those two implementation technologies in mind. They're really interchangeable and are more about what to do with the control plane portion of this than anything else. So the challenge of that central directory is really that it's got many entries that are key value pairs or at least key node mappings. And you could have billions of entries in the system. So I would say that anything that thinks of a directory here like a single server is bound to be a bad idea because it's just not going to scale the billions of entries very well. And back in the early 2000s, myself and a bunch of other researchers started looking at how do you deal with peer-to-peer technologies as a way to solve this problem. And the solution, one solution here, is consistent hashing, which I'm going to tell you about. We did tell you about it last time, but I want to re-emphasize what it is. And the idea behind consistent hashing is it's a way to take your keys and figure out a clean way to distribute them throughout the system without having to know pretty much all of the nodes that are participating. So this seems like a strong ask when you think about it. If you look back at this diagram for a moment, if there's hundreds or thousands or millions of servers down here, and we have to somehow with consistent hashing figure out which node to go to without going through a master directory and such that all these nodes don't know about each other, that seems like it's pretty difficult. And that's one of the reasons the core algorithm is so interesting, all right? And so this is basically going to be a mechanism to divide our space up. And we'll talk you through that in the next slide. And then I'm going to show you how the core algorithm lets you get by with only knowing essentially a logarithmic number of nodes in the total system. And you can still do this well. So we're going to associate each one of those storage nodes is going to get a unique ID. And that unique ID is going to be in the hash space. So imagine you take their IP address and their owner and whatever. You concatenate all those things together and you hash them. And you get a single 256-bit ID out of that. Now we're going to talk more about secure hashes a little bit later in the lecture. But so every node has an ID and it's going to be in this ring space, this unidimensional space, from 0 to 2 to the m minus 1, where m is going to be big. And so let me just show you the picture here. So here's an example of the ring, the ring to rule them all. And for the sake of class, I'm only going to talk about m equals 6. So really, there's only 64 possible spots on this ring, 2 to the 64, 2 to the 6. I mean, this gives you 64. In reality, m is probably 256. Because we're using SHA-256 to do our hashing. And so let's just say there's a lot more slots on here than 64. But let's use that for our illustration here. And first of all, notice on this ring are a set of servers that have their ID that's been acquired by hashing their IP address and their name or whatever. So this node here has an ID of 4. And that means that we think of it as in position 4 on the ring. This one has an ID of 8. We think of it as position 8. This one has an ID of 32. We think of it as position 32. Now, hopefully what you can see here is these are not evenly distributed. In fact, probabilistically, they're evenly distributed. But they're really a random hash over some data that's associated with the node. And so they're distributed through the ring, but they're not equally distributed. And it's going to be important, in fact, that we have a good security hash here, a secure hash that can basically pick these positions in a way that's hard for anybody to fake. And then once we've put these on the ring, now what we're going to say is take, for instance, node 8. We're going to say that node 8 stores all hashes with keys from 5, which is just after 4 to 8. And 15 is going to store everything from 9 to 15. And 20 is going to store everything from 16 to 20. So the way to think about this is we've put a bunch of storage nodes on this ring. And then we're going to decide where to store our key value pairs based on where the key is on this ring. Now, there's a lot of stuff I haven't told you yet. Like, for instance, what does this mean physically? Well, I haven't told you physically, because since these are randomly hashed, these nodes are going to be spread physically all over the planet, potentially. The other thing I haven't told you about is how much do each of these nodes need to know about each other? Now, so just to emphasize here, so key 14, value 14, is going to be stored on this server. And why is that? Because server 15 is the closest one whose own hash or own ID is bigger or equal to the thing we're storing. I want to pause here and see if there are any questions. And like I said, in practice, M is really something more like 256. And so this ring is really big. And these nodes are much more sparsely distributed around the ring. OK, questions. We only have a very small class today, so you guys are likely to get your questions answered. Anybody? No questions. All right, should we move on? Now, Ford is a system that was developed with a group of researchers at MIT and at Berkeley. And you can think of it as a distributed lookup service. And in my view, I like to teach about it because it's the simplest and cleanest algorithm for distributed storage that I have seen. And it's a comparison point for all sorts of other algorithms. And the important aspect of the design space for CORD is we want to decouple correctness from efficiency. So we want to figure out what do we need about that ring and the storage servers on that ring so that the algorithm I'm going to describe to you is correct. And then we'll talk about how do we make it efficient. And the thing that's interesting about CORD is we're going to combine that central directory and the storage nodes together and spread them all amongst all the nodes. And so we no longer have a single lookup directory and a set of storage servers. Instead, we're going to have a set of storage servers that are just going to talk to each other to make this work. And the properties are as follows. So correctness, we're going to make sure that each node knows about neighbors on the ring. So it needs to know how to go forward and how to go backwards, a predecessor and a successor on the ring. And as long as the ring is connected, the ring is going to performance tasks correctly. And then from a performance standpoint, then we're going to start adding some more neighbors. And so we're going to start learning about a logarithmic number of neighbors across the ring. And that's going to help us get a much more efficient lookup. Now, there are many other structured peer-to-peer lookup services like this. Tapestry is one that I worked on here at Berkeley. Bamboo is another one I worked on. There's Pastry. That was a Microsoft product. There's Cademlia. There's a lot of interesting ones. Several designs here at Berkeley. And so this problem of how to look up a key value pair got a lot of study in the early 2000s. And the way to think about Kord's lookup mechanism is once again, routing. So we're going to describe this in a recursive fashion to start with. And then, of course, you can do this in an iterative way as well. I think the recursive version is a lot easier to think about. So every node in the system is going to know who its successor node is. And so here we have an example where some client talks to node four and says, here, look up key 37 for me. And so what's going to happen? Well, we're going to start routing packets from the point at which we enter until we find information about what the right node is that's going to store 37. And we can figure that out if you look ahead, which we can't do if we're a distributed algorithm because we don't know about all these nodes. But we're looking down from above. And you can clearly see the lookup for key 37 is going to want to get back node 44 because that's what's going to store key 37. Why is that? Well, 37 is going to get stored on the node that is the closest one clockwise on the ring. And so 37, the closest one clockwise is 44. So how does this happen? Well, four says, well, I don't know what it is. So it routes to eight. And it says, I don't know what it is. Routes 15, routes to 20, 32, 35. This point, 35 knows that its successor is 44. And so it just responds back and says, hey, I happen to know that node 44 is responsible for key 37. And at that point, node four can talk back to the client. And the client now knows just to talk to 44. Now, if we wanted to be fully recursive, we could have 35 pass a query on to 44 and have 44 send the key back. That would be another option. So if you notice here, in order to make this correct, so how did we find the first one again? That's a great question. So the answer is that any clients that talk to the storage server need to know at least one node in the system. So that one node they need to know, it doesn't matter which one it is. In this case, the client, which I haven't shown separately out here, happened to know about node four. And node four serves as a gateway into the ring. Did that answer that question? Now, you can see that this doesn't seem very ideal, because if I've got 1,000 nodes that are storage nodes, I may have to take many hops to find out what I want here. It turns out that the worst case look up here is order n. So that's probably bad. But we're going to show you how to get log n in a little bit. It's going to be a dynamic performance optimization and so on that's going to be pretty interesting. Now, what I want you to see, though, is from a correctness standpoint, as long as every node knows who its predecessor is and successor, and in this case, just its successor, then we can always find the server we're looking for. OK? Now, what does this really mean? So here's this ring and here's key 14 stored on node 15, let's say. What it really means is something more like this. So these nodes, since we were doing hashes over their IP addresses and some metadata, it means that they could be anywhere in the world. And then we're connecting them together based on their hash name. So 4 talks to 8, 8 talks to 15, and so on. So that, for instance, key 14 happens to be stored here on the East Coast. Node 4 is up in Alaska. So based on what you see here, that what I just showed you on the previous slide, if somebody were to ask Node 4 for key 14, we would go from Alaska over to the East Coast, over to the West Coast, and we'd get the result. OK? So really, because of the hash being a randomizing function, we've scrambled the geography of this ring. OK? Now, that's actually good. And the reason it's good is because it means that no particular part in the world here might be a hotspot. It means, unfortunately, though, that we don't have the most local of lookup. Because if we started Node 4, it'd be nice if we could just go down to 15 and back. OK? Now, this is a really good question here about redundancy. How do we get redundancy out of this? For the moment, suspend that question for just a second. Certainly, we could put RAID servers or RAID storage on each of these nodes. And that would be great if the disks fail. But we would like something even more powerful, because I don't know if there's a big earthquake, and California falls off into the ocean, it'd be nice to know that key 14 survived somehow. So in addition to the RAID redundancy that we've been talking about in class, there's some other sort of redundancy that we want here. OK? Yeah, by the way, if you've ever seen one of the original Superman movies, basically the plot is bad guy buys up a bunch of soon-to-be beachfront property in Nevada, and then has a plan to basically cause California to fall into the ocean, and therefore have really expensive properties. Fortunately, Superman saves a day, and it doesn't happen. So OK. So if we move forward with this, by the way, I'm showing you these clients now to make this a little more clear. The clients need to know one gateway into the system in order to talk to the system. OK, so that's going to be part of the initial lookup. And by the way, that's pretty similar to what happens with DNS. You need to know how to talk to local least one DNS server somewhere before you can start resolving names. OK. So the first thing I want to talk about is how can we make sure this ring stays connected, even though nodes are failing and coming back? OK, and so how we can make sure it's connected is we're going to have this dynamic stabilization procedure. So every node can run stabilize, in which it asks its current successor node who the predecessor was and figures out is there something wrong with who's connected to whom. And then if it finds a problem, it can run notify, to help reconnect the ring. OK, and so these are the kind of things that are a lot easier to see with animations. And so let's suppose that we have this ring. And what I want to show you here for instance is here's a new node, or it's a node that crashes coming back. Then suppose that what I want to do is I want to join the ring. So what do I do? Well, just like we've been talking with clients, presumably what I know is I know one of the nodes in the system. And if you remember this ring has nothing to do with locality, that node could be, I don't know, 8 or 15 or something. And so what I need is I need this new node needs to know one gateway node. And I'm going to say 15 just for the sake of argument. And what are we going to do to join? Well, we're going to send a join message to the node we happen to know about. And what's interesting about this algorithm is all that the ring is going to do is it's going to figure out who is responsible for storing key 50 as if this was just a regular key value lookup. And so we're going to work our way through. And eventually 44 is going to say, well, I know about 58. Here you go. 58 is where key 50 would be. And notice what we've done. Now all of a sudden the new node 50 knows that it needs 58 to be its successor and 44 to be its predecessor. So just by asking the ring where key 50 belongs, it now has some information about nodes that it can talk to. And so 50 starts by updating its successor 58. So now it's technically connected somewhat to node 58, but 44 is also connected to 58. So we now have a kind of a weird partially connected ring. And let's look through what happened. So node 50 is going to run stabilized. And so it's going to talk to the successor that it knows about and ask it, well, who's its predecessor? So when it does that, what does it get back? Node 58 says, oh, my predecessor is 44. And at that point, now things are getting interesting because at that point, we can notify node 58 and say, hey, you know what, I'm actually a person you should know about for your predecessor. And we can also take this connection at some point, 44 is going to be running its own stabilized, so stabilized is running continuously. 44 is going to ask 58 who it thinks its predecessor is. And it's going to say, well, I think it's 50. Okay, and at that point, what you know is, oh, 44 says something's wrong here. So it's going to change its successor to 50. And then finally, it's going to notify 50 about itself at which point 50 knows its predecessor. And when all said and done, we have the node 50 has joined. Now, what I want to point out about this joining operation, I went through it pretty quickly, but you're welcome to go back to the slides and animate it through, is really what happens is we have this continuous stabilized procedure that everybody's running all the time. They're asking their successor who the successor thinks the predecessor is, and they just run this over and over again. And what happens is the ring keeps converging into something connected. And that will happen even if nodes fail and come back up and so on, it'll converge to a connected scenario here. But what you can think about pretty easily, I think, is if you lose two nodes in a row, then what I've just described to you is no longer going to work. So there is a way to completely break the ring such that the stabilized procedure won't reconnect it. Can anybody think about what the right thing to do there is in that scenario? How do we make sure that two failed nodes in a row can't prevent the ring from reattaching itself? Anybody? Perfect. We need to know more than just one successor and one predecessor. And so what that's called typically is the leaf set. The leaf set is multiple nodes. If you look at any given node, multiple nodes forward and backward called the leaf set, as long as we maintain that leaf set, then we can reconnect in a way that's going to be stable against all sorts of failures. And one thing I posted last time, I could move it to today, I guess, if you want for reading, but one of the original chord papers talks you through about how many of these leaf set nodes you need to make the probability of a permanently disconnected ring so small that you wouldn't care about it. All right, good. Now, questions? Are we good? Now, one of the things that I will point out is so far we still have this pretty expensive lookup process, which is order n. Now, we have figured out how to make this stable. So first of all, as long as we have a fully connected ring, we can always find the storage for the data, and therefore we have a correct algorithm. Now, the question that's in the chat there is a good one, which is suppose that we had some key stored on node 50 and node 50 disconnects, then all of a sudden a key stored on that node or the set of keys stored on there are suddenly unavailable. I'm assuming that's what you're thinking about. And that's correct. So we'll have to fix that problem. Right now, we're just interested in the lookup process of figuring out which node should hold our data. We'll worry about making sure the data doesn't go away in a moment, OK? So, oh, OK. Not exactly. Let's see. Oh, I see. If you have somebody you mean to disconnect every other one, it turns out that that will converge pretty well, especially if you have multiple links. But try going through the process and disconnect one and another one and skip one in the middle. You'll see that pretty much what's going to happen is you can eventually send. Let me think about that. Yeah, so you can eventually get this to stabilize and reconnect, OK? Now, the really strong part about keeping things connected is to have a leaf set with more than one node, by the way, though, OK? Now, and what you should do is you should take a look at that paper because they describe this in more detail. But basically, what you want is a stabilization procedure that can work even when several nodes in a row are failing. And what you'll see is that there's a way to do that as long as you have multiple links. And what we're going to do right now for performance is going to make that even harder to destroy the connectivity, OK? So if you look here, the question is sort of how do we make sure that we have better than order n, OK? And better than order n is the following. What we're going to do is rather than just keeping track of nodes forward and backward, what we're going to do is we're going to keep track of our current position plus 1, our current position, plus 2, plus 4, plus 8, plus 16, and so on. And what they mean by keep track of it is here at node 80, the question would be, what node would store 81? Well, that would be 96. What node would store 82? Well, that would also be 96. What node would store 84? That'd be 96. At some point, we get to what node would store 80 plus 32, so 112. Well, that would be 112. And we're going to keep track of a logarithmic number of these pointers. And of course, the way we find out about them is we just query the ring and ask it, oh, I want to store each of these keys, and what will come back from the ring is, which node's responsible. The powerful thing about this is, once I've got all these nodes, now I can do a really fast routing process to figure out how to find which node is going to store the key I'm interested in. And one thing that's very helpful here, I think, in this context for everybody is to think about this as bit correction. So I am at a certain position, and I'm interested in a certain key. And what I'm going to do is I'm going to correct the bits one at a time using this finger table. So my first routing hop is going to say, well, if I'm at 80 and I want to get somewhere over on the ring, I'm going to connect. I'm going to correct the bit I've got. In this case, it would be a 1 in the high point. I'm going to turn it to 0 by taking a long hop. And then I'll take a less long hop and so on. And I end up with a logarithmic number of hops to get me to my destination. And you can view that like I'm correcting the bits from my starting point to my ending point. I'm correcting them one at a time by taking these various hops. And that's how we end up with logarithmic routing time. And furthermore, this forest of additional pointers plus the extra pointers from the leaf set together make it really hard to be unable to reconnect the ring. And so if you read that court paper, it talks about how you make use of all the information you've got to keep the ring connected. OK, questions. We're good. Now, let's think a little bit more about data. So basically, first of all, we're going to have more than one forward and backward link called the leaf set. And in the predecessor reply message, node A can send its k minus 1 successors and so on. And so you can see what's going to happen is during these heartbeat process of looking things up, asking, well, hey, my successor, who's your predecessor? During that process, the stabilized process, we're going to get back multiple nodes, which is going to help us get a forest of connectivity forward and backwards. And that's going to allow us to keep our leaf set as correct as possible. And I will point out, by the way, that these links are really just an approximation of what we need. So if it turns out I try to take a hop that's long and that node is down, the one that this happens to connect to, then that's OK. It's going to connect to 20. I'll just take the next one. And so I can always revert to taking the order N routing path until I've got some of these long hops available to me. And so it converges very nicely on a performance version of things, but it can always fall back on the circular routing process if some of these fingers aren't correct. And we just keep refreshing them over and over again. And so there is a finger table lookup process that just keeps renewing these pointers over and over again. And the good thing about that is as new nodes come in, the finger table adjusts as new nodes leave, or as old nodes leave, the finger table adjusts so that we keep ourselves with our log N lookup. And you end up with really high probability even if half of the nodes fail. So if you have log M where M is the number of nodes in the system, you can end up with a situation where you can find data even if half of your nodes fail, you can find data with the right number of leaf nodes. And that's kind of what's proved in that chord paper. And that's not that many because it's a logarithmic number. So before I go on to storage full tolerance for the data, does anybody have any questions on this? We good? So now let's look back at what we had a slide before. So we had key 14 stored on node 15. And the downside of this is, of course, that the way I've described this, first of all, the only place for node 14 is for key 14 to be stored is on node 15. Now, if you look at the consistent hashing, what it says is if node 15 weren't there, key 14 would be stored on 20. That's just the next node up from 14. But since the only copy of key 14 is currently stored on node 15, if 15 dies or goes away, we don't have the data. So it's fine that the consistent hashing tells us where it should be stored, but we can't store it there because we've lost our data. So we got to do something else here. And the way we're going to do that is we're going to take the forward leaf set, or you can do both forward and backwards. It depends on the algorithm. And what we're going to do is we're going to store 14 on the successive nodes that we know about because of the leaf set. So we'll store it on 20 and 32. And now what's good about this is if node 15 fails, which is the point that's supposed to store it, we've already got a copy on node 20. And node 20 can notice, oh, 15 went down. Therefore, node 20 can start the process of making sure that 35 gets a replica. And we always have three copies in our leaf set. So if we think of a leaf set as not only for keeping the ring connected, but also for how we replicate, then we can now come up with a dynamic process that automatically adapts as nodes fail by replicating on successive nodes and making sure that we always have a given number of copies in the system in addition to the ring being connected. So if node 15 fails, now what we'll do is we'll add an extra copy to node 35, okay? Questions. And the ring's going to stay connected because of our connectivity algorithm. And so what's good about this is, like I said, you store the data in the chord ring and it's very hard to destroy. Okay, why are they called leaf sets? That's a good question. The reason they're called leaf sets is because in some sense, you can view the, if you take any given starting node like 58 and you view the set of fingers, that's a tree. And so eventually you get to the leaf set. And so it's like a tree with leaves. So that's where the leaf is coming from. And here's an example of that. So if you look at what happens with leaf sets, so what I'll show you here is here, for instance, is a starting node. I've got its leaf set is in green. The finger tables are in red. And suppose that I'm trying to get from here over to here, okay? And so what I'm gonna do is I'm going to start bit correcting. So I might take a long hop first and now I can check all of the fingers, okay? Or I guess you could call them branches if we don't want to mix metaphors too much and the leaves and none of those quite have what I want. So I'll follow one of these branches, it's a little shorter. And then eventually I'll get to a node that's this one where that node knows because of the leaf set which node is the one I'm looking for, which is gonna be the one that's just bigger counterclockwise or excuse me, clockwise from the idea I'm looking at, okay? And so this leaf set not only serves to help us with our replication, but it also serves as part of the last couple of hops. We can use the leaf set to basically find who's supposed to have our data. All right. Now, so let's look at replication from a physical standpoint, right? So if you look again at this ring I showed you a little while ago, that ring is mapped physically to things that are spread widely and now we can see another big advantage of the randomness introduced in chord and that advantage is that these copies are actually stored in geographically separate places. And so when the big one happens in California, it's not likely to take out things over here in Minnesota, okay? And so the randomness is helping us to avoid correlated failures where yeah, we have a bunch of copies but they're all in the same machine room and the building got struck by lightning. So that doesn't happen in a chord algorithm, it's sort of geographically distributed by nature. Okay. The downside of course is performance might hurt if you happen to be too far away from a copy. And so I will tell you that there are subsequent versions of chord which when you're doing this routing and you have a lot of options here, see how we have many places we could go, we can actually take places that advance us the furthest along the ring while keeping locality as short as possible. And so we can actually take locality into effect to some extent in chord and make our routing less like bouncing back and forth across the planet randomly and more like working our way physically toward the thing that we're interested in. All right, good. Last but not least, and I didn't have slides for this but I wanted to point this out. One of the things we can do with chord is we can use chord to store locations of data rather than the data. And so think of this like a DNS built out of chord. And so what the client does is the client doesn't know where the data they're interested in is they ask the chord ring. The chord ring tells them who to talk to and then they can talk directly to them and exchange data over the shortest path possible using TCP IP or whatever. And so you can now get the best of both worlds in that you have very hard to destroy lookup process and then you can choose, here's the client and it's using some data, you can choose to replicate that data on close to the client and maybe a couple other places close to the client even though the initial lookup might be geographically separate once you start using the data and know where it is, you can have good locality out of it. And that's pretty much what we did with the tapestry lookup process back for Ocean Store. So I did wanna point out that what I've just described to you this chord ring is actually used in lots of cloud services these days, the idea at least. So for instance, DynamoDB and I have a paper for that up on the reading from last time uses the chord rings and you could look down here, but it uses them rather than spreading them around the planet it uses them within their machine rooms as a way to distribute load. And so when you're ordering things from Amazon and you're ordering, you know, you're putting things in your cart, all of that data is actually stored in something like chord that's in a machine room. Okay, and the applications because they're worried about people and not pissing them off when they want them to buy things, what the chord ring is really about is making sure that they can get their performance for retrieving something within a small number of nines, okay? And so the availability is an important aspect here. And so basically you have a service guarantee that says we'll get a response within 300 milliseconds for say 99.9% of the requests, okay? And so that's part of the way that the chord algorithms are adapted in a real cloud service, okay? And notice that this is very in contrast essentially to what we've been talking about a lot of the rest of the term, which is focusing on mean response time. Instead we want to have guaranteed performance, okay? And this is again, I want you to think back to when we were talking about real time scheduling and what was important there was keeping the predictability of the scheduling time low, keeping the predictability high and keeping the timing tight rather than worrying about making it as fast as possible. So S3 is actually using something slightly different but there's lots of different schemes out there. What's good about the various things that are using chord-like algorithms is this is scalable. As you can imagine, if you don't have enough performance you can just start adding more nodes and it adapts automatically, which is pretty good. So what I wanted to do next, I'm gonna leave that there a little bit. I wanna talk a little bit about security and then talk through a couple of things and then I wanna try to get to quantum computing as well so we can, I know there was some of you who asked some questions about that. So I'm gonna leave this topic unless there's more questions, okay? So I'm gonna talk through a couple of things and I'm assuming everybody kind of knows but I wanna make sure we all have the same terminology. So security is an interesting thing. It's basically computing in the presence of an adversary. So I'm assuming several of you have all taken 161. I don't know if that's true or not. We have a very small class tonight but you can start worrying about things like can that adversary prevent me from making forward progress or can failure prevent me from reliability, robustness, fault tolerance, et cetera. Security is kind of dealing with actions of a knowledgeable attacker who's really trying to cause harm and we wanna make sure that they can't really screw us up. And we talked about Byzantine agreement a couple of weeks ago. That's one example of trying to prevent a decision-making process from working but in general, security is kind of dealing with situations where there's an adversary, there's a security problem, okay? And there's been many problems, okay? Where people have broken in to systems and it's a constant arms race preventing people from breaking in to things you care about by using new techniques. And the distinction between protection and security I think is an important one because protection is the set of mechanisms that we talk about in this class. Security is basically using those mechanisms to prevent misuse of resources. So for instance, virtual memory is a mechanism that can be used for protection. Security policy would be making sure that when we use virtual memory, we don't let malicious processes or different processes owned by different people use the same memory and have the potential for screwing each other up. So that would be a security policy built with our protection mechanisms, okay? So I wanted to point out something interesting. I don't know if you've ever seen this before but here is a car in a ditch. And what's interesting about this particular car in the ditch is that back in July of 2015, there's a team of researchers that took complete control of a cheap SUV remotely exploited a firmware attack over the Sprint cellular network. And they basically caused the car to speed up and slow down and veer off the road and totally wirelessly. So this is a little scary to think about. Now fortunately, no humans were harmed and the people that whose car was driven off the road were researchers as well. But this is something that one might hope our security policies could prevent. And the thing that is getting in the way of preventing things like this is that there's an increasing amount of machine to machine communication where it's really machines are talking to other machines controlling each other and making sure that a malicious person can't get in the middle of that and cause unexpected behavior is very tricky. And there's this term cyber physical systems which I don't know how many of you have heard of that but the idea is computers controlling physical things using policies and algorithms. And the problem is that if somebody manages to get in and mess with those cyber physical systems they can cause physical harm, okay? And that's a problem. So part of this question is the following. So let's talk about the data. So there was firmware in this car. So one might argue that one of the problems was that firmware was accepted as authentic even though it came from a malicious third party. And so one of the questions that is important is do you know where your data came from? That's a provenance question. Another is do you know whether it's been ordered or changed or altered in any way? That's an integrity question. And really this is a question of the rise of fake data which is kind of much worse than fake news which is about corrupting the data and making the system behave very badly. So we have several security requirements that people talk about. So authentication is making sure that a user who's making changes to the system is really who they claim to be. Data integrity is making sure that the data hasn't changed, okay? So that's important. Confidentiality is making sure that the data is read-only by authorized users. So that often involves encryption of some sort. And then non-repudiation is a surprisingly important thing that people don't often talk about which is that if one sender makes a statement and they send a message or whatever, they can't later claim that, well, I didn't really send that, somebody malicious did. And so that's basically making sure that you can't repudiate things that you've previously said. And so I'm hoping that if you haven't taken 161, it's on your list because there's a very interesting set of things that people can talk about. But cryptography is one of the central points of many of these mechanisms. You just have to use it correctly. And this is communication that's in the presence of adversaries. It's been studied for thousands of years. There's actually something called the code book which you should look up which talks about thousands of years of cryptography. And the central goal has always been confidentiality about encoding information so an adversary can't extract it. The general premise is that there's a key and if you have the key, you can decode things. If you don't have the key, it's impossible. What's gotten more interesting over the years, of course, is public key cryptography where there's really two associated keys and you encode with one and decode with the other. And that really leads to all sorts of really interesting authentication problems. So basic cryptography which you've probably heard about is you have a secret key and you take the plain text and you encrypt it with the secret key and you send over the internet something called cipher text which is encrypted and you can decrypt it the other side. And assuming that the key hasn't been leaked then it's not possible for an adversary and you have a good algorithm like AES it's not possible for an adversary to send a message that the receiver will treat as real because you have to have the secret key. Now, one thing you do need to do in order to make this single symmetric key encryption work which is symmetric because the same secrets used at both sides is to prevent an adversary from holding onto an old message and sending it later as you have to start adding what are called nonces which are things like timestamps and so on so that every time you send this it's unique and if somebody sends an old version you can detect it but I'm assuming everybody understands the idea of encryption with a symmetric key and the other thing is I mentioned hashes earlier and so the idea of a secure hash function is one where you take data and you run it through a hash function and you get a bunch of bits out of it and if you change the data even slightly you end up with a good hash function with something that essentially roughly half of the bits change. So the change from Fox to the red Fox runs across the ice will give you something very different. If you take Fox and you add a few things after it it'll also change it drastically, okay? And so the hash we often talk about is the hash of a message is a set of bits say 256 bits and this is a good example of what we used on the chord ring where that ring was two to the M possibilities while that might be the result of a hash function like SHA-256, okay? And what makes this secure is that it's not possible for somebody to come up with another source that matches the hash function, okay? That's one example of something that's not possible. In fact, it's not even easy with a good secure hash function to come up with two different items that you come up with yourself that have the same hash function, okay? And so that's why we can kind of use hashes as a proxy for the data itself. And a lot of the things you hear about in security literature using cryptography assume that the hash function is a reasonable proxy for the data itself, okay? So SHA-256 is a good example. So here we can for instance, if we share a key K what we can do is we can and that key is secret we can take a plain text something like a contract and we can run it through a hash function where we take that key and append M and that's called a digest now we can send that across and the data and at the other side we can verify by computing that HMAC, okay? And if they match the one that was sent across versus the one that you computed yourself then you can know that the message is not corrupted otherwise it's corrupted. And so we can use hashes to prove later that after the transmission has happened that the data is authentic, okay? So hashing is pretty powerful and I'm not gonna have a lot of time to go through this with you that's a 161 topic but just keep that in your lexicon about hashing being a good way to ensure the integrity of data at the other side. And so for instance, in that firmware problem with the car we could have a key that only came from the manufacturer in a secret way and we could check the integrity of that firmware against the manufacturer. And if it wasn't, you know, if the integrity wasn't high you know, it was basically didn't match then we could know that that firmware is probably bogus and we shouldn't be using it, okay? Now the downside of course of everything we've talked about is both sides share the same key. And so if you leak the key, then you got problems, okay? And furthermore, you have to somehow share the key and so that requires you to go in a dark alley and you know, hand the key over. And so this seems like only part of the solution. And the interesting thing about that is this idea of public private key pairs and public key encryption, which again, the cool thing about that is that now you can distribute the public key let's suppose that somebody over here wants to have anybody send them a secure message. They generate a public private key pair. They give the public key to somebody else. They can broadcast it, you know, to the world and then anybody who wants to send a message just encrypts it with the public key and the only way to decrypt is private key. And that private key is something that I hold secret but the public key I broadcast. And so this is basically, this is the basis for all sorts of modern algorithms, okay? Among other things, if I were to encrypt the hash over data and then encrypt it with a public key, then I can know for a fact that that data has made it through and only could have come from somebody, okay? So that's part of how we actually sign things, right? So for instance, here's Alice and Bob. Let me show you a find algorithm here. Bob sends his public key out into the wild to Alice. Now Alice can encrypt messages and send to Bob and only Bob can decrypt them. Alice can send her public key and now Bob can send things to her. And what we're done with is now we've got a secure channel between the two of them given public information, all right? And now this is another mechanism we can build all sorts of stuff about, okay? Now the question about how to know whether this is a valid public key requires public key infrastructure but that's another story. So now I'm going, I went through this very quickly. How many people have never seen, anybody never seen this kind of thing before? Or is this all pretty familiar? Okay, good, oh, great. This is in CS70, great. So let's talk about a project that I've been working on. So again, you could view security as trying to protect things with a firewall or you could view security as it's all about the data. And if you can protect the data, then you can protect everything, okay? And so if you think about the internet of things, really one way to look at the internet of things is that we have a whole bunch of devices and compute elements all over the world and it's really a graph of services that we wanna connect. And so distributed storage is everywhere. Every arrow represents communication. We've got storage everywhere. And really what we wanna do is we wanna make sure that the data can only be written by authorized parties and only read by authorized parties, okay? And these secure enclaves are a topic for another day as well, but this is a special virtual machine that's in modern hardware that basically allows you to set up a secure channel and do some secure encryption in a way that not even the local operating system can see the data, okay? And so if we have the secure enclaves stored everywhere and secure encrypted data, then perhaps we can do some interesting things, okay? And we can do them securely. And so let me see, I'm running low on time here. I wanted to say something a little bit interesting here about why data breaches, which we've heard a lot about in the last four years are so prevalent, okay? And if you look, the problem is that people who are trying to build secure systems kind of think of it this way. They say, well, I've got a secure network on the left. I've got a secure cloud in the middle. I've got a secure network on the right. And they're so secure that the only thing I have to worry about is securing the communication between these parties. And if I do that, then the system's secure, okay? And so this is what I like to call as border security rather than data security. And so if you think, well, I'm gonna put some firewalls and now I can say, look, this is a trusted computing base that's secure. This is one as well. There's one around the cloud. And then the only thing left is cell phones which I make secure tunnels with and this just is fine. The problem with this point of view which you've probably heard about everywhere is the moment that we have any breaches inside the trusted computing base, then all of a sudden not only is the data breached but somebody who is inside that firewall can produce data that looks authentic even though it's not because people are trusting, well, if it came through the firewall properly then it must be authentic. And if you think back to cybersecurity, if you think back to what we've been talking about with that car, suddenly that might be that you could have firmware that looks like it's from the manufacturer even though it's from an adversary. And now we suddenly have this issue that physical devices that are trusting on this security suddenly start performing things they're not supposed to. Okay, so the real reason we get these data breaches everywhere is because people think that they can put these boundaries up in a way that can't be breached. And of course we know that's not true. And basically the problem really is not only are things breached but the integrity and the provenance of that data is not known. So what do we do? The data-centric vision which is one that I've adopted in my research group is one in which we think about shipping containers full of data. So if you think about down the port of Oakland you've probably all seen these shipping containers. This was a great invention back in 1956. So before 1956, what happened was we had a longshoreman who would take a bunch of things and they would go to a ship and they'd play Tetris with it to try to fit all of these things onto the ship. And then the ship would go to its destination and then there'd be people there that would unpack them and then you'd have to figure out how to put them on trucks and so on. And it was a mess. And basically one person said, well, why don't we just make things that are all the same size and shape? And then all of a sudden we've got ships, trains, cranes, all of the infrastructure for handling these things are the same across the planet. And now I can ship something from my house in Lafayette to Beijing, the outskirts of Beijing just by calling the right trucks to come pick up a shipping container which gets taken to the port of Oakland and put on a ship and then it goes across the ocean and it's unloaded and so on. And why is that that standardization of the container? So the idea that we've done in our group is to say, can we use this idea to help in some way? And the idea is very simply that we think of shipping containers full of data which we call data capsules. And the reason I've got this little green bound around here is because it's a data capsule. And inside the data capsule is a bunch of transactions that are hashed. So remember those hashes we talked about and signed where we use a private key to sign a hash over something. And as a result, we trust that this really came from the person who said it did because only they could have the private key. And as a result of these data capsules this gives us a cryptographically secure way of moving data around to the edge, to the cloud and back again in a way that nobody can fake out. Another way to look at this is this is like almost a blockchain in a box. And so what we're doing in our group is we're looking at how to take these data capsules, make them a standard in a way that everybody uses them and on top of them you can build file systems and databases and everything you're used to but underneath the network knows how to ship these things around and route queries to them. And so think of this again, the underlying network is like the ships and trains and cranes and planes that handle this standardized metadata. And what is the standardized metadata? Well, it's a hash over an owner key and some other metadata about who created this. And that forms a unique address that you can route to in our system unlike not unlike an IP address, but it's a unique hash over the data. And you can imagine these things being small so they could be on phones or really large so they could be terabyte sized databases but that standardized metadata is really what allows them to be shared securely across the planet pretty much. Okay, so why is this idea help? The networking fact, okay? I'm actually pun intended here. So standardization makes it possible for the infrastructure to be put everywhere and it benefits everyone. Federation, you can actually build a market of service providers. The data becomes a first-class entity so your data basically can float pretty much anywhere so you could put a data capsule server in your house and all of a sudden your local data capsules could be stored there. Or if you're doing some communication with somebody else you could get a copy of their data capsules. And again, because it's like a blockchain in a box it's not possible for somebody to fake data that doesn't belong in there, okay? And so think back to that firmware issue with the car and the ditch, okay? And the other thing is that metadata we're looking at actually has details about what the network should and should not enforce. And so you can even start talking about privacy where if you had a bunch of cameras in a local edge domain and they were taking a bunch of data and putting them in data capsules you could make sure that the network would refuse to route those say outside of the house or outside of your building if that was disallowed. All right, so the vision here is really the following. It's, you have a bunch of resources underneath these are spread in the cloud and through endpoint places all over the network. You've got data capsules that have the ability to float anywhere they're allowed and that's what we call the global data plane, okay? So the global data plane is something that spans the globe just like if you remember just like the cord ring span the globe, okay? It's again, it's a peer-to-peer system just like IP as well that does routing and multicast that has things we call trust domains and accounting. Below you can have many utility providers kind of like Comcast or AT&T that all provide service. And above there's an API that allows all of these apps to access their data in the data capsules, okay? And so this is like me in my house calling a truck to pick up a shipping container, okay? And so this vision, if this is ultimately complete is one where you instead of paying for IP service you would pay for data capsule service and you'd be able to store your data in a way that was secure, okay? And could be used anywhere you want and you'd own your data. So physical view just one last little slide here and then I'll move on to the quantum computing so I wanna make sure we cover that. But if you think about IP, the way we talked about it briefly in a couple of weeks prior, if you look at the physical view of IP there's a bunch of routing clouds and there's also transit providers, okay? And so this is exactly how we get IP working. And so in the global data plane we do the same thing where we have global data plane domains. We have routers that route global data plane traffic and they're tied together just like we get with TCP IP, okay? There are peering arrangements just like with IP. And as a result, and we have name resolvers that help us find our data capsules. And as a result, we could actually have, forgive me, building all this up but we can actually have clients which might be compute, they might be little robots, they might be smart cars or Teslas can all tie into the global data plane and access data that they're authorized to do anywhere it happens to be. And if they need high performance or privacy they can pull it to their local domain. And so the physical view of the GDP is really instead of thinking about packets you're now thinking about data and its integrity and its provenance, okay? So it's a switch in viewpoint on how we wanna be dealing with data. All right, sorry if that's a lot of information but I wanted to see if there's any questions there before I switch over to some quantum computing. All righty, give me a second, I'll be right back and then we'll see if there are any questions that came up, one moment. Okay, so good, so we have some good questions here. So first question is how do we know the data is secured? So just like with a blockchain, let me just back up to the picture here which I think is a good one to be talking about. What we know is the following, the metadata is among other things the public key of an owner hashed, okay? And so all of these signatures have to be signed by the owner and anybody can verify that the data that's in here was put in there by the right owner, okay? So that gives us integrity and provenance. It means that we can know that none of the data that's in here could have been put there by an adversary. So that's the first thing that we know. And then the second thing is of course, we can put arbitrary encryption on top of this as well to make it private. So really the signatures are about integrity and who put the data there and the encryption would be about privacy and there are many ways of deciding kind of which keys to use for that encryption and how to share them with people you want to decrypt. Okay, and so that's the security of this and what you know is when you get some data from the network, you can immediately verify that that data is what it's supposed to be. So you can check that the signatures are correct and if you have a signature only at the end of a chain of data, you can essentially check the rest of the chain by checking the hash pointers. So these are all of the things you get out of a blockchain by the way for those of you that are familiar with Bitcoin or whatever you get it here with the data capsules. And so I like to call these cryptographically hardened bundles of data and if somebody tries to put garbage in there a legitimate person who's trying to look at this can just throw the garbage out because there's no way that that garbage could have been put in there in a way that meets the integrity constraints of the data. Okay, and so it's not forgable. It maintains its integrity, the transactions can't be swapped or whatever. And so it's a uniquely high integrity kind of bundle of data. And if you build file systems and what have you on top of this really what you're doing is you're appending data to that and it becomes a secure log on which you can build pretty much databases. You can build file systems, all sorts of stuff. Okay, so this link structure within the data capsule is really just think of this like get you guys are all familiar with get now. So this is like a get tree with signatures and integrity through hashes. Okay, and the signatures can't be faked again because only the proper owner knows the private key to produce the signatures. Okay, and so if you breach the private key of course that's a problem but potentially every data capsule could have a unique private key which leads to an interesting key management issue but that could be another topic. Okay, did I answer those questions? So the vision here really is of pretty much everybody using data capsules everywhere, okay? And if you can get that to happen then basically you potentially have a very interesting scenario here now. I just wanted to share another point here really quickly. So for instance, the way we're looking at our routing plane is really the data plane itself is a series of routers that all know kind of where the data capsules are and have some very interesting properties where if any of you wanna come work on this project come talk to me separately we have plenty of places we can talk to you, okay? Now I wanna, I promised you some, I promised you some quantum computing. I did wanna show you one other interesting slide here potentially which shows you kind of an idea of how we can build things up here. So the data capsule infrastructure is spans the globe kind of just like TCP IP does. And so in that scenario what you get is you get potentially these GDP switches which are just overlay network on top of IP. You can have location services and storage services for storing data capsules. You can have secure enclaves which lets you do secure computation and then you can have lots of interesting clients and part of what we're doing is we're working with roboticists and machine learning folks to put their data and their models for grasping and so on inside of data capsules. And as a result, they can reside securely in the edge and say your robots or whatever in a way that can't be breached, okay? And so this is really targeted at secure edge infrastructure in addition to the cloud. So these data capsules can move back and forth but certainly you need something like this on the edge because these pieces of hardware are easily breached and you wanna make sure your data is secured and unforgeable. All right, good. So let me say a little bit about using quantum mechanics to compute. And since there's only a few of you tonight if you're willing to hear me out I can talk for a little bit longer just to get through a couple of other things on quantum computing. But what does it mean to use quantum mechanics to compute? So it's basically using weird but kind of useful properties of quantum mechanics, two of them, quantization and superposition. I don't know how many of you have taken a quantum mechanics class but what you find out is for instance, back in chemistry, if you remember from chemistry you had the orbitals, right? And so only electrons were only allowed in certain rings or spheres actually around atoms and that was because of quantum mechanics, that's quantization. That says that the electron could either be at one point the s equals zero or the s equal one or s equal two but nowhere in between. And that quantization really gives us the ability to talk about something like a one or a zero. So we got the idea of digital data varied in that quantization but because it's quantum mechanics we can do the second thing which is superposition. And this is having a bit which is both a zero and a one in certain fraction of between the two and that's where things get interesting. It's like it's 50% zero, 50% one or something in between that's called superposition. And what's interesting to me, so I've designed computers in various times in my life is that most digital abstractions that you might learn about in 151 or pick your 141 some of those various VLSI classes is you spend a lot of time trying to get rid of the quantum effects. You want a zero to be a zero and you want a one to be a one and you want them to stay that way. And it's when they don't stay that way that you got problems. So then you put error correction codes and all that stuff that we talked about last week and the week before. Quantum mechanics, however, if you're willing to allow things to not be always a one or always a zero, what you can do is you can start to do quantum computing and that's basically using quantization and superposition to compute, okay? And so some interesting results just to tell you quickly here is for instance, Schor's factoring algorithm factors large numbers in polynomial time even though the best known classical ones are sub exponential in the number of bits. And so if you could get a Schor's algorithm running on a quantum computer pretty much all RSA cryptography would be broken because you could factor, okay? So other interesting results here are for instance Grover's algorithm is not as spectacular but it's still pretty interesting. So imagine you've got an unsorted database of millions of elements, okay? So what does unsorted means? It's not sorted, right? So if you wanted to find a value, what, you know, on average, how many elements would you have to look through in a million items before you found the one you want? Well, on average, you'd have to look at half of them. And however Grover's algorithm using a quantum computer lets you find items in an unsorted database in a time that's square root in N rather than half of N, okay? That's pretty interesting, right? The other, so 191 is mentioned in the chat. That's a good class to take if you're interested in quantum computing. Another one that's my favorite, I think best application of quantum computing is what I like to call material simulation. This was kind of the original application of quantum computing that was thought of. And basically the idea there is if I want to design a brand new element or brand new material to build things out of and I want to take into effect all the quantum mechanic effects, then exponentially, or classically, I'd have to build something that was exponentially hard but it'd be linear time in a quantum computer. And so if I'm really interested in designing exotic new materials to build interesting things, I probably want a quantum computer. So there are many other algorithms out there now these days, they've been slowly working on them, but these are some pretty good ones that give you an idea why this might be interesting, okay? And furthermore, we've got Google, we've got IBM, it's very popular these days with big companies, Microsoft is in here as well, looking at building these quantum machines, both of these two, both Google and IBM are superconducting bits. So these parts of the machine you see here are normally put into a doer and they're running at four degrees Kelvin or something really cold. This particular type of quantum computing technology is not going to be in your laptop, or at least not in any laptop I'd want to put on my lap. But there are other types of technologies, including ion traps that potentially are pretty interesting that there've been some thoughts over the years might be able to run at something closer to room temperature, not there yet. The current goal of Google and IBM and there's been some notion that maybe they've shown this is to do something which they call quantum supremacy, which is basically to prove that there really is a possibility that quantum computers could be faster than classical ones. And so the issue here is that these computers by building built by Google and IBM have maybe of order a hundred bits maximum. It's very hard to do anything interesting with a hundred bits but they're focusing on demonstrations that show that with those hundred bits they could potentially do something a lot better than a classical machine. So that's called quantum supremacy. So what I wanted to do just to give you a little flavor for quantum computing that you can go away with here is here's a version of quantization that's particularly simple to get once you got it. And that is there are certain particles, so protons and electrons and neutrons, those are good examples that are what are called spin one half particles. And physicists treat these things like they're spinning like a top, okay? Except what's interesting about that is that they can only spin with the axis pointing up or down nowhere in between, okay? That's the quantization thing. And what I'm showing is this is the spin relative to a magnetic pole north and south. And what's interesting about that spin up or spin done down is that now I've got zero and one, okay? So suddenly I've got binary, that's interesting, right? And so these are particles like protons or electrons have this intrinsic spin. And so now I got one and zero or up and down, okay? And a representation called the Heisenberg representation looks at this messy physical situation like this, which is either a zero or a one in these brackets. And that represents spin up and spin down, okay? Or vice versa, depending on how you want it, what it's looking like if you're with the field then that's a lower energy. So that's probably spin zero, it's probably zero. Now one proposal for building quantum computers from way back when was called the Cain proposal. And those spins were actually what you got when you embedded phosphorus impurity atoms into silicon. And then those phosphorus impurities would have a spin up or spin down that could be treated as one and zero. And then you could actually use these electrons to manipulate, okay? And that was one of several sort of what I like to think of as scalable solutions built on top of silicon, which are exciting because maybe you could get Moore's law out of them. And this is an example of something people were looking at, okay? But the temperature here was less than one Kelvin, which is really cool, okay? But let's suppose now here's where the quantum computing nest gets pretty tricky, okay? And bear with me just a little bit. I know I'm going a tiny bit over here, but if you think of the zero and the one thing, okay? This is actually a wave function if you take quantum mechanics, representing spin up and spin down. And what's interesting is the wave function in quantum mechanics is actually a complex function that I can add together C zero plus C one as a complex coefficient. And all I need to make sure is that the C zero squared plus C one squared is equal to one. So what happens is these actually end up being probabilities that if I actually tried to look, I would see a zero or if I actually tried to look, I'd see a one. And so what you see here, okay, with this psi function is actually a superposition of zero-ness and one-ness together, okay? Now, you know, I realize this looks a little weird. We don't normally get a wave function notation in 162, but the thing that's like I said, it's very interesting about this is that this is a description of a combination of zero-ness and one-ness where the probabilities can be adjusted anywhere you want such that they, their squares, their norm squared adds up to one, okay? And if you measure the bit, so you actually said, well, do I have a zero or I have a one? What's funny is you find out, you don't have this thing because when you look at it, you either find up or down with a given probability, okay? All right, now bear with me. So those of you that are skeptics out there would say, oh, really I don't know whether it's a zero or one. So these C zero and C ones really represent my lack of knowledge, but once I finally looked, I found out what it was, okay? I'm sure that there's several of you that think that that's the case, but that's one option. The other is that this is a real effect in the proton or whatever we're looking at here is actually sort of in one state and sort of in another. Okay, and those are two options. And it turned out that there was, there's a set of famous Bell inequality experiments that were done that showed that reality is actually the second choice. So in fact, as weird as it is, that proton is a combination of zero and one at the beginning and it's only when we look carefully and force it to be one or the other when we actually try to measure it, then it gets forced into a state, okay? And so if you think about this in terms of building a quantum computer, there's a couple of interesting things here. So one, we gotta make sure that the environment never measures before we're ready because otherwise we'll destroy this interesting superposition and maybe we need that to compute, okay? And just to get a better notion of how weird this really is, okay? So if I have a bunch of bits, n bits together, there's two to the n possible values of n bits, you know that, right? But here's an example of a three bit example of a superposition in which there are eight options and I can simultaneously have one of those eight or many of those eight values in different proportions as long as the probability sums up to one, okay? So as long as C zero to zero norm squared plus C zero to zero one norm squared plus this plus this plus this sum up to one, that's a real physical situation which represents a single register in my computer that has a superposition of all of those values and the moment I take a look, then I force it to be one value. And so if you only measure one bit, for instance, you can say that the first bit will be a zero with probability, what? Well, which ones have a zero in the first bit? This one, this one, this one, this one. So the probability of finding a zero in the first bit is this sum, probability of finding a one in the first bit is this sum and you can go from there, okay? So we really don't want the environment to measure this before it's ready. So in fact, we can have quantum error correction codes, believe it or not, which can protect this quantum information from being measured by accident by the environment. And as a result, really we can hold these quantum states for a long enough period of time to actually do something interesting with them. So let me show you this simple two-bit state, okay? This is called an EPR pair for Einstein-Prodonsky-Rosen. It was produced by Einstein and Pradonsky and Rosgen as a thought experiment. And the idea is I've got two bits, but I don't have all four options. I only have a zero zero or a one one, okay? Those are my two options. And I separate the two bits, so this is the two spins and in fact what I do is I maybe send one on a rocket chip to go light years away. And so now these two bits represented by this state are light years away. And if we measure one of them, like let's suppose we measure and find that there's a zero back on Earth, we know instantaneously that we got a zero on the other side, okay? And so that looks like we had faster than light travel, in fact, instantaneous travel of information from the Earth out to that far planet. Einstein really didn't like this. He called it spooky action at a distance, okay? But in fact, what's interesting about this is you can prove that there's no actual information transferred, okay? So however, we can use this to do what's called teleportation, which is take information at one side, do some measurements, send some data to the remote side and recreate the data, recreate the quantum state at the other side and that's called teleportation, okay? All right, so I'm about to lose a bunch of you, but let me just show you how you factor with this because I think this is interesting. So the way we build a computer is we take a complex state like I just showed you and then we put it through a bunch of adders, whatever you wanna call it, which are really all unitary transformations. There are things that make sure that that probability always adds up to one and then we measure a result and the output is our answer, okay? So basic computing paradigm. You input register with superposed values, you do a bunch of computing on it such that the probabilities are kept and you measure, okay? And the way it looks is that you take, let's say you put an input with all possible combinations of the inputs being equal values, all possible probabilities, it looks like you're doing computation on all possible values at once, but then when you measure, you pick up exactly one and that's the answer you get, okay? And if you don't do anything very interesting here, this is gonna look like you randomly picked some input and computed on it. So basically what we're talking about here looks like a random computation like you get in CS70 or 170 where you randomly pick an input, you compute on it, you look at the result. So that's not very interesting, right? Because we already know how to do that. What we'd like is we'd like it to be such that if you take this input state and you put an input that's a complex combination of possible inputs, you run it through a quantum computer and you measure the output is with high probability, some answer that was hard to find. That's what we'd like, okay? And so if you look here, if the two end inputs are equally probable, there could be two of the end outputs that are equally probable. And what we'd like is the probability of the outputs to be piled up high on the answer we want. And it turns out that something like Fourier transform does the trick, okay? So if we can do a Fourier transform on some input, we can actually get an interesting output. So if you bear with me, I'm gonna show you Shor's factoring in one slide. So this is something that would break RSA, okay? We can basically say the difficulty of factoring RSA, this is the type of cryptography that you might use with your bank across the internet, is figuring out how to take a large number which is publicly known and factor it into two large factors that are primes. And if you can do that, you break the cryptography. So classically, this is an exponential time algorithm. And so as long as these are big enough, nobody's gonna break it. Quantum computer can do it polynomial time. And let me show you how. Here's how it is in a nutshell. You pick a random X between zero and N, that's easy. You say if the GCD, the greatest common divisor between X and N is not one, you just found a factor, you win. Let's assume you didn't do that. We find the smallest integer R such that X raised to the R is congruent to one mod N, okay? So that basically is doing modulo multiply. That's really hard to do well and easy. If R is odd, we gotta repeat. If R is even, then we can say, well, I know because X to the R is equivalent to one mod N, I know that X to the R over two mod N is equal to A. And so then I can find this, A minus one times A plus one is a factor of N. And we have another failure mode here where A is equal to N minus one. But if it isn't, GCD of A plus or minus one N is a non-trivial factor. So if we could somehow figure this out, what R makes this equation satisfied and we can do that quickly, then we win. And that's something that you can't do easily classically, but with a quantum computer, what we can do, and unfortunately, I guess I don't have time to do this because we're running out of time, but I can set up a situation where my input to my algorithm is all the possible ks, if I take a bunch of values and I compute the value X to that value and I add them all together as a superposition and I do a Fourier transform, what I'll find is that X to the R congruent to one, as I have R go through all its possible values, is gonna be a periodic function. Why? Because if X to the R is equivalent to one, then X to the two R is equivalent to the one, X to the three R is equivalent to one. And so I can do, I can essentially do a Fourier transform in a quantum computer, I can get these peaks and that Fourier transform will tell me what the frequencies are and that will give me that value that I need, which I can do this, I have to repeat this a polynomial number of times and then voila, I've just factored that number. Okay, so that's the essence of the Shores factoring algorithm. And it all hinges on the fact that I can come up with this superposition state where it's all possible values of X to the k where k varies from zero to n. Okay, and I put them all together in a superposition state, I do a Fourier transform, I get the result. Now the interesting question is, is this something to worry about? And the answer is, well, so far, no, but it's looking like it's getting closer and closer, okay? And so I would say there's been a lot of very interesting effort in quantum computing in the last five years, enough that I did a lot of research on quantum computers in the early 2000s and also the 20, up to 2010, 2011. I think it's looking more promising now. One of the things that we did do, and we don't have time to talk about this, but we actually investigated, if you were to build that factoring algorithm and you could do it as quantum circuits that could run on a quantum computer, what would that look like? And we actually investigated ways of optimizing that and we could actually look at performance of different options for the shortest factoring algorithm as quantum circuits. And so we built a CAD tool to do that. So I don't know, I think it's a pretty interesting area right now and there's a lot of interest in it. All right, so sorry I kept you guys way over, but this is the last lecture, I figured if anybody was interested, we talked about key value stores, we talked about CORD. Hopefully I gave you an idea about CORD, because CORD is the route from which pretty much all the interesting peer-to-peer algorithms come and it's used in a lot of areas right now. We talked about, briefly went through some cryptography and then I talked about how data capsules are all about the data. And that's a new model where the data can float to the edge in the cloud and back. And I think it's pretty exciting project we got working on it, if anybody's interested in that. And then we told you a little bit about quantum computing and feel free to come ask me or also look at 151 or 191, excuse me, which is an interesting class in quantum computing. All right, well, thank you everybody. Sorry for going way over today. Thank you for those of you that stuck around. And I hope you have a good finalizing of project three and those of you listening in cyberspace later as well, you are all great. And so I'm gonna miss you guys and I hope you have a wonderful holiday. Have a good rest of your semester and I hope you don't have too many finals in a week. Bye now.