 So my name is Ryan Whirl and I'm here to talk to you today about how to build anything with FoundationDB. I know that sounds a little nuts. I can't be literally everything because I'm only here for half hour. But I want to provide you with some ideas and some inspiration to tell you that even if you're not super familiar with FoundationDB today that you can, you know, dream up things that you can do in the future. And I want to give you the foundations of just like how do you use keys and values to do stuff. And we're going to do that with two examples today. The first one will be modeling what looks like a relational database schema, which you are all very familiar with by now after all the talks today about the record layer. And some of the details were in the CouchDB layer as well. And the second problem is how to add features to an existing distributed system that doesn't provide super strong guarantees. And we're going to do that by adding consistent object listing, basically removing some of the problems with Amazon S3. So when you start using FoundationDB, you get, you're basically dropped into a world with nothing and you have to figure out what to do yourself. It's not quite literally nothing because you have the tuple layer, but it's not a whole lot. And I think this is a quote from Dave Sher. I'm not entirely sure. I think it's from the first open source thread on Hacker News when FoundationDB was released. And this is the best description I have ever seen of what it feels like to use FoundationDB. What I want to do today is give you the hammer and the nails with which to take those two by four framing studs and build a house or a database. An alternative way to look at it is you get some Lego, which is a little bit more fun than building a house. So when we're talking about databases, I think we should, you know, I want to do some definitions first and, you know, kind of map out where things fit. And on that screen, the D probably doesn't come across as vibrantly as it is on my screen, but those are divided between blue and pink. And what we're going to be talking about today is basically C in acid. FoundationDB does the other stuff for you. So what is, you know, you have to talk about this for a second. I'm not talking about the C in the cap theorem. That's not what I'm talking about. You know this, this is not what this talk is about. This is really not relevant for the first half. It'll be slightly more relevant for the second half. But again, really, not really. That's not the C I'm talking about. This is the C I'm talking about. Your program brings the database from one valid state to another. Emphasis on your program is your job. So even, you know, even if you're just building like a Rails app on top of a SQL database, it is still your job, at least partially. And you need the A, I, and D from acid in order to do this. And my experience in the real world is that data corruption is a combination of bad application code, as in bugs that you all write in your programs, and bad databases. And I'm sure you've all experienced both. Some problems are a bit hard to track down, whether it's one or the other. But it's definitely a combination of the two. And hopefully, FoundationDB gets rid of the bad databases part. So it's just your bad code that's the problem now. You've got to fix that. So most databases provide you with a few things in the C realm, other than A, I, and D, which I already mentioned. The most important one, I would say, is no false positives or negatives from index lookups. That one you kind of need if you have indexes. But again, if this is your layer, that's your job. Schema management, like inserting and indexing records, and some elements of type checking. Not all databases do that, but a lot of them do. And foreign key like referential integrity constraints. Not all databases support those either. But if you're going to add that, it's useful for ensuring there are no bugs in your program logic. So how do you create consistency? And remember, this is consistency in the asset sense, not in the cap theorem sense. So yeah, you're provided with A, I, and D. And C is your layer's responsibility. So how do you do that? So in the example today, we're going to talk about a simple SQL-like schema. It's not about any particular database. It's just a logical model that we're going to think about. So it's a bit oversimplified, but it's not too far off. And I think the record layer talks gave you a lot of information about how to productionize that type of thing. So more definitions, because again, these words are overloaded in a lot of domains. So what do I mean by a database? I mean a named container for tables and some other things that I'm not going to talk about. But we're just going to talk about tables today. What is a table? A table is a named container for indexes and attributes. What is an index? It's a description of how you transform a record into some type of key that you're going to index and what you're going to store in the value. And an attribute, I think you all kind of get that at some field on some record and that you're defining at the layer of the table. What's a foreign key? I kind of talked about that before, but you need a way to manage the lifecycle and the relationship between records across tables. A lot of databases enforce that using an index under the hood, even if you don't have one. Not all, I mean it's not in theory required, but that's a common way to do it. So using the tuple layer, which is basically the thing that you get in the bindings in order to structure your keys, which is a good idea to use, you could think about modeling this as you've got a database called school. It's got a table called students. An index that is the primary index and the value that you store in this index key, which I'm going to explain a little bit more, is the value for that ID column. And you can see this structure over here. I'm going to use this throughout the rest of the section of the talk. Each one of those blocks is some component in a tuple. And the way that I've, the insight here that I think may not be obvious is that if you're trying to think about how you model a SQL database is the primary index is an index like any other index. And you can put data in there like any other index and it's not necessarily some special thing. So this is the kind of creativity you have to have when thinking about how you're going to model things. So what about a non primary index? That's another index too. And as an aside here, the long names in the components of the tuple are just for illustration purposes. You're not going to actually do that in a real layer until maybe you would with Redwood prefix compression. But today you'd probably use something like the directory layer to turn those into shorter prefix codes or something that you do yourself. There's a benefit to doing this besides just prefix compression that I'll talk about later. So remember when I said what an index is. It's some rules for transforming a record into a key that you put in an index and a value that you store. This index is a non unique index. You can imagine some other person with my name and some zip code I used to live in that is also indexed in this table. So you need to store the primary key for the record at the end. So if you imagine there's another person named Brian that lived in zip code 10075, they could have primary key two and they could both live in the index so that when you do a range scan, you get both records. So this is the difference between unique and a non unique index key. On a unique index, you wouldn't put the primary key necessarily in at the end of the key. There's this is, you know, things that you can choose to do, but it's a method of ensuring that your index is unique. It's not about enforcing the rule. That's something that you do in your code. This is just a way to structure the key. You'd still presumably go read that key and make sure that it actually doesn't exist before you just blindly write to it. That's not the point of what I'm trying to illustrate, just the key structure. Yeah, so you want to keep the last value in the key as the primary key just again so that when you do a range scan, you get them all back and you can de-reference them on the base table. This was covered in the CouchTV talk today and it's very important. If your users expect your data to come back in some order, that's like a sort order that's for their native language. It's not just the byte order. You need to support collation is what it's typically called, where you would store some type of a representation that goes through something like ICU to turn it into a key that sorts well for the database, but then you need to also store the representation that the user would see. That's important for... Basically, if your app is gonna be used by people that don't just speak English, you need to do that. So CockroachTV, which is a SQL database built on top of a key value store, has some documentation about how they do this, which is very good, because it's like a real thing that works in the world and it has lots of details, just like the record layer that you could go check out. And I'm gonna re-upload a new version of the slides if you saw them, doesn't have this in it, but it will soon, so you don't need to take a picture or whatever. Yeah, so now I can explain to you why you would necessarily, not just for prefix compression benefits, want to put some type of indirection between the name of a table, for example, and the ID of a table. So if you wanted to support renaming a table or any other logical schema object, you don't wanna have to rewrite the whole thing just because you stuck it in the key. That would be annoying. So there are benefits as well to being able to do that type of renaming and remapping it to some ID, the schema object. Another feature that you may want in your layer is non-blocking schema changes. The consistent metadata management feature added in 6.1 makes this much easier than it would have been in the past, so I can actually explain it in 30 minutes. So when I say consistent, again, moving the database from one valid state to the other, consistent in terms of the metadata management is a, that's the cap theorem type consistency, but that's the one I'm talking about right now. So an index must not return false negatives by allowing queries before it is fully built. That is a rule. How does the metadata version feature help? So if you version stamp the metadata object that, excuse me, if you update the metadata object and update the metadata version key at the same time, you can, as was described in the lightning talk, keep the history of the metadata and cache it in your layer so you don't have to repeatedly read it out of the database and cause a hotspot. It's stored at that key, which everyone saw earlier today. And you use this to signal that metadata has been changed in your layer. Again, this is stuff that you've already seen. It's free to get this key. That's another thing that I want to emphasize. It's free to get this because it's sent to you automatically when you start a transaction in starting at 6.1. So the paper that I think Alex referred me to, originally about how to do this is online asynchronous schema changes in F1 from Google. F1 is basically their SQL layer that works on top of Spanner, which it's roughly in a similar situation as you would be in FoundationEB. So it makes sense. It's a little bit tougher for them because they don't have that notion of a metadata version. So the paper goes into lots of complicated rules about how to do this if you didn't have the metadata version. Luckily, you can just implement the state machines in the paper and it's a lot easier. So I'm gonna just describe adding an index. There's a whole bunch of other ones that you can do that I'm not gonna describe. But basically, you have to update the schema object and the metadata version in the same transaction to signal that the schema has changed. You're doing this in a transaction, it's atomic, just like anything else. So you set the initial state to write only and this means it's invisible to reads for other transactions that are going on. Why does it have to be invisible to reads? Because it's being built. That means if you served reads from it, it could serve false negatives. So as new transactions start after the metadata change happens, they do write into this index. Again, they don't read from it. And same thing for updates and deletes. And in the background you do some type of a backfill job to add the new records. And one way to fulfill this is to store the version of the metadata that an object was written with so that you know if it's from the older, the new version of the schema and you can do a big scan in the background and add it to the index. Yeah, so you use that version to detect like if it's from before or after. The background scan, that's something that Nicholas just mentioned is going to be more parallel in the future version of the record layer. Again, you can reference the record layer for how to implement these bits efficiently. So when you're finished with that background scan and indexing all the records, you can update that metadata again along with updating the metadata version to say the index is ready for read-write traffic. And then you're done. And this is a lot more, this is a lot simpler than what was described in the paper because you don't, like you have access to that consistent metadata version. Another feature that you may want to implement because it's commonly requested by business people is to have change data capture. So you can audit changes to your tables. The goal of this is to log the before and after version of every record that's inserted into the database in the order that they were changed in some type of a log structure. Those of you who understand FTB will understand that this is hard. This is not a thing that you just get for free. And that's why I'm giving you a warning that if you do this, you really have to understand what you're doing and do some careful capacity planning because this is basically like append-only type structure. And if you fall behind while you're moving this around, say you're gonna write it and then periodically move it into some other system, if you fall behind, you will run out of disk space very rapidly. So how could you implement this in FTB? Again, this has performance implications, but depending on how valuable it is to you, this is a thing that you can do. So before you write to a key, which represents a record logically, like in the primary index of the table, you read it first. So again, this has performance implications because you have to do a read before you do a write. And it also has conflict implications potentially. You store the new version of the record and the old version together in another value in some log-looking thing using a version-stamped operation. And the reason that you can't do the read concurrently with the write or after is because of the read-your-writes cache, you'll just get back the value that you just wrote and that will be invalid for our goal here. So when I say old and new, I keep saying those words, what do those mean? Old or new, before and after, this is basically the same thing. For an insert, before is null because it's a new record and the after is the inserted record. If you've seen the MySQL bin log or any type of other database change log, it should be somewhat familiar to you. But an update, the before is the old value, after is the new value, that one I think makes a lot of sense. For a delete, the old value, the before value, excuse me, the before is the old value and the after is null because the record no longer exists anymore. So how would you implement something like this? This is, again, a simplified key structure but the primary index and the change data capture index are, they're again, they're just indexes. And you can take the old, on the top you're imagining, you read out that old value, you're gonna update it to the new value. You would write into the change data capture index with a version stamped operation both the old and the new value such that in the future you can go back and read it out in order. So another thing, if you're familiar with FoundationDB, you have to try to avoid write hotspots. You need to structure your rights to write to multiple ranges at once so that you don't overload one storage server. In order to read this log back out, you have to read every one of the partitions starting between whatever version you're trying to read and up to that future version. So this is a very naive way to do that just to spread the data out, pick a random number mod by the number of partitions. There's many more fancier schemes that you could come up with, especially if you wanted to keep records from the related records in the same logical partition. Oh, the reason why it should be a power of two is because in theory, you can split and merge the number of partitions you have based on that. If you don't pick a power of two, that's harder to do. So feature request that I have, which may be it's somewhat related to things that we're going on for backup, but exposing the data on the T logs to applications may be useful. Some people may find it easier than implementing this thing. It doesn't let you get the before images, but it lets you get the after images, which is maybe more valuable to some people than the before. You could also implement this thing on the storage server. All again, that's kind of crazy and it's not free, but that's another spot that could implement this for you as a change feeds type feature. This breaks the key value abstraction once again in favor of efficiency for a high value use case. It's arguable if this is high value for you or not, but that's just a common theme, especially with the query push down feature that was talked about this morning by Evan. It's breaking the abstraction in favor of a little bit of efficiency. So onto the second half or last third, I guess, however you want to call it, but let me describe to you why Amazon S3 is not fun to work with. It has a lot of limitations. These are quotes from the documentation. Amazon S3 offers eventual consistency for overwrite puts and deletes in all regions. That doesn't sound good. I'm not gonna read this one, but basically it just says that object listing is not consistent again, not fun to work with if you want to work directly against S3. This one is not obvious at all. And if you make a request to S3 for an object that doesn't exist, and then you write into that object, say you're speculatively hoping that an object will be there at some point, future reads on that key will have eventual consistency. I don't know why. That's just a thing that it is. Maybe they fixed it and don't update the documentation, but yeah, something you gotta live with. This is basically the killer. There's no conflict resolution that it's not, there's no locking, basically. If you simultaneously try to write to the same key, you better hope your timestamp is later for the thing that you wrote, otherwise somebody else wins. The thing I'm gonna talk about is that object locking mechanism somewhat of what we're gonna build, but FoundationDB takes care of that before you. So what can you do with S3? Like what actually does it give you? Assuming you read only already written keys and you know for a fact that they were already written, and assuming you never update them, you get consistent read after write. That's basically the only thing. You do not get consistent listing at all in any way. So that's not a lot. If you wanna operate directly against S3, that's hard to deal with. So how do we fix listing? This one is kinda easy. You just write all objects to S3 first, I'll describe a little bit more about how you do that, and then you write a pointer to that object into FDB. So if you perform your equivalent of S3 list operations using only FoundationDB get range requests, you get consistent object listing. There's lots of more details that you can go into about this to describe exactly the semantics that you want, but that's the gist. But that's not the only thing. We need a little bit more to make this work. So I'm gonna also assume you don't care about garbage collecting failed puts to S3. You can imagine some scenario in which you successfully write to S3 and you're gonna go then write it to FoundationDB, but FoundationDB is unavailable, so you have some garbage data in S3 now. That's fixable, but not that interesting. So the way that you write objects now in S3 is you just pick some UID key that is random, long, unique, and you wait for S3 to acknowledge the request and then you write that real key that you're trying to write your object to into FDB along with that pointer to S3. Now you can do something like conditional writes, like Google Cloud storage can do based on the metadata and the reason Google Cloud storage can do that is because Google Cloud storage is backed by Spanner. So how do you read out of this structure? You just go to FoundationDB first, read the key, and then you read the pointer out of S3, so it's a slightly more latency, but really not that much. S3 is not particularly fast to begin with. So how do you delete data out of this? You add the pointer that was stored in that key into some type of queue that will ensure that it is eventually deleted, and then you delete the key from FoundationDB. All of what I just described is basically how you can wrap some other system in a shell of FoundationDB that protects it from all of its bad spots and lets you get more features that make it easier to program against. So if you can imagine some type of analytics application that works against data in S3 that previously would have had to deal with these artifacts of eventual consistency by retrying until it saw what it expected to see, all kinds of other tricks that now it could somewhat avoid. So go build stuff. Thank you.