 All right, this is the last session for the Data Facility Lectures. And we're super excited to have Lauren Fuchs from Oracle. Lauren did her undergrad and PhD at RPI at Rensselaer in New York. And I'm told skip the master's part, but she's not even bothered. She only had a bachelor's and a PhD. That's crazy. Right. Yeah, they give it to you for free. You just sign the form, or maybe RPI is different. You don't care. I don't need it. She's been at Oracle since 2007, working on the core development system. And so she's going to go retro today and not have any PowerPoints. Everything's good on the slide. So let's thank Lauren for being here. Thanks. Morning, everyone. Thanks for that introduction, Andy. Before I get started, let me get a feel for the audience here. Who here is an undergraduate student? None? So who's here as a graduate or a postdoc? OK, and I assume everyone else, you are professors? OK, so I assume I don't need to go over, like describe to you people what a B plus tree is or what acid properties are and all that. OK, all right. So before I get into the design of Berkeley DB, I'm going to start with its history. It all began long time ago in 1991 at the University of California, Berkeley, hence where its name came from. It began with some graduate students, Margo, Keith, and a few others. They were finding that the current database library for UNIX did not meet their needs. So being the intrepid students they were, they wrote their own. Created the database, put it online, made it open source, and went on to other things. As time went by, as often happens with open source products, they'd get requests for bug fixes. And being the busy students they were, they had to go, oh yeah, sure, we'll try to fix it, but we don't know when we'll get to it. This went on for several years until 1996. An up and coming web browser company named Netscape came to them with a list of bug fixes. Again, we'll get to it when we can, but we're busy. So Netscape came back and said, OK, if you do this now, we'll give you a big stack of money. And that was the birth of Berkeley DB. Since then, Berkeley DB's been deployed millions of times. It's become the back end of many different products ranging from Amazon, MySQL, Subversion. It's pretty much you log on to Linux or any Unix system like that, and DBD is already installed. We are everywhere. Now, before I get started, also another thing. The Keith in the story I just told you, he being the slacker that he is, started another database company and created another database after he sold Berkeley DB off to Oracle in 2006. That new company, Tiger Wire, he's the same Keith who talked to you in the beginning of the seminar. Wire Tiger. Why? Wire Tiger, sorry. So now on to the meat of the presentation, the design of Berkeley DB. BBB can be summed up in two words. Simple and flexible. We are a key value store. There's no schema, no query engine, no optimizer, none of that. We take your data. Doesn't matter what it is. We'll take anything, pictures, video, text. It's all the same to us. We'll take that. We'll store it for you. And some later time, we'll give it back to you. Because we're so flexible, as I pointed out before, we're the back end to a lot of other databases that are more complex. MongoDB used us as a back end, as a proof of concept that they could get rid of their old one. Berkeley DB offers a SQL API, which we created just by taking SQLite, ripping out the pager and sticking ourselves in. So that's a description of how flexible we are. Now, as you all probably know, storing and getting data, even though it sounds like a simple operation, giving that its asset protection is not. Berkeley DB provides asset protection using five major subsystems, the cache, the data store, locking, logging, and recovery. And I'll start today with the cache. Let's see. All Berkeley DB databases are just a set of pages. You start with page zero, and zero being the page number, which will be very important later. Page one and all pages are the same size. They never change once the database is created. On and on, page 3,000, and on and on and on, up to terabytes of data. That's possible. So let's say we have an example database. MyDB. And we have a record in MyDB where the key is CMU, and the data is awesome. And we'll say this record is stored on page 3,000 at offset 1,000. So if this, we'll say MyDB is on disk, and if we want to access that record, all we do is open the file, start at the beginning, seek to 3,000 times the page size, read the page into the cache, and then offset at 1,000 and read the key and the data. Now, the cache, of course, wouldn't be very efficient if it had to read the entire database into memory or access the disk every time it read something. So what we do is we share the cache by assigning each database an in-memory, well, a shared memory, and yes, we use shared memory, hash table. There, each page is hashed by its page number, 013000. So again, like most caches, when you go to read a page, it'll first look through the hash table, go look in the correct bucket for the page. If it's there, yay, we can read our data. If not, pull from disk, then we can read our data. So going back to the example of looking up our key, there's probably, there's an obvious question. Well, how do you know CMU is awesome? Is at page 3,000, offset 1,000? Well, that brings us to the data store. Berkeley DB databases offer several different methods for organizing the data. There's a B plus tree, a hash, heap, queue. You can decide when creating the database whatever type is used. We'll say, for example, my DB is a B plus tree. So again, we'll say we're trying to locate key CMU and pull up its data. How we do this as follows. First, we go to the root of the B plus tree, which is always stored at database on page 0. And then it does, as you all pretty much know, how a B plus tree works. This, the page will contain offsets, will contain, I mean, the page number to other pages. So to do a binary search for the key, we'll say we found it at, we found that the next part to search is page 13. There, binary search again, and located at page 3,000. And since this is a leaf page, it'll contain the data of CMU. So it's pretty much a basic B plus tree. However, there is one wrinkle here. Berkeley DB supports keys of any length. I mean, you can have keys that are multi-gigabytes long will support it. I don't know why you'd want to do that, but it's possible. Have you seen people do that? We had once a bug where they ended up with multi-megabyte keys, but that was a product where we were generating the keys for them. And they managed to hit this one awful spot where it just kept growing. As you can imagine, it was ridiculously slow. But yeah, generally, the keys are pretty small. But yeah, since we support keys of any size that makes building a binary search structure kind of a problem, this is how Berkeley DB solves that. I'll draw you an example page. This is, say, random page in Berkeley DB. It starts with metadata at the top, various metadata used by the pages. Next, it has an index of an index that contains offsets into the page where keys are located. And finally, it has the keys. And there was the keys and the associated data. Now, the keys are actually not sorted on the page that we pretend to stick them wherever they'll fit. However, the offsets to the keys are sorted with the first offset pointing to the least key and the last pointing to the greatest. So whenever you do a binary search, again, start in the middle of the index, read the offset to the key, and then compare that with your search key and move on, as you would. And then just continue from there. That gets into off-page duplicates. Well, we have a special structure where, once it goes to the offset, it gets another page number instead of holding the key. It holds a page number that points to another page that holds the key, which in turn could include part of the key, then include more metadata that points to another page, and so on and so forth, till you've got the entire key. Yes? Repeat the question for the video. So your question is, let me see if I get this straight. Basically, what do you do if you, say, insert a key, since this has to remain sorted, how do you deal with that? You know, I'm sorry, I actually cannot answer that question right at the moment, because I'm not 100% sure how we deal with that. I know it's something clever. If you could give me your email address at the end of the talk, I'll try to get it to you later. OK, moving on. So we have our data. We know how to get to it. This leaves the important question of, well, how do we protect it? How do we prevent collisions, overriding, make it durable, inconsistent, and all that? Well, that gets into the locking, the logging, and the recovery systems. I'll begin with how locally DB handles locking. A locking system is page-level read-write locks. So to go back to our example, let's say, instead of reading CMU is awesome, we want to insert it. So to go back to our example, we would start, our process would start by requesting page 0. Before it could get that, it would need a write lock on it. It would request a write lock, I mean a read lock, sorry, a read lock from the locking system. If the page 0 has no write locks on it, then the read lock is granted and we can continue. If there is a write lock, then the request is put in a queue and the threat of control is forced to block until the read lock can be granted. It would do the same for page 13 and the same for page 3,000. Once we're here and we found where we want to stick the record, then the process has to ask the locking system to upgrade its read into a write. If there are no other read locks, then it's granted we can write our data to the page. And once the transaction is resolved, all locks are released in one fell swoop. If there are all other read locks on the page, then we have, again, the upgrade request is put in a queue and it's forced to wait until it can get a lock. Now, page-level locking is great. A lot more concurrent than database-level locking that a lot of systems use. But it has a major downside. I guess everyone here, you know what it is? Well, deadlock. That's the big problem with it. Berkeley DB handles that by implementing its own deadlock detection system. Whenever certain locks are requested or at certain interviews, Berkeley DB will go through all the current granted locks and lock requests and build a matrix and use that matrix to determine if there's any deadlock. If there is, then it will select one of the transactions involved in the deadlock to be the sacrifice. It'll release all its locks and force the transaction to roll back. So any of the other transactions in the deadlock can now move forward. We actually allow you to configure your policy. I think our default is its random. But you can do the longest transaction, the shortest, who has the most locks, who has the least, anything like that. How often do your customers tune that? That seems like a hard thing to do, right? They don't really tune it much. They're pretty, I mean, we don't really have that much deadlock, because how smaller pages are, it's just. And also, people tend to use Berkeley DB more for just to insert a single record at a time or just a few, not really long, intricate transactions. So it usually doesn't become a problem. But the deadlock transaction is not a separate thread. Listen, stop everything. They're just a demo. Yeah. OK, so. How do you conduct this printing of notes in the B plus 3 where there is a conflict where there is no space? I'm sorry, I don't know. So when you're inserting your record in B plus 3, there is a possibility of a lock to overflow. There is a conflict. How do you conduct that printing? When do you go and read each of the locks and you do? I came to require you to acquire a red lock. You can choose not because you felt differently. You are reading the read, you're getting the read. Locks until they leave a lot of different from the right, right? Oh, the read locks we keep. So if we have to start splitting and going up the tree, then we just go back and try to upgrade the reads. Yeah. So now I shall get into, well, the transaction that's sacrificed, how do we roll back its operations? That, of course, goes into the logging system. Now, before I go into detail on the logging system, I want to explain how Berkeley DB is different from a lot of databases. I know many of them use a journal file. They'll use write ahead logging, which Berkeley DB does too, to just write the logs to the journal until it's time to resolve the transaction. Then it'll flush the journal to disk. Well, it'll flush the journal logs to disk. It'll flush the database pages to disk. And then it'll just delete the journal, because now everything's done. It's consistent. We're good. Berkeley DB doesn't do that. Our logs, they're not a journal. Instead of just a single journal, we use an ever-increasing number of fixed-length log files. And these tend to start at, say, log 0, 0, 0, bunch of zeros. 1. Then once that fills up, we go to log 0, 0, 0, 2, and so on and so forth. And these log files, they can stick around forever unless you do a checkpoint and specifically decide to delete the old ones. Is that missing what the distinction is between the other systems? It's just that these files are fixed-length. They're fixed-length. They never get deleted after a transaction commit. We don't delete them. They stick around. They theoretically stick around forever. And some people do keep them all. And once I get more into the logging, you'll see why we do that. Again, back to our CMU example. Say we found our page. We've gotten our logs. We're ready to write it onto page 3,000. But first, before that can be done, we have to write a log entry for the update. Now, every log in BDB begins with a log number. You usually call them LNS. This consists of the number of the log file and an offset into the log file where this log entry will be stored. Next is the transaction ID. That is exactly what it sounds like. It's a unique ID used to identify the transaction. So in case we have to roll back, we know what logs apply to that transaction. Next, the page number. All operations are logged and performed at the page level. If you have, say, a data record that has to be inserted on multiple pages, each insert is broken up into a different operation. And each operation gets its own log. Next is the previous offset. This just points to the previous log of this transaction. In that way, when undoing a transaction, it's easy to just read one, then skip to the next log. That's part of this transaction. Because since we've got a bunch of transactions going all the time, the log after this one could belong to a different one. So we don't want to have to read all through them to decide just start at the beginning and jump, jump, jump, undo, undo, undo. Finally, we have the operation data. This is another thing that Berkeley DB does that's different from many systems. Not only do we log enough information to undo the transaction on an abort, we also log enough to redo it. In fact, any Berkeley DB database, if you've got your log records back to the beginning of time, we can recreate the entire database from scratch, just reading through the logs. We've got automated systems that do that. And some of our users actually do do that. They'll create in-memory databases that never on disk. So if there's ever a crash, they just go back to the logs we created and continue. So we have our log for our entry. That's already, now we stick it in the log buffer. Then we write our data onto the page. And now this is actually really important. We write the LSN of the log entry onto the page. In this way, we know what, since the log numbers are always increasing, this way we know what was the last log entry to update this page. Now that this is all done, we flush the log to disk, and we're done. Consistency is guaranteed because, as I said, well, now we're actually sorry. We resolve the transaction first, of course. If it's a commit, then we create another log record that specifically says this transaction's been committed. Put that in the log buffer and flush all that to disk. And then, as I was saying before, we leave all the dirty pages in memory. We don't have to flush them to disk to guarantee consistency, because if there was a crash right now, right after the commit, we could still just read back through the logs and have enough information to just write everything back onto the page. So how do we do that? Well, that comes to the recovery system. As same as it's equally as reliable or some better than others, or that means mostly the disk controller has issues or less issues. Charles would be a lot more better to handle that. I think we have had some issues with f-sync, and we've had to use other, on certain systems, use other methods of getting it, yeah, other systems called other methods to get into the disk. But yeah, and I do know it's actually been a specific problem where it didn't actually sink immediately. But I'd have to talk to Charles because he's run into those problems. Let's say, oh, well, actually before I get into the recovery, there's one more thing I have to describe, and that's checkpointing. As I said, we don't flush our dirty pages on commits, so they just keep all collecting in memory. And eventually, you'll probably want to just sync them to disk. Well, we use checkpointing to do that, and how that's done in Berkeley DB is first it flushes all the pages to disk, then it creates a checkpoint log entry where the LSN of that log is guaranteed to be larger than every single log entry of an operation that's part of a committed transaction. We put that checkpoint log in the buffer, flush the buffer, and we're good. Now, the database is to a point where if we crash, we don't have to read all the way back to the beginning of the logs. We can just go to the checkpoint, and we're guaranteed that everything beyond that is on disk, so we don't have to keep reading back. I believe it's blocking. So this brings us to the final part of core BDB, which is the recovery system. Say we've inserted our CMU example, committed the transaction, but we haven't bothered to do a checkpoint yet, and then our application crashes. BDB recovers from that automatically when you go to open the database after the crash. What it does is, before you can perform any other operations, it looks into the log files and finds the last log. It reads that log, finds the page number of the page it operated on last, and reads that page into memory. It then checks the LSN on that page, the log number, and compares it to the log. If the numbers match, or this is greater than that, it knows that page was flushed before the crash, so it proceeds to actually undo the operations, believe it or not. If this log number is less, we know it was never flushed after it was altered, so we just ignore the log. After undoing or ignoring the operation, we read the next log entry in the log file and proceed with the same operations, just keep undoing or ignoring and also keeping a record of all checkpoints, of all transactions we find that have been committed, and we keep doing that until we get to the first checkpoint or the end of the logs. At that point, we know everything on disk now has been committed and been flushed, so we proceed to walk back forward in the logs, but this time only redoing operations that were part of transactions that were committed. Walk back forward through the logs until we get to the end, and once that's over, we're guaranteed that all uncommitted transactions that were still going on when the crash happened have been undone, while all committed transactions are now redone and the database is in a consistent state. That is the core of BDB. As you can tell, it's a very simple acid-protecting paging system, as I alluded to before, because it's so simple, it's easy to build things on top of it. In the past two decades, it's been out. We've added a bunch of features on top of it. The biggest, most interesting one is replication. Berkeley DB supports a single ReadWrite Master multiple replica replication system, and it's all built upon the fact that all operations can be redone just by reading through the logs. What happens is whenever the master node commits a transaction, it flushes the logs then before reporting that the transaction has been committed, it sends all the logs in that transaction to each of the replicas. They take the logs, redo the operation, and then respond back, we're done. And when the master gets a quorum of responses from the replicas, it can declare the transaction committed. We do actually support partial replication. Each database, Berkeley DB works by having an environment and any number of databases in that environment, which all share the same locking, logging, and cache. So while every database, you have to replicate the entirety of it, you can choose to skip some databases. A recovery system is actually written so that if a database is missing, when it does recovery, it just ignores it. It assumes, OK, it's not supposed to be there. Another feature we're adding, and this will come out next release, which should be in a few months, release 6.2, we're calling it slices. Remember how I just said Berkeley DB works by having multiple databases in a single environment? Well, slicing will allow you to split your database among multiple environments. So when you've got hardware that has lots of CPUs, you can take advantage of that by not having to share the locking or the logging system. The early version of this feature unfortunately doesn't support transactions across slices because it's a very early version that we had to knock out in five months. But future examples, future versions will include cross-slice transaction protection using XA transactions. Now I'm going to finish this talk with a bit of a sales pitch. As I've been saying over and over, Berkeley DB, it's flexible, it's simple. We are written not only so that we can take any data, but actually so we have our APIs written so that if you want to replace the locking system with your own locking system, it's made so you can do that very easily. Want to do your own logs? We'll let you do that. Through the API, you can write your own logs. How will people see you doing either of those things? I said we have page-level locking for our B-tree. We had one customer who really wanted record-level locking for that. He said, screw it, I'm throwing out your locking system and writing my own. And actually the logs get used sometimes to for companies that just want to do a special audit trail. So it actually does happen sometimes, not very often, but it does. But yeah, it's very flexible. It's open source, free to download anytime. The code, it's written in C, very easy to read, very well-commented. So basically it's the perfect database for experimenting and research I've found. And I'm not just saying that, I implemented my PhD thesis using Berkeley DB. So it's very easy to just take and if you want to implement your own locking system but don't want to build everything else in a database, you can do that. Your own logs, all that, or just build something on top of it, it's wonderful for that. If you want to download Berkeley DB, I'd give you the link, but it's long and weird looking, just Google Berkeley DB download. We're right at the top. Download us. Yeah, no, it is, I admit, but he doesn't take my phone calls. But yeah, now download it, open it up, and stop playing, and stop playing with it, and even ask questions on our forum. Again, the guys that, when I was doing my PhD, I couldn't have done it if they weren't so helpful on our public forums. Mike Feingold, he was working at XML at the time, the Berkeley DB XML database, and he answered all my questions really timely. It was great. So that concludes the talk. Do I have any additional questions? So what's the history of Oracle's business interest in Berkeley DB since it's open source, and there are many of you database companies, why did they invest in it? Well, one thing it seems, since they bought us, it seems like they really wanted a back end for a lot of their other stuff, and didn't want to write it themselves or depend on anyone else, because we're starting to be the back end to a lot of their products. As for how Sleepycat made money before its purchase, it was, well, it worked in two ways. People would buy support. That's pretty common. But also, our license works so that if you're open source, you can have us for free. If you use us internally, we're also free, although I think they may have closed that loophole. But if you're a commercial product, and you want to include us in your commercial product, you have to pay. So that's been a source of revenue. That's been a big revenue stream for us, just many people wanting to use us, again, as a back end to their application for their commercial product, so we get money that way. I've heard rumors they brought it to squelps that might see a back end, so that might see open new transactions. Oh yeah, I know we were the only transactional back end for a while. That happened right before the inner TV cafe. You guys bought it and killed it. Really? I thought that happened earlier than 2006, because that's when we were bought, but it could have happened. Everybody thought, yeah, yeah, yeah, yeah. Any other questions? So could you talk about how you handle multiple day-laces, multiple processes accessing the same day-laces? Is there anything special you're doing to share memory locks, or is it just they're writing to the same memory space in use, but an IPC control QM, or is it all holding for the locks? What's the architecture for that, for the multi-processing case? Yeah, we do make use of shared memory. Yeah, we do use locks. I know recently we moved to, sorry, this is more someone else's field, so I can't get into too much detail, but I know recently we used to a feature called latches. And also just shared memory locks and latches is how we handle that. The architecture is here. Big thanks for DJ2B's locking, for keeping me straight on this whole thing, and thanks for all our guests who've come to give talks. So maybe we'll try this again next year. So thanks for coming here, everybody.