 Welcome guys to the second session in the data methodology lectures on embedded databases. I can't even begin to tell you how excited you are to have that speaker today. Richard Hipp is the creator of Secret Light, which if you don't know this, is the most widely deployed database system in the world. It's on a billion Android devices, however many hundred million iPhones are out there. So everyone who's seen one of you that has a cell phone here is running Richard's code. It's in everything. It's actually in run up to day two. It was sent out on 200 million AOL installation CDs in the 2000s as well. That's how the Richard explained the thing. I actually looked this up last night. I mean Richard, the success he's had with Secret Light is something that I, the only dream that I have, even the systems we build here at CNU, Richard code is actually running in space right now. So two years ago, they sent up some Android devices that had to be certified by NASA and it's in the international space station right now. So Richard code is running in space. That's the craziest thing you can think about. I didn't know that. Yeah. So I... Without trading. What's that? Without trading. Yeah, without any trading. So again, I'm super super excited to have Richard here today and he's going to talk to us about Secret Light. So thank you. Thank you very much, Andy. It's a great pleasure to be here and honor to be here. Delighted to be with you today. Listen, I've got a bunch of slides. I think that's going to be pretty stuff you're familiar with, but I'm going to click through them pretty quick. So you might want to download a copy of the slides from the URL you see on the screen because I'm going to be clicking through some things fast. And if maybe you want to go back and look at it later, that's going to be important. And I'm going to probably go too fast for you to take notes. And the video will be on the website as well. Okay, yes. Video will be on the website as well. So what is SQLite? It's different from most other database engines. It's a library. It's not a system. It's not a process. If you've done a lot of database... If you've worked with any other database, it's probably a system. It's running in the data center somewhere. SQLite's a library that just links into your application. It's a very different thing. It's what is delivered as a single file of ANSI-C code. Now this is a large file and we don't actually develop by editing that one big file. We have hundreds of files and then part of our make process kind of concatenates them all together in the right order. But that makes it very easy to deploy. It compiles down to roughly 500 kilobytes. It's very small and compact, small footprint. Low dependencies. It's designed to run on embedded systems. I've got here the minimum set of dependencies. I mean just standard library things. That's all it is. You don't have to import a bunch of other libraries to get this thing to work. So it can work on embedded systems. A complete database file is stored... A complete database is stored on disk as a single file, which is an interesting thing. Most other database systems store the content in a directory full of files. And in a lot of systems that directory is known only to the system administrator. It's some weird place that nobody knows where it is. But with SQLite the entire database is just an ordinary file. And that means you can take that file and put it on a flash drive or email it to a colleague. It's really easy to move around. It's a full featured SQL implementation with things like common table expressions, partial indices, full text search, archery indices. It features power safe serializable transactions. By power safe I mean you can be in the middle of a transaction or in the middle of doing a commit and the system will lose power. And under the assumption that the hardware behaves as advertised, which is not always an accurate assumption, but assuming that hardware does what it's supposed to do, you won't lose any work. Very simple API. Easy to program to. It's designed to be able to drop in. It's designed for application developers to be able to take this code, drop it into their application and have a full featured database with no other work. And the source code is in the public domain. So you can grab the source code and do whatever you want with it. You can grab a copy of the source code, relabel it, and go with it. I mean that's whatever you want to do. It's called SQLite. People think oh it's just a little tiny toy database. It does have its limits, but they're pretty big. We support multiple concurrent readers and one reader, multiple concurrent readers and one writer all at the same time. Now you know that's not huge, but that's usually enough for your embedded device. We take strings and blobs up to a gigabyte in size, which is actually more than a lot of large scale databases we'll do. A single database can be up to 140 terabytes. We've never actually tested that limit because we've never actually come up with a file system that could give us a 140 terabyte file, but that's the theoretical limit. 64-way joins, 2,000 columns per table. There aren't any real arbitrary limits other than these. It's actually a full featured database engine. What is the forcing function for 140 terabytes? What limits it? The limiting factor on the size of the database file, 140 terabytes, is that we use a 32-bit integer to count the pages. Actually it's a signed integer, so we have to leave the top bit off. So we have 31 bits and the maximum page size is 64K. So you do the math. So how did this get started? I didn't start out as a database person. I was doing some application development, solving some hard problems. And I was doing these client programs that were doing a really interesting theoretical calculation, but I had to get my data from a database engine. And the client that I was working for provided the database engine and it was informing. And hey, it worked great when it worked. But sometimes they would power cycle the database engine or power cycle machine and the database engine wouldn't come back up. And when that would happen, that would mean that my application couldn't do its job because it couldn't read the data because it had to talk to the database engine. And so my application would bring up that dialog box. And that's what an error dialog box actually looked like in the late 1980s. That's an actual screenshot. They look better now, don't they? Yeah, a little bit grainy. Back in 1998, we thought this was so cool. This was so state of the art. Do you remember that? So I had this idea, well, look, this particular application, it's not doing a lot of heavy transactions stuff. In fact, it's read only. And it's not doing any elaborate joins or queries. Why can't I just read this data directly from the disk? Why do I need this server sitting in between me and my data, which is just another potential point of failure? So after that project was finished, I had a couple months off and I wrote SQLite. And that started in 2000 and May of 2000 was first code and the first release was a few months later. So that's how it got started and that sort of shows you the motivation behind it. So it's a little bit different from what you're kind of used to seeing. It has different use cases. SQLite is not trying to replace all these other databases that you're more familiar with. SQLite ends up being very useful in embedded devices and the Internet of Things. It's in your cell phone, it's in the smart thermostats, it's in your microwave oven, it's in your TV set. These sorts of things don't need to be doing a bazillion transactions. How many such devices is new? It's hard to count because it's public domain, a lot of people don't tell us. Until a few minutes ago, I did not know that SQLite was flying on the space station. You missed the intro. You missed the intro. Can you know in AOL installation CDs? Yeah, AOL was an early adopter. Do you remember back in the early part of this millennium where you get the AOL install CDs? You know, $10 a month or something. Do you remember those? SQLite was always on those. But SQLite is also good as an application file format. If you're doing a traditional desktop application of some sort and instead of doing file save and writing out a bunch of XML, make it a database. You get all this powerful query language and transactions. It's great for what I call the lingua franca of a federation of programs. You've got a bunch of graduate students working on a problem and you're writing in Python and you over here, you're doing C++ and you're doing Ruby and this guy over here is doing PHP. Oh, you're doing Lisp. All these guys are doing different programs because that's what they want to do, but they all got to talk to each other. Why not use an SQLite database file as your means of your common mode of communication? It's often used as a local cache for an enterprise database. So you're on a device. You want to download data that's relevant to the device so that the device can continue to operate while you're off network. You're on your phone and you're going through a tunnel and you've lost reception. You're off network for a while and it works well. It's also good as an interchange format. So, for example, you know, the program guide on your cable TV and it comes down to the set-top box. In a lot of cases, it's being bound from the satellite to your set-top box as an SQLite database. And then the little window that shows you what... That's just an archery query. So, here's your decision checklist about when to use... Or what storage engine to use for your project. Is the data remote? If the data is on a different device from the one that your program application is running on, use one of the traditional client-server databases. If it's big data, if it's more data than you're comfortable putting in a single file, use one of the traditional client-server databases. You've got concurrent riders. If you're trying to do a gazillion transactions per second, use a client-server database. That's what they're for. And these are all very important problems and these client-server databases are all very good at solving them. But there's a lot of problems that don't fit any of those categories. And for all those other cases, just use SQLite. And where people mess this up is that they do the first part of that checklist right and they get down to the bottom and say, Oh, well, I don't have any of these problems. I'm just going to open a file and write a bunch of JSON into it or some binary format that I made up. And this happens a lot. And that's the use case for SQLite. SQLite is not really competing against these other database engines that you study all the time. It's competing with F open. That's its goal. So SQLite is found in lots of different things as Andy was saying. It's in all the phones. It's in your Mac. It's in your Windows 10 machine. It's in the web browser. It's in lots of applications. It's built into a lot of programming languages, and it's just appearing in devices all over. We're pretty sure, we can't measure this because it is public domain, but we're pretty sure it is the most widely used database engine in the world. I think that it is probably the number two most used piece of software in the world. I'm thinking that number one is the ZLib compression library. There's probably more deployments than SQLite, but other than that, I'm not aware of anything that does more. Okay, I've got one marketing slide, actually two. This is the first of two. I've got a question in the back. I've got a slide coming up, so hold the question. Okay, so this is my, I've got two marketing slides. I just wanted to point you, this is a graph of Apple's stock price from there when they originally went public in 1981 up until about 2012. And from there, it's gone up. This is before the split. They had an 81 split, but adjusted for the split, it would be now at about 1,000. And you can notice how the stock stayed around 10 or $15 per share for over 20 years, and then it suddenly started this rise up to 1,000. Okay, absolutely true. SQLite was introduced into the Mac platform there. Okay, I'm just saying, I'm just saying. Did you buy a bunch of Apple's stock then? Unfortunately, no. So, what I'm going to talk about here is I'm going to go over how SQLite is implemented, because you're all database people I assume, and even if you're not, even if you're working on one of these systems that does a gazillion transactions per second, and SQLite is really not what you're into, it's an important database, and you need to understand it. And so I'm going to talk, I'm going to give you an overview of the implementation so that you understand what it's doing, and so that if you want to go and look at the code, you've got kind of a roadmap. The code is available online. Here's how you get it. There's two places that you can get it, and it's readable code. We put a lot of comments in the code, and the readable comments, we see tweets about how, oh, you should read SQLite code, it's really easy. So you can actually study this. My goal is to give you a roadmap so that when you just pull a random bit of code out, you have some inkling of what it's trying to do. So any database engine, I like to think of them in terms of a compiler and a virtual machine. So you've got SQL that comes in, and you've got a part of the program that's going to compile, part of the database engine that's going to compile the SQL into a prepared statement. This is, I think of every statement of SQL as a miniature program, and the prepared statement is the executable, and so this is just like GCC, and then you get a prepared statement, and then you run the prepared statement, and that's like just executing the binary. That's the way we like to think of it. This is the stack of SQLite. There's a parser, a code generator, a virtual machine, a B3 layer, a pager, and an OS interface. I'm going to talk about each of these in turn. At the top, there's a parser. It's a standard kind of parser. The tokenizer is written by hand. It's only a couple hundred lines of code. You know, when you're studying compiler construction, everybody, you have these big chapters about, you know, lex and lexers and stuff. I've never understood that because it is so, so easy to write a tokenizer and a couple hundred lines of C code that is at least two orders of magnitude faster than anything flex will ever generate. So I don't know why they do that, but... I've never heard anybody... No, no. This is just the tokenizer. This is just splitting it up into tokens. Now, the parser is a traditional LALR1 parser. It doesn't use YAC. I wrote my own parser generator when I was a graduate student. Okay. And it has the advantage over YAC that it generates a parser that is re-entrant and thread safe. And back when I was doing this, Bison and YAC parsers were neither of these things. They may have fixed that now. I don't know. I haven't kept up with it. But it's a traditional parser. These are the files where you can find this stuff if you want to look at the code. The source code to the lemon parser generator is included in the source tree. And documentation is included in the source tree. So you can learn about lemon. And then the structures that define the abstract syntax tree are in that header file and the tokenizers in that file there. So moving on down for it, we have the code generator, which does semantic analysis of the parsed code, transforms the parsed tree in ways to try to make things more efficient, does query planning, and then generates byte code. The output of this is the prepared statement. So these two things implement the compiler. Think of these two steps as GCC. They take raw program text and turn it into something a machine can understand. And of course, the rest of the stack is going to be implementing the machine. So the virtual machine in SQLite is a byte code interpreter. A lot of other database engines, they just walk the parsed tree, and that's how they execute. But I wanted to do a byte code interpreter. The original byte code interpreter was a stack base where you push things onto a stack and then operate on the stack, just like JVM and all the others. It seems like every virtual machine always starts as a stack-based machine. But we changed it to a three-address machine because that actually turned out to be more efficient and much easier to write optimal code for. So it's really simple. It's a big for loop. Program counter equals zero. Program counter plus plus. And inside the for loop, there's a switch. It switches on the opcode, and there's a case for every opcode. And part of the virtual machine, I also include the implementation of the built-in SQL functions. So they're included there. Talk more about that later. So here's an example of what the byte code looks like. I won't walk you through this, but you can look at the byte code for any SQL statement that SQLite can generate by just putting explain up front like this. And the documentation for the opcodes is available online if you want to try and decode that. This is doing a full table scan. So this one's pretty simple and I could fit on one slide. For a join with subqueries and lots of conditions, this might go on for hundreds or even thousands of instructions. But a simple table scan, pretty quick. If you want to study this and you want to look at what the byte code SQLite generates is, let me tell you how to do that. You need to do a custom build. And so to get the tariff, I'll get the source code. Do configure and I like to do disable shared because autocomp, if you don't do that, it does all this freaky shell script stuff to do shared libraries. And you can do that if you want to. It'll work, but I find it confusing. So I always disable it. But before you do make, there's an extra C preprocessor to find that you need to give it. And that's going to add these comments. I'm going to go back a slide. Over here on the far right, we've got comments that help explain what each opcode is doing. And by default, those are not generated because they take up space and it takes CPU cycles to generate them. So in a production environment, we don't want to do that. But if you're debugging, they're very useful to help you quickly see what's going on. And so you'll probably want to include those. So once you get that thing compiled, what you've compiled then is a command line shell. It's just a simple program that reads SQL statements and then sends them to SQLite to be executed. And this command line shell, SQLite 3, if you give it a line that starts with a dot, a period, that's special. The shell doesn't send that to SQLite. It does some special processing. And the dot explain sets up the output formatting so that you have nice neat columns and it automatically indents loops so that you can spot the loops more easily. And so that's a nice thing you want to do. So you want to be sure and type that. And then you just type in explain and then the rest of your query and you'll get to see the bytecode. So moving on down the stack, this is kind of the boundary between to the storage engine. And a lot of places that go and I talk to people about databases and when I say databases, to them the storage engine is the database. This is their focus. How can I get as many writes to disk as possible? My view is a little different. I think that this whole stack is the database. And the bottom part is just the storage engine. And if you've got just a key value pair type thing, you only have half a database. That's my opinion. Really, in my view, the interesting stuff is happening up above all the query planning and the analysis of the declarative programming language. That's where the interesting parts are. The bottom is just the storage engine. Now there is a reasonably clean interface for this in SQLite. And what a lot of people have done, including some people who will be speaking to you in this lecture series, is they have taken the default SQLite storage engine out, just stripped it out and plugged in their own. So then they've got an SQL system on top and then they put their own whiz bang storage system underneath. Because you see in a minute, my storage engine is not like cool and has the latest algorithm. It's old tech. It's old school. So the B-tree layer, we support B-trees and B-plus trees. There are multiple B-trees per disk file. One B-tree for each table, one B-tree for each index. Variable length records uses a very efficient coding mechanism. It's accessed via cursor. So when you're working with B-trees, you open a cursor into a B-tree and then you can seek the cursor and advance the cursor backwards and forwards. And it allows for multiple cursors to be open on the same B-tree at the same time. And including cursors that are changing the B-tree and it takes care automatically that people don't write things out from underneath each other. This is just the 30,000 foot view. I'm going to go back over all of this in more depth after I finish the quick overview. Next on down is the pager layer. And this is the part that implements your transactions, your atomic commit and your rollback. The pager views the database as a bunch of pages. They're numbered starting with 1 because page 0 is like our null pointer. And the page size is a power of 2 between 512 and 64K. The pager has no idea what the content of the page is. It doesn't care. It's just managing the pages, handing them to the B-tree and then dealing with transactions. It also provides concurrency control because with SQLite you can have multiple processes talking to the same database file at the same time. And there's not a server controlling this. They're all peers and so there's got to be a mechanism to make sure they don't step on each other and that's handled by the pager layer. And then at the bottom we have the operating system interface. This is the portability layer. This is how we allow SQLite to operate on Windows, on Mac, on various embedded operating systems including some custom operating systems. You can plug in new OS interfaces at runtime, which is an interesting feature which I'll talk about in the next slide. And I say an OS interface, people have substituted, we have an example of this, an OS interface that talks directly to hardware. It bypasses the operating system completely. So you can actually buy commercial off-the-shelf devices, little gadgets that are using SQLite and they plug in their own OS interface that talks directly to the flash controller and they use SQLite as their file system. They have no file system on the device that you only have the database. Interesting concept. Now, because it's runtime-plugable, we have this concept above a shim where you can plug in your own OS interface that doesn't really do a complete OS interface. It just maybe changes the calls around a little bit and then passes it on down to the real OS interface. And this can do things like you could add encryption or compression, you could do logging. We use this a lot for testing because we can plug in a shim that can simulate hardware failures and that we can prove that SQLite is going to be able to recover gracefully from a hardware failure. And there's some examples, implementations of this sort of thing if you want to play around with it. So there was a question in the back on what the percentage was. Here is the graph, here is the chart. The parser is the little green. This is source code. If you looked at compiled binary, the ratios are going to be a little bit different, but this is roughly the same. And in particular, the parser is going to grow a little bit because the LALR1 language is a very compact notation. But not a whole lot. And then the code generator is the bulk of it. The code generator and the virtual machine and the parser together are over half of it. The B-tree layer is this thin little slice right here. It's really not that much. So that is the 30,000 foot view of SQLite. Now what I'm going to do now, my plan is to go back through this whole thing again, but this time from the bottom up and get into a little bit more detail about how things work. So that you better understand what SQLite is doing behind the scenes. So let's start with the pager. Again, this is what handles power-safe transactions and current to control. This is the thing that makes sure that you can roll back your transactions or that if you crash in the middle of a transaction, your database is consistent. That transactions are atomic across power failures. It also handles current to control and provides an in-memory cache for the disk controller. So when you start out, you're getting ready to read the database. Here's a little diagram I've got. I guess it's labeled disk. Now it means NAND flash, right? Because I don't think I even own a computer that has spinning rust anymore. They're all... But you know what I mean. So all the data is on disk. There's an OS cache, an operating system cache, but it's empty right now. There's no content there. The cache is cold. And you want to read from the database. And so the first thing you have to do is get a shared lock. And that's drawn in RAM because you think about these locks, they don't really persist if the system crashes, all the locks go away. So you get a shared lock and that prevents other processes from changing the data out from under you while you're trying to read it. And then you read a few pages that you need in order to do your work. And everybody's happy. Now suppose you want to make some changes. You want to change the database. You want to write. You want to insert some data. The first thing it does is it gets a reserved lock on the database file, which says, you know what, I'm getting ready to write. We can only have one writer at a time. I call dibs. It's not writing yet. Other people can continue to read. But this guy has dibs. He's claiming the right to the reservation to make a right. And once he does that, then he stores the original unchanged content of the database files, database pages he's going to change, and a rollback journal. And this is a file in the same directory as the original database, but with the name dash journal appended. It's a rollback journal. This should be familiar to all of you. I'm just giving you details. And then after he's done that, he's free to make changes to the individual pages in user space. Nobody else can see this yet. Other processes continue to read the old original data. Now we're ready to commit. So the first thing we have to do is fsync or flush the rollback journal to disk. This is important because if you lose power, that stuff's got to be there in order to recover. Sometimes people don't care about recovering after power failure, in which case you can turn that step off. And you know, to a first approximation, this step of forcing it out to disk is what takes all of the time. Everything else is free. This is what costs time. So you can turn that off and it'll make the things run a whole lot faster. Many, many, many times faster. But if you lose power and the rights to hardware occurred out of order, it could corrupt your database. So anyway, it flushes it out to memory. Then it gets an exclusive lock on the database file which prevents anybody from being able to read the database file because we're getting ready to write on the database and we can't write to it while somebody else is reading because that would read it out from underneath them. So then we write to the... we do the write system calls and then we have to flush that out to disk with another fsync. And then the moment of commit, we delete the rollback journal. Or maybe we truncated or set it to zero. But somehow we make the rollback journal unusable and that's what causes the commit to occur. Before this point, if we lose power, the rollback journal is always sitting there. So here we've lost power and in the middle of writing to the disk database file, we didn't get the complete write done. We've lost power and now power's been restored and we're coming back up and somebody gets ready to read and they get the shared lock and they immediately notice, oh, we've got a hot journal over here. It's a journal that didn't get processed correctly. So it immediately goes to an exclusive lock and rolls the whole thing back, restoring the database to its original condition. So this will always happen until you delete the rollback journal. So the delete is when the transaction commits. That's rollback mode. That's the default. That's the most reliable. But it's kind of slow. You can have multiple readers or one writer, but you can't have a reader and a writer going at the same time. So we have another way of doing this called write-ahead log mode. This is not the default and I'll explain why this is not the default in a second. But with the write-ahead log mode, it starts out the same. You get a shared lock on the database file. You read it into user space. But it's okay to go ahead and make changes in user space right away. You don't have to log the old original content and you don't have to upgrade your lock from shared. Other people can continue to read. And then when you want to commit, you just write the changes out to a write-ahead log, which is the name of the database file with minus wall upended. And you're done. You didn't have to F-sync. You're finished. This is not durable. I'll get to that. But it is atomic. And I've got a little dot here in the last record. Of course, all of these pages have a header which has which page number it is and there's a checksum and some other things. And one of them is marked, oh, this is the last record of the commit. So now I know the transaction is finished. So that's great. And then another process can come along and it wants to read two and it's reading some of the pages, some fresh pages off the database that the other guy didn't touch. But it also wants to read one of the pages that the first user has changed. And it has to read that change out of the write-ahead log. You see? So this page came out of the write-ahead log so that it reflects the change whereas these others were unchanged and it read them directly from the database file. So you can see that we have some readers that were reading completely out of the database. They're looking at a snapshot in history whereas this guy's looking at what the current version is. And then this guy might want to make some changes and he just appends to the log as well. And so you can have multiple readers going at the same time looking at different... reading from different points of the write-ahead log in the database file and have snapshot isolation. The reason that we don't do this by default is that when you're trying to get a page you have to know if that page is first in the write-ahead log before you read it from the database. And the way we do that is that we have a little hash table that's in shared memory. But if the hash table's in shared memory and you have multiple processes trying to access the database and it's on a network file system and they're on different computers, that's not going to work. And so for the other scheme works fine if you've got processes on separate computers accessing it over a network process, this one does not. It also does not work on some operating systems that have dodgy memory mapping because we use memory mapping for the shared memory. So... Not on tape. It's the ones that you've heard of, actually. They claim to have fixed it. We've reported the bugs and I think they may have fixed it, but whatever. So this is an option. A lot of people use it. And if you're using Firefox, if you're using an iPhone, if you're using Android, they normally enable this because this works a lot better. But it's just an option. But this wall file can grow without bound. At some point you've got to get this data back into the original database file. So we have a checkpoint operation that runs automatically. Or you can set up a separate thread or a separate process to run checkpoints. But if you don't do that, it'll do it automatically when the wall file gets big enough. And to do a checkpoint, the first thing we have to do is make sure the wall gets persisted to disk. This gives you durability. You can also set it up so that it automatically does this f-sync after every transaction if durability is important to you. It turns out most applications, if you lose power, if you lost power and when you come back up, if you missed the last three transactions, most people don't care. They'd rather have the performance. But if you really have to have the durability, you can set it up that it flushes the disk after every transaction. So then periodically, we will take the content of the wall file, roll it back onto the disk, and that is... And then we truncate the wall file and start over. And that's a checkpoint operation. Question in the back of the room. The question is, how do we handle multiple computers reading the same database file over like NFS? Yeah, whatever your network file system has to support POSIX advisory locking or the Windows equivalent. And if it doesn't, and a lot of them don't, you run the risk of some corruption. The locking is POSIX advisory locking. Actually, we have multiple... We have the pluggable OS layer. There's a different OS layer that will substitute creating .files in place of POSIX advisory locking for cases where... But that's a lot heavier, slower. And if you crash in the middle, the lock doesn't automatically go away, so you've got to go back and clean it up. So, either way, I'm running behind. I'm running badly behind, so I'm going to go faster. So that's kind of the discussion of the pager. We didn't talk about nested transactions, pluggable page cache, or how we test this thing for crashes. How do you know that this really works? How do you know that this is really going to recover on a power failure? That's an interesting problem. I haven't talked about it. The V3 layer is next. Multiple... It's like all the transactions are in units that are full page size. Is that a problem for any of your users? They say, oh, I'm only changing a little... So the question is, when we're logging in either the rollback journal or in the right-of-head log, we're logging complete pages rather than just the change. And I know that some other like Berkeley DB just does the change. And we benchmarked it. It's not a big performance hit, and it sure makes things a whole lot simpler and more reliable if you've got the complete page there rather than just the change. So, you know, the Berkeley DB people will be here in a few weeks and you can ask them. They might have a different opinion of this. I don't know. So, in the V3 layer, there's multiple V3s per file. You use B plus trees for tables with a 64-bit integer key and regular V3s for indexes. A table that looks like this, it's a key with arbitrary data. The format of the data, the V3 doesn't know what that data is. It's just binary to it. The format is actually interpreted by the next layer up. And, of course, you've got a root page and it points to child pages. My pointer, I mean, is just a page number. It's a 32-bit page number. And all the data, and it's a B plus tree, so all the data is in the leaves of the tree. The keys can appear more than once in a B plus tree, but because they're small integers, it's not a problem. Because our B tree is used for the tables, it's optimized for appending rather than for arbitrary inserts. Oh, I did want to point it. Amazing fan-out, you know, because I'm only showing a fan-out of three in this simple diagram, but you really get fan-out on the order of a thousand. So it really was a risk. We use variable-length integers, and this is the variable-length integer encoding where it just reads bytes that have zero in the high-order bit and uses the seven lower bits or it uses the entire ninth byte in order to construct your integers. And this was a mistake, and I give you this, this is a failure. So if you're ever doing something like this and you need variable-length integers, don't do them this way. Instead, do them like this where the first byte tells you, you know, the magnitude of the integer somehow or another. So here's an example where if the first byte's less, 240 or less, then that's just the value. If it's between 241 and 248, there's some old formula that gives you larger values. So if we're not, if the first byte is 255, then it just takes the next eight bytes and that's your value. And the reason for doing it this way is when the first byte determines the size of the variable-length integer, which is very important for efficiency and parsing. And the other thing is that you can actually compare two integers using MIM compare without having to decode them, okay? So this was a mistake. The way I did it was a mistake. Always do it this way. The other thing I wanted to point out is how the pages are laid out. There's a header on each page. And then I have a section in here which are two byte offsets to each row within that page. And then down over here I have all the rows. I did this backwards as well. If you ever are doing a B-tree, let me suggest that you flip it around the other way. Put the header at the end and the pointers to the offset and then the content here. And the reason for this is you've got variable-length rows in here. And if you're having to parse this stuff out and you could potentially have been handed a corrupted database file, because remember, SQLite is used to pass information around on sticks and stuff. Somebody could have deliberately given you a corrupted database file in attempting to crash your system. So when we're parsing this and we're doing these variable-length fields, we don't want to do a buffer overrun. And the way I've got it now, because the content area goes all the way to the end of the page, I have to be very, very precise in making sure I don't overrun the buffer. Whereas if I had done it the other way around and put this header and this other stuff at the beginning, I've got kind of an overrun area, so a little bit of slop. You could save a lot of performance that way. Just some hints. Okay. It's not so much for performance as for safety. Right. It's for performance in an environment where we want to guarantee safety, because SQLite right now has to spend a lot of CPU cycles making sure that we never overrun that buffer when, in fact, in practice, you never do unless somebody's trying to break into your program. So those are essentially wasted cycles. Okay, so indexes are stored as regular B-trees. We treat an index as just the key. There's no data on an index. It's just a key. And it's binary data. The B-tree doesn't know how to sort these things because it's binary, but the next layer up hands it a pointer, a comparison function that allows it to sort these. And we'll talk about that in a minute. B-trees, there's no... the data is distributed throughout the tree, or the keys are distributed throughout the key. The key is the data, but the keys are not duplicated. Remember in a B-plus tree the keys are duplicated. Here that it's not duplicated, there's only one key, one instance of the key in the table for each one, which is important because now the keys are much larger. But you have reduced fan-out. So searching takes long because the keys are larger. Fewer of them fit on one page. It doesn't fan-out as fast. Search takes a little bit longer. We've got SQLite setup so that there's always a minimum of four keys on every page. So I'm going to skip that slide in the interest of time. Now, we've got a bunch of B-trees in the same file, and these individual pages can be interleaved all through the file. And the only thing that you need to know is what the root page of each B-tree is. If you want to see where the pages are, you can download the source code, do configure, do make showDB. ShowDB is a little utility that we wrote that kind of decodes the file format. And they do showDB database page index, and it will actually show you what each page of the database file is used for. And we can see here that the first page is both a root and a leaf. So that particular table fit on a single page and both all fit there. And you can see a bunch of other tables. Down here at the bottom, I want to point out we've got overflow pages. Because I mentioned earlier that SQLite handles up to a gigabyte of content in a row, but the pages are like 1k. How do we do that? Well, if it doesn't fit on one page, it puts a little bit on the original page, and then puts a pointer to an overflow chain. And this is just a linked list. And when I was designing SQLite, I looked around at all the SQL databases I could find, and I didn't find any that really had large blobs or strings. And so I thought, you know, I'm going to support this, but it's rarely used. It doesn't have to be efficient. I didn't try to make this fast. But amazingly enough, it turns out to work very well, even for megabyte-sized strings and blobs. Adobe discovered this for us. And the Adobe Lightroom product, they have to store a lot of thumbnails or images, and they use SQLite as their application file format. And they were wondering, do I store thumbnails directly in the database, or do I just store the file name and then write the image out to a separate file? And they ran it. And it turns out for blobs less than about 50 or 100k, it's actually faster to write it into the database than it is to write it to the file system. I believe this is because if you write it to the file system, you have to open and close. And it's the overhead of the open and close system calls that slows you down, whereas the database file is continuously open. So we're faster than that. All right, the Beep Tree Apprentice, again, you access the Beep Trees by cursor, you open a cursor, you seek on the cursor, forward and back, ask for the data of the key. Close the cursor. How do we find out where the root pages are for each of the Beep Trees in the file? So there's a special table in every SQLite database called the SQLite Master Table, and the schema looks like this. It's there by default. You can't change it. And it has the type, which is table, index, view or trigger, the name of the thing, the SQL that originally created, the original SQL text, and it also has the root page. So, and this particular table always has a root at page one. So we can go to page one, there's a Beep Tree there, which is this thing, we can read this Beep Tree and find the root page of every other table in the file. And here's just an example of how you can actually see that table in action. Right, so this means that if the first page gets clobbered, it's going to be really hard to recover much from that database. Yes, because you've lost the schema. The schema is stored on the first page, or of course in overflow pages as well. So that's the Beep Tree. We didn't talk about freelance management, auto-vacuuming, shared cache. I'm looking at this clock here and I'm running way over, so I'm just going to slip through this. The virtual machine is, it defines the format of the records. And I'm going to very quickly go over this, the records. SQLite has this interesting property that it kind of ignores column types. You can put anything you want in any column. You don't even have to declare the column. SQLite derives from the scripting language community from tickle. Anybody heard of the tickle programming language? You used it? SQLite is in fact a tickle extension that escaped into the wild. This is the truth. So, you know, kind of a typeless, Python type thing. So we have stored the data type for everything and so we've got a bunch of variable-length integers that define the data type and then we have the actual data. And so here's how the integers decode like an integer of zero means that it's a null. You know, an integer of six means an eight-byte signed integer. And then for strings and blobs, it's these values here and you've got a little formula that gives you the length. So most of the time these type codes are a single byte. Here's an example in coding. Create table ABC. Notice I didn't put any type information in there. I just, three columns, ABC. And you can do that in SQLite. And I'm inserting 177 in null and hello. And so here's the header in these four bytes. And then here's the two-byte integer for 177. And then null doesn't take any data at all and then there's the string hello. So that's the encoding. The code generator, it's in these files. In the interest of time, I'm going to skip over this really quick. If you're going to work with the code generator, I suggest you enhance your shell by downloading the tar ball, doing configure, and then adding these extra sharp defines to the make file and then doing the build. And then you get some extra command line tools that allow you to, for example, print out a parse tree in kind of ASCII art. The clock is just spinning around so fast so I'm going to flip right over this and you can go and read this at your leisure if you want to do this. But we've got really cool tools that, if you're in the debug and you're single-stepping and you want to look at a parse tree, you can actually call some routines and it'll print them out for you. It'll allow you to trace, it prints out each virtual machine opcode as it runs it. So I want to get on into how the query actually runs. So here's an example of how the data is stored in SQLite. You've got the row ID and then all the data for a simple little table here. And if you want to query this table, you know, give me the price for peaches. Of course, one way to do that would be to seek to the beginning of the table, step through each row, pull out the fruit field, see if it's equal to peach, and if it is, output the price. It's a full table scan. That always works. And of course, but, you know, if there were like seven million rows, that would be really slow. So, well, you could also ask for it by row ID and then it'll do a binary search and that's very fast. But, you know, if you're running a grocer, you don't want to have to remember that the secret code for peach is four. That's crazy. You really want to ask for peach. So, for that we have an index and the way an index works is it just, it creates this key over here that contains the value being indexed and the row ID. And because the row ID is unique, this guarantees that each row and the index will also be unique. So, in cases where the actual fruit is not unique, you know, the row will disamb, the row ID will disambiguate it. And so once you have that to query for the peach, you can do a binary search for the entry that has peach. You read off the row ID, use that to do a binary search in the original table that gives you the whole record and then you can pull out the price. And that's great. And if you can do the same thing for orange and it goes to the first winter entry with orange, gets the row ID and gets it. Second entry and gets that one as well. And this is just by stepping through the B tree. But if I said orange in California, you know, it has to do orange, it gets one, after it does the look for row one, it has to check to see, well, is it in the state of California? No, it's in Florida. I have to reject that row. I did that B tree search, that binary search for not. And we hate that. That's extra work. We like to avoid that. And so you think, well, maybe I'll do an index on state but that doesn't really help either because then you could look for California but then it'd have to check for grape and then you'd miss one there too. It's the same amount of work. So what you need there is a two column index where you have both the fruit and the state in the index. And where the fruit's a tie, the state breaks the tie. And it's the same type of thing but now you can look for orange and California, get the 23 and immediately look it up. And that's a lot faster. That's a lot better way to do that sort of thing. You could do even better than this though if you built an index that contains all the columns in there. The fruit, the state, and the price. And then when you do select price from table where fruit is orange, you could do a binary search to find the first entry where it says Orange, California. And the price is already there in the index. And you just read it straight out the index. You don't even have to look at the original table. This is called a covering index. If it's an or, of course you can always do it full table skin. Just start at the beginning of the table. Read column by row, checking the condition and the where clause to see if it's true and then outputting the price. But we'd really like to use an index. And in this case, it will take two separate indexes. One for fruit, one for state. Do the look up for the row IDs. Take the union of those and then pull the prices out that way. Another thing it can do, if you've got a two column index on state and fruit, but you want to look up by fruit, you think, well, I can't really do this because I can't do a binary search for the second field of an index. But SQLite will do this. And the way it does that is that it recognizes there are not many states. There's just not many values. So it will iterate through all the possible values for state and then within each state look for the fruit orange. We call this, there's no official name to it for this as far as I'm aware. We call it skip scan, but it will try and do this. If it knows that there aren't many distinct values or just a few distinct values for the first column. So we can sort. Always do a sort of sort. A sort is the most expensive thing you can probably do in a database engine. If you give an order by clause in SQLite and it knows that it's going to come out in the right order anyway, it just throws the order by clause away. It doesn't force you to sort. So be generous with your order by clauses. If you give an order by fruit and you've got an index on fruit, it will walk the index and then do, you know, pull out the row days and then do the search and pull out the rows in the correct order. And you think, well, you know, that's still in log in. I haven't saved anything. Well, actually you have because it uses a lot less memory. And also if you're looping through this and you get only a couple entries through and you say, okay, I'm going to quit. It didn't have to do all those other look-ups before it did the search. So this is a much more efficient way of doing it in practice. And of course, if you have a covering index, it just spins right down the covering index. If you have a covering index that's almost there, but not quite, here, you know, we want to sort by fruit and price, but we've got this pesky state row in the middle, which kind of messes up our sort order. It will read down as much as it can. It will read all the unique values for fruit and then gather them and sort them separately. So it does lots of little sorts, which is more efficient because it can start outputting rows immediately before it's scanned the whole table and the sorts are smaller. If you've got something like a union with two order bodies, it will actually break this up into two separate queries, run them as co-routines, and then take the union of the output. With all these indexes? Yes. Who decides are you doing it in planning for all of these potential choices or there is some decision-making? Yes. Because it's not what you wanted. It's dependent. So this requires a lot of memory, a lot of maintenance. With simple updates now, it's not only just worrying about it. Well, I mean, the indices are maintained automatically. You, the programmer, I'm skipping over some slides here because we are out of time. You, the programmer, have to come up with the indices. SQLite is not going to do that itself. But you know, this is the key advantage of having a query language. And I want, this is an important slide. It's probably the last one I'll have. Because if you've got a query language like this, you can code, you can design your application, you can build it, you can spend weeks coding it up. You get down to the end and you've got a performance problem. Oftentimes you can fix that performance problem just by doing create index. And suddenly you've got completely different algorithms that are being used. And you can do this the night before you release your product. Whereas if you're using a key value store, something else that doesn't have a query language like this, and you get down to the end and it's not working, it's not performing well, then you've got to spend some serious time recoding. You can't do that on the day before you release. You've got to recode and retest. That's the beauty, that's why it's so important. That's what this top half, the part above the storage engine is so important because it gives you that flexibility. So the question is, do I provide utilities to help programmers identify bottlenecks and do the tuning? No, that is a frequent request. It's on our to-do list. I mean there are things there. We provide mechanism, but it's not an especially intuitive thing to do. You have to kind of know what you're doing. But we need an automatic tool that looks at the schema, looks at your query and say, hey, you should consider this index. We don't have that yet. That's on the to-do list. So, code generator. We didn't talk about join order selection. I also skipped some slides. There's a lot of other cool things in here. The query planner stability guarantee. This is an important part of an embedded database is that there's all these different algorithms that it can choose, but in a commercial database in a data center, you want it to adjust its algorithms as the data changes to select the best algorithm for the current state of the data. If you're doing an embedded product and you're shipping millions of these things, you don't want the query planner changing its mind for some small percentage of the users out in the field because usually it will do a good job, but sometimes it might choose a bad plan and then you're going to get bug reports. You want the query planner to always do the same thing, and SQLite can be set up in this default situation. It's always going to choose the same query plan for a given schema. It won't change that unless you run Analyze or change the schema around. No, it's actually a cost-based system, but the costs are fixed. They don't automatically recompute. When you run Analyze, the Analyze command. So if you rerun Analyze, it's going to get different statistics, and then it might choose different plans. So don't do that if you're concerned about query plans changing out from under you. Okay, other topics. I haven't talked about virtual tables. I haven't talked about the full-text search engine, and this is a really cool thing because it actually implements LSM trees without a lot of B plus trees. And it's a really cool idea, and it's actually very efficient. It's faster than the LUCine. R trees, memory management, how we test this thing. We've got a really impressive test suite, and that's very important to us. That is... I'm slightly over time, and I'm happy to talk about any of this. This is a slightly... We've gone to a slightly lower orbit, and I've given you sort of an overview of how the system works. There's a lot of details here. Love to hear your questions and your feedback, and you can also go to the mailing list and visit us there. Thank you very much for your attention. I'm sorry for running five minutes over. If people have to leave, you're welcome to step away. Otherwise, if there are questions, I'll be happy to take them there. Any other questions? Yes, ma'am. So the thing I see is that if it's a tree dog, it means how minimal or how long to wait. What is the fixed distance between the tree and the tree? It's the optimal point, so our variables are... Alright, so I understand the question. You're asking... Yeah, there's trade-offs in any system. Of course, you know, on a full-scale, data center-oriented database, you've got a lot more power, and it's doing lots of fancy things, and we don't do that. And so the trade-off is because we're using less power, and it's easier to maintain, and we want it consistent across millions of devices. Have we found that right balance? Well, we're constantly adjusting that balance. We take a lot of input from people who are actually using this stuff, and we're trying to adjust it. I talked about the query planer stability guarantee and how it doesn't recompute plans based on evolving data. There's a compile-time option that will make it do that. And so if you have a very specific need where you want it to violate that guarantee, but work more like a mainframe database, you can make it do that. And so we provide a lot of options that way. Have we found the best blend? I think so, because there used to be a lot of these embedded SQL databases, and now there's just SQLite and maybe SQL anywhere, and the rest of them have kind of gone away. What about the Java ones, too? Oh, yeah, the Java ones. Does anybody ever use those, really? Go ahead and say it on tape. Yeah, I think the market seems to be saying that we're doing well. Maybe we're just lucky, I don't know. But you're right, there's a balance here. You've got to find the right mixture of features and capabilities. And I think we've got a niche, and we're trying to do that. Okay, so the comment was, SQLite is known for its testing philosophy. And the way this came about was a couple of things. One, I was doing some work with some avionics companies who wanted to put SQLite into safety critical systems. And that made me aware of this spec called a DO178B. DO178B. And that's a design spec that the FAA uses for validating software that's in safety critical functions in aircraft. And it made me aware of it. I learned about it. That took a lot of time. And then also when SQLite started to go out in a lot of phones, we started to get back a lot of bug reports. You think you write really good solid code, but it's amazing if you put your code out on a billion devices, how many bugs will come up? It will truly amaze you. And so we spent about an intensive year kind of converting all of our processes over to this DO178B style of work. And we made this test suite, which is now, it's a proprietary test suite. And that's sort of our intellectual property. That's the company's intellectual property. The code is public. The test suite is our IP. That's our company's value. And we've got this massive test suite that does 100% MCDC, Modified Condition Decision Coverage of the entire thing. And it's been a huge, huge benefit. Since we put that in place, bugs do come up, but they're infrequent. We really haven't had many bugs. It also gives us enormous flexibility to change things. I had this conversation with the Postgres people because they've also had a very reliable database engine. And I asked them, you know, how is it that you're so reliable and you don't have this, you know, this big test suite like we do, because we depend on that. And what we really decided was they keep their reliability by not changing anything. The code works, leave it alone. Whereas we have now the flexibility to rip out entire subsystems and completely recode them on a point release. And we do this fearlessly. So it takes a lot of work. I spend more time writing test cases and debugging test cases than I actually spend writing code. But in the long run, it saves you work because we just don't have a lot of bugs to fix. And so it's worked out really well. When I was a graduate student, when I was your age, I thought, oh, I don't need all this process. It's just paperwork. This is nonsense. But what I've kind of realized is that it actually does work. So I'm going to help you. Any other questions? Yes, sir? Keep it quick. Yeah. And I guess people can find me around. I'm here the rest of the day, right? And you can find me. I have a quick question. Is there a special version of the SQI for nonvolatile memory? No. I mean, you could plug in your own little OS layer down at the bottom that somehow is special for nonvolatile memory. And remember, some people, I was thinking in NAND flash, you're talking about this thing where it's in memory all the time and the memory is operating at full speed. It just doesn't forget when you lose power. PCF? Yeah. Right. Exactly. It works at full speed. Nor flash. Was the old technology. Yeah. No, we haven't really looked into that. It's not really optimized for that. It's old school. I guess we could. All right. Next slide. Thank you, Dr. Hidford. Yeah. Thank you for having me. Thank you. So there's not a database talk next week. I think in October.