 The Carnegie Mellon vaccination database talks are made possible by Autotune. Learn how to automatically optimize your MySeq call and post-grace configurations at autotune.com And by the Steven Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org Hey, let's get started. Welcome to another semester of the Vaccination Database Seminar series at Carnegie Mellon. So this is our second dose. We love database so much we had to come back for more. Today, we're excited to have Philip O'Toole come give a talk about RQ Lite, which is a distributed database. He's built on top of Seedolite. And so Philip is an engineer at Google here in Pittsburgh. Prior to that, he was doing an influx data and a bunch of other places, a Pivo, which is I think pretty exciting. I would bring that one up more often. So, as always, if you have any questions for Philip, as he gives the talk, please unmute yourself, say who you are and where you're coming from and ask your question. We want this to be interactive for Philip so he's not just talking to himself for an hour. And then we also like to thank Autotune and the Steven Moy Foundation for keeping it real for sponsoring this event. So with that, Philip, the floor is yours. Thank you so much for being here. Great. Thanks, Andy. Yeah, hello, everyone. Thanks. It's great to be here. I'm going to talk today about RQ Lite, which is a lightweight distributed database built on the RAF consensus protocol and Seedolite. So, as Andy said, I'm an engineering manager at Google Pittsburgh. I'm working on very large scale logging systems there for the last three years. But most relevant to this talk, I did spend a couple of years at the core influx, the core team at influx DB. I mean, I worked on everything from the HTTP API, SQL parser, the storage engine and query layer. It's an open source time series database written in Goal. And my experience there definitely influenced my work on RQ Lite, as you'll see today. And as Andy mentioned, I did work at Tivo once. I didn't realize that would be interesting. There's somebody out there that has a high definition Tivo. Every time you change the channel, that's my device driver kicking in. So there you go. So that's great. So yeah, what are we going to talk about today? Say that again? Is Tivo run SQL Lite? No, it doesn't. I just happened to be running kernel device drivers all the way back then. Tivo does run SQL Lite, but maybe just not his part. Not my part. You're absolutely right. So what are we going to talk about today? The agenda is we're going to like just basically explain what RQ Lite does, purely from a black spot point of view. What are its goals? The goals have influenced my work on this a lot, partly because this is mostly a solo project. So my goals are what I execute on, which is kind of refreshing. We'll talk a lot about its design and implementation. And it can be difficult to understand how it works without a basic understanding of distributed systems in the RAP consensus protocol. So I'll actually spend a few minutes on that. For many folks out there, this may be stuff that they've heard many times before. So before we get into the meat of the RQ Lite software itself, we'll be spending a few minutes on distributed consensus. But it's good to set the table for what exactly this program piece of software is trying to solve. And then we'll finish by discussing some of the practicalities. How the software is tested, how I measure its performance, where the performance bottlenecks are, and some known use cases and applications. So what does RQ Lite actually do? And like I said, I believe it's important to make as clear possible statements as I can so that people understand what I'm trying to do here. So even if that means I'll probably go over some areas that some people are familiar with. RQ Lite is about replicating a set of data that is written to it using SQL. And the data is replicated because you believe your data is so important that you want multiple copies distributed in different places. Or you want to be able to query your data even if some of the machines fail. And these different places could be different machines on a rack or different machines in different buildings or even different machines on different continents. That's the first thing. So it replicates the data. And then the second thing it does is it provides very strong guarantees about what the state of any copy of that data is with respect to a special node called the leader. And that's where RAFT comes in. It prevents diverging copies of the data and ensures that there's an authoritative copy of that data at all times. So it's the two things. It replicates it and it allows you to make very strong statements about the state of the replicas of that data. I wrote it in Go, though it statically links the SQLite C source code. So I'm not running patched SQLite or rewritten SQLite. It is the actual C source code. It exposes a HTTP API. This does mean it's not a drop-in replacement for SQLite, but it doesn't hide the database. So all the power of SQLite is available if you can work with the HTTP API. It's open source. Development started in 2015. So it's actually been developed for many years. Though the first simple implementations were available quite early on and the current release is 6.0. So that summarizes what the system does. I like to go top-down at the start. We all use databases. So instead of getting straight into the details, what's it like to use? This is what it's like to use. It's primary interfaces HTTP API. This gives you a feel for it. You can post data to us and the body of that request is a SQLite statement and you get a response back. And you can also query the API through a GET request. There's a lot more to the API than this, but fundamentally this is how it works. And in choosing and designing this API, I was heavily influenced by two things. The success of Elasticsearch HTTP API, which I've built a lot of systems with Elasticsearch. And then my experience with InfluxDB, because InfluxDB's primary query language is actually SQL. So I wanted to replicate what I had seen in the best of both systems. And this ties into making SQLite as simple as possible to use. So what are its goals? The goals of SQLite are the following. The interesting thing is the goals were not immediately clear when I first started development in 2015. I actually just created it for fun. Just to answer the question, could I take SQLite, combine it with a consensus protocol and what would happen? And it actually worked. It was only as implementation proceeded over the first couple of years that I began to realize what the goals actually were. And those goals drove the design and implementation of later versions. It might seem kind of obvious, but you'd be surprised the number of times people don't let goals drive their software. They like feature creep or just they like something else like fun to drive it. But the goals, the goals have become very clear in my mind and have actually helped me make explicit choices. So the first goal is the first three goals are the following. It's got to be trivial to install, easy to operate and easy to use. And so why are these goals so important? They're important for one reason, because in the real world in my own personal experience and I've developed databases must be operated. And if your end users cannot, and I've seen this as a user of database, if your end users cannot operate and deploy your database quickly and get it working and get some value from it with minimal hassle, especially startups where I've spent a lot of time at, they'll just move on to another technology when they find something else that just works better for them. Also, in any real world application, database installation and management is at least as important as ability to store and read and write. If users can't back up your database, restore it, monitor it, it's not actually suitable for real world applications. And this is why Requelite supported backup and restore almost from day one, even when it was a much simpler DBMS than it is now. Because if people don't feel comfortable that you won't let them down or that they don't have a backup plan, they won't use your database. So this is why really easy to install, easy to operate and easy to use our best goals. A second goal is got to be high quality. Okay, so it seems kind of obvious. I mean, it's a database, but it doesn't just mean data integrity. Again, it means that it's reliable, stable, predictable, because all those attributes make it easy to use. I'm mostly a solo developer of this. So I've been able to keep focus on quality, even it means that new features are slowly introduced. I only introduce new features after I'm sure they will improve the database and are coherent with the larger vision. Now, this does mean that Requelite may be too simple for some applications. But I would rather the database do something very specific, perhaps limited and do those things very well than doing a larger set only somewhat well. So let me give you an example. To this day, Requelite still doesn't support transactions in the same way that a distributed database might offer them of another type, because it would introduce a lot of complexity into the code. And I'm not sure yet how I figured how to introduce distributed transactions, but still keep the system easy to operate and reliable and operate very well at that particular sweet spot that I think it's best for. The explicit goal of testing of high quality also influences the implementation and testing. And I have a lot more to say about testing later on. But in my experience, testing is so important that if a new feature or a potential code refactor cannot be tested properly, then I take it as a design or a strong sign that the design is poor, or the feature just doesn't make sense for the system. So you will see the emphasis I put on this database testing later. And I think this actually is a pretty good system to study if you're trying to understand distributed consensus systems and databases today. So I'm going to get into the design and this discussion is probably the meat of the presentation today. And there's kind of two phases to it. One of it is I'm going to take a very basic look at the distributed nature of Requelite and how that melds with Sequelite. There are two very distinct concepts in this database, distributed consensus and a high quality database. And through the magic of software composition, we have managed to put two things together that have actually proven to work really well over time. But to understand the distributed nature, I'm going to go through some simple explanation of what the core problem is in distributed systems. So again, why do we have distributed systems? So distributed systems can do two things or even both of them, right? They can provide reliability because your data is located in multiple places. The computation is also available for multiple places. That means a distributed system may be more scalable. So they may be more powerful. However, distributed systems are often a bit more difficult to operate. And that does introduce tension with the easy to operate goal of Requelite. But this is why I think it's important to remember why we build distributed systems. They can be kind of cool, but actually in practice, they're often a hassle. And they're a necessary evil to actually solve problems. No one builds distributed systems if they can help us in the real world. So, and then so, and why pick Sequelite? So it's a rock solid relational database. It's in a single source file. I have used it to build systems before. If I replicate Sequelite database, then I get reliability because the Sequelite databases are copied in multiple places. And thanks to the goal programming language, I can just link in the C source code. And I've got a standalone binary that has the RAV consensus system and Sequelite all in one statically linked library. And this makes deployment trivial as we'll see later on. So I take two things, a well written, well understood consensus system and a work class database. Can all be packaged together in literally one binary and we're good to go. So before I get into the details of the Requelite design, I always like to talk very briefly about what is the core challenge of distributed systems. So everybody can understand what problem we're trying to solve. So the core problem is, it's very easy to set the value of a single node. I want to set x to five on a computer. We all know how to do that piece of software says x dot five and so forth. But often our data is so, it's so important that we want to replicate it. It can be done to avoid data loss or perhaps you want to horizontally scale it for reads or something. So the question becomes what part of the system should do the replication. So one answer is that the system takes care of itself. You write x equals five to the first node and the second node transparently copies it to node B and we're done. Great. What if the nodes can communicate? If the two nodes can communicate, what should we do? This should node A respond with a failure. Should it inform the client? Should it automatically take care of it later on? What if it never comes on? It actually gets a little bit complicated about what we should do here in this situation. Okay. Let's try this. In this case, another approach is to push the responsibility to the client. Now the client is responsible for replicating the data. Now this does mean more work on the client, but it does give the client maximum flexibility and information about what has happened at each stage. But the interesting thing here now is the nodes are much more similar. Each one must be able to take an update. Okay. What happens if there is a disconnection between the nodes and another network partition? There's nothing stopping, and in this case, there's nothing stopping two clients rising different values for x to each node. And in the face of a network partition, what should happen when the cluster is repaired? Now, I know no system actually operates as simply as this in the real world, but the actual core problem at the center of the distributed systems is not too difficult to understand but a surprise need difficult to solve. And so that is the problem known as distributed consensus. And this is what raft does for my system. It means I don't, it has provided a solution. It provides a solution to solve that problem of what is the authoritative state of a variable at any point in time on a clustered system, even in the event of failures of that system. And that state x equals five, x in my case is an entire SQL like database. And that is how I bond the two together. So, in specifically, the protocol I use is called raft. There have been a couple of others. PAXOS is very famous and also famously difficult to understand. If anybody's used ZooKeeper, it's another system that is built to solve this problem. And then raft is a recent solution that came out of Stanford in the last 10 years. A couple of core raft concepts before we get into how this works at SQLite. By definition, every node in a raft cluster agrees on the events and the sequence of those events. And that's called the raft log. That's it. Actually, in fact, the raft consensus protocol is specifically about ensuring that this log is the same on every node. That's it. It says nothing about what you do with those events that are in the log. It says nothing about how you should apply those events to your state machine. It doesn't even make a statement about the structure of those events. All the raft protocol says is, here's a piece of software, and if you pump in an event, I will make sure that it is in the log on every other node that is registered with the same piece of software and in the same sequence. And that turns out to be an incredibly powerful conflict, a log replication. The raft also does one other thing. It also defines the leader election protocol, which ensures that there is one and only one leader, and the leader is an important node within the cluster. So as I said, the raft log is a sequence of events persistent to disk representing the series of changes to a state machine. The leader is the single node in a cluster responsible for coordinating replication of the raft log. And the follower is one or more nodes in the cluster that maintain the copies of the raft log, and any follower can be elected leader if the current leader fails. So that's it. That is a very high level summary while going into the protocol details of what raft does for us. Now we have a system that will say, I will make sure that the sequence of events, which are actually SQLite insert statements, that sequence will be the same on every node. So let's start seeing how SQLite actually applies this to a SQLite database and brings a real life distributed database. So conceptually, this is how it works. Raft provides a solution to the consensus problem. It abstracts away the details allowing programmers like me to choose the state to be agreed on. And in the case of SQLite, the state is a SQLite database. And you can send any SQL statement you like to the raft consensus system, create a table. But only those SQL statements that reach consensus will ever be sent to the SQLite databases that are sitting under each node. And that's the key thing, right? So think of the raft consensus protocol as one black box into which clients can send statements and only statements that get through the distributed consensus filter will be applied to each of these SQLite databases. Each of these databases, each one sits on a node is exactly the same copy on every node. This is not partial replication where one table is on one node and one table is on another node. It's the exact same database under each because the same sequence of SQL statements is applied to each. In theory, for the detail folks out there, it's actually a quorum. So clusters are 101 to end nodes. And what raft the raft consensus protocol actually does is it says every SQL statement that has been committed to n over two plus one nodes will all will be will be the only statements that are ever applied to your state machine. In practice, clusters are usually three, five or seven nodes in size. After that, it starts to get a little bit and a little bit tedious to run. But there is no theoretical limit to how large you can make these clusters and the bigger your clusters, the more no more node failures you can tolerate. So for example, if you've got a three node cluster, I can lose one node of the cluster and it still operates fine. So I hope that makes sense. So basically what you're doing is that you have your consensus layer written and go and somebody starts issuing much queries. I guess you're within a transaction and then they go to commit and then you then run run raft. And if they agree that let's do this, then you apply the changes and they acknowledge it back to the client. Exactly. That's exactly. I'm going to just a little bit more detail, but that that's basically it. And it's a successful example of composition because neither the raft consensus protocol nor the SQLite software needed to change. Requelite is all about the management software, the Clue software and the testing, the performance, the correctness, but in both components, neither had to change for this to work. This is another way of looking at the system where it's more of a cluster oriented picture. You have the leader, the client talks to the leader by default, sends changes to us. The leader will ensure that any statement is in reaches consensus. And once that statement reaches consensus, it's applied to the SQLite database. Requelite is a CP system in terms of CAP. That means that if there's a network partition, one side of the cluster will continue to offer consistent read and writes, but the other side will be unavailable. So in other words, Requelite, it gives you consistency before it will give you availability. So this is another way to look at it. So this is another example. The key thing is that each SQLite database under there is exactly the same under each node, assuming that all nodes have caught up to the leader. But you will never, ever get divergent copies. You may get some copies that are further behind some SQLite databases that are further behind, but you'll never get conflicting copies of data here. Very briefly, for the programmers out there, this is the interface to the RAF system. There is an apply call. So when I have something that I wish to apply to the state machine, I call apply go programmers will recognize the code here. I don't tend to go into too much, but I think some people like to write the code. The RAF log in this context is basically like a void star. It doesn't care what it is. It's like apply this byte stream to try this lump of bytes to the state machine. That encapsulates a SQLite statement, but the rafts and sensor system doesn't care. There is snapshot, which is for getting a full copy of the SQLite database. There's a thing called log truncation, which I won't go into, but for some distributed people out there that the systems people out there that may know that there's restore for bootstrapping a node from the RAF log. And then there is some other utility functions. But this is the interface that was all I had to implement to convert the raft abstraction consensus layer into SQLite. I'm not going to go too much details, but I think some of the programmers may be interested in seeing this because they may be interested in applying the RAF concept to their favorite state machine. So let's go into this slide. There's a lot here, and I think we may have questions about this because this really gets into how the two things work. So what are we looking at here? We're looking at an architectural diagram of the software inside a SQLite node. I'm going to show the lifecycle of a couple of executes and queries, and I think that will help people understand what's going on. So you've got the HTTP API on top. We've got some authentication functionality in there in case people want to secure access back up and restore and stats and diagnostics very important in the real world. Databases must be operated. They must be managed. They must be scaled. Therefore, that's very important functionality. We've got a lot of functionality out on the right-hand side for cluster communications, and I explain how that works in a second. You've got the RAF subsystem, which is that code that runs inside console, which I have put inside here. You've got a translation layer that goes from the goal code into the SQLite source code, and then the SQLite source code, SQLite either uses disk or RAM. And re-qualizing its default mode uses RAM because the RAF log is on disk, and the RAF log is the authoritative record. So it doesn't matter if the SQLite database is in RAM because everything is rebuilt from the log and it has performance improvements. But let me go through this, and there are some interesting flows that people might seek from up here. So there's two terms I'm going to use because they are goal terms. A SQLite statement that changes the database is called an execute, and a SQLite statement that just reads is called a query. So let's look at the lifecycle of an execute, something that actually changes the SQLite database in a cluster system. So the write comes in and it hits the RAF subsystem. The RAF subsystem first communicates with all other nodes in the cluster, theoretically a quorum, but sends this out to all other nodes. This is the leader. Communicates with those nodes and ensures that every system, every other node or quorum can accept this. Basically, accepting means can just write it to their log. It doesn't have to mean that it makes sense to the SQLite database. In fact, this execute, this write may actually be nonsense. It may be SQLite junk. It doesn't matter. It just makes sure that everybody has agreed that they have received this command, this execute in the sequence. If they say yes, we're committed, it is then written to the RAF log. Now RAF consensus is done. That's it. This node could go away and we haven't lost any data. After it has gone to the RAF log, after that's whatever that execute was has reached consensus, only then is it applied to the SQLite database. And now it actually makes a change to the state machine, but that's just downstream a nice side effect of the RAF system. But that is it. Now we have a SQLite database that has been updated to all the nodes in the cluster. You might say, okay, that's interesting. Now compare queries. Because queries do not involve consensus, they can go straight to the database. And if you understand that difference, you'll understand a lot about how the requalite system is working. Is there a question? Yeah, so Joseph, do you want to unmute yourself and ask a question? Yeah, sorry, thanks. Really interesting so far. My understanding is that the RAF subsystem is not doing any SQL interpretation at all. So you can send out any old thing that perhaps looks like or perhaps doesn't even look like a proper SQL statement. They all get written to the RAF log. That is correct. I am relying on the code in the SQLite system itself to do all the validation. I do think I use to build parameterized queries to protect against SQLite injection and so forth. But Jess, you're absolutely right. The RAF subsystem doesn't care. It just promises that a sequence of bytes in a given command will be in the RAF log in the same position on everybody. So you are right. If somebody wrote a ton of junk SQL statements, they would always be in the RAF log until the RAF log was truncated, which is a RAF thing that people can read about. But yes, that is correct. But in some ways, that's a strength of the system too. All the checking of the SQL stuff, the SQL statements, is done in a single place. And I rely on the high quality code. I don't have to replicate. I don't have to copy any of the SQLite code. It's all done in SQLite. But the answer to your question is yes, that is true. Does that also apply to the distinction between writes and reads? Is that made at the API rather than by the RAF itself? No, no. Because as I show here, a read normally just goes straight to the SQLite database because you're not changing anything. So there's no quorum to be concerned about because a read by definition cannot change the state of the database. So it doesn't need to go through the RAF consensus system because since it can't change the system, it can't make the state be in conflict with another node. It's just a read and whatever is there is there. Now, how is that classified? Are those two separate API calls? They are, yes. Very good question. Yes, the go database layer and most ORMs, well, not ORMs, most database layers, there's two distinct HTTP endpoints. One is for query and one is for execution. Now, you can in client libraries do things like check if a query is read only, for example, and then automatically redirect to the right endpoint. But I haven't done that yet. But you are correct. There is two distinct endpoints depending on whether you're doing a query or an execution. If you send an insert to the read endpoint. If you did that, what happens today is I, the insert, the query endpoint uses a read-only connection to the database. It would, so that prevents user error in that case. Earlier versions of recoil light did not do that. And yes, you could actually change the database, but you could get it in an inconsistent state if you did that by accident, because you would have changed the SQLite file directly, or SQLite database directly without going through consensus. And that's why it's very important to protect people from doing that. That protects you with a start procedure, for instance, at what point was read-only and then it's the definition change so that it actually makes, modifies the state. And then if you didn't update your API. But I don't think SQLite supports stored procedures. So I could be wrong, but whatever SQLite supports, and as long as you use the API points correctly, it's generally work. I think I need some more questions. Yeah. You want to meet yourself and ask your questions? Yeah. So is the write acknowledged back to the client after writing the raft log to the disk after that write is reflected in SQLite? Good question. The raft subsystem offers programmers like me the ability to signal back when the state machine has been updated. So actually the write is acknowledged to the client after the actual SQLite database has been updated across the entire cluster. Okay. Thank you. Sorry, I have one more question. Sure. How do you protect against non-determinism and write queries? So if the write query depends on time or has some randomness in it, does that mean that all the replicas might get a different view or might apply a different transaction? It does. And this is called out in the SQLite documentation. There are very small, if I believe I'm understanding your question, there are a very small set of SQLite statements that are not supported. Time, for example, is one of them or random is another, because the random, for example, would be evaluated after the consensus system, which means that the random could return a different value on each node. The technical way to say this is SQLite supports statement-based replication, right? It doesn't support a bin log type replication. So if you have a SQL statement that can be evaluated differently each time, that could cause problems with this system. There are some ways to get around this, right? You could parse the SQL statement before it hits the Raph sub-system and resolve things like time to the actual time. But I haven't chosen to do that yet. But it could be done in practice, but the software would suddenly become a lot more complicated. Both of you do that because they do active-active replication and so the storm procedures are running independently. You have to trap time, random, and other stuff. So there is, but yes, these are insightful questions you're right on. Do I keep going, Andy, or is there any others? I'm not watching the chat. So one other thing that might be interesting to people to understand this difference a little bit more is I have just explained how executes must go through the Raph sub-system, but reads don't. In fact, there is a use case for a read to go through the Raph system anyway. And that is what is called a strong read. There are certain applications and certain use cases often made famous by the Jetson test system where you must be able to guarantee that the read took place in strict, I don't have quite the details, but a quick serial, a quick linear in a very strict linear manner with the entire system. So you can... It's strict, there are a lot of glory or strong consistency in there. Same thing. So in that case, you can actually send your query through the Raph consensus system and that guarantees that when the query executes, it will execute in a very specific relationship to all other statements that were... Changes that were made to the database. Why don't I do this all the time? Because as you'll see later, performance goes way down the moment you go through the Raph sub-system. Most people don't need this. But Requelite does have configurable read consistency, but nothing for nothing. And it does involve Raph consensus. So that's... And that is selected by the client. So it defaults to weak. We've been just straight to the database. Under linear. So I did speak that say that goal... We're going to the end of the design session and we're going to get into some practicalities. But just to wrap up, why do I think this design and implementation has met Requelite's goals? I think the choice of goal makes it easy to install. It's a statically linked binary. It's literally a single binary. You can drop it in any machine, spin it up. You can have a Requelite cluster in like five seconds. Go has a great unit test framework. I've made testing really easy. It has a HTTP API, which checks the easy to use box. Sequelite, it's statically linked with the binary. So there's no need to install it. No libsequelite3 is missing from your machine problems. It's obviously very well known and it's rock solid. And Raph is designed with ease of use and clarity in mind and has the code I am using. It's actually the same code that runs inside the console. So the choices have met a lot of the goals and that it's been great to see. Andy, I think I have a question. I was just posting the tools, MIT, TTC, on the different isolation levels. There's a point where people can find more strict serializability. Nice. Very good. I would love to have someone find the time to run Requelite through the Jefferson test. I assume it will pass, but I am not a closure person. So I have never written it. Is your Raph documentation on my view or are you taking the open-source one? I take the open-source one from Hashingur. So if console passes, mind should pass. It's unaltered. OK, so that is the design phase. And so the last part of my presentation is to talk about practicalities. So this means let's take a look at Requelite in the real world. So I talked a lot about testing. We'll spend a few minutes about testing. Testing is very important to me. Testing, specifically the ability to test, has significantly influenced the implementation. So let's take a look at each level of testing in this database. So unit testing is the fundamental test approach, right? When it comes to unit testing, if I can to unit test a component cleanly or it's a strong sign that the components design or interface is wrong. Or if the component that I want to unit test is bringing in too many other components, then it's not focused enough on one task or maybe it needs a little bit of dependency injection and so forth. But this is very important. Most database level testing takes place at the unit test level. And this works very well because Requelite is a shared nothing system when it comes to the database layer. A query never needs to be satisfied except by querying the local SQLite instance. There's no need reason to query two nodes at once, bring the data together, map reduce them and figure out the result. So unit testing is almost completely covers the requirement to make sure that the database and my thin layer on top of it is working well. Next comes the system level testing. And so system level testing, which you can also call integration testing, focuses mostly on the RAF consensus system and how it interacts with the SQLite database. So stuff like the fact that to making sure that a SQLite statement is actually replicated to each node, that the various re-consistency levels actually work, that leader election works, that if I take two nodes out of the cluster, but the cluster can still continue, that when those nodes rejoin, they get all the changes that were missed and so forth. That's what system level testing is. And I do single node and multi-node testing here. And then finally, there is very limited end to end testing. And that's done by spinning up actual Requelite clusters using a Python-based test squeeze. And this is really to ensure that the system basically works, right? That I don't release a system where all the unit tests and system integration tests pass, but I am embarrassing me and release something that doesn't start because it required a read-only file or something silly, right? Now, tests that are only limited are added at the end to end layer very deliberately and only if I can't test it any other layer. And anybody who's been a programmer will know about this that end to end testing while it can seem impactful at the start is very difficult to debug after a while and very costly. And so basically the end-to-end testing is a small check. And I'm proud of the fact that I do stick to the testing pyramid philosophy, right? Where there's about... How many check my numbers here? They're in version 6.3, which is two releases ago. Not that many. There's about 36,000 lines of source code in the whole thing. That excludes the SQLite source code and the Hashicarp code, just the code I've written. 10,000 of those lines are unit tests. 7,000 are integration tests and less than 1,000 are the end-to-end. And that pyramid is very important to making sure that the quality of testing and that the testing runs quickly. And programmers out there may recognize this approach and I have found it to work very well. How fast can you complete all your unit tests? About seven minutes. Or the unit tests of about four minutes. So yeah, that's great. And so I will often run them during development and my continuous integration and deployment system up in CircleCI will be running them with every PR. It's very, very important. I actually only finally came across a test scenario which I have not yet figured out how to automate yet. Well, I know how to automate it. And actually Richard will know about this. Finally, people started pushing this thing over two gigabytes of SQLite database size and there are some intrinsic limits in SQLite that I need to work around and I need to be much more clever about memory management. But it takes a few minutes to load up a two gigabytes SQLite database. So I am thinking about how to test these scenarios. Do I have a standalone SQLite database stored up in, you know, a GCS or S3 bucket that I can pull down and so forth. But except for that particular case, I have managed to have extensive coverage of all the known use cases in a test suite that runs less than 10 minutes. In parallel, I run three different tests at once. But the, you know, the 20 gigabyte SQLite database test is something that is not something that I can run very quickly right now when I'm trying to think of approaches for that. I mean, SQLite has generate series, right? So you can generate a synthetic database very quickly. Oh, well, I didn't know about that. So I look into that command. So yes, I'm looking for one that I can generate very quickly. But the thing is, Andy, I also have to write it to the RAF log. So there's a lot to do and as you'll get to see when I talk about performance in a minute, it can take a few minutes to load up very large databases. I have some ideas, but that's the only test case that I've started to come across, very large scale testing, which cannot be run in the context of CI all the time. Performance testing. So a minute ago, I talked about correctness testing. Let's talk about performance testing. What do I do with performance testing? It's pretty simple. I want to measure the maximum insert rate, determine the current query rates. I want to look for system issues during load testing, memory leaks, excessive disk usage, and so forth. And then just general system stability, right? When I run this thing for a day, hammer it, really load test it for one day. Does it still, is it still running the next day when I make sure that everything is okay? I do this usually after every major release or some of the intermediate releases, the feature releases. The last time I did this was 6.3. I didn't do it for 6.5. But when I have a change that I think could, could affect performance and every major release I run it. What does my test harness look like? It's a recoil-like cluster. I use, because of my experience with InfluxDB, since that's part of the development team, I still have a weakness for this great software. I use Telegraph for metrics collection, and I pump it into an InfluxDB cluster and use Grafana for analysis. And so my test runs are normally 6 to 12 hours in length, and this gives me the basic stability and resource testing. I do it with a single node, and if a single node looks good, then I'll spin up a cluster. What kind of things have I found? I have found memory leaks. Really, you know, you could have a small memory leak. It might be noticeable at the start, but if you run, and I have in speaker notes that I can share afterwards, I have links to blogs about this, but I have found memory leaks. Performance testing has also shown me, as one might guess, if you go back to the architecture diagram, reads are hundreds of times faster than writes, right? A read just goes into the node, hits the SQLite database, comes back out again. So reads are very, very fast. Thousands of reads a second, of course. The disk is usually the bottleneck. The disk always saturates. And it is the RAF log that is saturating the disk, because the RAF log must do an F-sync after every write. Even if it's a tiny piece of data, and that quickly saturates the disk. This is why I run the SQLite database by default in memory, right? It's one less load on the disk, by taking the SQLite database and putting it in memory. And that means the entire disk can be available to SQLite. So a little bit of detail. So disk.io is always quickly saturated. But a single node running SSDs can easily sustain hundreds, five, 600 inserts a second, 20,000 queries a second. Once you move to clustering, you'll see that insert speed drops off. You'll see that in a second. The only way to improve performance if people aren't happy with the 6-700 inserts per second on a cluster is you have to kind of scale vertically. SQLite doesn't scale horizontally. You can scale it vertically with better disks and better networks. But the insert speed is the same as the Hashicorp console, which is a production system used by Hashicorp's code. So I mean, it's perfectly usable, but it's not thousands of inserts per second. It's hundreds on high-end machines. I mean, the limitation of RAF does not... Exactly. There's nothing for nothing. But the... Can you please hold that good stuff? Yes. One thing that I did introduce in... I can remember with 6.0 is I compressed the RAF log. So as before, again, RAF doesn't care what's in the entries. So I GZIP the SQLite statements once they're above a certain size to reduce the IO load. It didn't actually help that much, and I'm not sure why. I haven't dug into the details, but it did reduce the disk requirements of recall life enormously, and that was very helpful to keeping as a lightweight database. GZIP is pretty unusual for a database, right? It's usually UNLZL. Yeah, maybe I should try something else, but I just basically, again, the goal... I went to the Goal Standard Library, and it has a GZIP, and the improvements I got were a factor of 10 sometimes, depending on how verbose the SQL statements were. It also helps because the compression kicks in not just to the disk, but also between nodes. So now the recall light nodes are actually sharing compressed state messages as well, so that was very helpful. And I moved to ProloBuff encoding, but the insert rate improvements was disappointing, but the footprint of the system, of course, reduced drastically, and that was helpful. It seems like you get much better compression if you had a... Oh, I am sure that there is, yes, potentially people with more expertise could help a lot, actually. But I am less interested in digging down this path more because maybe I am missing something, but the insert rate did not go up as much as I hoped. I thought with a five-times reduction in compression, I would see an equivalent increase in speed, and I did not. It took CPU time to spend on GZIP, right? So the experiment you want to do is you want to... If it is a long delay, then you want to GZIP it because you are trading off CPU time, but if the two nodes are in the same box, GZIP is going to be taken forever. It is quite possible, but it still was a win because I still... People run this in small systems, and they do not want to have a huge amount of disk, and so it actually was still a win. Everything happens quicker, especially if people are loading large statements like snapshots, which is a technical raft term, happen quicker, the log truncation happens quicker. Everything just happens quicker when there is less to write the disk at the end of the day, so it was still a win. It can be turned off if users do not want it, but I have found it always to be a win. A little bit more, some practical results. So this is some results I ran recently with 6.3 on Google Cloud Platform. I chose reasonably good machines, four virtual CPUs, 16 GB drives, or 16 GB RAM, 10 GB SSDs, and a bunch of 20 or four. Let's look at the numbers. I did this test a couple of times. So for comparison, just a single node, no replication, loading it with as many inserts as possible through the HTTP API, I get more than 500 inserts per second. Move to a single zone. What's a single zone? A single zone is where all the VM instances are running inside the same data center in Google Cloud Platform. The ping time between the nodes is 0.03 milliseconds. Insert rates immediately drop by half. But it's not just the network latency. It is the clustering overhead that just happens in the RAFT hash record code, which I haven't dug into enough, but there's overhead there with the clustering system. The reason it's not the ping time for the first drop is the next one where the ping time goes up by 10, but the insert rate doesn't drop much. And if people are interested in digging into the hash record code, they can find out even more than I understand right now, but why is these numbers? But I ran them twice to make sure. But now when I deploy three VMs, each in a different DC data center within the same GCP region, where the ping time goes to 3 milliseconds, my insert rate doesn't drop that much. So clearly it's something in the clustering code intrinsically, not just the network latency. Finally, or sorry, next second to last, if you deploy three VM instances, one on different parts of the continent, in this case, it was Oregon, Iowa, and South Carolina. And where the ping time goes up by another 10, now the insert rate has started to drop. And now I can only get 26 inserts per second. And for fun, then I deployed three equalized instances, one in London, one in Tokyo, and one, I think, in Iowa. And now the ping time is 90 to 200 milliseconds, and I get 10 inserts per second. I will say one thing. I managed to spin these clusters up in about 30 seconds, each one. So this goes back to the thing I am proud of. I can give you a distributed relationship with database in 30 seconds, and it will just work. So it's kind of cute. It's kind of amusing, but it is real, and it is nice thing to be able to spin up for fun, a globally distributed database in about 30 seconds. You need a baseline to say, if you don't go through your layer, how many inserts to roll a single line can you get, then you add your layer with a 1% S. If you bypass RAF, just go directly to... You would get enormous inserts. This is very fast, I'd say, right? But there's nothing for nothing, right? For real systems, this is the sweet spot, right? You're in a single region where every VM is in what is called a different zone. These are GCP concepts, but the concepts are the same in AWS. And what this gets you is a decent insert rate, a couple of hundred inserts per second. This is not going to be a high-volume, transactional database where you're going to be storing Twitter feeds, but if you need to store very important data in a relational sense, this will do it. But by putting it in a single region, each VM is deployed in a different zone, and most cloud providers promise that every zone in a region will fail independently. So it's about where you will get decent insert performance, reasonable insert performance, but also reliability, because only one node should go down at a time. This was what my testing showed. So the final part of the discussion, lessons learned. So some practical things I wanted to share in some new cases. First thing over seven years of database development, there's a famous quote from a guy called Fred Brooks who wrote the Mythical Man Month. It's called, plan to throw one away because you will anyhow. And I found this to be true. This is the idea that you don't really understand a software problem until you've solved it at least once. And I didn't really understand how to build RecoLite in a way that was usable until version two or even version three. Very early versions of RecoLite had a bad data modeling, poor API, and it was difficult to use. Is there a question? Yeah, we'll cover that at the end. Next lesson. You can have significant bugs for many years. So the 6.0 series fixes a long-standing RecoLite connection handling issue where the code wasn't quite using the connection in a thread safe way. And it was only when some folks in Israel were really loading the system that they brought out this issue. I still don't know quite how they did. But they brought out a race condition and how it's handling connections. And 6.0 brings much more polished connection handling and uses the RecoLite interface very well now, I hope. But it's just interesting. If you're looking for a robust database, there's no, nothing replaces using one that's been around for a long time. Right, because there's always some deep bugs. And so there was one even with all the emphasis on testing. There's a lot of misconceptions about what I've built. So people assume it's a drop-in replacement for RecoLite. They think it's a peer-to-peer system. I've had quite a few people say I want to run RecoLite on all the cell phones in AT&T's network. And they'll just come and go and I'm like, it's not that kind of system. So it's interesting. There's a lot of misconceptions about RecoLite before. Andy, maybe you've hit this, there are a lot of misconceptions about databases. People hear database and SQLite and distributed and replicated and they assume a lot. A lot of people, look, I'm going to say a lot of people, but a fair few people don't appreciate the importance or the difference between distributing something for reliability and distributing for performance. Systems optically distributed for both reasons, but not always. Databases are a hot topic. A lot of misconceptions in the field and it can be interesting to talk to people, especially in a world of frantic development where people want to get something off, running, moving quick, only to discover later it's not actually what they want. So that has been interesting too. So, known applications. Just for the record, I think the niche use case for RecoLite is the storage of a small amount, a few megabytes, tens of megabytes of really critical data that you want to remodel relationally. And you want something that's easy to use and rock solid and it just works. It's not suitable for a lot of things, but I think that's a sweet spot. But in practice, some people have just used it in the single node because they love having a HTTP API in front of SQLite. I've had, there's an open source project maker. I've linked to it there. It's for making networks through firewalls. They use it for storing configuration. There is a bank in Tennessee that stores log data in it. Again, they just like storing log data in SQLite, but also like to be able to reach SQLite over the network. So many people just use it for the SQLite API, which to me is a compliment that the API works well for folks. Some people like to use it for storing data that they store canonically but they want to present it to other people in a relational model. There's a start-up in Israel that reached out to me who is using it to store data in a reliable manner so they run multiple nodes to store configuration information in their Kubernetes-based system and they want it to be reliable so they like the clustering feature. But ultimately they don't care if it goes away because they rebuild from XED anytime there's a complete failure. But I thought that was kind of interesting. It's not a very abundant but ephemeral systems at the same time. So they are using it and they have helped me improve the software a lot. And finally, for some reason a lot of people who are interested in building Bitcoin applications want to use SQLite. I think it's because Bitcoin developers are often very interested in getting something running quickly. They want something fast for the system and the idea they have. But whatever they do they run it themselves because they feel that what they are doing is sensitive and they don't want their information outside of the software that they have built and run themselves. So they seem to be attracted to the fact that SQLite is written in Go and that they can spin it up quickly and I've just noticed that people like to store Bitcoin information in it. I'm not a Bitcoiner so I don't know. So that is it. So what did we talk about today? We described SQLite which is a lightweight distributed database which uses SQLite as a database engine. We discussed its design with particular references across the RAV consensus system. And we finished with an overview of testing, performance and some of our applications. So that is it. And I have some reading materials there that I can distribute with the slides afterwards. But that finishes my talk. Okay, so I will applaud everyone else. Thank you, Phillip. This is very interesting. All right, so we have time for questions. So if you have a question for Phillip, please can unmute yourself if you are and ask your question. Abol, I see you first. Do you want to go for it? Sure. So thanks for that. That was really interesting. I guess I might have missed this in the beginning but when the reads so reads are not going through RAV, right? Not by default. So Dave, but which SQLite database do they go to among the ones that you have available? They just go straight to the code translation layer and then go straight to the SQLite source code that's linked statically and SQLite serves the query and returns the results straight back up. I guess what I'm trying to get at is one of the issues that some of this faces is the read-write serializability issue, right? Whether a client gets to see read-write consistency. I was trying to get at whether that is something that you handle, you don't worry about. So, I don't pretend to be an expert in this area but what I can tell you is what SQLite does. In its default configuration it has two distinct SQLite connections. One for the inserts and one for the queries. And so, and we have the SQLite expert on the call as I check. So basically once the transaction is committed on the insert statement, on the connection changes to the database, those changes will be immediately visible to any other connection. So once, if you make a change to the database through the RAF consensus system and once you get the act back from the cluster any other query should see that change from that point on. That helps answer your question. That's a question I'm getting at. So, do you expose begin transaction commit? I don't. This is what I meant by the statement that SQLite does not yet support distributed transactions. If you send a begin and try and start overriding that themselves, it's not that it won't work. The issue is, and this is a problem I would really like to, it's not that I can't solve it but I want to solve it and still keep SQLite easy to solve and the issue is this. If you start a begin and then the system crashes the cluster goes down. That begin is in the RAF log. So, when the whole system starts up from fresh again, it's back in that transaction stage and I want to make sure users understand that. So, unlike other databases simply restarting an entire SQLite cluster would not clear the transaction because it's in the RAF log. Now, I have some ideas about how to make it easy to understand that this is happening but fundamentally, SQLite doesn't roll back a transaction or abort it if you restart the whole thing because it's in the RAF log already. And so, that is why SQLite doesn't, if you do explicit beginning commits behaviors undefined in the event of failure. But if you did do a begin none of the changes would be visible or whatever the SQLite system offers is what you will expose, what you will see. The main thing to know is the execute endpoint the insert endpoint and the query endpoint have their own dedicated SQLite connection. So, whatever the SQLite documentation says about the visibility of different changes applies to SQLite. So, that's the one I'm kind of curious about. So, you have the right statement that goes through, it passes but because of partitioning or whatever delays, there is a SQLite instance that doesn't get that right. So, it still has old version, it's running behind and now your read goes to that particular version. That is absolutely true. So, say for example there was some network delay that yes, there is a small window there whereby if you're doing weak reads, you can't be sure that everything in the RAV log has quite been applied to your SQLite instance and that's where strong read consistency comes in. When you issue that strong read, you can be sure that everything that has happened before, every statement that you issued before that changed the database will be reflected in the database when you read the database. Got it. Thanks. So, strong consistency is what you can choose in the event you want to be absolutely sure that you know that the database will be at a certain point before you do the read, but that has a cost. That read will now involve the leader contacting a quorum of nodes before it will perform the query for you to make sure that every node has applied the changes and so there's nothing for nothing. I'm all didn't say this, he's a distinguished database professor at the University of Maryland College Park. Okay, good. So, you probably, yes, I have documentation written about this in the GitHub and I'm obviously welcome to change this if it's not clear, but there are various read consistency levels. But in practice, then your read performance is going to be, then you're sharing the same RAV consensus system with the inserts and your read performance will take a hit too if you're doing a lot of inserts at the time. Well, it will take a hit too. Thank you. All right, other questions or comments? Yeah, I've got a question. My name's Nate and I'm just a rando off the internet. Very excited to be here. Great talk. I actually work on a time series database at Bloomberg, but I'm new to that team, so I'm learning everything I can about databases. Anyway, my question is early in the talk, Phillip, you showed a diagram with a leader and two followers. You did not note down which number the slide was. Yeah, I know which one. This one, right? My question is when do the followers communicate directly to each other? I didn't hear anything about that. Actually, they don't. Do you know what? Looking at this slide, it's a little bit deceiving. I'm just representing that there are network connections between them. They will only communicate directly with each other in the event that they don't hear the leader anymore and then they will start talking to each other. Exactly. That more just represents they all have open network connections to each other. Ready for use, should that leader suddenly stop heart beating with them? So in this example, if the leader suddenly died, the two followers will see that because they'll lose network connectivity and then those two followers will start what is called a leader election, which will result in one of them becoming the leader. Thank you. Another Rando from the Internet. Got another question for you. First of all, great presentation. Thank you very much. When you were talking about performance and the insert speed, I noticed that all of your numbers were for the three cluster case. Can you share anything about what happens as you scale up to five and seven? I didn't do any testing of that. I don't believe it changes a huge amount and let me tell you why. The leader communicates with the nodes in parallel. It doesn't go to the first node, the first follower, then the second follower and the third follower. It does it all in parallel and I know this from practice. So I have not run the numbers I do not have at the hand because it's a certain amount of work to spin them up, but I would not expect this big fall off as you would add five and seven and so forth. However, as you add more nodes there's always a chance that you're going to have some slower node out there. The responsiveness is bounded by the slowest node of the quorum. With more nodes, you're more likely to have a slow node. The disk has been weighed heavy loaded at the time and so forth, but I would not expect it to take a huge hit in theory because the leader contacts the followers in parallel. In the case of five clusters, the moment the leader hears back from two other nodes itself counts as one plus two other nodes, it can commit the entry to the RAF loop. So it doesn't have to wait until all it and four other followers return. Does that make sense, Joan? Yeah, thank you. That makes sense. So it's not the slowest node, it's the middle node, the speed of the node. It's the slowest of the quorum that responds, which probably won't be the slowest node because in the case of a five node cluster, it'll be the leader, which is very fast because it's local, plus the next two, exactly, you're right actually, the middle node, very good. So it'll be the three fastest nodes as soon as they respond, the entry can be committed. Okay, so I have one quick question before we go. So how would you compare RQ light versus there's also DQ light and then there's also bedrock from the expensive file, we call limited CQ light. So I am not really familiar with bedrock, though I have heard about it, so I can't read the comments too much on it, but I can comment on DQ light and there's a little bit of backstory there. So RQ light existed before DQ light and one day I got an email from Mark Shuffleworth, he's a while's or is the CEO of Canonical and I was really amazed by this because I knew that Mark Shuffleworth had done it to space once. So they actually had a problem about how they needed the transaction support which I didn't have at the time. RQ light was also a lot less mature so they went and built DQ light and it's a good contrast. So DQ light is different in a couple of ways. It's written in C. They wrote their own RAF consensus system in C. It operated at a very different level. RQ light is a fully-fledged somewhat simple but fully-fledged connection system. You don't need to do anything. It's right there. You spin it up and you've got a database. You need to program to use DQ light. It's a library that you link in with your code. So you have to be a programmer to use it. However, the advantage there is that it is, at least I understand it, is a drop-in replacement for DQ light because they present the same programming interface. The network layer still happens. You have to configure their library code so that it sets up the network connections to the drop-in replacement. But it's not a database management system. It's a library. So in that sense, it depends what you want. If you're a programmer and you want to build a very custom solution and you want something similar to this, DQ light actually might be a better solution for you. You have to use a patched version of DQ light. I don't think the changes they have made are that huge, but you do have to use a patched version to make it work. As far as I know, what they do is they replicate write ahead log frames. But fundamentally, it's the same concept. They are replicating write ahead log frames so they do lower-level replication where I do the statement-based replication. So similar systems, just replicating at very different layers of the stack. Awesome. So again, Richard, do you have any final comments or are you going to give it your blessing? Oh, I'll just give it my blessing. That's great talk, Phil. And Phil, we need to get together later and talk about, you know, what APIs I can give you to make a Requelite better. Oh, I'd love that. I would. Thank you very much. I'll drop you an email. Can I find your contact details somewhere? Or you could find. Why don't you drop me an email? It's on the slides here. My website is on the slides and I'd be great to hear from you.