 Databases, a database seminar series at Carnegie Mellon University is recorded in front of a live studio audience. Funding for this program is made possible by OttoTune. Hi guys, welcome to another data seminar series. We're excited to have Ben Johnson from flyup.io. He is the author of Lightstream, which he's going to talk about today. Prior to this, Ben also built another popular database called Bolt, which is written in Go, a key value store, which has email he's discontinued, has taken off on his own. So we're super happy to have him give a talk today about Lightstream and SQLite. If you have any questions for Ben as he gives the talk, please unmute yourself and say who you are and ask your question anytime. We want this to be a conversation for him and not talking to Zoom by himself for an hour. So Ben, thank you so much for being here. The four is yours. Go for it. Awesome. Yeah, thank you so much for putting this on. These talks are great. So I really appreciate coming on here. So yeah, again, my name is Ben Johnson at Fly.io. I wrote a tool called Lightstream a couple years ago, and it's centered around SQLite and making that available in a lot more use cases. Specifically, I want to make this idea of a single node deployment, which I started maybe 20 years ago in the industry, and back then running on a single server was not a crazy thing. Whereas these days, it seems like everybody has 20 nodes in their deployment for a blog. So trying to make this, you know, this simplified single node deployment kind of cool again. That's my hope. So if you've watched any of these other talks, you know, they're fascinating and there's a lot of really awesome cool stuff that people present on, you know, a lot of stuff like high performance, you know, people are sending millions of queries per second through databases, people talk about scalability, where you might have just something, you know, a thousand node cluster and doing whatnot, or people do fancy data structures like CRUTs and whatnot. And this talk is going to have absolutely none of that, unfortunately. But there are some other interesting things and the industry that I think are going on that are kind of cool to look at that aren't centered around this stuff. So hope it gives another perspective. So first of all, Lightstream, what is it? It's a streaming replication for SQLite. So it'll take your database in SQLite and continuously back that up to either Amazon S3 or, you know, Azure has a durable storage that is Google. There's also, you know, you can go to a network disk, you can go to SFTP. You're basically just trying to continuously stream out your SQLite data to another location for durability. And this kind of, this high durability, high durability really gives you the ability to safely run a single node and really minimize the window of data loss you can have with it, even in the worst case scenario where it, you know, catastrophically fails. If you're not familiar with SQLite, it's a great little tool. Its motto is small, fast, reliable, choose any three. And it's an embedded SQL library. It's a pretty full featured SQL library. It's got CTEs. It's got, you can do all kinds of joins. And there's a ton of stuff you can do in it, honestly, for such a small little library. And the databases are really just these individual files on disk. So they're pretty simple to reason about. If you want to send someone the database, you don't have to do any kind of special fancy commands, you can just send them the file and they have the database. And this is contrasted really to like something like Postgres or MySQL, where you have, you know, a network communication layer. You have to go over to another machine and talk to it. And with SQLite, there's none of that network communication. You don't have even IPC, really, except for some locks. You can do multiprocess, but generally it's usually a single process. And it's really just, you know, read, write calls to a file system, essentially. So as far as the kind of the small part of SQLite, it's about 130,000 lines of code for the core database. There's other things you can add on extensions that are actually typically built in, like a full text search and some JSON support you can add on there. And, you know, that core database really can do quite a bit. It's all the transaction layers, it's paging data in and out. You know, it's pretty fascinating what they've done. And you compare this to something like Postgres, which is millions of lines of code, you know, it's just, it has a ton more features, some are more esoteric, some are just, you know, it's more robust. But, you know, for SQLite gives you a lot in that little package. And from that small package, you know, they've really, in addition to having all the features that, you know, a lot of applications use, they have this massive test suite. And this is kind of the reliable part. So it's about a million lines of code for this test suite. And all those tests are even parameterized. So each one's running kind of multiple tests within it. I think it takes days to run when they do a release. So it's pretty extensive. And finally, you know, SQLite is considered the most widely deployed database in the world. So it runs on your phone, it runs on your laptop, it runs in appliances, it runs on airplanes. It's, if you have even like, it runs on all kinds of weird obscure CPUs. There's one VFX works I never used, but apparently runs on it. You can run on ARM, you can run on AMD or on x86, you know, the sky's the limit really. It's really a great tool to have around if you need a database. So before we get to the actual light stream part, it's really good to understand how SQLite is structured internally. And it has a really simple layout. So this shouldn't take too long. The actual physical layout, again, is just a file on disk. And the file is separated out into fixed size pages by default. It's about four kilobytes. You can change that down to 512 bytes or to 64 kilobytes. And, you know, each of these pages are just added in their sequentially. So page one, page two, page three. But then logically, they're actually structured as a B plus tree. So B plus tree, if you remember, all the actual data is stored in the leaf nodes, so the ones at the bottom. So you write, you know, pages two and five would have your leaf actual data. And then the structure above it is actually, you know, nodes pointing down into kind of a sorted list. It's kind of like a sort of tree and how it's structured. And those branches just kind of point down to the leaf nodes. So this kind of layout works great for, you know, read optimized data, which SQLite is used used a lot for. If you have something like an LSM tree familiar with those like level DB, that's more right optimized layout. So yeah, really simple layout. Lightstream is a physical replication layer. So we really just care about pages. We're actually don't even care about your underlying, you know, your rows and your schemas and stuff like that. We're actually just copying out pages. So also another important thing to kind of understand about SQLite and, you know, where it came from in its use cases and why those are kind of evolving. It's good to understand some history. So SQLite originally started in 2000. And for the first 10 years, they just had a transaction mechanism called the rollback journal. And it works pretty simple. If you want to make a change in the database, you know, transactional name, you would actually copy out the old version of a page to a file called the rollback journal. And then you can make changes inside the database. This is really simple. You know, if anything fails, you can always copy back from the rollback journal into the main database and you get your state from before the transaction. Now the problem with this is you get really bad concurrency because, you know, if you have a writer that's kind of mucking around in the pages, you can't have other readers reading those pages at the same time. So you have either a single writer or multiple readers. So this really limits it to kind of the single user software. If you think of like the messages app on macOS actually stores data in SQLite locally. And, you know, that's really a single user. You just, you're the only person using it on your laptop at a time. So you don't really need that concurrency. That's a good example. Or like phones, for example, it's just you using it. Now what I would consider kind of modern SQLite is maybe in the last 10 years. The write ahead log was added to SQLite. And this just massively increased the reconcurrency. So the way the write ahead log works. And please, again, stop me if you guys have any questions. If you guys, if you have a, it basically works the opposite of a rollback journal. So instead of copying the old version of the page to another file, you're actually going to write the new version of the page to another file called the write ahead log. And it's an append only log. So every time you make a change to a page, you're going to add on, you're going to kind of tack it on the end of that write ahead log. So this gives you better reconcurrency because you're able to write. And at the same time, you're going to be able to have readers reading old versions of those pages because multiple versions exist. You get better performance from being able to do this append only. And you have, you know, a single writer and you can have multiple readers all at the same time. And each reader gets its own snapshot at the point in time when it started. So it's, it's super flexible and super easy to kind of use it. It's serializable isolation. It should be familiar with isolation levels. If you're looking at more like distributed consistency models, this is really what would be considered more of a strict serializability where you have a total order at the database as a whole. And you also have kind of real-time constraints about when data, you know, when you are able to see transactions when they're applied from the readers. So, you know, all in all this, you know, even though it's just a little library that's running locally, this can handle a ton of concurrency. I can usually handle a kind of small to mid-size web applications out there that might be experiencing kind of in the hundreds of requests per second. But just great for those, which is honestly probably like 9% of applications out there. Like for all of the Facebooks in the world, there are so many more people that have an app that gets, you know, 20 hits an hour or something like that. So that's, that's kind of more of that where we're focusing on with this. Finally, the last thing is that, you know, in addition to the improvements in SQLite itself with its concurrency, the cost of computing has just like plummeted over the last 20 years. I tried to pull a little bit of data on here. So back in 2000, you know, you might have a server 32 bit CPU with maybe a gigahertz, you know, might be 700 megahertz processor in there, single core. We're not even talking multi-core. You know, might have a gigabyte or half a gigabyte of RAM and you're running kind of spinning disks back then. So those are getting maybe 75 megabytes per second. And you're probably paying 10 grand for that server. Like it's really expensive. So, you know, kind of the idea with NTR architecture is that you get this really beefy, optimized database and put it on the most expensive server you can afford. And then you have a bunch of smaller, cheaper application servers that can offload some of that like HTML rendering and whatnot, some of the business logic off of the database server. But, you know, over the course of 20 years, we now are just pulled up, you know, Dell's website and you can get kind of their baseline seems to be like a six core, three gigahertz 64 bit processor. You know, you're getting gigabytes and gigabytes of RAM. If you want to go to AWS, you can find machines that have a terabyte of RAM on them, which is just mind blowing, honestly. And then finally, you have hard disks that are SSDs, NVMEs and these get gigabytes per second of throughput. So, you know, we're just in this completely different world from where we were just two decades ago. And the server only cost you maybe a thousand or two. It's relatively cheap. And then finally, the other alternative that's really popped up of the last, I don't know, five, 10 years are virtual private servers, where you can get a machine that's beefier than what you got for $10,000 back in 2000. And you can just rent it for five bucks a month, you know, digital ocean, you know, I work at fly, we do that kind of stuff as well. So it's just, it's amazing how much computing power we have now. And, you know, what we can do with it and what that looks like and how it affects our tools. And, you know, now we have faster computers, the database isn't necessarily like a bottleneck for performance. So that's not always the most important thing as cool as it is to have like a million transactions per second, like most applications don't need that. And really, actually, we're starting to get to this point where a limiting factor is really the speed of light. So instead of trying to get your machine faster, your code faster, you just need to get your data closer to your users. Or if you want to process the data, you want the data closer to your application, not in another database on another machine. For example, like if you have like a Postgres server, you want to connect to from your application. And if you're running even within the same region on AWS, you know, you can have up to millisecond of latency just for a query, that's not even the query processing time, it's just simply getting into the database server and back. Whereas if you looked at like SQLite, we really don't have any layers in between our application and the data, so the overhead is about 10 microseconds per query. So, you know, there's just orders of magnitude difference. And if you're looking like a point query, looking up by a primary key, or if you're doing a simple range query, a lot of the actual request processing time with that, you know, might be in the tens of microseconds to actually just do those lookups. So you're, it's amazing how much time we have just to go reach the database and come back. Some other important features as well, operational simplicity, you know, I mean, I used to be a DBA back in a past life. And, you know, the manuals for like Postgres or for Oracle are thick, like they're massively thick. There's a lot to know about those. They're really complicated. A lot of times you have to set up users and roles and grants and you got to open up ports and, you know, there's a lot of stuff that are just kind of a pain. So if you can remove all that and just have a simple storage primitive like SQLite, that can be great. And the last or another thing is the speed of development. With something like Postgres, you have issues called n plus one queries, where one query might get 20 rows back, but then you know, I needed to make a query for each one of those rows back to the database and that can really be a performance hit. And the last thing people want is safety. So you want your data just to exist and not get, you know, destroyed. If you have like a fire in your data center, for example, and that's kind of where SQLite falls over is safety, not that it doesn't store data correctly and not corrupt it. But if your machine dies, it's a single no database. And that's kind of where light sharing comes in. It's meant to be kind of this this missing piece as to why a lot of people can't use SQLite for their applications. It's just not safe enough. Okay. Any questions so far? I'll take a quick drink. Talking too fast. What's that? Sorry, did you say something wrong? I think Andy said, I think we're good, which is what I was saying. Yep. Okay, cool. So, okay, let's jump into actual Lightstream and how it works and what it does. So SQLite's wall, the right ahead log. That's the only mode that Lightstream will work with. And once you see a little bit about this, you'll understand why. So SQLite's wall journaling mode basically writes those new pages in a transaction to a separate file. So if your database is called DB, then your wall file will be DB-wall. And for example, here our database file, we have four plus pages inside of it. Excuse me. And right now it's just the database file. So if we start our transaction, then we're just going to be reading database pages, share from the database file. If we create a transaction and do a write and commit it, we might update pages one and three. So now a transaction that started before this write would read just from the database. But if it started right after this transaction, it'll read pages one and three from the wall, and then pages two and four from the database file. So it's able to get its own snapshot, the version of the database for that point in time when the read transaction started. And then we add another write transaction that updates page two. It's going to read from, you know, if it started after that transaction, it's going to read page two from the wall. If it started before that transaction, it'll read page two from the database. So again, each transaction is getting some snapshot. And then finally another transaction can even update pages that are already in the wall. And depending on when you start your read transactions, those will read from any pages kind of before that read transaction started. So that's the basics of the wall. One issue you might notice here is that the wall just keeps growing. And that's a problem. I mean, disk is cheap, but it's not that cheap. So we actually have to do this process called checkpointing or SQLite will do it. It's called checkpointing, where the last version of every page will get written back into the database. We need to wait until all reads or all transactions are, it'll basically block new transactions and wait until any existing ones are done. And I'll do this copy real quick and then you can resume processing. So a lot of SQLite, it's usually best to have very short lived transactions and SQLite. It doesn't work very well if you have transactions open for a long time. So essentially here, we'll have, you know, page one goes back to page one database, page two goes to page two. And then this version, page three goes to page three. And once all this is in the database, we can then truncate the wall and it'll start over. I think we have a question from Katie. Do you want to unmute us? Yes. Hi. I am Krithi. I am a master student at a university of Waterloo. So the wall is we are, every transaction is creating a paid snapshot and adding it to the wall. My question is how do scans work with this generating strategy? Sure. So scans are going to, so you're going to be, if you scan the data, if you want to go, you're basically looking at, actually you can think about it without the wall first. If you look at just the database, we have, you know, logically it's a B tree structure. So if you wanted to scan, you might start from, if you're scanning say a primary key, you can seek to the primary key by kind of traversing the B tree. And then you'll start from that place and you can scan along forward to the pages, the leaf pages in the B tree to actually to read that scan. The only thing the transactions at the wall do are you can update versions of those individual pages so that new transactions, instead of reading from the database, they'll just read from whatever is in the wall. So if you updated a name in a row, that might only update one page with an, you know, maybe a leaf page in your, your B tree and your transactions just jumping to the page in the wall instead of reading from the database. Thank you so much. Okay, good question. Any other questions? All right, cool. Yeah, thanks for calling it out. Well, so that's checkpoint. We're basically copying the last version back into the database and we can delete the wall and it'll start back over from the beginning. And we just keep doing this over and over again, and that's how we can manage the size of the wall by default. The wall gets up to about four megabytes, so it's not huge by any means. You can, you can adjust that, but that's kind of where it starts to try to push the data back in. So checkpointing. So we're kind of getting these new versions of the wall every time we start writing out. Now what Lightstream will do is in the current version, it checks every second for any changes to the wall. And if anything is updated. So if you do a write to the wall and maybe another write to a wall within that second, then Lightstream will check in and say, oh, hey, there's two new transactions. I will copy those out to what it calls a shadow wall, which is basically just the directory to the side of the database. And then another transaction comes in on the wall. Lightstream can detect that and then copy it over. And the wall has some check summing within it. So we can actually verify consistency within that. And yeah, essentially we just keep copying to the shadow wall. And if we actually get a checkpoint on the wall and truncates, we can detect that and it'll start another transaction and we can copy that over to the shadow wall. So we're basically kind of replicating every version of the wall in the shadow wall. And the shadow wall just acts as kind of a staging area for us to later push out to durable storage. So the shadow wall, we're going to have essentially all these versions of the wall from SQLite. And we're going to start just basically pushing right over to S3 or whatever storage you want to have. And once they're safely there, we can just delete them from the shadow wall. And now they're just, the shadow wall is really not meant to store a lot of data very long, just very temporary. And we really have kind of two concerns that's separating here, where we're trying to move stuff off of SQLite as quickly as possible, because we don't want to have the wall grow. We have a lot of transactions going on. But then, so we're trying to get that as fast as possible. But then on the S3 side, we might not want to push up as often because there's a cost every time we push up to S3. So we can, we can change that if we do it every second, if we do it every 10 seconds, just depending on what the end user actually needs and how much they want to spend on it. Hi, so I have some questions maybe related to that. I think you just answered the first, which is how periodic is periodic? It seems that this is up to the end user. Yeah, by default, it originally started as 10 seconds, but I actually dropped it down to one second. Okay. And go ahead. Yeah. Okay. And for a second question is, when you're doing the copying to S3, are you batching things together? Are you copying it as is? So when it's copying up to S3, it's doing it kind of as wall segments. So we're actually, every second when we copy out from the wall inside of SQLite, we're taking like a chunk, like a subset of the wall. And then we push up to S3, or we're basically copying those chunks up. So it's not so much of a batch, we'll kind of concatenate some together. But yeah, we don't batch a whole lot. Thank you. Good question. So then another thing that Lightstream does is by default, every day, it'll do a snapshot of the whole database and push that up to S3 separate from the actual wall file. So that means once you have a snapshot, you just simply need to pull that down and then replay every wall file that occurred after that. And then you'll apply those snapshots, basically do a checkpointing process for each wall back into the snapshot. And then you'll get the result will be a byte for byte exact copy of your database. And that point in time where that you're replaying up to. So, you know, given that you have a one second delay going up to S3, you really only have a window of data loss of one second, which is pretty good for a lot of applications. It's not quite synchronous, but it's close. Yep. Or I could read it out. Sorry, I was just with the wall fragments get to S3, are pages entirely contained within those? Or could you have a page that split across two of those segments? Sure. So the segments that go up, it's going to be contiguous transactions. So it's going to be full transactions. So you'll always have, you'll never have a transaction that's split across segments. And all the segments, all the pages are within those transactions. So you'll never have a partial page go up within transactions. It's always full pages and full transactions. You might have multiple transactions within a segment. Yep. Good question. So, you know, long and short is that, you know, the stuff that Lightstream's doing, it's nothing super complicated. A lot of it's just like copying over chunks and moving up chunks and then pulling chunks down later. And, you know, the goal was really around simplicity. Like, you know, if something goes wrong, it's pretty easy to figure out what happened and we can debug things. You know, there's always more complicated things we could probably do. But this actually seems to work for, you know, a lot of use cases out there. People seem to have taken to it. Any question from Kriki? So I might be making some assumptions, which may be wrong. But my question is that the logical tree, it contains pointers or page addresses, right? And so when these pages are moved to S3 and are used as backup, how is the SQL light database restored from this backup? Sure. Yeah. So the page pointers that are within the B tree, that is right. They are pointers with their pointers to a page number, not necessarily a specific version of a page. So the actual process, when you actually do restore from S3, you're copying these pages back into the same position. So if you backed up pages, you know, one in three for a transaction, when you bring it down and you checkpoint it back into the original database, you're copying those pages back into where pages one and three would be in the database. In the database file, you can, you know, it's positional. So since they're all fixed size pages, if you wanted to find page three, it's going to just be, you know, the page size of four kilobytes times three, and that'll give you the position of page three. I don't know if that makes sense at all. I can try to explain it further. It may, I do get it, but I have questions, but you can move forward for now. Yeah, go for it. So I continue to last piece too with Lightstream. A lot of these, you know, all this data will start to build up over time. It's nice. One thing about Lightstream is you can do point time restores for any data you have. So if you want to, you know, if you drop the database on accident, you can go or someone corrupted data in the database, you know, two days ago, you could go back and actually restore up to the point just before that corruption. But you don't want to hold your data forever. So there's a retention enforcement where by default, it's every day, it'll delete out stuff that's a day old. It'll always have at least one snapshot in there before it deletes or after it deletes. So it should be safe, but you can keep, you know, for example, a lot of people will run maybe seven days of retention and you do a snapshot every day. And the idea with the snapshots is that the more often you do snapshots is the fewer number of wall segments you need to replay afterward. And it speeds up your restore time. So if you have a lot of writes and a lot of wall files, you know, it could take seconds or minutes to restore all that because I asked to download all these little files from S3 and replay them all. So yeah, more snapshots is just fewer wall files. Yeah, so that's kind of long and short of how Lightstream works. I can jump in if you have questions. And, you know, a big piece of Lightstream and honestly, most databases, even with software is just how it fails and making sure it fails well and what those values look like. So, you know, I mentioned this before, Lightstream is an asynchronous replication tool. So that means that there's no acknowledgement that anything's been replicated to the client before the client receives an acknowledgement. So that means you could have a data loss window if it can't communicate with S3 or, you know, just that one second of delay between pushing out. So in a catastrophic failure, you can lose up to, you know, X number of seconds depending on what your sync interval is. But if you are doing like a deploy, if you're trying to shut down, it does try to make a best effort to try to push out any outstanding data before it shuts down. On the restore side, we're not doing really any optimizations. It's meant to be kind of a dead simple tool where the wall is really just pushed up as is. And even if you have kind of duplicate wall pages in there or, you know, pages in the database, it doesn't try to merge those or compact those. And again, I mentioned this before, but, you know, more wall segments does mean a longer restore time. And, you know, we're doing compression up there. LZ4 is what we're doing right now. It seems to work pretty well. And we're doing, I think, on the lowest compression setting just to get the performance out of it, as far as a trade-off of compression size. And we're getting to some of the costs as well that kind of pushes towards that. Yeah, I think I cover all of this stuff. Yeah, snapshots improve restore performance. One quick comment or question from Jared. Is that how I put it? Yeah, yeah, sorry. I can unmute. I just got a little louder if I remember around me. Ben, just maybe for the benefit of others who are thinking about these things as well, how would you compare the use of virtual memory and memory map files with the wall mode compared to the rollback journal and how those modes operate differently? And the reason I ask this is, of course, like, if you're trying to map this to the browser or other places, since SQLite is so portable, you have places where it does not fit into the VFS architecture, right? Sure, I mean, I map stuff. I can't remember exactly. I know there's some things you can turn on and off. The rollback journal and the wall can be separate from that. That tends to be kind of its own concern as far as the journaling mode. I know that I haven't done a whole lot of stuff as far as WASM and the browser. I mean, I've seen a lot of people do some cool stuff on SQLite and what you can run in the browser. I think there's really some interesting stuff of people running WASM on the edge. And Lightstream itself doesn't lend itself well to that. Actually, the next slide will actually tell you why it doesn't lend itself to running in the browser. There's some stuff we're working on in the future that we are hoping does work well We do want to actually run this on the edge for people that are running WASMs, if you have like Dino, I don't know if you guys have heard of Dino or some of these other like Bercel, for example, they run a lot of code on the edge, kind of JavaScript code on the edge. And we think there's a place for SQLite there too, even though it doesn't seem like it naturally fits. I don't know if that could answer your question, Jared. No, no, that hit the exact, like literally down to like those people, like the exact problem that we're like, yeah, like, like think about Lightstream and other tools from. So yeah, that's great. Yeah, yeah, cool. So another failure issue with Lightstream is that you can't run multiple instances of Lightstream all pointing to the same destination, which might seem very obvious, but there's a caveat that people don't think about with this. So, you know, it's uploading snapshots and it's doing its wall segments and it's doing this retention policy enforcement. So if you have two instances trying to replicate to the same place, they kind of run over each other and start deleting each other's data. And obviously, you shouldn't run two instances at the same time, but kind of modern deploys have this thing where they will do rolling deploys. A lot of times where you might have two instances of your application running at the same time for just, you know, maybe a few seconds, and that can kind of corrupt your database or your backups. So certain platforms, this doesn't run on like GCR, Google Run or Google Cloud Run. Some of those actually fly, we have an ephemeral version that this doesn't work very well on. So you need to have like a persistent storage along with your instance to make sure that there's only one instance at a time, typically. So that's one caveat that people kind of run into. And then another issue too is that restarting your instance will incur downtime. And, you know, Lightstream was kind of built before kind of when this whole model is kind of this, this new model of like Kubernetes and like this ephemeral idea of, you know, we do this at fly.io where, you know, you have, you spin up nodes and they just kind of exist. And then you deploy new ones and they kind of like create new nodes and the old ones just disappear and like it's very ephemeral stuff comes and goes, you don't have a single server you're deploying to. So Lightstream actually doesn't work great for ephemeral stuff right now. It's really built for kind of an older VPS model where like DigitalOcean, for example, you spin up a VPS, you run it on there for a bit. And if you wanted to do a new deployment, you just have to, you know, update your code and restart your server, you don't have to restart the entire box, whereas these ephemeral models are essentially restarting the entire server. So that's kind of a caveat with Lightstream. You know, one thing that drives Lightstream and one thing that makes it kind of interesting is the cost model. When people hear continuous streaming back up to S3, that sounds really expensive, but it's shockingly really cheap. And that's kind of the whole reason this thing exists. So S3, their whole model is basically they want you to put your data up there and then have a bunch of people download those assets. So they, they don't charge you for the ingress fee to actually upload the data as far as the bandwidth, that's all free. But they charge you like an arm and a leg to download your data, which is great for backups, because you basically only write your data to S3 and you almost, you know, if you have to download your data, it's because your server crashed. They do charge for put requests. So every wall segment we push up is going to cost us, you know, half a cent per thousand requests, which is super cheap. And if we're replicating every second and we actually have writes just constantly for the entire month, the maximum you'll pay is $13 a month. So that's really at the high end. You can drop that down to every 10 seconds. So we have about $30 a month to do backups. But really in practice, like people aren't getting writes constantly every second. So Michael Lynch, a while back, he actually wrote up his experience with Lightstream. He pays three cents a month for his web app to do backups. You know, it's great. It's, it's pretty cheap. So it's a nice alternative for a lot of apps that don't need, you know, full synchronous replication and, you know, don't want to pay a lot of money, especially with a lot of like free, free tiers on a lot of these systems we have want to fly, but like a lot of these places do, we can spend stuff up super cheap and run it there. So that's a great alternative. So a lot of stuff. So that's all with Lightstream, kind of how Lightstream works overall is fairly straightforward, I think, I hope. It's a physical replication, but we've learned kind of a lot of lessons from that stuff with Lightstream. And we have, we're doing some future work, not on Lightstream necessarily. So Lightstream runs a separate process. And that that's great because it's kind of an operational concern. You don't have to build it into your application. You can run legacy applications without them having to change it all. They just run on Lightstream and Lightstream ships off the data automatically. But the problem is, you know, with all that is, it's a separate process and it really just doesn't have a lot of fine control about when transactions happen and, and what it can do. So, you know, one big feature that are probably two big features that people wanted from Lightstream is to be able to have re-replication. So they want to be able to replicate out their data to different parts of the world or to different servers. So they have, you know, high availability and redundancy. And that second feature is high availability. They want to be able to do deploys without having downtime, which is, you know, pretty critical for a lot of apps. So we started with a thing called LightFS, Light File System. And that's, that's a lot of what I work on these days. So this borrows a lot of ideas from, I'm not trying to make two presentations. It really borrows a lot of ideas from Lightstream or just kind of repackages of them in a way that we can make it so we can distribute SQLite databases in a kind of natural way. So it's a fuse-based file system. So this is a user, user-space file system. It's not compiled into the kernel. You can create them on your own. There's all kinds of used file systems out there you can use. They're kind of fun to make actually if you ever want to download and try one. And essentially we want a cluster of nodes that all share the same SQLite database or databases. And, you know, we still borrow this restriction of a single writer at a time because that's kind of how SQLite works. It doesn't have any concept of overlapping transactions or optimistic locking or anything like that. So we do have a single writer elected via a console lease. And the nice thing that, you know, some of the features we've added with LightFS is that we are able to do these kind of point-in-time restores you can do very quickly. Yeah. So there's some kind of interesting stuff that we've pulled out of Lightstream and made into this new system. Yeah. So I'll touch on a couple of these points here. You know, people wonder why fuse specifically. SQLite does have a virtual file system built into it, but it really requires people to actually load up the extension. And it's just kind of one extra step for people to do. And it doesn't sound like much, but like if you log into a server and you need to query your data, you actually have to remember to log, you know, load up that extension every time you want to do queries. And, you know, we're just removing into this world where developer experience has really kind of become a bigger thing, you know, even more important a lot of times in performance or even sometimes cost honestly. So being fused, you know, fuses this file system layer. So that's really how SQLite communicates us through the file system. So we're just kind of injecting ourselves into there. And we can get a lot of fine grain control around, you know, if, you know, someone tries to write on a replica, we can prevent that, you know, we can redirect rights, we can do all kinds of kind of fancy stuff. And it also gives us stuff like multiprocess support. So we're at the file system layer and set up in the process. And, you know, this thing needs to run as a single API server as well. And if you had a multiple multiprocess system on a machine, you can't really easily do that. So that's why we're using the fuse file system. People also wonder why we even bother clustering SQLite. That seems kind of crazy. Why not just use Postgres? Which is, you know, again, I don't want to not Postgres, I think Postgres is awesome. But it's very heavy compared to something like SQLite. And, you know, we want to cluster it because we want to increase availability. We want to do these deploys at that downtime. And we really want to be able to replicate to the edge, not just like that's possible, but that it's dead simple. You know, we don't want people to think like, oh, now I got to read up on all this stuff and set up these things. We basically want to have people just kind of have a switch. And more or less, it just kind of works. I mean, there's a few caveats there, but it should be relatively simple. And we think we could do that. And the reason to replicate to the edge is that, you know, a target for a lot of people with web apps is to do 100 millisecond request time. That's kind of the amount of time it takes. If you want to, you have to feel snappy and like real time, then 100 milliseconds is kind of a human perception of what feels snappy. And for most of us, you know, people that are over in CMU, you guys are right by the US East One region for AWS. I'm sure a lot of websites feel really snappy to you guys, but people around the world have a much different experience. And just kind of, in my head, I always think of kind of these rules of thumb, like going from the US to Europe is about 100 milliseconds of latency. You know, that's not even with any query processing or request processing. That's just literally the speed of light going from the US to Europe. And then if you go to US to Asia, you're looking at a quarter of a second, and you know, things just start to feel very sluggish, especially once you add on additional request time or request processing. So if you can, you know, push out servers and make it cheap for people to run through data in a lot of places, then I think the web will just feel a lot better for a lot more people in the world. So yeah, point of the edge sucks, and we just want to improve that. Another, you know, piece that people are curious about is why we chose to do leases with console. You know, there's a lot of different ways you can do this. There's raft and it's been really popular in the last 10 years. You know, we want to make it really simple. And like I've written a raft library at CD, the original raft library they had was written, I wrote it. It wasn't great. So I'm not bragging that actually took it out. But I like I've had experience with raft. So it's a great protocol if you really need that really strict distributed consensus guarantees. But there are a lot of applications that can use looser guarantees and that and really people want operational simplicity. And that's the one that's kind of loose membership into this ephemeral world or we have Kubernetes and things spinning up and those all the time. So, you know, we use, I think it's somewhat novel way to do this where we do have this a transaction ID that's monotonically monotonically incrementing for the most part. And with each transaction ID we associate with the checksum of the data and the database. So the actual full state of the database to check some on every transaction, which sounds expensive, but we actually do this in a very incremental way where when a page comes in, if you get a new page, we're going to do a CRC 64 of the page number and its data. And then we XOR that into the checksum. So then when we do an update to the page, we kind of take out the old checksum for that page and add in a new one. So we always have this rolling checksum. And actually thanks to Andy actually didn't know the name of what this would be called actually had to ping him and ask him. So we have this rolling checksum that always gives us the exact state of the database. So we can do this constant integrity tech and checking between all the nodes. So, you know, there are some failure cases where a, you know, with asynchronous replication where you can have nodes that you might have a primary that gets a transaction, but hasn't pushed out to record those cats, you know, it can fail, lose leadership and another node comes up and now you have this divergence between nodes. And this checksum really ensures that every node, you know, rejoins onto another node, you can see that it's kind of diverged and reset its state to the new primary. So you'll lose that data, but you at least won't corrupt your database, which is a win, I think in my book. Any questions so far here? And lastly, what's that? Raphael has the question. Can you? That's a question. How big can the database be for this replication to work well? If you want to replicate your database to every edge nodes around the world? Sure. We typically would recommend, I mean, I would say we typically are targeting around one to 10 gigabyte databases. And beyond that, I find that, you know, especially with SQLite, you have a single writer. So you start limiting actual write throughput. So I see that it's, it usually makes more sense to start starting out your database. The nice thing about SQLite again is that you can have, you know, there's not really much overhead to have an additional database. So sometimes people build models where you actually have a database for customer, you know, as long as you don't have a million customers, if it's like a thousand customers, you can really get some great tenant isolation around having separate databases. And those tend to work better. So yeah, I wouldn't, I wouldn't recommend this for like a terabyte database by any means. That's going to hurt. Good question, though. Any other questions? So yeah, the point in time restores, you know, whereas before with Lightstream, we did this raw wall, raw wall file replication. We kind of introduced this new file format that's similar to the wall, but we can actually do a couple of interesting things. They support streaming of multiple of these files. So we can actually merge them together to get the latest version of each page. So you can actually roll up these kind of these wall files are called LTX files, these replication files into higher level files so that they're faster to restore later on. So you can get this kind of really instant restores for your database, which is it's nice for recovery, but it's also nice if you want to have if you're deploying out and you have like a staging area and you want to have a new staging area for each of your, you know, maybe your PRs, you could spin that up, you know, do a point in time restore from your database and then just start testing things out on that staging area. So there's a lot of really interesting things you can kind of do from a development standpoint with that. Finally, we got some other interesting pieces like encryption. We're going to be adding here soon. You know, if you push and stuff up to S3, you don't necessarily want S3 to have your data. So if you can encrypt it and do it transparently, you know, that's great. We're currently looking at ASGCM-SIV, which is kind of a new hip kid on the block for encryption, I guess, but it has some benefits as far as the nonce that make it pretty attractive for us. And then another issue, another thing too is with LTX is we have some associated event data. A lot of times we're replicating out to the edge nodes, especially with real time applications that need to get notifications of data changes. It's not just enough that the data changes underneath the application, but yeah, so we're going to be adding some events associated with event data associated with the transactions. So yeah, some interesting underlying file format changes. I think I'm kind of getting up on time here. I'll try to go through the rest of this kind of quickly here. There are alternatives. I don't want people to think that Lightstream is the only way and that it's the best thing ever. You know, everything has this trade-off. So just want to go through some of these here. Regular backups, honestly, is probably what most people need, like an hourly backup of your database, especially if it's a small database. You can really do wonders and it's really easy to do and it's really hard to mess up. And additionally, you can do this in addition to Lightstream so you can run the two together at the same time just for some additional safety. And yeah, we have even instructions on Lightstream's website for doing this. There's graph-based replication. There's tools like Arculite, which I know you guys have had Phillipo tool on I think last year to talk about this. He's great. And this gives you that really strong consistency, but there's trade-offs with that. You know, this is a bit more complicated to run and set up. You can't do that loose membership that, you know, with like ephemeral nodes, you really need to kind of have, every time you add or remove a node, it's, you know, some commands you need to actually make it, it's a little more complicated. That's all. There's some VFS-based tools out there. There's one from Backtrace Labs, which I butcher the name every single time. I don't know if it's Verneul. I don't know how you say it, but I think it's French. There's another one called Light replica, which is like an AGPL and a commercial license. I haven't used it before, but it's out there. And this uses SQLite's virtual file system. So they actually have a layer where they can intercept writes, but again, you have to load it up and actually compile it into your application. So it doesn't work for legacy applications. And then finally, Postgres and MySQL. Like I've done a bunch of apps with these, and they're great. It's kind of that traditional NTR architecture, and there's really nothing wrong with them. I mean, I think that SQLite has its place, and it's great, but I really don't want to knock these other databases. So, you know, a little bit more complicated, but you're going to get a lot of other interesting features, different currency models that you can use, not everything serializable. So you can get some performance. And lastly, I'll just say, like, don't rewrite an existing app and SQLite just for fun, because you saw this presentation. You know, there's nothing wrong with legacy applications and legacy tools. So keep on trucking if you're doing that. Okay. So the last point's here, just to cover everything real quick. Lightstream provides really like a disaster recovery for kind of these mid to small size apps. That's really its focus, and it's kind of going to stay its focus. You know, we're building light FS to kind of make this distributed SQLite ecosystem. And there's a lot of interesting stuff you can do around real-time replication and high availability. And again, there's alternatives that we covered here, RAF-based, VFS-based, and then finally, you know, Postgres and MySQL are great still. So yeah, that's my talk on Lightstream today. And any questions? Thanks for the talk, virtual upload button. But yes, I have a question, but if anyone else has questions, they can go first. Can you hear me? Yep. So Lightstream is kind of blindly copying pages, like the redhead long pages. You're not actually parsing them and interpreting them on the way out. Have you thought about what optimizations you could potentially do if you actually parsed the wall? Sure. Yeah. So it does actually do some parsing of the wall. I should have covered that. So there's a wall header. So there's a wall header on the whole file and there's a wall frame header on each frame. So, you know, we can delineate transactions by that wall frame header. We can see which page is the last one in the transaction. That's how we can get transaction boundaries. And as far as actually parsing out the wall, the pages in the wall, there are some really interesting stuff you can do around that. It just gets really complicated. So there's something called CDC, which is change data capture, where you can actually, if you wanted to have a stream of all the data that changes in your database and you do an insert, or maybe like an update, you can actually parse out from the page and you could technically detect, you know, if a row is updated and what the previous version of the update is and the version after the update. So almost like a trigger, but external to the database. And there are tools like this for like, you know, PostgreSQL and MySQL out there. I can't remember the name of them now, but sometimes people use that. It's Debezium. He gave us last semester. Yeah, that's the guy. So you can use that. A lot of times people will pipe out those change data law or change data capture streams out to something like Kafka. So then you can keep a record of a lot of this stuff or like, it becomes kind of this enterprise service bus. I don't know if that term is still used anymore. I feel like that's like a Java term. But yeah, so that's, there's some really interesting stuff if you do actually parse the data. But the problem you get into that is you really need to like reconstruct the row and reconstruct, you know, that row, if it had moved pages and you'd actually find where it was in the previous page. And so there's some complexity there, but it could be definitely a fun little exercise. Awesome. Thanks. I don't know how many people are actually streaming out to change data from their SQL database to Kafka. That seems a bit heavy-handed, but it could, yeah. I guess speaking on heavy-handed, do you have some sense of like, let's say I was running SQLite before and now I run Lightstream. How much load does that introduce onto the machine? I imagine it's not actually that much, but maybe you have some estimates. Yeah. No, yeah. No, it's pretty minimal, I would say. I'd never seen it take out more than maybe 5% of the actual processing time. You know, the vast majority is going to be IO and a lot of the pages are all recently used, so they're all going to be in the page cache anyway. So that's super fast. And then beyond that, you know, it does have the LG4 compression. I would say that's the biggest CPU component. But actually reading through the wall, I mean, honestly, you're like skipping, you know, you're basically reading the wall frame header, which is like 24 bytes, and then you have like a 4K page that you're just kind of skipping over, like you're copying it out, but you're not really analyzing it. So it's typically pretty minimal.