 Databases, a database seminar series at Carnegie Mellon University, is recorded in front of a live studio audience. Funding for this program is made possible by Ottotune. Google. We're excited today to have Iki, who is a co-founder of Neon and a longtime press So he's here to talk about the new Neon service database service that he started with Nikito is also a friend of the Carnegie Mellon University group and a former co-founder of MemSQL in single store. So, as always, as Iki is giving me the talk, if you have any questions, please unmute yourself and say who you are. Feel free to do this at any time, that way he's not talking to himself for an hour on Zoom. And with that, Iki, the floor is yours. Go for it. Thank you for being here. Thank you, Andy. So my name is Heik Lindenes and I've been a Postage Hacker and Committer for many, many years and now I'm a co-founder of Neon. So what is Neon? It's three things. It's, first of all, it's a startup that we founded in 2021. From a technology point of view, it's a new storage system written for Postage. It's open source. It separates computer and storage and I'm going to go into more details on that. And the third thing that Neon is, it's the name of the cloud service that we run. So there is a public cloud service. You can go there, click a button and get a Postage database at neon.tech. But today I'm going to talk mostly about the storage engine, the storage system we wrote, and the cloud service and how they work together. So in a nutshell, the core idea is that we separated storage and compute. One of the original idea when we founded the company was that we wanted to build something like open source Amazon Aurora. Amazon Aurora was the first database that kind of pioneered separating storage and compute for LTP systems. Of course, many others are doing that now for LTP, but also for all app systems, there's Snowflake and so forth. But we kind of took that idea and ran with it. It is a single writer system. Also there's only one primary node that is generating log and processing updates at any given time. We don't try to solve the multimaster problem, we don't try to solve conflicts across regions or anything like that. It's a single writer system. You can have multiple readers, you can have multiple readers for high availability purposes or for scaling out your read workload, but it's only a single writer. We have a multi-tenant storage system. So we have one storage layer that we share across all of the customers, all of the databases. They're stored in the same storage system and we kind of combing all the data from different databases in memory and we have a shared cache for them and so forth. But everyone gets their own compute. So when we run Postgres for you, you get your own VM that you can run it in. In the future, we would like to give, they haven't implemented that yet, but the idea is that you could bring your own image and run your own version of Postgres with whatever extensions and so forth you want because it's a VM, you could do that. The unique features we have is the copy and write branching and time travel query and we make those very cheap. So I'm going to begin before we dive into the architecture. We love Postgres. So I've been involved with the project for a long time. So when we started, very early on, the decision was that we would try hard. Not to modify Postgres where possible. And in fact, we would like to get all of the changes we have made into upstream project and we would like to get the goal is that you could run unmodified community Postgres on top of Neon Storage. We're not quite there yet. So we are currently carrying a patch that is about a thousand lines of code, but that's pretty small compared to the overall size of Postgres and the overall size of the storage system. Can you quickly say what that patch actually contains? Yeah, sure. So so there's a there's a few bits and pieces here and there, like the the the main part is actually hooking into the storage system. So normally when Postgres needs to read a page or write a page, we hooked in in there. So pretty close to where the actual read or write system call happens. We hook in there. So when you read, we instead, we send a request to our storage. But there are there are a few changes here and there to actually support that. So like Postgres, there's a few other places where Postgres relies on local storage or being able to read these files kind of that bypass those APIs. So we had to pass them. There's a few, few changes we have to make in the buffer cast to track. I'll go into more details on that, like what track the LSN numbers of evicted pages, there's a few few changes like that. But it's it's all really related to the storage. Like we have what we have not changed is the planner, the executor. We have not changed anything with the index types. So all of the Postgres index types, just the three, all of those just work because they all work with these eight kilobyte pages, like Postgres always does, and we hook in at that level, like we deal with the pages and write a head log, we don't we don't know what's in those pages. We don't pay attention to the content. It also means that Postgres handles MVCC in Postgres. The MVCC is based on like transaction IDs that are stored on the tuples on individual rows. And again, we don't care about those like we don't, we don't pay attention to this also means that you still need to vacuum. Well, like, like with always with Postgres, we haven't addressed that problem. So the high-level architecture of Neon is again, separation of computer and storage. And for us, the compute means Postgres. So it's a VM that runs your Postgres instance. And the storage is our storage system that we wrote from scratch. It's written in Rust. That was a fun learning experience for me. Personally, I hadn't done anything with Rust before, but it's been a good experience. But it consists of paid servers, safekeepers and cloud storage, the storage system, and I will I will go into the more detailed on that. But first, the, the changes to Postgres, Postgres has support for streaming replication. So that's what you use normally. If you want to set up a replica, you can set up, set it up so that it streams the log, the right-hand log from, from the primary to the replica. And we kind of hook into that. And instead of streaming it to a replica, we stream the log to our storage system. And instead of reading a page from local disk, we changed that so that you send a network request over the network to our paid server for, for fetching a particular page. And one interesting thing is that for us, like the write operation, where normally Postgres would write a page back to disk, like a dirty page. That's a no-up for us. So we just throw the page away at that point. And if it's needed to write a page back to disk, like a dirty at that point. And if it's needed again, it will be, the while replay will be handled by the storage system. And we will get the, get the page image from the paid server after the while replay. That also means that for Neon, we don't need to do checkpoints in the traditional way. Just the writing is a no-up. So there's no, there's no need to like flush everything to disk in Neon. We still run checkpoints. There's still functions called checkpoint. And let Postgres run them because it does a lot of kind of other stuff than the core flushing dirty pages to disk. But we don't need to do that, that flushing dirty pages to disk port. So all in all, in Postgres, there's only, the local disk is only used for temporary files for sorting and that kind of stuff. But when, when we restart Postgres, we can just wipe, wipe away all of that. So what is the point of separating computer and storage? I mean, this is, this is very popular nowadays. I said Amazon Aurora kind of pioneered that, but there's so many other systems that do it for different reasons. For us, the big reason was that it allows us to be serverless. So it means that we can, what, and what we mean by serverless is that we can very quickly shut down the Postgres instance and we can very quickly start it up again. So when you connect to Neon, we, our current start-up time is about four seconds. And that includes like launching the VM or Kubernetes container and, and setting, setting all the connections up to the, the storage and, and replying back to the client. Another advantage of, of the separation of computer storage is that you can share the same storage by multiple read-only nodes. So normally you would, you could scale Postgres by having multiple read-only nodes, but that would mean that you need to duplicate all of the data. With Neon, they can share the, share the storage layers or all the read-only replicas can be pointed to the same storage. So you don't need to do that. And, and finally you can scale these parts independently. So you can add more read-only nodes or you can change the size of your compute nodes independently of the storage. And, and cloud storage is very cheap. So our storage system uses the like Amazon S3 or something compatible as the final backing store. And that is very cheap and, and which is good. And maybe you're getting into this, but like the, like when you boot up, how do you, do you pre-vision anything in the buffer pool? Do you, what do you do to sort of have a semi-warm cache that have like complete cold start every single time? Apparently nothing. So it is a cold, it is a cold start. The, the storage system has its own old cache. So there will be the data in there if you just recently restarted. But we, we, we don't have anything special at the moment to warm the Postgres cache. And we probably will need to do something like that is, that is a known problem. There are extensions to, no, go ahead. The server cache though, that's, that's the, it's the wall though, right? It's not, it's not materialized pages. It's both. We do cache in the storage system. We do cache the materialized pages too. There are Postgres extensions, like there's a pre-warm extension. I think it's actually included with Postgres itself that you can use and it does work with, with Neon as well. So if you, if you, if you want to, you could use that extension. That's like, that blindly brings in all the pages, right? Like you, you, you'd want sort of like what NDB does or my single does where like, you, you want to bring in the buffer pool, bring in the pages that were like, that were just like, that were in the buffer pool at the last shutdown. Right. It doesn't do that. I think there is a function to, to, to remember what was in the buffer pool before and bring that in, but we haven't integrated that in any way. So that would be like up to you to set it up. Okay. Also, thanks. So I'm going to walk through the right path. So, so when Postgres, when you modify a page, it goes to the log and we stream the log to what we call the safe keepers. So we have, we have three, three nodes, three safe keeper nodes running out at all times. And there's a consensus algorithm. It's based on paxes. And the idea is that, that, that keeps to your recent transactions are safe, keeps them durable. So when, when, when you commit a transaction in Postgres, it goes to the log and we wait until it has been acknowledged by a majority of the safe keeper nodes before we, we acknowledge it to the third small client. Safe keepers, they have the local SSD drives and we kind of, we, we assume that we won't lose more than one, one at a time from a durability point of view. But safe keepers don't do much other than that. It's just to, to make sure that we don't lose the recent transactions. I'm going to make a little detour into the Postgres threader log format because this is pretty critical to, to all of the things we do. So in Postgres, the rider head log is a, it's a very classical, like a various style implementation. It's physical. So we don't store the like insert update statements. We store eight kilobyte pages. That's the page size in Postgres. We store those pages and modifications to the pages last as bar records. Postgres doesn't have an undo log. So that, that makes some things easier for us. We never need to worry about like a boarding transactions or anything like that. The log only moves forward in Postgres. And I said earlier, like Postgres handles MVCC at the higher level, at the top level. So we don't need to worry about that. And that turns out to be a very nice property for, for what we are doing. So kind of continuing with the right path. The next step, after we have made this piece of log durable in the safe keepers, we stream it to what we call the page servers. And that's where we have most of the code that and most of the complexity and design problems in the page server. That's really the key of the, of the storage system. So the Postgres, the, the, the page server digests like processes the incoming through the catalog. And it processes it into, into, into a format where we can quickly find all of the wild records of a particular page. And then it writes it in this like preprocessed format into immutable files on disk. I'm going to go into more detail on, on that exact file format and there's some interesting problems and ideas there. I hope actually that some of you all will, will have ideas on what we could do there to do it better, but I'm going to go through what we have. But the basic idea is that we write the data into immutable files on disk. And then we also upload them to cloud storage. We keep a copy in the page server too for, for caching purposes, like for fast access, but everything is also uploaded to the cloud storage. And now if you look at the durability, what this means is that we, the durability is kind of handled by two components here. It's the safe keepers for the recent log and the cloud storage for all of the older data that has already been processed by the page server, but the page server itself is, is not a, it's not a critical component. So if we lose the page servers, that's okay. We can just spin up a new one and it will download the stuff it needs from the cloud storage, which will take a while, of course, if you have a lot of data, but you don't lose data. So the durability is handled by the safe keepers and the cloud storage. Now that was the right pattern. That was like what happens when, when you go back to the modifications, but the, the critical functionality of the page servers is that it can reconstruct like using this log that it has stored, it can reconstruct any page in the system at any point in time. So we keep all of the logs. Well, up to some retention periods in disk space is cheap, but it's not infinitely cheap. But this, this kind of replaces your traditional base backup file archive. So we keep all of that in the page servers and the cloud storage in the format that we can access, like random access. So the page server can reconstruct any page at, at any LSN at any point in time. Now, so now when Postgres requests a page when you need to read a page it sends this request to the page server, get page number 100 or get page number 200 at this, at this point in time, page server will, will find the last image of that page, replay the log over that single page and kind of send back the, the replayed version of the page. All right. And so, and the LSN is sufficient to know, like Postgres says, I need page 1, 2, 3 and you would have an LSN of what that you think should be in that page and you would know the page server doesn't have that, that log record it has to go wait for the safekeeper to send it or they actually go to request it. That's a great question or statement. Yes, that's exactly how it works. So when the page, when Postgres requests a page at a particular LSN, if the page doesn't have it yet, it will wait. And that's, that's how we kind of make sure that we have the, like, read after write consistency. So when Postgres modifies a page, it goes to the log then when you need to read the page back, we, we know that we have the most recent version of the page or the version that you request it. Okay, so then that and I don't know the full details of how Postgres does something inside. Does Postgres maintain, you know, for page 1, 2, 3 I expected this LSN or that's something you guys are adding. That's something we had to add. Like actually we, I mean that wasn't really needed for correctness actually like from the correctness point of view we could all like if it's primary note that's running, the primary note could always request the latest LSN that it wrote and that would be correct. But that would have a performance problem because now pretty much every time you read anything in page server we would need to wait to get the latest version, the latest log and most of the time there were no changes to that, to that version. Alright, so that's one of the patches you've added Yeah. Yeah, we had to add this little cache that tracks the LSN numbers of pages that have been evicted from the cache. So we track, I think we track like the last thousand pages that have been evicted like in an LRU fashion and what was the LSN of each of these pages. So if you request back one of those pages that was recently evicted, we will know that what was the last LSN that had changes on it. For those of you listening, that's the secret sauce right there. Awesome, thanks. Yes. Putting it all together, finally there's the control plan and proxy. So I mentioned that the storage is multi-tenant and the Postgres is single node computing or VMs. So we needed something that manages all of this. So actually when the user connects or the client connects, it first connects to a proxy and the proxy intercepts the connection performs authentication, that's all uses the same Postgres protocol so it's transparent to the users but then it will last control plan, like where is this Postgres instance running? Where is this database running? And if it's not running, we actually shut down these instances after five minutes of inactivity. It will launch it and then it will connect you to the instance. Is that proxy? Sorry, I'm asking a lot of questions. Is that from scratch or did you take like PG Balthor or PG Cat didn't exist? Yeah, we wrote it from scratch. So we don't use the proxy for connection pooling. We only use it as a like a pass-through to proxy just for this purpose of making sure that the instance is running. Awesome, thanks. So the control plan also provides the web use interface and user-facing APIs. So like when you actually go to the website and click buttons and create databases, that's what handles and connects puts all of these pieces together, like it tells the Postgres instance, where is the data and all of that. All right. So that was kind of the overall architecture, but now I want to dig a little bit deeper into the storage engine and how do we actually, as Andy said, the secret source of how do you, how can we reconstruct any version of any page and how do we store that and how do we make that perform? So just to kind of a preface here is like, if you think about the traditional time recovery system, like normally with Postgres or other, many other databases work the same, like the typical scheme is that you take a backup like say every night or every week or something and then you also archive all of your logs. So you stream your log files to cloud storage or some durable storage and that way you can recover to any point in time. So the classic example is that if you accidentally drop a table now you need to do point in time recovery to the point just before you accidentally drop your table. So what you do, you take the last backup before that point in time restore it on a different server probably and then you replay all of the log up to that point and that works. That's how you normally do, but it's a bit tricky and people don't do that very often because it's a pretty painful operation. Like one problem is that if you accidentally go one second too far now you have this start from scratch again and that's kind of painful. So, but with that in mind, what we kind of, what we do in Neon is kind of the same idea we have, but we do it at the page level granularity. So we keep images like backups of pages, individual pages and then we store the water records of these, for these individual pages. Now, so now when you request a page at the particular point in time, we find the last image of that page and we find all of the water records on top of that, for that single page and we replay it. But in order to make this actually perform, like the right of the log is in wrong format. It's written sequentially so you can't easily find all of the water records for a single page. So this is the magic that the page already does. It re-orders the log and builds a little index on it so that we can quickly find the records for a particular page. The other thing that we need to do like the Postgres log contains a mix of page images and the water records. Postgres has its own region. Sometimes it prefers to write a full image of a page. For example, if you're loading, bulk loading data, it will typically create full images of those pages. Or if you're building an index or many other reasons, it just copies a full copy of the page to the log. But most of the time you just include these water records that kind of store the delta between the old and the new version of the page. But you could have in the Postgres log, you could have a million records on top of the last full image of the page. And that would mean that when someone requests that page, we would need to replay a million records kind of on the fly. And that gets slow. So what we need to do is to also pre-materialize and store some additional images of these pages. And that's much of the heuristics in the page server is when to do that. And when exactly do we need to do that? When do we want to do that? Do you only store for one given page, do you only store like the latest version in the page server? Could you materialize the multiple ones? We can materialize them all. So that's the key thing that I think makes this special. So we can reconstruct any page version up to the retention period. And the idea is that this will replace your traditional backups and the wall archive. So instead of keeping having backups every day and keeping them for a month, you would just set the retention period for a month and let the page server keep all of that history and be able to still reconstruct any version of those pages. That's the idea. And do you do like on the page server like on the physical storage are you doing any compression or duplication of this for the pages? Currently, no. That's something we definitely need to do or like ship to. That's kind of an obvious optimization to do. Yeah, thanks. I think someone raised their hand. Hey, can you hear me? Okay. Yeah, go for Matt. Okay. Does this mean so like if you don't set fill factor, right? That's more just like a interaction with auto vacuum, right? Like where updates could span multiple pages and you just need the auto vacuum to come along and clean that up. But with this system where you're keeping page versions because that means you just get a bunch of right amplification if you don't have your fill factor knob set, right? If you have a bunch of updates, I guess. Yeah, that's true. We haven't done anything magic about or anything special about vacuum. So vacuum will create new versions of these pages. That's kind of okay. I mean, those wall records are pretty small. They just contain the list of tuples that were removed from the pages. But that's, yeah, we haven't done anything special about that. So vacuum will create new versions of these pages. Okay. So whenever Postgres needs to read the page, it sends this request. And this is something that Andy already kind of asked, but what LSN do number do you use in this request? We track this last evicted LSN in the primary. But the other case is that if you have a read-only node, like if you want to do this time travel query, for example, if you want to recover to the point just before you drop the table, you just launch a new Postgres node and you point it at the page server and you give it the LSN that you want to read the data as of. And in that case, the Postgres will send all of the requests at that specific LSN. And now you kind of see the data as it was at that point in time. The third option is that you can have a read-only replica or read-only node that follows the primary. In that case there's a cache validation invalidation problem. And if you have a read-only replica that's following the primary, it will actually need to still process the original write-ahead log from the primary to figure out which pages are being modified because it might have a version of those pages in cache and it needs to throw them away. But the way it looks like from the page server side is that it will request the pages with an increasing LSN as it tracks the primary. So let's move to the storage engine and how does that work? What is the internal on-disk file format that we use and so forth? So first of all, we model this as a key value store. So we store pages and these LSN numbers. But the way we store it internally or model it internally is a key value store. So in this case, the key is the block number and the relation ID, like which table does this block belong to, and the LSN. So it's kind of a two-dimensional key value store and this is a bit special. I didn't find good existing implementation that would have this concept of a two-dimensional key in that sense. And the value is the eight kilobyte page or the wallet record to reconstruct that from the previous page version. We also store some other metadata keys like the track, for example, relation size. So when the table is extended, we keep another key value pair that just tracks the size of the table and there's a few other things like that. Sorry. What's special about the multi-edged key? Well, the LSN dimension is kind of special. So first of all, we're doing range queries on that frequently. So when you request a page at the particular LSN, what we actually need to find is the previous, is the last version of that page before that LSN. It's not like a point lookup because there are systems that can do range queries, of course. But the other point is that it's the LSN number keeps incrementing as we digest new well and we don't replace the old values with that. We just kind of add to it and we keep the history too. Okay. So we took inspiration from log-structured mergers trees. They have a lot of the characteristics that we need. One is that we want to use the cloud storage and log-structured mergers trees has this nice property that it's based on immutable files. You never update the file in place. You just always accumulate data into new files and sometimes you do the background operations to rewrite them but you never try to update them in place. That was really important in order to be able to use the cloud storage because the cloud storage really works by uploading or downloading files. It doesn't make it easy to do point update or update individual records or pages. So we looked at log-structured mergers trees and it's kind of a similar idea. We buffer the incoming log first in memory and we index it in memory in a little b3 just to hold it there. And when roughly one gigabyte of log have accumulated in memory it's written out to what we call a new layer file and this would be in log-structured mergers tree terms this would be an SS table. And yeah, we never modify them in place. Then we start collecting or accumulating the next batch and we write that out. So kind of the obvious question is why didn't we just use ROXDB or some other existing LSM tree implementation because there is a lot of similarity here. And we thought about it. I actually looked around and googled around a lot trying to find something that would fit the bill. But I didn't quite find anything that works and kind of ticks all of the boxes we have. If I missed something, if anyone out there has ideas on what might work here, I'm all ears. I'd like to hear your feedback. But the big stumbling point or the kind of thing that didn't work very well for us was this accessing history part. In a very early prototype that we had we actually used ROXDB and we just used the block number and the LSN as the key. Like that combination was the key. But it didn't behave very well. So when you keep accumulating new versions of a page what that means in that scheme is that you insert new key value pairs. But then when you do compaction you kind of move those existing keys to the next level and the next level and next level and so forth. But that's not what we want to do because we know we're not going to modify those keys and there's never any tombstones and we never remove stuff as such. So that didn't really work for us. The right amplification was pretty bad with that. The other thought is that many of these LSN3 implementations they actually natively have support for snapshots and they have the capability to read older versions of key value pairs. They typically do that for MVCC and snapshot isolation. So you can take a snapshot and read many keys as of that snapshot. Many key value stores we looked at have that capability but they don't really expose it the way we wanted. Like many of them wouldn't allow us to use our own LSN number as that key for that dimension. Or they would only allow you to take a snapshot and then read all of the data. But it wouldn't allow you to take a snapshot in history like after the fact they would only keep the snapshot so the system is running. So I couldn't find anything that kind of fits that well. The other thing with the history is that we store these images and we store the deltas and there are two very different kinds of records. So when you're doing a search you need to find all of the deltas. That means all of the wall records for that page until you hit the last image. So it's not enough to find the latest key value pair for that page. You actually need to find all of the key value pairs so that you can collect all of the wall records until you find the last image of that page. And the third point is that we wanted to control this materialization. So we want to be able to create the new materialized versions of pages when we reshuffle the data like in background operations. So some implementations we looked at they would allow us, they might allow you to hook into the compact or re-write some of the keys at that point but not all of them. We really wanted to have control of that process. Finally does this upload and download to cloud storage. Again some key value stores do that natively but not all of them that we looked at. Branching is another feature. We want to have these cheap copy and write branches. That might be actually okay. The way we implemented this in this our own storage engine anyway is kind of at the higher level so we create a new storage for each branch and then if you fold the bottom of that, fold through the bottom of that storage without finding the version of the page then you look at the parent. So that might have actually worked with the existing ones. Finally we wanted to have find something that's written in Rust or other memory safe language. So our storage system is multi-tenant so we share the buffer cache in the storage across different databases possibly belonging to different customers and we really don't want to have a seg fault that causes an outage for everyone because of one database having problems or worse we definitely don't want to accidentally leak data from one database to another. That would be way worse. So we really wanted to have, especially if we're going to modify it, like if we would take something as the starting point and modify it to do all of the things we wanted to do like I wouldn't trust myself to write that code in C or C++ it would have to be some memory safe language. Finally we already have a write-a-head I mean we're storing the write-a-head log format, write-a-head log in this format oops that's the screensaver. So we're storing the write-a-head log and many of these other key value stores they come with the write-a-head log and we don't need that, we already have one. Although many do allow you to turn that off though. So what did we do? Our storage consists of these immutable files again in LSM3 terms those would be SS tables but we call them layer files and there are two kinds of layer files those delta layers and image layers and the image layer contains like a snapshot of all of the key value pairs at one particular LSN so that's like the baseline snapshot or backup of a particular range of keys and then there are delta layers which consist of all of the key value pairs in a particular key and LSN range but only if it was stored so we don't make copies of pages that have not been modified in those. This is better explained with this picture here. So we think of the storage format as a two-dimensional storage so at the top you can see these delta layers they it's a rectangle so it contains all of the value records for a particular key range and a range of LSNs like a range of time and then in that there's this image layers which is a snapshot at the particular LSN so it doesn't cover any range of LSNs it's at the particular LSN but it contains all of the key value pairs at that point in time and we try to maintain so that every file is roughly the same size we're aiming at roughly one gigabyte file size that seems to be pretty good for dealing with cloud storage yeah so when you need to search this is the basic read operation when we get a get page request you start from the requested point in time like the requested LSN and the requested key and we scan if you think about this as a two-dimensional problem we search downwards and we look into the layer file here and we collect all of the while records for that particular key in that file if there are any then we continue the search on this file if there are any records for it there so forth until we hit the image layer and the image layer contains images of all of the pages in that key range so we can stop the search there and now we kind of accumulated all of the last image of that page from the image layer and all of the while records on top of that and now we can replay them there might be images of this page in these delta layers as well and then we can stop the search earlier but like in the worst case we have to scan until we hit the last image layer so that kind of puts the backstop on how far back we need to scan now when new incoming while is processed we create a new file at the top of the delta file so that's like a that's basically the original right ahead log just reordered and stored in our format that makes it faster to look up and then we have these background operations to create new image layers that runs every 20 seconds or something I think and the point of creating these image layers is that it speeds up the searches so I mentioned that if you're searching it kind of access the backstop there are no images of that page earlier by the time you hit the image layer it's definitely going to be there and you can stop the search it also allows us to do garbage collection later so if there are any versions or any pages down here that have been modified later that allows us to later garbage collect these files because we know that there's going to be an image here and then we have a compaction and this one I'm not 100% sure if this actually makes sense for us and this is something that I'm still kind of thinking what would be the right strategy or when to do this but the basic idea of compaction is that we take these delta layers and we reshuffle them into different shaped rectangles that contain all of the changes for a larger key range but a smaller larger LSN range but a smaller key range like this is equivalent to the merge operation in LSN trees where you merge into the next level in your LSN tree the reason I'm not 100% sure this makes sense for us is that we usually can't actually get rid of any data here because we're still keeping the history so we're just kind of rewriting the data into a different order it can help with the searches because now you have fewer like when you're doing a search from up here we don't need to visit so many layers so many files but maybe we could get the same benefit with some something like a bloom filter which is typically used with LSN trees but we haven't done that yet but one thing it will do it will aid with the garbage collection later so if you have a skewed workload where you have one part of the database that you never update you just load it with a little bit of data and you never update it again and then you have another part of the database that you keep updating more frequently like this reshuffling operation allows us to keep the garbage collect the part that updates more frequently more often finally I mentioned we have garbage collection so conceptually the idea of garbage collection there is a line there's like one week or month or something period like how far back do we need to be able to reconstruct these pages and if we have the images and the data of all the later versions we can remove these old files that we don't need anymore that was the tour of that I included a picture here this is actually a colleague of mine wrote a little tool to extract diagrams like this from the actual files that we created so this was a dump from some test workload and you can see here that there are these files at the top these are the files and then you have different shape files and image files I'm not going to go into the details and try to explain this any further but this just serves a nice visualization of how it actually looks like from a real cluster so that's basically how the storage format works the one thing that I didn't mention yet is the branching so we support cheap opium copyright branches and that's also what we rely on for backups like this this replaces your traditional backup and restore operations because we keep the history and the way that works is that when you create the branch we just start a new storage at that LSN like from like a new empty storage and then whenever you search you if you don't find the page you're looking for in that you continue the search from the parent at the point where the branch was created so that's pretty that kind of fell fell off from the storage format pretty easily and kind of as a bonus when we started to design this thing and that actually turned out to be a very nice feature and kind of a unique feature so there are some open questions with this we haven't really addressed this so these are like open questions for me like how often does it make sense to materialize these pages like preemptively we don't really need to materialize or do the watery play until the compute requests a page and if you have a small database it's quite possible that you like load it with data you do a lot of updates and then you delete all of the data and we wouldn't necessarily need to ever like replay the log in the page server we just need to store it if the page ever never requests page we don't need to replay it on the other hand if the page ever does request these pages like if you have a workload that doesn't fit in the cache in Postgres then it will frequently keep requesting the pages that it wrote a while ago and now now you would really want to have low latency and it's kind of bad if we do need to do the replay at that point because it does show up like it does take a few milliseconds to collect the records and do the replay so when exactly do we want to kind of do that preemptively before we get the request we haven't really solved that problem like when would be the optimal what would be the optimal strategy for that I don't know what about you because you're serverless when a after what five minutes you decommission a compute layer do you send any signals down to the to the page server to say hey by the way this guy disappeared so either clean things up right now or maybe give it a lower priority and clean it up because that's a later point we don't currently that might make sense on the other hand it's quite possible I mean still even if it's inactive for five minutes it's still probably more it's still probably more probable that it becomes active soon and then then you'll need it then then it is for some or any other random database to become active so I'm not sure that that might be something we wonder we should collect stats on and see what it actually looks like yeah she has a question I have a question how do compute servers know which page server to read from for specific page so the control plane keeps track of that like which paid servers contain the data for which database so when it launches the Postgres instance it passes that as a parameter like the hostname and port to connect so like would it be like a range of page numbers go to one server or is more like no so we haven't we don't currently do sharding at all so it's currently the way this works is that there is only one page server for one database but that's that's something we will be working on the future like sharding and spreading the workload across different pages like this architecture would allow us to do that pretty easily it's very easily shardable but we haven't done that yet. That makes sense and then another quick question for reading from paid servers do you guys find yourself more limited by network bandwidth or disk bandwidth or like something else kind of having problems with all of the pieces at the moment so there are there are like one thing we ran into very very soon and we're only now starting to address actually is the like sequential scan speed. Postgres normally relies heavily on the operating system cache for sequential scans so when you're scanning it will just request page number zero then page number one then page number two and so forth and in that workload what came up very quickly is that if you do a round trip to the page server for every individual page that's very slow so in that case it was the network round trips that was the problem we added prefetching support so we now we're not smarter about that and we will like send batches of requests to and kind of eliminate that problem in other workloads what shows up is this overhead of reconstructing the pages doing the watery play if there were a lot of changes on the page in yet other workloads we had some test cases where we have a lot of layer files if you have a lot of history we had like a dumb algorithm for keeping track of what layer files exist and then that came up as consuming a lot of CPU but you're addressing that with a different data structure there so it's kind of depends on the workload some workloads have issues with the network others we just have a lot of overhead that's not like low hanging through that we will need to fix in the page server thank you I don't go to attention here can you merge wall records like if they're contiguous like you know like Alice and 123 and Alice and 124 they both update the same record even the same page and then you say all right better me applying them incrementally one after another and combine them single log record and it's batch applied but I guess if someone wants 124 then you have no way to go back yeah first we haven't done that but what we've definitely thought about that so what we could do is to merge records between commits so no one cares about like what happened between two commits that's invisible to the users so definitely between commits we could do that okay second thing is that I'm not sure how much that would really say like if you insert one couple and then you insert another couple the same page like if you merge those into a single bigger record it would still take roughly the same amount of space and probably roughly the same amount of CPU work to apply so I'm not sure if that makes sense unless you can actually like eliminate some steps like what might make sense is to record the after image of you know what what was the version of the image of that page look like after you replayed all of these records and then it might help yeah I am actually pretty much at the end of the presentation but like some of these open questions is like when do you materialize in spaces when do you create these image layers and something I already mentioned is like when do you does it make sense and when should we merge these delta layers together and to kind of be able to scale this so yeah those were the open questions I have but I'd love to hear more questions that you all might have okay awesome I will clap on everyone here so we have nine minutes for questions I've already asked a ton but I'll give everyone else a chance to go for it so if any questions for you can go for it just unmute yourself and fire away you're all fools alright fine I'll take the time thank you so again going back to the service thing at any given moment of time and I realize this is a very during the day but what percentage of the databases that are in Neon are like swapped out like how much are you guys saving your customers like 95% of the databases can be swapped out at any point in time or is it a lower number right at the moment we try to keep everything in the page server so we would actually only rely on this swapping out if we have to kill the page server and reload it so currently we we still don't have that much data that we would need to swap things out like we've had this service running for a few months now and that has not been enough time to create enough data to actually have to swap anything out but I'm thinking of just doing like a least recently used scenario and then I mean then it's up to us like how much local disk space do we want to have in the page servers and then we would keep as much as we can fit. There's that and then there's also the compute layer so what percentage of the Postgres instances in Neon are like decommissioned at the compute layer at any given time? I do have numbers on that we have roughly between 50 and 100 active computes at any given time and remember we shut them down after five minutes so these are the numbers that are actually active and I think we have about 3,000 registered users who have created databases so yeah you can do the math from there. But it varies like there is like a power distribution like some databases are active like some people have already put something in production even though we're still in beta but some people are brave and they put production databases on them and they have applications running pretty much all the time and then there's others who a lot of people just signing and tires and go never come back but then there's people in between like who run daily jobs. We just had a customer case where they were running a batch job twice a day so they would load a bunch of data and then do some processing on that and then it would go idle and they would do that twice a day. Got it, okay. And then everything's just running Kubernetes so when someone connects you just spin up a you just find out the pod to say here's now the beat layer and then it talks to the page and gets whatever it needs on demand. You mentioned it takes four seconds for if you have a decommissioned instance they connect to it and they have access of course and then you bring up that pod do you know how that competes or how that compares against like a server list from Amazon? I don't think Amazon shuts down completely so it only allows you to scale up and down. I don't think they scale down to zero and I don't know how long it takes for, I think there's a manual operation but I don't know how long it takes for them. From what I know from our numbers so where is that time spent? So it's about one second between one and two seconds for spinning up the pod before any of our code runs. I think it's about one second or something and after that it takes a few hundred milliseconds to download what we call a base backup which is a Postgres data directory that doesn't have any of the data because that's downloaded separately but we still need the Postgres data directory so that Postgres can start up and it can find all of the everything that's not the table or index in there so that takes a few hundred milliseconds to download, extract and get Postgres running after that we run a few queries I think we do run a couple of queries to check that it works it takes a few hundred milliseconds then in our control plane there is when we start this operation we add a record to an internal database to remember that we started this pod and there's a few hundred milliseconds goes into our internal bookkeeping in the control plane basically and then there's the round trip latency just going to the client too so all in all that adds up to roughly four seconds and we definitely actually want to get that down to one second or something like that like four seconds it's okay for many applications but it's not quite fast enough that you wouldn't care like with four seconds many applications would try to artificially keep the database active which we don't want people to do we do want them to shut down if they're inactive yeah okay so a question on the chat do you have any benchmarks against Postgres hosted on cloud so I guess it would be RDS or Cloud SQL or something like that we run some PG bench tests internally it depends on like it can be competitive we can get pretty close to that but of course it depends on how much resources we throw at it we can get pretty close to the performance of RDS but that's like I don't know how much we're probably spending more CPU time and resources to achieve the same level of performance currently maybe a better question is like is there a workload where you would actually end up doing much worse on neon where the application would better off running like you know stop Postgres and RDS or something I don't think it's like I'm sure we still have like silly mistakes and stuff like that but discounting that like looking at the architecture I don't think there's anything that would be particularly harmful I mean you can't beat a local SSD I mean that's just that's not going to happen but like if we do all the caching and all of that in a good way I don't think there is a particularly hostile workload I was just sort of thinking there's we've come across customers who are running Postgres RDS they switched to Aurora because they somehow magically thought it would be faster and then like turn up to be worse plus the pain you know 20% overhead over that anyway right my last question to be um since you're sort of cutting off the storage layer of Postgres um and Postgres famously relies on the OSP cache for you know for caching pages is there any I'm guessing the answer is no but is there anything you haven't changed to deal with maybe an assumption that Postgres makes about having an OSP cache or has there been any surprise that like there's something to be way easier than you thought it would be I'm assuming there's no issue I can think of but yeah the sequence of scans was one thing like we depend on Postgres depends on operating system to do the read ahead for you that was something that that did us differently so we had to implement that prefetching ourselves it just means that we don't get the benefits of the operating system cache so you want to tune the computing instance differently than you would tune vanilla Postgres like with vanilla Postgres you would want to like only use a part of your memory for the shared buffer cache because the operating and leave the rest to operating system cache but that doesn't work with this because there is no operating system cache yeah we might want to implement like something we've been thinking of is to actually implement another level of caching for like a local in memory or this cache in the compute but haven't done that yet