 So we're super excited today to have Thomas Neumann, the Thomas Neumann, who's a professor at the TU Minnick. He's also the chief architect of Hyper that was sold to Tableau, and then he's been working on this new system called Umbra that he's here to talk about for us today. Again, we're super excited to have him here because it's a whole new semester, whole new seminar series, and Thomas was the first person I immediately wanted to get to talk about it because I think this stuff is fantastic. So as always, if you have any questions for Thomas as he's giving his talk, please unmute yourself, say who you are, and where you're coming from, and feel free to do this anytime. You want this to be a conversation with Thomas and not him talking by himself for an hour. And we thank Thomas for being here staying up late because it is he is in Germany because he is the best database researcher in Germany. Of course, that's where he is. And so it's 10 o'clock at his time. So we thank him for staying up late. All right, Thomas, the floor is yours. Thank you so much for being here. Yeah. Thanks for the introduction. So I'm happy to talk about our system. So first some background why I'm so what I want to talk about, as Andy said, so about 10 years ago, we started building the hyper system, which eventually went commercially, this was a pure in-memory system. And this work to find great performance, everything is good. But at some point, we realized that being purely in-memory has some problems. Now we'll talk about the problems in the next slides. But therefore now the title of this talk here, so the next system that builds the umbra system, is what we are saying, a disk-based system. Now to be honest, it's an SSD-based system, but it's a disk-based system that aims to provide you the same performance as an in-memory system. Okay, now why do we care about this? So if you look about the landscape of potential systems, here you see a sketch that says, okay, if you look compare the performance and the cost of in-memory systems and disk-based systems. Now we're talking about classical disk-based systems. We see that in-memory systems are very, very fast, exceptionally fast, much faster than the others, but they are also expensive. Because in-memory is expensive compared to disk. There's no question about this. And what worries me even more than just the price is that the size of memory does not grow up that quickly anymore. Like 10 years ago, there was still a very fast growth of memory. And we saw that a few years later, everybody would have multi-terabytes memory systems, but this didn't happen. I mean, you can buy systems with a few terabytes, but they are also quite expensive. And memory sizes don't grow significantly any longer. So this means that we have some kind of limitation for in-memory systems. This made us nervous. And the second argument, of course, is simply the cost. Is it cost efficient to put everything in memory? And furthermore, the hardware has changed too. We no longer have just this slow rotating disk. We also have what we have said as a cloud-based SSD-based storage devices. And you see there are some ways in the middle between them. So first, they are much cheaper than main memory. And they can also be quite fast, not as fast as in main memory, of course. But you can read several gigabytes per second from an SSD. This is a serious performance. In particular, if you think that you can put multiple of these in one machine, and then suddenly we have a reband with that is really something. So you can do something with these systems. And the cost, of course, is much, much lower. So we say with the cost of two terabytes, the main memory is something like $20,000, while on an SSD, it's just $500. So this is drastically cheaper. And we can scale this to very large sizes. And therefore, you said, okay, this is an attractive device to use. It has much better latency than a rotating disk. And bandwidth is quite good. Okay, on the other hand, we still have to keep in mind that these are fast, but they're still slower than memory. So if possible, therefore, we would like to operate in memory if we can. And in practice, of course, we can, because even though the dataset can be large, the working set is usually not that huge. So you have reasonable chances that if you have a large server, your working set will often fit in memory, not always, but often it will fit. And we would like that, at least in the case that the data does fit, the working set does fit in memory, you want to get the same performance as an in-memory system or nearly the same performance. But if you take a traditional architecture, this doesn't happen. So if you use a traditional buffer manager, the buffer manager itself imposes overhead on you. And therefore, at some point, your performance is limited by the buffer manager, even though, in practice, everything is memory. And this is, of course, it's a bit silly. Therefore, we want that our system uses a non-traditional buffer manager, and therefore, it gives us, and I will talk about how this works in a moment, and it gives us good performance if things fit in memory. Of course, if things don't fit, then we are bound by our device. That's clear. We can be faster than this. Okay, but now also, this is therefore where we place ourselves. So we say we, umura systems, we want to have reasonable costs of being SSD-based plus a large memory, but not super large memory. And so we combine large in-memory buffers with very fast SSD devices. And we also use a different buffer manager architecture, different in multiple dimensions. And this gives us good performance for a wide range of use cases. So this is basically the story. Okay, and in this talk today, I would like to show you basically what has to be done differently to really be a disk-based system. You don't want to use a traditional disk architecture because this is slow. You cannot just use in pure in-memory architecture either. So you have to change things. I just want to give you a short overview of what we did to reach this point. Okay, so to summarize this, I hope the only marketing slide that I have, so we said the umbra system is basically an evolution of hyper towards an SSD-based system. So we can get comparable performance to hyper if the working system fits in memory. And if it doesn't fit, then we scale to whatever the IO device gives me. Now, many aspects are similar. Like as a hyper system, we are compiling queries into executable codes of a compiling engine and so on. But I don't want to talk about these parts because they are the same as in my previous system. So I only want to talk about today what's different. Yeah, in particular, I want to talk about some key additions and the adaptations we did to take this thing. And we started with a buffer manager. So it's this buffer manager that is pretty scaling with a pretty fast and it's an unusual property. It supports variable-sized pages, which usually systems don't. So usually they have a fixed size, but we built a variable-sized page and I will talk about why. And then our compilation model is also a bit different than what we did to help us. We have adaptive compilation and we have an execution engine that splits this execution plan into modular steps for reasons that we talk about. Then we had to do something with statistics maintenance because if you have data on disk, you don't want to look at it with random IO, because even on SSDs cost you something. As you prefer not to, therefore, our statistics is different and we do strings a bit different. I think it skips the strings, but I will talk about the other three parts. Okay, so let's now start with more technical things. So what about buffer managers? I think all of you know how buffer managers are implemented and a traditional buffer manager organizes data in fixed-size pages, like 16 kilobyte, 64 kilobyte, or 4 kilobyte depends on the system. And we do this for good reasons, of course, because the buffer manager is much easier if you have fixed-size pages. Like recovery is easy, memory management is easy, you have no fragmentation, all pages are the same, so everything works nicely. But the price that you pay for this simple buffer manager is that the rest of your system has a terrible complexity. In particular, if you happen to have data structures that are larger than memory. And this can happen quite easily. I mean, the stupid example for this is if the user inserts a string that is larger than a page, which can happen. Yeah, a user can insert large strings. This compares complex logic. Okay, now you could say how often this has happened, but there are other things that you would like to store. For example, dictionaries for dictionary compression. People would like to store them nicely, so consecutive in memory, but there's a high chance that these are larger than your memory, sorry, then your page. And then, yeah, so we need some complex logic that maps these data structures into fixed-size pages. And what we call here partitioning logic. So there's this mapping of your data structures to your page. And this is a problem because this makes your life much more complicated and also this costs your performance. If you just think about a lookup in a dictionary for dictionary compression, I think this is the best example. You want that resolving one value takes nearly no time. You want to access an array and that's it. But if you split this into multiple pages, usually logic is much more complex. And so if you use fixed-size pages, you get a simple buffer manager, but the rest of your system is complicated because you have to deal with all these corner cases. And I think this is a bad trade-off because you implement a buffer manager only once, but you have many data structures in your database system. And that's why I think you don't want this. I think what you want is you want a buffer manager that supports variable-sized pages. So basically we want to have a one-to-one mapping from data structures to my buffer pool. So the data structure is as it is. And if some object is larger than the traditional page size, then it is larger than the traditional page size. So your buffer manager should do the right thing. So we can have large objects simultaneously and then we can have a nice representation to directly access this. Also perhaps another motivation why we prefer to remember we want to have the same performance as an in-memory system. And of course our in-memory system that we compared against our own head objects that were larger than the page and needed immediate access. And so therefore if you want to beat this system, this is of course what we want to do. We have to have a similar look at performance basically. So therefore we said, okay, we want to support these structures. Good thing is if you do this, you need no complex partitioning logic, everything's easy. Bad thing is the buffer manager is complicated. Yeah, but as I said, I think this is a worthwhile trade-off because you implement the buffer manager only once and then the rest is okay. So of course it is more complicated. Now the most difficult problem actually of the buffer manager is fragmentation or the most severe problem. Because if you have data objects with different sizes and you allocate and release them in our order, you can get whole. It's difficult to fill so that we call external fragmentation. And so how do we get rid of the whole set we have in our buffer? So otherwise it can happen that numerically we still have space, but in practice all space is unusable. Okay, and we solved this problem by using a trick here. Basically, we distinguish between virtual and physical memory because in the end the physical amount of memory that you have in your system is limited to whatever your memory is. So you have, let's say 512 gigabytes or something or 256 gigabytes, this is your space that you could use, but on the physical level, but on the virtual levels in the virtual address space, we have much more space available. So the writing system will happily give us more as long as we don't allocate it all. And therefore we allocate this memory that we have available repeatedly so multiple times with different size classes. So here you see a small sketch that says, okay, let's allocate the same memory range with 64 kilobytes, 128 kilobytes, 256 and 512. Of course, you need exponential growth to limit the number of interest ADF. Okay, but the total amount of pages that we need is just double if you use a power of two gross. So we don't need a lot of extra, but now this gives us basically the whole space with multiple granularities. And now if I need memory on a certain size class, so let's say I have space available here immediately for small pages, but now I need a large page and I don't have it consecutive. Then what I can do is I deallocate two small pages. So the state's operating system to cut the link between the virtual address space and the physical address space. You unmap the 64 kilobyte pages and then you have space, physical space available again, for 128 kilobytes. And then you can touch this block here to the left and then you get it and then everything works. Okay, so basically you tell the operating system to give me the same physical memory under different virtual address and then you don't have the fragmentation problem. Okay, so your fragmentation is fixed because we migrate to different address in virtual address. Okay, and so therefore the mapping between virtual address and physical address is only guaranteed for pinned buffer pages. So if somebody accesses a buffer page, then we guarantee this, otherwise we can remove this. Okay, so this was a memory allocation. Now, while this is handy, so this allows me to store complicated things in my buffer, this alone is not sufficient to get good performance. Remember our opponent here is a pure and memory system. And like the traditional implementation here would be like we acquire a latch or read writer log basically for every page that we access, we pin this, unpin this. This is terribly expensive because every access grabs a log. And even worse, if you think about the B3, the root is highly contented, there's a log on the root is highly contented. So we don't like this. Therefore, we use two tricks here to improve performance of this performance. First is pointers whizzling. And second is version latches. So we don't use traditional latches, we use the version latches. Okay, now how does these techniques work? So pointers whizzling. Can you hear me? Yes. My mic was working. I have questions from the audience. So someone how do you want to meet yourself and go for it? Yeah, sorry, I didn't want to interrupt. Are there any issues with changing the virtual to physical mapping very often where you can increase the size of kernel data structures or blow out the TLB cache? I've heard of using M remap to like do a fast realloc. This kind of trick can cause other memory access to be slower. Yes. So therefore, we don't use M remap because this is indeed a bad thing. We just tell the operating system to unlink. So this is M advice don't need, I think. So then to tell the operating system to forget this. So this leaves as a mapping simple. So it's still a linear mapping as a kernel. So address space does so you don't get crazy mapping objects as a mapping is still linear. Now there is still a cost namely if you change the size then you have to shoot down the TLB of the other course. And therefore you don't like to do this very often. So we actually prefer to keep memory in a certain size and we only move it to a different size if you have to because you are right. So there is a cost associated with this. And now there are some some tricks you can play in the kernel to make this cheaper. But so there's a professor from a database chairs that told me how you can improve the performance. But this requires patching the kernel and we don't do this by default. So by default we just use vanilla kernel calls and then we prefer not to change the size if you can. But it's not terribly expensive. You can't do this a few hundred times a second. That's okay. But you don't want to do those millions of times. All right. Next question is from Nevin. Hi professor. I'm wondering how do you how does the flush work with variable page sizes backed by a file? Oh, Nevin from Google BigQuery. Hello. So you mean how to write it to disk? Is this right? Right. It doesn't work that different from a regular writeout. You do the same. The only thing is you have to be a bit careful with recovery. You're also on physical storage. You don't want to change the size of the page because otherwise it could happen that a large page is overwritten by what's the one direction. No, better than the other direction. Two small pages and you overwrite it with a large page. This is unsafe because then you are not sure that the header with the LSN number is in the correct place. And you have to be a bit careful with this earlier thing so that to guarantee the recovery works. But otherwise, you just treat it as a regular page. You write it to your file and then that's it. So are like 64 KIB, is that like a page in itself? Yes. Okay. Cool. So you just write a 64 KIB page. Got it. Thank you. Yeah. All right. Or larger. So basically our page is a combination of physical offset and the size class encoded in the page ID. So we must know how large each page is, of course. And then we can read and write it. All right. Todd, do you want to ask your question? A quick question on that. Do you use a no huge page for your mapping so that when you actually want to unmap a 64K slot, it can actually be remapped elsewhere? So I'm using two megabyte pages and I'm going to do page splitting on that. No, we don't want to transparent huge pages on this. Okay. So you just pay the extra TLV pressure because you get finer granularity. Okay. Thanks. Yeah. I think you could improve the efficiency of this even with some kernel changes. And some people did this, but this is not something that mainline kernel offers you. So we just use a regular. And so basically our trick is we try to change the size as rarely as we can because the change sizes increases cost. All right. Awesome. Keep going. Okay. And yeah. So we said, I will see as I said, so we want to keep the cost of accessing a page down. Because if you do the, so the solution is you use a hash table, then you acquire a log for the hash table, and then you look up the page and then acquire a natural page and it's all terribly expensive. Now, so therefore what we, to, to get the in-memory performance, we use pointers visiting. So basically, if we, on the page, you have a page ID. So we call this here slip. And that's a potentially swizzled pointer. So this, the slip is either a page ID if the page is not in memory, or if it is in memory, it's really physically pointing to the page that we have. And therefore, if you now follow a page ID on a former page, we check, did we already resolve this? And if yes, we just jumped there. So it's one if and then you go to the page that you have to access. And if not, then of course, we have to acquire a log and do something, but we do IO anyway. So if it's not in memory, we do IO and then we can pay some price because this costs you something. But if the page was already in memory, then we can just follow the, or if we already traversed this pointer before, then we can just follow the pointer. And that's it. Of course, the price that you pay for this is that when you evict a page, you must invalidate incoming slip. That's because as somebody would think that the page is there, but it's not. Yes, you must invalidate the slip. And at least our current implementation therefore allows you to have only one incoming slip. This has some consequences. Basically, your data structures must all be trees. Because then if you have a tree, then everything's fine. But if you have not a tree, then there's no strong reason to not have two. We could have two or so, but you want to have a low number of incoming slips because we must invalidate them when we evict the page. But otherwise, the slip encodes our address and then we can very, very cheaply access a page without having a hash table lookup. Okay. And a second thing that we use to get a good performance is we use a different kind of synchronization. Of course, in a disk based system, you need some way to synchronize because the page can go away because due to memory pressure, some page can go away. So we need some mechanism to synchronize, but we don't like this. Just think about the worst offender is the root of a B tree. Whenever you do a lookup, everybody goes through the root. Yeah. And the traditional solution would be, okay, you grab a readwriter lock, so the latch on the root, find where you have to go, really latch the child, release the root, and that's it. But if the data is in practice in memory, all your time will be spent on acquiring this stupid mutex, in particular the mutex on the root of the B tree. That's highly, highly contented even though everybody's only reading. Yeah, because acquiring a mutex is an implicit right, even if there's no, no logical contention. Yeah, therefore we don't want this. So what we do instead is we use what you call here a version latch is just 160-bit integer counter. And this can be locked in different modes. Basically, you can, the easiest mode is just exclusive lock, as you would expect. You can also have a shared lock. So like remember how many readers we have. And the third mode is we can get an optimistic lock. And so this is what does optimistic lock mean. So in this version latch, we remember how many exclusive unlocks we did. So basically whenever you do an unlock of an exclusive mode, you bump a version number. Okay, so this allows you to realize if there was a change. And this allows you means that we can optimistically reach a page without doing anything. Yes, if you go to the page, read this version lock. In this moment, we know if somebody gets an exclusive lock. And if yes, of course, then we have to wait until this one goes away. But if nobody has an exclusive lock, we just read the page as it is and no locking at all optimistically. But the problem is before we can act on what we have read because this can change at any point in time, of course. So if before we can act upon where it's read, we read this world lock again and check if it's changed. And if not, we say, okay, we are safe. We safely read something. Otherwise, we had to raise and then we start again. Okay. And so this optimistic mode here means we do not write to memory at all. Yes, we only read at this case, beautifully, even with dozens, of course, hundreds, of course, even, yes, it's extremely fast, scales very well. Of course, we may fail if you have contention, but it's usually not the case. And also we rarely have these things. And of course, you have the other modes read shared for read only and exclusive for read write. And now the crazy thing is, if you're an optimistically ledged page, the page might effect go away while you're reading. But this works because the way we implemented above furniture, if a page is evicted, we read the zero page. And therefore, we're just reading zeros. Of course, this is garbage. Your reader must be a bit careful. You must be must cope with the fact that you might read garbage. But we realize this afterwards. And then everything works fine. Yeah, this gives you extremely good scalability across course. Okay. And now how do we use this to implement the traditional database? Basically, our relations are stored in B plus trees. And now the key of the B plus tree is just a synthetic table ID. It's not visible to the user. Internally, we organize this by table ID and as keys. And then here we said we have a Pax layout within the leaf pages. Technically speaking, we also have a pure column store layout. But so this is the old to be oriented. If you wish to refer Pax out within the city page, and then we can scan this reasonably well, and we can we can navigate in this to our table that we need. And you have only one incoming pointer for a page. Okay, and we use different access modes, depending on what you're doing. So if you have just a traversal, like you do point lookup, or you go over the root of the this is P3, then we use optimistic locking, because we don't modify anything. Yes, we just read and go to the next page. And now if you read a leaf page, then what we do depends on our mode. And if you just do a one table access point access, then we use also optimistic locking. If we use a scan, then we use a shared then we grab a real shared mode. Now, why do we do this? Why do we do use not always use optimistic locking? There's a problem with optimistic locking is you must validate after every read. In particular, what about strings? Because strings can go, a page can go great any point in time. If you use a point access, we just copy the string and check afterwards if it was still valid. But if you don't want to do this during scans, therefore we let the page and we're done, then nobody evicts it. This allows us to avoid copying strings. Okay, now there's also some other difficulties that we had. Of course, what about multi versioning? Of course, you want to have multi versioning because we want transaction isolation and so on. But multi versioning is a bit unpleasant, it's a complex data structure and we have to update this and after the transaction commits, we don't like to write it to disk. So we prefer not to write it to disk because of the high overhead. Therefore actually we use the mixed solution for this. So first, in the regular case, for most transactions, we just use an in-memory representation of the transaction. And the argument is that most transactions are small. So if you just touch a few hundred tuples and it just happily keeps its memory. And so we have here the transaction version buffer that tells me, okay, about change my transaction. And then we attach this version buffers to pages. So basically we have a mapping table that tells me, okay, for this page, this is the corresponding version buffers. And then we can look up in the page, but for performance reasons, we stored in the buffer frame. So when you load a page, basically we check its corresponding buffer, then we attach it to the page directly with the pointer, then we can go to this. And if we evict a page, we keep this version buffers around, so here's what we call your offense, and then we come to the page, comes back again, we reattach it and that's it. Okay, so this gives me very fast access to this version buffer. This works beautifully if your transaction is small, but of course it could be that your transaction is large. It's not often the case, but somebody can do a bike load or does a lead from a relation or some other heavy thing. So you could touch every chapter in the database, and then of course we don't want to have an in-memory representation like this, because it'll be very expensive. Okay, and therefore, if we do have a bike operation, if we notice that you do this, then we do something else. And for reasoning here, I said, reasoning is that of course bike operations can happen, but they're not very frequent. So nobody has a high-frequency bike operation, at least if they do, then they have a performance problem. So we say they happen from time to time, but not all the time. And therefore, we say, we allow that at most one bike operation is active for a given relation at a certain point in time. Of course, there can be another one afterwards, but we have only one running one. And then for these bike operations, we store versioning information just as bits that we directly meturize on the page. So we basically say on a page, we remember to which bike operation it belongs. So we see an epoch number on the page, then we just store a created and deleted bit for the topic if he needs this for the bike operation. Okay, it's the advantage of these mini versions, just two bits, that we can write this to disk. So we can handle outer large updates. You cannot have more than one concurrently, but each individually can be outer really large. Okay, this allows us then to isolate also these heavy operations properly with multi-versioning. Okay, what's my time? I am sorry, 50 minutes. I think I might have to speak. Let's see. You're doing great. Keep going. Okay. Yeah. So and then a few other things that you have to do differently, because it was the disk based systems, because in Hyper, we just built a sample of our, the first, what the bank, we need statistics. We need statistics all the time for estimation purposes, like our quip to my, they need statistics and we have to know how many tablets qualify for a predicate, how many distinct values we have for group buy and all the stuff. And how do I use the sample for this? And then data change too much. We just recomputed the sample. So this is what we did in memory. But on disk, you don't want to do this because random access to tablets is problematic. On a rotating disk, it's a disaster. On an SSD, it's okay-ish, but you still get read amplification because you just might just need one tablet from a page, but you read the whole page. And so retrieving a large sample from disk is expensive, even on SSDs. Yeah. Therefore, we don't want to do this. And we don't want to compute samples on demand. What we do instead is we use reservoir sampling. So basically, why, why do you insert data or why update or delete data? We maintain a sample for you all the time. So data samples always available. We don't have to recompute it. And when you insert, we do use this Vita, I think, algorithm to predict the skip length between hits in the reservoir sample. And then we, the most of the time we do nothing. And we just, we figure out which tupper would be relevant for the sample, and then we insert it in our sample. Okay. One difficulty here is do how to do this multi-threading. Yeah. We can't say multi-threaded insert. And I hope you see a small sketch here on top. I'm not sure if it's understandable. But basically, we have the skips between tablets that we can pre-compute. And we hand the skips to threads, basically. We said, okay, you are responsible to skipping so many tablets until once qualifies. And we give this out to, to, to threads on demand. Notice the length of the skip increases over time. That's because the larger the ratio becomes more, more unlikely it becomes that the tupper is relevant. We hand out the skips and each thread, thread local counts down until the relevant tupper would be encountered. And this is very cheap, very low overhead. And only when a tupper is relevant for the sample, then we insert it into a proper synchronized data structure. Okay. So this allows us to maintain the sample with very little cost all the time. And now samples are great. If you just want to handle a predicate, it does not work well for estimating these distinct counts, like group by sizes, for example, unfortunately. And to solve this, we maintain another data structure also while inserting, namely hyperlockrock sketches. And so for all tupper, for all columns in our relations, we maintain hyperlockrock sketches when you insert. And then we can give you a pretty good estimate for the number of distinct values for all columns without reading afterwards, because you don't want to read afterwards. I know systems try to solve this, I all cost problem by re by sampling not tupperage, but pages, but don't do this. Yes, you get statistical bias if you do this and rather prefer a proper statistical solution, like for example, for something. Okay. Okay. If you more words about what else we changed, now our compilation model was a bit different what we did in hyper and hyper. We just took the query, compiled it under a function, then you call the function, that's it. But this is unpleasant, because in the IO case, you might want to de-schedule the query, you might want to pause your query. And if you're curious, just the function that you call is quite difficult. You don't get rid of this function that is currently executing. Therefore, what we do instead is we compile our query into a state machine, basically. Now you see a small sketch, I hope it's understandable. So we say we have a small query says we're reading supplier, group by nation key, and then we perform account. And if you say this results in two pipelines, the one is scanning of supplier and the aggregation, the second pipeline is reading the result and then producing the result. And we on the right hand side, you see the small state machine here. In this case, it's just a simple form from bottom to top. Just hop one level up. I see we set up pipeline one, create straight liquid storage, and then we do a straight-local aggregation. Now in parallel, you know, notice this bar around the boxes. This is now my discredited. Then we look at a global hash table in Merch. Again, parallel merches in global hash table and so on. So this is how we execute the query. And at any point in time, you can stop this and then resume the state machine whenever you are. This was quite handy for handling IO and also for handling long-running queries. You don't want that your long-running query kicks out all your short queries. So somebody might run a very cheap query and you don't want them to wait until the long-running query is gone. So we use this for both for scheduling and for IO handling. Now, so this mechanism. Did you get, I mean, is this involved as a co-routines or is this like little functions that run? This is not, it behaves a bit like a co-routine, but from a technical perspective, it just functions with state. But this is generated anyway. But it looks a bit like a co-routine. But from a technical perspective, it's a regular function with explicit state. So we keep track of the state. Got it. Did you consider like, I'm thinking of like the SQL OS stuff where they have these little functions that are like at most six milliseconds. They always come back and you can decide if you or they basically run their own co-routines themselves. Is that sort of the way I should think about that? Yes. So we might could have used, for example, simple task co-routines to do this. We intentionally did not because we looked at this problem as simple task co-routines does memory allocation for its own state. And we don't want this. We want to control memory. And therefore we create this automatically ourselves. But note that our query compiler is a really compiler anyway. So anyway, we are generating a co-routine, if you wish. It's all generated anyway. Yes. Awesome. Thanks. Okay. And yeah. So the nice thing about this model architecture is that we can easily suspend it and we can fine-grained things. Okay. Now, also what the steps allow us is that we don't have to compile everything at once, but we can compile pieces of the query individually. Like each step is a function, so you can compile each function individually. And so what we do is we first we lower our code into a custom intermediary representation that looks a bit like LVM. It's not the same, but it looks a bit like this. And then when we have this, we can lower this with different backends. When we run a query, we first translate it into using a very low latency backends. We're very cheap to compile. And then we start it. And then we see how expensive this step is, because a step like allocator hash table mean, come on, this needs very few instructions. There's no point in optimizing this. And only when we see, okay, this step is expensive, then we start a more heavy weight step. And note that users while the query is still running, so we don't stop the query for this. We compile in the background and then we switch while it's executing. And so in fact, compile time in a compiling engine, compile time becomes a problem because we started using LVM, always using LVM. But this has problems because LVM has super linear compile time, unfortunately, because I spoke to some of these developers and I said, hey, your compile time is super linear, if the function is large. And they said, okay, don't make large functions. And I said, excuse me, is machine generated? I cannot. Okay, they don't have sympathy for all problems. Okay, so therefore, we do something else. So we start with directly emitting machine codes. We did ourselves, we have our own machine code emitter. And the difficult thing is actually not generating machine codes, right if easy. Difficult thing is register a location. And this requires you to detect loops and lifetime computation. So this is bad things. And unfortunately, you need some pretty advanced algorithm for this. Fortunately, Tajan described how to do this. So if you can steal them all from literature, but these are pretty advanced things. But it's a good thing if you implement them, then you get guaranteed almost linear compile time. You know, almost linear meet the practice linear. This inverse Akanman function is, you know, for practical purposes less than six. So this is linear practice. And so then you get very, very good compile time here. So, you know, you see compile time compressions. Notice the scales are very different in these three cases. Yeah, the one is yes, if you compile a query with between zero and no between zero and 2000 joints, in LVM compile time goes up to 150 seconds. If you do this, you force LVM to do say, okay, do it as cheaply as you can, you still need four seconds. And if you use our own emitter here, we need, I don't know, 35 milliseconds or so. So this is much faster than what the others are doing. Yes, I understand. So like you're not generating LVM, you're generating your own bytecodes. And then you're emitting assembly and you're running the assembler, are you still running LLVM? No, we have different. So after we generated this IR, we have different choices with what we call a back end. So basically we can lower our own mid representations into different backends. One is backends that directly emits machine code and note that we skip the assembler. So we really emits machine code. So by the byte stream, yeah, so machine code. This is one, this is a super low latency, but it has quite okay performance, but it's not as fast as a code that LVM generates with O3. So this gives better code. So because you have a tradeoff here between how much time you spend on compiling and execution speed. And so therefore if you realize that this clear part is really quite a lot, then we start LLVM backends. So we lower our own representation to LLVM and then compile it there. And then, but only if you think it's worthwhile. And in most parts of security, it's not worthwhile. Because there's a tradeoff here because between execution speed and compile time. And so we only do this if you would benefit from this. All right, Madden, you want to ask your question? Hi. So I'm a huge fan of, you know, vectorized run times. So I'm just wondering, would a vectorized runtime do any better within this context? So from my understanding, we're trying to optimize for like, like IO stalls in our query plan. So if we had a vectorized engine, would that be in some sense easier or better to use in this context? I think for IO, it's not a super lot. Perhaps it's a bit easier because you can predict what will happen next. But notice the one reason why we want to compile instead of just vectorizing is you also want to run OTP workloads. And if you have OTP workloads, then compiled engines are really significantly faster than vectorized. Vectorized is fine if you're on OLAP. So you can build a beautiful OLAP system with vectorized, no problem. But if you try to run OTP workloads, then this is a vectorization overhead. It becomes notice because often you touch only handful of tuples. And then you don't want to pay the price for a vectorized execution. And even though I said here, we want to handle this case well, we still assume that most of the time we will be in memory. So we assume that IO is an outlier. I almost work, I almost give good performance, but we still optimize for the in-memory case. Right. And so you do like caching the query plan to avoid recompiling, right? Actually compile down with solos that we no longer do this. We did this at some point, yes. But really, we need 33, 5 milliseconds for a query with 2,000 joins. Very few queries have 2,000 joins. So compile time is not a serious problem for us. So for this reason, actually we stopped caching plans. Fascinating. Thank you. Okay. So nearly out of time, but just a few numbers to show you, this is just a comparison between Umbub and Hyper. So our own system and Monet and this left one here is John order benchmark. We see that on average, we're about three times faster than Hyper and four times than Monet. And if you compare to VCH, then we see on average, we are 1.7 faster than Hyper and 1.9 than Monet. But with the error bus. Note, but of course, as a full disclosure, the reason why we win over Hyper is not because our runtime system is faster. It's perhaps a bit faster. But the real reason is we win is we have better statistics. We generate better plans. Hyper is a very fast system. So if data fits a memory gives excellent performance, but we did also a lot on the optimizer. And if in fact, we happen to construct better plans, this is therefore the numbers here are not so over interpreted. It's not easy to compare systems. Basically, I just want to show you that we get basically the same performance as an in memory system. So if you know that this is now measured in memory, in the warm cache. So this is in memory. And there we get the performance of an in memory system. This was our goal. Now, what about if the data does not fit a memory? So it's easy to figure where we are. So how fast are we? Did we reach the bandwidth basically that we could get? And we did a comparison is with an MAP system, just to see what overhead we have. And if you have if you run your singlet threaded, then we see that we get basically the same performance as an MAP. And average, we are paying something like 6% stowed on and singlet threaded. And if we if we measure the code cache case, then we get basically the full read bandwidth of NFSD in both cases. If you use just a handful of threads, note that in practice, MAP, even though here it looks good, is not a viable option. MAP is a table choice, and he has written extensively on this. But MAP is a table choice. In particular, if you run the same experiment here, not with 16 threads, but with 200 threads, performance is awful. Yeah, because they are all fighting for locks on the corner, performance is terrible. So you cannot use MAP for highly parallel reads doesn't really does work. So here I just want to show you that in the in the end, we reach the performance of the IO device if if we are out of memory scenario. Okay, so to conclude, I hope I could give you at least some ideas that Umber tries to offer them best of both worlds. So we say we achieve the performance of an immemorial system if the data fits in my memory. And we, yeah, as scalable as a disk based system, if we go beyond memory, if you have to. And yeah, so some key things that we had to achieve. So one is of course our buffer manager. And second is our execution plan model and this adaptive execution compilation framework. And finally, we did a lot on work on getting statistics and string handling reasonable, even if data is is beyond memory. Okay, thanks for your time. Okay, awesome. I will call on behalf of everyone else. You know, I told Thomas he had 45 minutes and he's on 45 minutes on the dot. I mean, very Germany. So we have plenty of time for questions. So if you want to explain to me, I'll meet yourself and go for it. All right, Todd, go for it. Hi, I had a question about the buffer manager and your B trim implementation. Do you use a fixed page size for the internal and leaf nodes or a fixed tuple count? And do you try to target being just under a size that's a power of two in order to reduce internal fragmentation? Or how do you sort of manage those trade-offs? Yeah, so the, in fact, we have in the in the B tree, we have a nice picture between you know, I'm sorry, I should have had a picture of the B tree. In the in the B tree, in the leaf page, we try to reach 64 kilobytes sizes. So this is our preferred page size. And we fill it up to 60 kilobytes. Now, if you give me a tuple that's larger than 60 kilobytes, then of course, I will use a larger page because there's no choice. And so if you you can insert a string with two gigabytes, if you want, and then you get a two gigabyte page. But we prefer 64 kilobytes pages on the leaf level. On the inner nodes, we will always get 64 kilobytes pages because our keys are tuple IDs and we can guarantee that we never. So inner pages are in fact fixed size, but leaf pages are 64 kilobytes if you can get away. And otherwise, as large as we have to for your, for your tuples. Okay. And under full pages are still 64k. You don't use the smaller size classes for a... No, but it's not very likely that you get under full pages because we fit now, well, perhaps a few... On deletions, do you do merges on deletion? Or... Yeah, we merge on delete. Yeah. So, of course, if you're older tuples, I've started 33 kilobytes or so, then we get under full pages. But, but otherwise, we put tuples on there as much as we can. So we fill the page with tuples. Okay. And then on the interaction for using tuples, just the pointers like that to use any prefetching tricks or anything, or just take the caches as you're traversing and binary searching. Currently, the logic is not so simple. If I remember correctly, it's a code... First time you go down, you just go down and take a cache. But if you do multiple lookups, you remember where we have to go next or expect to go next, and then prefetch this and then the next page of access is cheaper to access. Yeah. Okay. Thanks. Alright, Chi, question. Yeah, so I have a question about the query execution. I'd like to know if you have a fixed number of threads running the queries or you will allocate new threads when there are more queries coming in the process. We have a fixed number of threads. Note that we have always as many threads as you have course, because I think it makes no sense to have more threads than cores, because clearly they cannot run concurrently. So we have exactly as many threads as we have a course, and then we schedule queries to threads basically. As we have a schedule, as it decides which thread works and which query, and then we hand out this. If you have only one query, of course, you get all cores, but if you have multiple queries, you... Yeah, that sounds good. And then comes my next question. So if you have a fixed number of threads, then there are some OLAP queries. They might be taking the CPU course for a long time, and then there will be some TP queries. So how do you decide the scheduling of these queries, so that TP queries have no latency and AP queries can still execute fast? Yeah, we use a decaying priority, which means as long as the query runs, it gets all the threads available, but the less its priority is. So therefore, if a new query comes, and then it will take course away from the OLAP queries, and then to get the OLAP queries through. Yeah, so when can you kick out the query? Do you place some explicit PowerPoint in those state machines, or you just randomly stop some threads? So basically when we... I didn't want to overload my slides too much. There's already too much on those slides. So if you look about this multi-threaded steps, these are internally using a more cell-driven parallelism, which means they are processing a few thousand tuplets, and then they fall back to a kind of scheduling function. And most of the time, of course, you just pick the next most, and that's it. But if somebody decides that your thread now should yield, then it's the next muscle boundary, you stop processing, and then you switch to another query. So you have to wait at most... Actually, it's variable size. Actually, it anyway aims for time. So we try to pick the most size so that they have all the same execution time. So but usually you have to wait, let's say 500 tuplets, and then you get a call. Yeah, thank you. You said you use only one thread per core. Does that mean that none of these threads ever block on IO? How do you get to see if they're saturated? Yeah, this is indeed a thing. So with this part, we are not 100% happy yet. So what we started with was just saying, hey, we're using very fast NVMe SSDs. It's an unlikely event that you block on O, then you block on O. Yeah, so because these have pretty low latency, so you can synchronize block. But if your device is not a very fast NVMe SSD, as we learned painfully, then you don't want to do this. So then you need another thread, but this does prefetching and then hence out loaded pages for further workers. So basically, the workers should never block for a long time. So if your IO device is very fast, you can synchronously block. But if it's slow, then you might have to have somebody else to do that for you and just wake you up when work is there. Hi, Jason, question. It was along those lines. If you're using the NVMe PCI Express SSDs, could you redirect the IO to those threads that are running on the cores that are directly attached to those particular PCIe slots? Oh, that's a good idea, actually. We didn't implement this. We did something similar in the hyper system where we prefer to read memory from the local NUMA node. We schedule strats so that you prefer to read from a local node. We could do the same, of course, for disk, but we haven't implemented this. But it's a good idea. I was wondering how does the database know whether to use LLVM or flying starts to compile a query plan? Because is there some statistics to help predict how would be the speedup of the optimized LLVM before compiling? Basically, what we're doing is within each pipeline, we keep track of the progress. The driver, for example, the scan for supplier, must tell us which fraction of supplier have you read, at least an estimate for this. We know how far we are, and this allows us to predict how long we will need to finish. And then we have a model that first models compile time of LLVM. So we have basically a polynomial function that tells us how expensive it is to compile this. And then also the expected speedup on LLVM, of course, it was measured, but of course, there's some inaccuracies there. But we have a model that tells us how much faster we will be if we do this. Then if it's worthwhile, then we start the compilation and switch. And if you say, no, the running time, the remaining running time is not significant enough, then we never. So then we start, stop this with our flying start begin. That's pretty cool. Thanks. I mean, this is a roughing detective, adaptive execution stuff you were doing with Hyper, right, in the ICD paper. This is an extension. Yeah, the Hyper had already, but in Hyper, it was a bit different. Then we just switched between a virtual machine or two LLVM nodes. But we stopped doing this. We now have a different start. So our initial back end is different than what we did in Hyper. I understand. Okay, cool. Gavin. This might be a dumb question, but I'm curious, if you have like async IO enabled abilities with like IO hearing, or the maybe Windows IO ring stuff, does that impact your ability to kind of schedule things and do async work? In fact, we use this for this background thread that does IO. This does use IO hearing to schedule IO. And then when the page is there, then we can hand this to the WebR threads. No, super cool. Thank you. All right, Todd, go for it. I had a quick question switching gears a little bit to the PAX encoding that you use on leaf pages. I'm curious if you, how you sort of navigate the tension between dense encodings and low runtime CPU overhead on decoding those encodings. Do you find yourself using very lightweight ones or denser, but heavy to decode or decoder format in memory at any point? Note that this PAX layout is optimized really for an ODP workload. So therefore updates and so on, everything has to be cheap. So this basically is not compressed or of course, the background allows you to do cheap scans. So it's not a rose layout. It does allow for cheap scans, but nevertheless, it's not heavily compressed or anything like this. Okay, you're not using any run length coding or anything within the PAX page? We have a different backend for ODP queries. There we do this, but not for ODP, no. Got it. Okay, thanks. And then Robert. Thank you. Yes, since you mentioned OLAP, what do you think about non-CPU execution resources? Like the CPU is kind of tapering off in frequency and even like core counts is more limits like GP GPU or even the CPU SIMD instructions. This is difficult. First is there are different things. SIMD instructions, of course, we use ourselves too. Like if you use a scan on a table, we use SIMD instructions for filter. And so this is regular stuff. Now, if you see asking about using GPUs for acceleration, we tried this a few times, but we always had the problem is, I mean, if the data is larger than your GPU memory, which usually is the case, we have to transfer data in and out of the GPU. And this tends to ruin the performance advantage that you get from the GPU. It's not true for all cases. Sometimes it's still a win, but in other cases, the gains were not really impressive enough for us to do this in our main system. So we had some experience with this, but we don't use this for productive. All right, cool. That was a lot of great questions. I guess my last question would be, you know, is Ambrose still Postgres compatible? And what is that level of compatibility? Is it just a SQL grammar, catalogs, other functionality? Like how compatible is it to Postgres? First, it uses the same grammar. And it uses the same Viya protocol, so you can connect to it with a Postgres driver. Now, if you talk about this meta table, so there are other questions. I think PG catalog we expose and some other tables too, but of course, these are fake. So if you clear them, you get data that looks as you would expect it to get, but we don't expose everything. We only expose what we needed for some drivers. If you would have a need, of course, we could add any tables that you need, but I think currently we expose only three or four tables. Got it. Okay. All right, Robert, one last question. I only had one. Sorry. Go for it. Are you all done? Okay. All right. So I think we're done here. Thomas, thank you so much for doing this. I really appreciate it. This is fantastic stuff.