 Hi, everyone. Welcome to the most politically incorrect session of this conference. My name is CK, and we just founded Vitas Data about six months ago. So the company is six months old, and we have just released our first GA of the product. So what I want to do is we will first give some background on the technology landscape, and then we'll talk about TPC HQ1, which is what we base the performance claims on, just to get you familiarized with what it does. And then we'll talk about why 100%, why we are 100% compatible with Postgres. And finally, we'll go through a few iterations of how we get the performance to go to 180 times. So first on the background. So there are two founders in Vitas Data. And we are X VMware, X Grimplarm, Microsoft Secret Server, X Informix people. I think the only thing missing here is some Oracle experience, but otherwise we are covered. So first, we all know that Postgres is amazing. A database system is not just about select queries. It has a whole lot of stuff surrounding it, authentication, authorization. So Postgres provides all these that we don't need to redo. What we wanted to do is we'll just focus on speed and make it go faster. So it lets us focus without, if we have to redo this whole thing, it'll take us 10 years. So what is interesting recently is the hardware has changed a lot over the past few years. Before, you were always optimizing for disk performance because it doesn't make sense if you can compile and run the query much faster only if it's in memory, but if you are limited by disk. But these days, you are no longer limited by disk. I think, I don't know how many people here, but probably have database that are smaller than their main memory size. So there's essentially no disk I owe going on in those systems. What you see is the bottleneck has completely shifted, right? It's no longer on the disk controller. It's now between, it's not a memory controller between the chip, the processor, and RAM. So that's one. The other thing is what we see is there's a lot of cores coming online, right? We're going to see probably 100 cores this year. I don't know, by 2020, we probably have 500 cores and you want to make sure that you are able to utilize those cores. Otherwise, you're going to run into a problem. I think the only way out of it is probably running virtual machines then, right? You run a few instances of Postgres on top of 500 cores machine and you try to get performance that way. But if you can make use of the hundreds of cores, then why not? So this is a TPC HQ one. For those who are not familiar, it's basically a scan on a line item table which has 6 million for one gig, 600 million for 100 gig benchmark. It has a filter and a hash act, a sort, and then a lot of projections and aggregates and projections, right? So there are a lot of sum, a lot of averages and some counts. Okay, this is what I see data scientists do day in, day out. So what they do is they just want to count features. Look at the features, try to see which feature in the data set would affect the outcome of what they predicted. And they do this live, they sit in front of the terminal and they type it, right? So if you have to run this query and it takes two hours, you're just wasting your time, right? You need to do this really fast for the data scientists. So first, so on the test DB standard edition, we're able to run this query eight times faster than standard Postgres. And on average, we run things about 3.7 times faster for the whole branch mark, okay? And we'll go into how we do that later. And for our enterprise edition, we are able to do up to 180 times faster for this particular query. And on average, it's about 27 to 37 times faster, depending on whether you have column store or not. Questions? Okay, so next, so let's talk about how we are 100% Postgres compatible, right? So we base this claim quite simply on this algorithm that we use. So you submit a query to Postgres, it goes through the parser. Parser comes up with a plan. And then, so the parser passed the AST to the planner that comes up with the plan. And we basically take this plan and we determine if we want to jit it with LLVM, right? So there are a few things that could happen where we decide not to jit it. So first, if the query is very simple, there's a cost associated with compiling the query. If the query is very simple, then it may make sense to just let Postgres run it. So we just throw it to Postgres in that case. The second case is if it's a, we don't handle reads and writes. We don't handle writes, we handle reads only. So if it's insert, update, delete, create table, we also give it to Postgres. So we don't change your data, only Postgres changes your data. And then finally, if the query use some constructs that we don't support yet. So for instance, we don't support GIS data types or queries yet. So in that case, we also give it back to Postgres. Otherwise, we will compile the query and then we'll run it to completion, right? So in this case, it's 100% compatible because anything that we don't run, Postgres will take care. Yes? On the previous slide, you said you have column storage. We're using our own implementation. We'll talk about it there. So in that case, we are not 100% when you use that. So if it's 100% compatible, then you can imagine a use case like this, right? You run your OLTP system with Postgres. You don't change a thing. And Postgres gives you logical replica and you run VitesseDB on another machine. And now you have with logical replication. And now you basically have the best of both worlds, right? You have OLTP system that you use for your everyday queries. And you have this other system on the side that you can give it to your data scientists that when they run a huge query, it won't impact your production. So it's very important for us to maintain 100% compatibility. Or if you don't want to do a replication, you can basically replace your primary, right? So what you do is you just move your Postgres binary, rename it, and then you just rename our binary in there and then it would just restart it and it would just run. Because you won't even feel a difference. Very well on time. So let's see how we go to two times faster, okay? So if you can imagine you don't compile the whole query. I only compile the expressions part, right? If you look at the query tree, you're doing scan and join and selects, but there are a lot of expression evaluation in there and you're only compiling the expression evaluation, right? So if you look at this expression, just evaluating A is greater than B plus C. For those people who are familiar with Postgres internals, you'll see that this would look familiar. It's going to do info GT for comparison, right? And then it's going to do a exact eval scale of R where it goes and retrieves the value for A. And then it's going to check if A is now. If it's now, then you just back out. And then you're going to do another call to info plus, right? Where you add up B and C. Again, you call another function, indirect call, to retrieve B, another call to retrieve C, and then you check for overflow and you do the plus. And then finally, you check for the comparison. Make sense? So there are five indirect calls and this is how Postgres call an indirect call, right? It's a direct function call. You call this function to call an indirect call. So you're passing a pointer to the function and it would call it for you, right? So now you have one function call that calls an indirect function call. So those five indirect function calls become 10 calls, right? And then it has to fill up a structure before it can call the indirect function. And then here is info plus and what it does, simply it takes two arguments and then it does a check for overflow and then it returns a sum, okay? So if you do it in LLVM, this is what you do, right? You're going to load A, you're going to load whether A is now or not. So go through the whole thing for A, B, and C and then you're going to do the addition and the addition, the overflow will be checked by machine, by assembly language, right? And then you don't even need to check the or because check the now, you just now the bit, it's just all the bit, right? If anyone is now, it will be now. You don't need to do a conditional if to check for nows, right? So you get the sum and then you get A to compare and again, you all the now flag and then you just check on whether you overflow, right? If you don't overflow, you return whether A is greater than B plus C or not and the now flag and if the now flag is set, the caller will take care of it. So you will just do the operation, whether it's now or not. So that's how it goes a lot faster than C code because the C compiler cannot assume this kind of stuff. So in summary, you use LLVM to jit expressions only. You still interpret the plan, the plan tree will go through it more later. And a lot of the indirect function calls becomes direct calls or you simply inline it. So when you're compiling, a lot of variables becomes constants, right? So things may be variable in the Postgres plan, but you know that they are constants when you are compiling it. Things like, you know, this variable is a integer. So you know that the length is four. You don't need to, if you want to skip it, you just add four, right? You don't need to say if this is an integer add four in the evaluation. And also, when you compile, there are conditions in the evaluation in Postgres executor where it has a lot of if conditions. And if this is an integer, do this. If this is a string, do this. If this is a numeric, do this, right? But when you're compiling in our case, you know exactly that this is a integer. So you can skip all the others. You just evaluate the code for the part that impacts your data type. So what this does is the code becomes a lot more concise and the machine code that you generate will be very small compared to what Postgres has to go through, right? It may presumably fit into the I1, I2 cache. And finally, you know, overflow and stuff like that. Hardware does a lot for you. And you can also use in 128 and stuff like that. Wow. In 16 minutes, we get to eight times. Okay, good. Okay, so how do we get from two times to eight times? So what you have in Postgres, when you submit the Q1 query to Postgres, is you get this kind of a query tree, right? You get a scan node. You get a filter. You get a hash egg. And then finally, you need to project and emit the data, the answer, sorry. The way Postgres evaluates is, each of these is a pull request. It's a get next, right? So it's interpreted. It goes through the tree from top to bottom in an interpretive fashion. So the control goes to the emit node, which will do a get next to its child, which is a hash egg, which will again do a get next node, get next tuple to the filter, which in turn, you know, called get next tuple to scan, right? So now you have already do like, you know, one, two, one, two, three, four, four calls, right? And then the tuple goes back up on the other side. So on your left side, on your right, you can see the pseudocode that describes how Postgres is evaluating this function, right? This query. So it's going to do a while loop. The emit node is going to do a while loop to get next from hash egg. And then for each tuple that it gets back, it's going to emit it, right? So it's going to call hash egg, get next. And then hash egg get next is going to do a while loop to go through the filter, right? And then for each node that it gets back, it's going to do hash. It is going to hash it and it's going to advance the aggregate. And finally, when it finished, when there's no more tuple coming in from the filter, it's going to do a, it's going to go through the hash table and it's going to finalize all the aggregates and returning them one by one, right? Well, same for the filter. The filter is going to get a tuple from scans and then it's going to qualify it and then return it, right? And then the scans is going to get it from the disk. Again, it's always get next. So it's a pull model. So what we do is you can make this whole, if you compile it, this whole query can become one function call and you just call it and it runs from beginning to the end. It becomes sort of like a, instead of a pull model, it's almost like a push model, right? So if you look at it here, the scan is going to run, so the blue stuff runs first. So the scan is going to run and it's going to get tuples from the relation. It's going to push it up, right? It's going to push it up to the if qualify T, which is a filter, the code that you use to filter. So push it up to the filter, which in turns, after it does that, it's going to push it up to the hash act. So the hash act is going to make the hash table and advance the aggregates, right? And after you have gone through the relation completely, now you have a hash table of all the aggregates. You're going to go through the hash table and you're going to finalize each aggregate and then you're going to call emit, okay? So now, if you think about the execution model, it's very different, right? So this model, you go from emit down and then back up. So this model, you go from the bottom and the tuple gets pushed up, right? So one important thing that we do here is in scan, when you do a scan, it's going to go and grab the tuple. When it grabs the tuple, first thing you do is you're going to deform it. And when you deform it, at that time when you're deforming it, all the columns are actually in registers. We know exactly which column are in which registers, right? And when they are inside registers, you are going to call if qualify T, which immediately look at the registers. There's no, you don't need to put it in memory and do a call into Postgres to go to the next node. You just take the register and you do a compare and kick it out if it doesn't work, okay? So that's how you get some speed up there. The other thing is when you're doing a scan, when we are scanning, because we compile the query, we know exactly which table you are scanning. And so basically in Postgres, you have one general heap deform function, right? That works for all tables. In our case, for each table, for each type of scan, actually for each scan, we make a special deformer just for you. It doesn't make any decision in there. It knows exactly, you know, the first tuple is an int. Sorry, the first field is an int, the second field is a float, the third field is an int. So it's going to just go through them. There's no while loop and, you know, for each column, if this is an int, then take it out and advance four bytes. If it is a float, take it out and advance eight bytes. It's just going to, it's a serial program. Yes, you have to, no, no, LLVM doesn't do any of those. LLVM is basically an assembly language processor. So you have to write program to generate the assembly code. Okay, any more questions? No. Okay, so if you do this, then you get eight times. And also another thing is, you know, if you look at these two things, this is a software model, right? This is, people designed this because it's easier for software programmers, database programmers, to think about a query. You make it in a tree and then each one does its own thing. It's a software model. In this case, there's no reason why you need a software model when you are running it in a processor, right? So you basically need to dissolve this model to make it generate code that works better for the processor. You understand? Right, so one thing that we do there is, dissolve these boundaries and things would just run much faster and smoother for the CPU. So the next two steps are the easy ones and I think it's, so what you do here is you just use multi-core, right? So you got eight times and if you have 16 threads, you're going to get more than 100 times faster, right? So how we do this comes from our Grimplum experience. This is almost like a mini Grimplum in each Postgres, right? Each thread is like a segment in Grimplum and then you basically run a parallel query inside Postgres. So first you do a threaded scan. So the table is this big. You just let each thread takes a portion of it and they scan the table sequentially but in turn, right? Each one grab a slice, each one grab the next thread gets the next slice, right? You have 16 threads and then they're going to compete and get 16 slices at a time and then you go again. And then for each slice, you're going to do a few, for each thread, sorry, each thread was going to look at their slice, take each tuple out and then do the filter, do the hash act. And here, after you do the advance, you must write to memory. There's no way around it, right? So that's why there's a line there. It's a synchronization line. So every thread is going to hash and then they're going to write to memory, make a hash table and at the end, there's going to be one thread that takes over from the hash act to do the top hash, which is taking a hash table from each thread. Every thread has its own hash table. So there's going to be a top hash act that takes the hash table from each of these threads and merge them into a final hash table. And then you can go thread that again when you go up. You can go through the hash table in, you can again launch 16 threads and each thread look at distinct section of the hash table. And then you can go up again in parallel. So in summary, oh, so one thing that could go really, really wrong if you're not careful is threats need to share things. Threats must be shared nothing for you to go fast, right? Because if they share anything, you need synchronization, right? So each thread and then when you need, if you need to synchronize, you don't really know where the threads are running. You may have four sockets, four processors. And if a thread on processor one needs to talk to, need to send data, need data from processor four, then you are going to invoke a very high cost transfer from the cache on processor four over to processor one. So they're all shared nothing threads. And then you need to pin them on the socket so that they don't jump around and you work on distinct buckets of the tuples of the table. And there are a few same points that you need to worry about but then not for hash join. Hash join is just going to push tuple through all the way up so it's going to take a tuple. After you have the hash table built for hash join, it's going to take a tuple and then map. And if it joins, it's going to send a tuple up. So the thread will not need to sync. It will just go straight up. Questions? Okay. So to go to 180 times again, it's just physics, right? So now we just add column store. And now instead of for Q1, instead of accessing, going through 16 of the columns and one of them is a comment column which is quite big, you know exactly which column during compilation, you know exactly which column you need. So you mark them as needed and then you just access those columns only. You don't access columns that you don't need. So essentially that cuts down the number of bytes that you need to push through the processor. And that's why it's physics in column store. We're done. Okay. So I want to leave you with this picture. If you have a Ferrari, don't make it go through a lot of speed bumps. Right? So there's no wonder that today when you buy a more powerful machine with a faster processor, you put it in, you don't see any performance gain, right? It's all, it's because it's choking, right? In post-graphs. So here are some references that we used to make the product. If you're interested, check them out. That's it. 30 minutes.