 The Carnegie Mellon Vaccination Database Talks are made possible by Autotune. Learn how to automatically optimize your MySuite call and post-grace configuration at autotune.com. And by the Stephen Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. Welcome to another Vaccination Database Seminar 3. We're excited to have today to have Vlad Lujchenko. He is the CTO and co-founder of QuestDB. QuestDB has been something on my radar that I'm looking at because they have a lot of cool things that we teach in advanced database class. They're actually implementing in their system. So really excited to have him talk about QuestDB today. So Vlad, before you start it, actually if anybody has any questions during Vlad's talk, feel free to meet yourself and fire away at any time. We want this to be a conversation with Vlad and not him talking to himself for an hour in the dark. And Vlad is actually in London right now. So he's at 9 p.m. so we appreciate him staying a little bit later in the office with us tonight. Vlad, the floor is yours. Go for it. Thank you very much, Andy. Thank you. Thank you for having me. I'm really excited to be here. And today we are going to talk about how we build QuestDB, which is a fast database for time series. So a little bit about myself. As Andy mentioned, I'm a CTO and co-founder of QuestDB. I am from Ukraine. I live in London, UK. My background is low latency software engineering and I've been writing code for 25 years and that's still going strong today. QuestDB started as my passion project in 2014, I'm sorry, in a shed so to speak. And the dream was and still is to this day is built the fastest real-time analytics database for vast and unbounded amounts of data, which is what time series are. QuestDB went first in production in 2018 and in 2019 we started a company and the dream became to build the best company that builds the best product. So my best company, I mean, the place where everybody can thrive and realize their wildest ideas in a database sense. Yeah, so why is QuestDB interesting? Well, I'll start with a bit of a controversy here and I say that we set out to write high performance time series database in Java. And you wouldn't expect that, right? So everybody these days would go rust, I'm sure, or C++ if they're a bit older. Yeah, and thanks to my background with banks, I learned that Java can be made to run quite fast. It is a little bit tricky to do, but this is exactly what we have done. And in case you already learned something about Java, you will have to unlearn what you learned when you look at the QuestDB code base on GitHub because it really is unlike any other Java project. It's zero GC. It doesn't have any dependencies, Java dependencies. That is, it's a very lean C like code base. We use off heap data structures throughout. Kind of we have really low overhead integration with C++ code and libraries. We use Cmd instructions and we actually wrote our own JIT. So it's not a Java JIT is our own JIT. And of course we use some C++ libraries but not that many. So those that we do use is a vector class. And libraries by Agnafog is pretty cool library for Cmd and we also use no less cool library or awesome JIT for JIT. Right, so what is the problem that we are solving? And the problem we solve is that of providing best-in-class ingestion performance while not compromising real performance at the same time. It's a bit of a, I guess, sounds like a bit of oxymoron but this is what we set out to do. So old ingested data is available for queries in near real-time. So what areas is this important for? One is it's important for nuclear reactors as it turns out. It is important for crypto markets and trading. That's not limited to crypto, in fact, this finance fintech, that sort of things. It's useful for geospatial tracking and of course it's very useful for metrics as a classical use case for time series databases. And then we just illustrate our kind of ingestion performance. This is a TSBS benchmark and we score quite high as this chart illustrates but at the same time our query performance is not bad either. So what you can see here is the benchmark that specifically benefits the storage layout that we have. This benchmark uses JIT and share nothing parallel execution. So this in particular applies to the wear-close filter. So to be fair, the share nothing parallel execution is not yet released. It's sitting on the branch and we are kind of stress testing the living hell out of it for now. Clickhouse as one of our main competitors performance-wise. They seem to do something similar. I suspect it's also they also employ parallel execution but perhaps their JIT is slightly less efficient than ours. It's hard to tell. But what he's telling here is this time scale behavior. So this benchmark uses composite index. So this is index performance of typical RDBMS versus an index performance on QSDB. So that's the kind of performance comparison we're kind of into. But let's dive in a little further into what QSDB looks like at glance. So this is a sort of 10,000 feet overview of our instance structures. As you can tell, perhaps we are in proponents of component-based architecture. Here ILP stands for inflex line protocol server. So this is an ingestion endpoint that we have. REST is HTTP server that provides REST API and other kind of HTTP server functions. Postgres wire is our implementation of Postgres server and all of these things are written from scratch. There's no libraries used anywhere and it's all dependency free. And perhaps some of these components do deserve a deep dive each. But in this talk, we're going to focus only on two of them. One being the storage engine and the other being the SQL engine. And on storage engine, we are going to look at it from viewpoint of data ingestion. Whilst SQL engine is kind of ingressed. It illustrates read-only algorithms and how we work with our data model on egress. But on to... Maybe you can get this. Where is the division layer between the Java and SQL slots? They're both the SQL engine and the storage engine. The API coming in is Java and then you go down JNI to SQL slots. Yeah, it's throughout. It's nowhere in particular. So for example, for networking points, we implemented our own scheduler for kind of connections that uses like RAPS, EPOL, KQ, SELECT, that kind of stuff. This is C++. It's just on a network layer. On the storage engine, all of the file access is obstructed the way we don't use Java anything to access any of the desk kind of functions. SQL engine, some of the implementations are going to touch on later on. They do use... Because we use off-heap memory. So we effectively pass pointers basically from one from Java to see another way around. So some of the algorithms that particularly those that use SIMD and JIT, they are C++. So some of them are Java. So it's a mixture. So we have a freedom pretty much to move pretty much any algorithm to C++ or Java, depending on where it kind of works. But Java is a little bit easier from that standpoint. It's a little bit more kind of... There's a little bit more structure to it, generally. So like in SQL engine, we have... I don't want to bang about this too much, but just one more second. So we have sort of expression parser. So we kind of can basically evaluate random arithmetic expressions functions and stuff like that. So this part is implemented in Java mainly. And JIT implements slightly simpler arithmetic. So it doesn't call functions, but it can operate arithmeticly with vectors of basically arrays. They can just add arrays, subtract them, do these kind of things. So Java C++ split is just everywhere and arbitrarily as we need to. Does this help? Yes. And could I think about the whole box, like all three layers in a typical deployment of QSTD? Is that a single JVM or like... Yeah, for now, this is a single instance. And touch base on this 2022 plans to build horizontal scalability where these data is going to become replicated, but for now it's just one process. Thanks. Cool. Right. Storage. Right. So I don't know what better way to start about talking about storage than types, right? So the one thing is, I would say the types that we have stored on disk in memory representation. You can think about this. If you have C++ array, this array would look identical on disk pretty much. And why is this important? So this is where we avoid type conversion between end user API and storage itself. So data is also stored in a tabular format. Columns are stored separately from each other. Values in columns are stored densely. And although QSTD is a time series database, it's important to know that the manner we think of data conceptually similar to relational database. Tables internally are internally and optionally partitioned by time. So this is done to help SQL engine reduce the IO. It is a common problem with time series workloads that sort of old data becomes less important. And it has to be offline. Some people offline it to dev now some other offline it to call storage. And time series partitions or time partitions rather help by becoming units of data movement in our current system. Partitions also help with ingestion. So in most cases writers have to concern themselves with dealing only with last partition. This helps writer to keep constant footprint during intense 24-7 ingestion workloads. Data ordered by time helps identifying time intervals faster which also helps reducing the IO. And our storage as a feature is snapshot, sort of supports snapshot feature to provide read consistency. And the snapshot units are typically either partitions or columns. And if I kind of would talk about types here and if I jump slightly ahead of myself, I would say that we store columns, column data in separate data files and that sort of columns never share the same file. And speaking of data types, there are three kinds or that's the way we look at them. Fixed type, fixed size types are sort of easy to identify offsets and column files to find values. They use a single file per column. Variable size types use two files per column and I'm going to talk about that a little later. And we also have a symbol type which is a dictionary base type. So what that is, it's sort of you can think of it as a dimension table which has ID and a name. And then quizdb would inline such table while storing IDs on disk and then displaying names to SQL queries. So let's dive a little bit deeper into what these types look like on disk. So this is effectively the directory structure on the file system. Directories are time partitions. Files inside directories are columns. These vertical boxes and if you see my mouse cursor have files on disk and the files contains arrays of data. In this example we have both fixed and variable size columns. Fixed times columns as I said they're pretty cool to locate values by offset. And variable size columns are a little bit more involved but not that hard to deal with either. So variable size columns is a pair of files. They do share these two files share the same name. We will call them the ifile contains offsets into the d-file. So d-file contains data pieces and ifile contains offsets of where these pieces begin. We also use n plus one kind of trick where n is number of rows in the column. And then plus one we store offset well index file stores offset of append offset of the data column so we can calculate it quite fast. Data itself is length prefixed and helps reduce some of the random disk reads. And if we look at partitioning a little bit closer. Let's assume we have a trades table that is this is an actual syntax. It's partitioned by hour. And on the left there's a data that new data comes in the table. And for simplicity this data time stamp is mixed. Well, we show just an hour and it's mixed seven and six six p.m. So when when data does go in we sort the data by time and we partition and the data is sent to on disk partitions after that. And let's let's have a look what we do exactly when data goes into internal disk partitions. So there there are two two data paths. One is a faster they slow the fastest data pass is a is append it's just by far the simplest in terms of implementation you can imagine how how hard it is to append an array so it's pretty trivial. The another data passes merge which is a lot less efficient. So to choose between these past the new data is is analyzed and what we are looking for is to establish how new data overlaps the existing data. Right. And then and depending on the goal of this analysis is to eliminate most of the merge if all of it if possible. The after data is analyzed we create what we call copy tasks, which can be either merger append by itself and these tasks executed concurrently. And to imagine it to understand the level of concurrency here. We usually take a single column during merge, and we split it into three tasks each column, and then within each partition. So, if you update multiple partitions multiple columns that that that would be swarm of small tasks and all of them are executed on a share nothing basis. And what's what's kind of interesting to note here. It's not just here just it's about all of our concurrency concurrency code is that our concurrent code operates or does what's called work stealing. So in essence under under pressure if CPU is is not available. The code that is about to publish tasks will also process them because they're not being consumed for example from the queue. And this just doesn't waste time on on waiting anywhere in the code so we there's never a situation where we publish something and we sit there wait for that for this published task to be picked up. The question in the chat from me, I mean when you meet yourself. Yeah, can you hear me. Yes, go boy. Yeah. Okay, so there were several questions over there let me just find them over here. So one is that the performance numbers that you showed. Are they on a single node that double socket I assume. Let me just go back on second. Yes, so this is this in a single single node yeah it's not it's we use a we we use all available CPU doubt to calculate what for Boston Justin queries. Next question. The next question is it it's it's a single JDM process but it's a multi-threaded process. That's right. Yeah. All right. To meet when you get your question about time type ordering out of order time stamps. Yeah, so what about the out of order right so you're in just you know some of the rows up here too late after you have ingested the previous partition. Yeah, this is the out of order is is is dealt with so this this slide is kind of alludes to the out of order stuff. So merge is something that happens when you have out of order right so what we do is we we heavily optimize for data being slightly out of order when I say slightly. It's within, for example, if you're publishing from multiple sources is within the clocks queue of the sources right so few few seconds out that kind of stuff or milliseconds or whatever. So if data comes like that we sorted in memory before he's the desk and converted to append. Data comes heavily out of order like in in instance of, I don't know, you receive data today. I don't know from yesterday or day before yesterday. He's still going to go in is just going to go on a slow pass and we're going to have to merge existing data or part of existing data with the new data that you ingested. So in a nutshell, out of order any sort of out of order is supported. It's just if data is too old and falls outside of optimization pattern, you will be slightly slower to process once it's slightly if we do kind of 1.2 million rows and on on a pen pass on really heavy merge pass we could be doing 50k or something like that. So it will slow down, but it's still reasonable. Okay. You can. Cool. I don't know where was I am. Yeah, I was going to ask one more question for sure. So when you do the marriage is it the copy on right or merge on read. It is you keep the Delta and then you merge during the read and then in the background you redo the partition. It's or you write away to do it. It's a good question. Thank you. We would do merge and write. So the data that lands on this and and it's available for read. It's already sorted and and merged. So the performance penalty is isn't right. So it just will kind of think about it. You pay this performance penalty once when you ingest the data and you pay it never again. Cool. Right. So where was I? Yeah. So this this effect you this this slide talks about out of order processing kind of a little bit of a simplified, but this is this is the idea. So it's a heavy heavily concurrent process. We're trying to do as many things as we can to saturate disk or memory. And then we're trying to optimize out merge kind of by pointer is pretty much. So now version in the snapshots. So this is this is really simple. So we create when we create new new, for example, if we ingest data out of order, we create new partition directory. And we will have current transaction suffix pretty much at the end and the name of the directory. And when data is ready, when this partition is fully constructed and ready that we update transaction file atomically and readers also atomically read the transaction file. I do mention column versions here so they they are not used by out of order ingest but they use by updates. So when when there's a sequel update, we will snap shortly single column and the new version is created. So I guess an eagle eyed viewer here would notice that we can't really create copies of keep creating copies of these partitions forever and and they would be right. So we will run out of disk space. And then there's a feature of storage engine to purge and use partitions. So how does, and this is a asynchronous job that runs in the background and picks partitions to purge. So how does the purge job knows that partition is unused. Right. So we use a fairly trick shared memory data structure we call it transaction scoreboard. So the idea is that the structure holds transaction numbers used by SQL queries in flight. Right. As soon as the SQL query ends and results, for example, nobody needs a second pass over the same result. The transaction number is released and this partition becomes eligible for purge. Sorry, I've got a light standoff on me here. So I don't know if you can still see. Fine. Yeah. So cool. You support multi statement transactions. Could you end this? Could you repeat that? You support multi statement transactions. No. Okay. Not yet. Not yet. So the whole idea behind writers to table, we will operate single writer to table. Right. So right now, there's a one to one relationship between table writer and an actual table. The user perceives as a table. But in the future versions, the table writer will become a shard of a bigger table. So that's the. But you can't call begin and then do a bunch of reads, a bunch of writes, and then commit all together at the end. It's kind of half implemented. So we haven't, it's not available to use yet, but we do have a system in place to ensure that's possible. So the multi table kind of multi table transaction is is done via a separate transaction manager. Generally. Right. So you, you on this transaction manager, you include tables that you kind of want to conduct transaction over. And transaction manager would use to face commit. Will you just call commit on a transaction manager and transaction manager would call two phases of commit on each table involved in transaction. And I'm not talking about you, like multiple tables. Can I just, can I, can I have, you know, update table foo and three separate queries and then commit them atomically at the end? Or no. Like up to the call begin, update, update, update, and then commit. And then the, the. Yeah. Yeah. Of course. Yeah. Okay. Yeah. Sorry. Multi, multi statement, multi statement update is out of the box supported. Yeah. Okay. Multi table. Multi table is not yet, but it will be. It will be. So just one, one small step at a time. So you will be supported. It hasn't been, it hasn't been yet requested kind of in anger, but he can be supported. I think the versioning of columns is slightly unique to like that. I mean, some ways you think of it. It's like, it's a, you know, it's a sharp version of that. Like, like, I don't know. The multi versioning on columns is, it's interesting. That seems unique, but it doesn't mean it makes sense for you guys, but nobody else does it this way, which I think it's kind of cool. Thank you. Yeah. So we're multi multi table. Sorry. Multi column versioning was really introduced by updates. It's just say, if you, if you, if you, if you have fairly trivial update kind of statement where you are, I don't know, you wanted to increment a column or say co argument, right? It just does not make sense to, to make a copy of entire partition just for that one column. So that, that, that's just, just where it came from. But actually, do that mean like the, if you do emerge though, and then now you change the relative order of, you know, of a logical tuple, that means you have to update all the other columns at the same time, right? Cause, cause unless you're relying on like fixed, fixed, if you're relying on offsets, like I want to get to the 100 tuple. And I know that's going to be, you know, tuple one, two, three. And you know, so no matter what column I'm looking at, if I jump to the hundredth position, it's the same tuple. Okay. So we, we, we, we, yeah, we don't have tuples as such. But so we're talking here, I guess, updating columns that do not change order of record, all the records in the table. Right. Okay. Just in this situation. So when, when, when update, when you do update say timestamp column, which is physical order of the records, that would involve partition move. So there's no two ways around it. Right. Cool. So just, just one note about snapshots here. So even though all our kind of snapshot in expedition was born out of auto water and updates, we actually found a good use for it for cloud native backup. And, well, if you're in a cloud, best thing to use to, to, to back up anything is just to take a snapshot. Right. So you, you, you take a snapshot of file system. Great. Right. So we introduced quest to be, well, we introduced a sequel on quest to be that you can run to effectively snapshot all transaction numbers, freeze them before and trigger snapshot and then release transaction numbers after snapshot system has confirmed the start of a snapshot. And we use the same snapshot in mechanism for backups. So that's just kind of one interesting side effect from, from having it. Cool. So onto the next seemingly unrelated topic to what I've been talking about is how we add the column to your table. And it's interesting because it's part of the high throughput promise that we make. I'm just going to show you the sort of the low touch operation that we employ to the other column. So let's assume that this is the, the non a slide is a table that you, you have. And this is a time point with the red arrow pointing. This is where you decided to add a column and some values to it. And by the way, the influx line protocol by definition supports adding columns to your table without having to put database into maintenance window or kind of introduce some risks, unnecessary risks. So let's, let's see how it works. So what happens, what happened here is we, we did create a column, but only in partitions in which there is data for it. And that we, we also introduced a marker that we kind of internally called infamous column top, the marker that tells which row number the transaction, sorry, the column started life at. So in this particular example, column started life at row number two and it's column top would be, would be two. So, and as you can see here, the, the column is not back filled. So there's no IEO of any description and this operation is instant generally. Okay. So if there's a default value though, if for the new column, like default default, you know, 0.0. Does the engine smarter to recognize that, okay, the column top is two, therefore you don't like anything before that doesn't have it. But like, I know the default value as I can fill that in. It's a kind of yes and no. So we, it's, it's non configurable default. It's not configurable default value of null, right, for, for column. But this is actually come to think of it, this is quite smart idea to, to put a metadata what value to return. So, so, so in essence, it's the internally these columns are kind of stopped with an implementation that returns constant now. Right. So we could potentially stop them with something that it turns configured value in metadata. So it's, but yeah, the logic is there. So if column is not present, you just return same value for it. Currently it's not. Okay. Cool. So storage abstraction API. Before we change subjects somewhat and talk about SQL engine. I would like to take a little moment to reflect on design choices that we made for, for building a storage engine. And one of them, and it's important one is a separation of read and write concerns. And to be fully honest, the first iteration that's circa, circa 2014 had a single storage API with no separation between the written writes, which, which was a mistake and led to realization that this API should be, should be separate. Also this API provide sufficient abstraction layer from physical storage for, for the rest of the code, code base, not to be affected by changes in, in the storage itself. The read API is used by queries, but that's not the only API they use. There's this few other abstraction APIs that particularly facilitate row-based access and the, and the page frame-based access. Also, the other important design choice, design choice, sorry, is that the, the whole code base is, is lock free. Like there's, there's no locks anywhere. And we are very big fans of sharing us an execution and we, we're trying to stick to it. So there's unavoidable kind of coordination between read and reads and writes is done via what we call optimistic reads. So an example would be to say if read, reading process needs to read transaction state from a file, right? So the reading process takes a snapshot of, of a transaction file, just reads whatever data is in it, is in it. And then it verifies that, that during this read, there was no writes to the snapshot. And the way it does it is by introducing versions at the start and end of snapshot and also memory barriers to, to make sure the kind of reads and writes are not reordered to miss an update. Also, we, it's a, it's a byproduct of, of using shared memory, the, the reads and writes can coordinate across processes, right? So you can have another instance of SDB running as a, as a read-only, read-only sort of process against the same data file or data files with no risk of this process ever having access, write access to the tables because this is done, coordination is done by a shared memory. So, yeah, so this concludes my piece about writes. And if there are no questions, we should be ready for SQL engine, which is reads. Thanks, have a good, go for it. All right, cool. Cool, SQL engine it is. Well, this image will be quite familiar to you if you're aware of a volcano model, volcano pipeline model rather. We use a very similar model because it does make sense. We have a reusable operators that can be arbitrarily combined in the pipeline. So our model and operators are both row first and column first execution capable. And the latter, which is column first execution is, it can be implemented as a concurrent map reduced relatively sort of generically. Column based execution allows us to leverage the existing storage model pretty much one-to-one and use SIMD instructions to accelerate computations and also as I mentioned, we additionally compile our JIT compiler, compiles ware closers into AVX assembler. And it's also a SIMD assembler to accelerate execution. We also, we have a handwritten parser that allows us to customize SQL syntax to introduce, well, introduce interesting shortcuts for common time series queries. And we can talk about the shortcuts in a second. But first let's touch base on SQL optimization. Your SQL dialect is what, like the starting point was Postgres or what? Yeah, it was starting point. There's a bit of a history to it. So the starting point was, I think it's called unce92 or something, something to that end. That was basic one now, right? Yeah. But then we did the next thing that came, we built this Postgres wire protocol, which forced us into Postgres dialect pretty much. So because there's tons of tools that run their metadata queries and the Postgres is such that, you know, if you want to list columns of a table, there is no one way to do it. Every single tool would do it in their way. I don't fully understand the version of that, but it's very, yeah, it's really diverse. So our dialect is becoming, we are aligning it with Postgres, but right now we are aligning it with Postgres. It's not fully there, but we kind of like our SQL extensions because they help, I think they help with time series and just sticking with pure Postgres dialect is just an inconvenience to people. So we have them, but on the rest of the SQL is Postgres. Right. So optimizations. So our optimizer is a rule based for now. So we don't collect, we collect some, we don't use stats on that. We don't use stats on that. So we don't collect or we collect some, we don't use stats on the table. The goal of optimization is to have predictable query performance. And the big part of it is to avoid processing data that does not need to be processed. And you'll see that in a minute how we do that. There's perhaps like too much in this slide, but this summary of, of what summary of examples are going to go through on the next slides. But I just wanted to stop, stop by and talk about this dictionary based symbol, symbol data type, which proved to be quite, quite a hit with, with our users. So it provides us with two benefits I can, I can talk about them. There could be more, but that's all I could think of. So in this case is the Speski joins that kind of relational databases use to resolve IDs to names. And it declutters the SQL queries by, by a huge margin. And also there's unexpected way we use symbols. Symbols are known set. So they, they act as enums. So they have an ability to stop query execution after enough results been selected against the, against the symbol. And I'll show you how we do it in a minute. So let's have a look at some of these, some of these optimizations. So this, this query sort of demonstrates a data reduction. One of, this is one of our SQL semantics, notably where it says time in, right? So this, this, this syntax shows what picks all the, selects all the rows in the table with 11th of April and 1pm UTC. So we just select an hour worth of data, right? And, and this is, this is pretty good illustration of how we use partitions. Also data ordering and lazy data access to, to call a large chance of data and reduce IO. So, so here we have three partitions, 10th, 11th and 12th of April. Optimizer knows that it, it, to produce a chain of operators that consists of two operators. The first operator would be, will use interval search only on timestamp column. So two out of three partitions will be ignored just because we're looking for, for 11th of April here. And timestamp column for 11th partition will be binary searched for interval start, start and end. So in essence, the output of first operator would look like this red, red blip on the second step. And, and then this subsection of partition in, is then searched for where predicate in, in parallel and using, using SIM. So, so that would result in bunch of row IDs that then used to fetch values of all other columns. So that's the, that shows how we kind of try and reduce IO pretty much by, by interval filtering. Another, another example here that uses data ordering and partitioning slightly differently. So this is also our own syntax. So perhaps. Quick question. There's no sub-level indexes or secondary indexes. There's no, in your last example, there's no secondary index or price. No. That isn't okay. No. So they, we, we treat our timestamp column as a, as a primary index. And this is the order of the rows in the table. And there's none of these sequels use any other index. So this, let me explain this syntax. Latest on is the clause selecting timestamp column of a table and partition by assumes that table has a multiple time series. And the column name here is the one that contains keys to this time series. So from, from our previous example, Tika, Tika would contain two symbols of BTCUSD and ECUSD. So we, we know, because it's a symbol column, we know there's only, only two values in a column. So what happens is the, the operator that's searching, searching the column, he will not search from top. It's going to go from bottom upwards. And the bottom is where most recent timestamps are. So the first row is definitely going to be a hit. So this, this is the latest for whichever Tika that was. And if the data is, is relatively well distributed, then the second row is going to be nearby. And the net result of this is this operation would perform in, you know, half a millisecond or something like that to find the row IDs. And this is irrespective pretty much of how big the table is. Right? So he will perform regardless of how, how big the table is. The only thing that might affect it is the distribution of, of, of these values, maybe the second value will be like, you know, mentioned last year or last time that, that's possible, but it's highly unlikely in kind of a no, in a normal kind of scenarios. And the last one, last query, this is another, this is a mouseful, right? But what this, what this query, what this query does, the syntax, I just wanted to kind of draw your attention to sample by keyword. This is kind of relatively simple syntax to, to run aggregate values over a time window, which is in this case, is 20 seconds. And the, the SQL returns aggregate values for each of the time window, time windows, you will find in this case, this is a OHLC chart, which is open, high, low and close prices on, on, on the time series. So it's, this is possible to do in a, in a relational database as well. The, it's going to be a little bit more synthetically involved to, to the same thing. And relational database, most likely than not, would going to have, would have to reorder trade stable on read, on every read, right? Which we, we don't do here. So that's the, that's, that's the advantage. Cool. Is that, is that, is that aligned to calendar? Is that specific to QuestDB? Yeah, it's specific. Well, the sample by and everything after that is specific to QuestDB. So aligned, aligned to calendar. Yeah, it's a bit quirky, but apparently when you, when you re-sample values and depending what, what sort of time series people dealing with the, they can be aligned to, for example, you align 20 second windows. So the question is where does 20 second begin? Right? So like where, right? Some people want to align it to first observation. Like, for example, from the first occurrence, then another 20 seconds. That's one thing. Some other, some other time series need to be aligned to, to calendar. Like you, you, you, you in your times that you need to align it to, on a midnight kind of regardless. And actually some, some people need to align it to the end period, which you actually don't do yet, but they do need to align it to end periods. So you have calculated these intervals in, in the other direction. So it's a little bit more complex. We haven't done it yet, but, but people want that. This is the latter one is for specific, for, to, to help understand this. There's some financial instruments they expire on a random, on a random time, right? And they want to align this sampling to the party date of this instrument. That sort of, that, that's where it stems from. Very interesting. Thanks. Yeah. It's, it's, what's in it really? So the, the, the, the, the world of finances is just, it's just, there could be another data science kind of. Anyway, so quick recap. So what, what did we learn here today? So the quest DB has a, is a, is a column based and only storage optimized. Ironically, the whole storage model is optimized for reads, even though we, we don't do too badly on, on, on writes as well. Thank you to what this is in part, thanks to your accumulation of smart sort of handling of out of other rights. We managed to achieve this, this kind of throughput. The whole thing storage SQL engine is built from scratch and, and we insist on keep doing it because I believe that building things, you know, performance is a function of control. So if we have control over components, we can achieve better performance. And we, we use multi-threaded and CMD based instructions while execution anywhere we can. So we're coming to, to an end of it. So what do we expect in 2022? If, if anything, so first, second and third is, is replication. So we kind of want to build replication. We would like to do it very right. The first time around. So we're focusing mainly on a building block of replication itself is where a table is able to maintain a copy of itself. Synchronously or asynchronously across the network. We are building a cloud, which is managed service. We are building cold storage. Cold storage, a lot of storage engines have it. And this is, this is interesting for data warehouses, mainly. It's just the ability to transparently for SQL queries to move data from cold storage via, via local disk cache. So to, to save costs, pretty much kind of store infinite amount of data at affordable price. We have, we have a parallel execution framework and we wanted to implement more queries using it. So to make more queries run a lot faster. And we also cool features or highly requested features that we usually implement at home. So that might be some surprising coming surprises sort of come in this year. Yeah. And this is it. Thank you very much for listening to my kind of late night gibberish here. So if you, if you have a remaining questions when you think I'm happy to take them. So I will clap them half every now. I'm not sure why you're saying late night gibberish. That's fantastic. So I'm going to the audience. If there are any questions for Vlad Govor. Hamid, I see you ready to go. Yeah. So any comparison with the influx TV, particularly the new engine that they are building with our format and column now? No. Well, unfortunately, no. We haven't benchmarked the new engine. It's fascinating. Kind of got a sneaky suspicion. They, they try to copy existing kind of database engines sort of with this, this model. But no, we haven't tested. We, we did test it with influx. I think it's 1.8 or two or something like that with the old engine. And yeah. So yeah. Yeah. We, we do have different, different models. But anyway, so you know about that. Yes. Hi. This is for you. Hi. Can, can. Can we go back to your alignment case? What are the different things you can align with? So I, we saw calendar. What are the other possibilities? Where are we? I think it was after this. Yeah. So yeah. Okay. Okay. So, so right now. The possibilities is first observation. So you, you say, oh, sorry. No, no, no, no, no. No, this is not partition, but this is after that. Oh, yes. Yeah. This is it. Yeah. Success. Got it. Yeah. So we, we can align as I mentioned to first observation. So you align it to calendar. Well, if you don't specify anything, this calendar would be UTC. So you can align it to a particular time zone. So you can specify a time zone in which you align it to. And, and you can also specify custom offset to align it to. So those, those things we can align to it. What we can do right now is align it to the end of the period. So that is specifically we, we haven't implemented yet. Does this help? Do you see any, do you see any requirement where the laws is a user defined predicate? Can you give me an example? Yeah, sure. Of course. So we see a case from healthcare data where people are looking at IOT values coming from sensors. And sometimes they would want to align things to, let's say the first jump of the day. So if you see more than a 2x jump, I'm making this up, but it's sort of like this on a day, you align to that end steps forward and end steps backward to do some operations. And that's where the, the reason to do it is, you know, this is healthcare data. They would, they think that, you know, between sleep time and awake time, there is a difference in performance or, or some such. So the condition, the condition that they would put off what the alignment would differ from case to case. Okay. Have you seen this? Well, the short answer, no, this is the first time, first time I heard about this. So there was nobody asking anything like that. That's one thing. Second thing is I understand what, what you want to do. Definitely understand. I kind of think it works. Well, yeah. So it's kind of multi-pass kind of situation. So it works. Anyway, I'll shut up here. The Prometheus, I was the, what I was thinking about Prometheus has this rate function that they can kind of understand the resets in automatically understands the resets and in counters. So I'm kind of thinking this is some sort of aligned to function, I suppose, almost. Yeah. That would Jesus Christ. I kind of need to think about this, but I do, so look, the good thing, the good thing is, right? So we do own the parser and then implementations. We can add syntax for it. This is definitely possible. It's even possible because it's open source project. So anybody can purchase this. But yeah, so first time, first time hear it, let me talk to the team here and see what they think and then maybe do a bit of a research. Do you have an idea where else can it be other than a medicine? I'm sure we can think of some. Okay. Basically, I'm not going to have a detection. Like something happened. Yeah, exactly. Measure after that event. Like I don't care what happens. Right. And so people will do before and after analysis, right? So this happened and then n times times before this and n times times after this, we want to see the impact of something. Okay. You know what? I didn't want to talk about this, but we're working on something else, which might help, right? So I'm not sure how you currently do it, but I strongly suspect you do it in Python or you might do it in Python, right? What we're trying to do is to build a Python layer on top of, I just explained that we have vector kind of storage. We store everything in vectors and this is incidentally, it's almost identical to how Pandas pretty much expects their data to be, right? So and our kind of goal of our project is to be able to run arbitrary Python code inside VM so you can pull the vector via memory reference and then you can run Python on it and you can do anomaly detection this way. That's a bit of an expanded answer here, but yeah, so on this sample by over function, join our Slack, right? And then we can talk about this in a lot more detail as in... True. Yeah. So that would be good. If you seriously consider kind of using this for this use case, we'll build it. Okay. We have a lot of questions to get through. So patience, you want to go next? Yeah. I'm also from UCSD. I have a question. Do you have any benchmark on a huge, like a fastest, small data stream like sensor data or this healthware device data where data is very tiny, but number is very high? Let me just understand. Data is very tiny in what sense? Is it like a short? Yes, tiny. I mean sensor means I'm talking about the ingestion and query capability. If your data is very small, like a sensor data for each timestamp, you have only one value. Now you need to ingest that thing or you need to index on that, that type of data, like a healthcare data means a wearable device data. Yes. We are built for this. So the performance in terms of benchmark, so performance numbers are kind of relative to the number of values that you sent to the database. So this benchmark that I showed, I think it has 10 tags and 10 values. I forgot what it is. So basically it's a lot of columns, right? So on a pair of values where you've got timestamp and a value, this performance figure is going to be much higher. So yeah, so we deal with this. Okay. All right. So we have one more question from the audience and then a question from the chat. So Nick, you want to go next? Yes, thank you. I think my question actually is similar to what Subhasis was asking about. So if I'm working with like IoT data or stock data and there's like high cardinality, I might have 100,000 sensors and I really only want to query a couple of them. But the time period is very large, maybe two years with the data. So the current setup, if the data is all stored on timestamp, I'd have to scan all of the partitions for that whole year. Is that the case right now? Yeah, that's a good question, Nick. Thank you. So, well, I also didn't mention here we do support indexes and indices on symbol values. So in essence, there's this couple of things. So if there's no interval in the search, there's no other way but scan the whole table. Well, when I say whole table, kind of touch on every partition, whether we can open and read the data in partition, that's another thing, right? So index is a partition. So for example, if symbol, if you're looking for two symbols across the entire data set, index is a built to answer a question whether this partition has an index, has the value you're looking for, right? So index would have to open an index and read top of it. Gotcha. So that would be a consideration into sorting the data by that symbol column and then the timestamp. Because like if I'm doing IOT or trades, I almost always care about that identifier and then the timestamp, I don't care about any of the other signals that are coming in. Yeah, so there is, yeah, it's a fairly generic use case that we're looking at. So it's a case of the way, the way I see it is a case of starting multiple time series in one table. And we do have a plan, although not immediate, to effectively partition within the table by the value of a symbol. So this way we would sort of cut down on IOT substantially. So I don't think we have a kind of immediate idea on how to do it best. You can see that you mentioned 100,000 symbol values, but there's a plan to do it eventually. Okay. I was thinking if you just sort it first by the symbol and then by the timestamp, then you can put an index IOT symbols for where the data starts and I can really quickly look it up. I think that's okay. Can we take this one off? I think hit up Vlad. Sorry Nick. Let's talk offline about this. That's okay. Thank you very much. Okay. So the question in the chat, they're asking, do you have any plans to support in-engine close to data custom functions, basically using defined functions? Yeah, Python. So the plan is not to implement any, not invent a language. So we want to run native Python on the data. So that's the plan. One last question from the media. Can you hear me now? Yes, go for it. If you handle really multiple distinct time series that you can join, this is a very important particular for data scientists because we've got multiple devices and then we need to find the correlation across. Yeah. Mine has only healthcare. Well, healthcare sounds like a really good area. Yeah, so we support what's called as off join, as a join. We genuinely support, I didn't mention in this talk, we support all of the traditional relational joins. Sorry, I accidentally hit my keyboard, I changed. So we also support as off join, as off join is specifically for correlating of time series. Yeah. So you can do it as off join LT join this sort of things.