 Databases, a database seminar series at Carnegie Mellon University, is recorded in front of a live studio audience. Funding for this program is made possible by Otitune. Google. Hi guys, welcome to the last talk of the semester. We're starting to have our friend D.V. comes up about their database system. So, D.V. is a tribute database system that comes out of Europe. And so, we're always excited to hear about what people are building. So, today's speakers is Maria and Marios. Maria did her PhD at TU Darmstadt in computer science. And Marios is a senior software engineer at CRE. So, as always, if you have any questions for our speakers as they're giving the talk, please unmute yourself and ask your question at any time. Would this be a conversation with them and not with them talking to themselves? And with that, we thank you guys for being here. I will say, so, Marios, it is 11.30 p.m. in Greece where you are. Thank you for staying up. And Maria, stand 30 where you are in Germany. Again, it's not the record. Our record is 4 a.m. But it's still pretty late. We're still pretty late, so we appreciate you guys staying up with us. Thank you very much. Thank you for having us. Yes, thank you for having us. And very nice introduction. So, yeah, today we are going to talk about Crazy B, which is a secreted SQL database built on top of Lucene. And as Andy already mentioned, so let's introduce us, the speakers. So, my name is Maria Selakowicz, and yeah, I hold PhD in computer science from TU Darmstadt in Germany, and I have been living in Germany for the last eight years. So, quite some time. But originally, I'm not from Germany, so I come from Serbia. And, yeah, I did my undergraduate studies in Belgrade. And today, I work as a developer advocate at Creta Yo. We are also going to meet Marios Trivizes, who is coming from Greece. And he did his master's degree at Athens University, right? Currently, he's working as a senior software engineer at Creta Yo, and he's also one of the first engineers at Creta Yo, so he has a lot of experience with building Crazy B, because he has been there since very early years. And agenda for today, it's going to be relatively easy to follow. So, can we go to the next slide, please? So, first, we are going to introduce Crazy B. Then we are going to talk about some main architectural points, then we are going to go a little bit deeper into the storage layer, which is, as we already discussed, completely based on the scene. Then what is the core and probably the most interesting part of Crazy B, this query execution that we are also going to mention logical replication and how we implemented this. And finally, we are going to say a few words about deployments and the ecosystem, because it's quite important when having a database. It's very crucial to have a well-established ecosystem that works with your database. So, what is a Crazy B? Crazy B is one of the leading database solutions for real-time analytics, it's distributed, it's open source, and actually it supports SQL as a query language, which is quite important to learn during the time. It's quite important to actually support a language that developers are familiar with and where the learning curve is not that steep. But let's go into more details. We like to mention very often that Crazy B combines the best of SQL and most SQL, so it also supports Postgres Fire protocol for easy onboarding and compatibility with many tools that actually work with Postgres. When we say it uses the best of no SQL, we usually mean that Crazy B also supports dynamic objects and storing objects in JSON format and also supports dynamic schemas. But you can use Crazy B to store and analyze a wide variety of data, including relational data, object data, time series data, guest portal data. We already talked about blobs and of course full text data, so sometimes you need two systems. For example, if you want to combine different data variations, but with Crazy B, you can use one database. Scalability is something that Crazy B has been known for, it supports the concept of horizontal scalability, so you can scale your cluster by just adding new nodes and it's of course performance, which is always a moving target for us. Since we are based on Loose Seen, we actually not aim, but also claim that the queries actually can run pretty fast, and we also support aggregations, very fast aggregations by using columnar storage, which we will talk a little bit later on. And Crazy B can also be seen as a distributed search engine since it's built on Apache Loose Seen. Currently, there are two deployments options, so there is open source version of Crazy B, and there is also cloud version that is available for more or less major hyperscalers. You already said this already, like you said standards to go, I mean, does the PostgreSQL wire protocol, is it the PostgreSQL dialect or is it something else? Yeah, it's actually so Crazy B is following more or less the PostgreSQL dialect, there are still some features that are not fully implemented, but we are working on the PostgreSQL ecosystem in general and full compatibility. So it's the PostgreSQL grammar? No, sorry, not only, but also the protocol of the wire connection. That's why we allow for PostgreSQL tools to connect to Crazy B, even the PSDL client can connect to Crazy B. And in general, Crazy B is how Crazy B stores data that's usually done in relational format. So when you create the table, it looks more like relational database, but it also allows for storing objects as adjacent documents, like we can see on this example. And the good thing is that objects can have very different schemas with arbitrary number of attributes and testing levels. And what we also store on top of this is original documents in JSON form, and so for each row that we store in Crazy B, we store the JSON representation of this row. And talking about schemas, there are three types, three ways actually to declare objects in Crazy B, depending on what type of schema you want to follow. The highest flexibility is given with dynamic schemas, where you can add a new column, let's say dynamically, which means if you want to store a new object with a different number of columns, then it has been defined, you can do this. And also new columns as well are going to be indexed. If you want the object to follow the strict schema, you can do this by declaring the object as strict. So in this case, every added object needs to follow the column definition. And finally, there is another way to declare object, which is ignored. In these cases, you can add new columns, but these columns are ignored from the indexing point of view, so they are not indexed. And another nice feature of Crazy B is that you can do the database administration by using pure SQL. So there are some tables like PG tables that are available, but also information schema tables, where you can find data about existing tables, partitions, constraints and columns. And not only that, but also system tables that you can query for, for example, finding out what are the running jobs, what are the finished jobs, information about your cluster and shards. And there is a simple way actually to stop a single or all jobs on a node. So this is a cool feature in Crazy B because you don't need any new knowledge actually to database information, so you can just use SQL as a query language as well. And finally, I think it's important also to mention when it's a good, what we consider a good use case for Crazy B and what we don't consider a good use case for Crazy B. So in our experience, we've seen that Crazy B is a very good database for time series data, especially IT data or industrial time series data. Then in cases when you need fast queries, aggregations, full text search, SQL as a query language, chaos, volatile data, we also support logical replication compared to many other databases. We have hybrid deployments that can actually be implemented on Kubernetes cluster that, for example, runs in some specific regions. And Crazy B also comes with the promise of high availability and scalability, but it's not a good database if you need transactions. So transactions are not supported if you have highly hierarchical normalized data models. So Crazy B supports joins, which is also a good feature, but joins in this with the settings is a very expensive to execute. And we are going to see how we actually tackle this issue. And in case you need foreign keys, Crazy B is not a good database. So let's go to the next topic, which is Crazy B architecture and the main overview. And I'm giving the word to my colleague, Marius. So hi from me as well. Thanks a lot for joining this session. And let's talk a bit about general Crazy B architecture. So as we have stated already, Crazy B is very easy to scale with it because it has a certain architecture. That means that every node of the cluster is equal to each other and can perform exactly the same tasks. So there is no single point of failure. And if any node fails, then some other can take over its jobs and the cluster can continue working in a healthy state. Apart from that, you can scale as your data grows or your requirement grows regarding performance or number of clients querying or inserting to the database by just adding more nodes to add more space, storage, more memory and more CPU power. And the moment that you add a new node to the cluster, the existing tables that you have defined, they are automatically rebalanced in the cluster using the new nodes added. So you don't need to do manually anything to rearrange your data across the cluster when you double the number of nodes, for example, or even if you add just a copy more of them. Just to say a bit more, I won't go into details of that, but there are much more complex configurations you can do. So there are nodes that have the master coordinator role in the cluster. And maybe you have all the nodes be able to take that role, but also you can configure that only a subset of the nodes that maybe don't even have much storage space and they just do the coordination, take the role of the master and then you have data nodes. You can also have only client-only nodes with responsibilities basically like a load balancer to service the clients in front of the actual cluster performing the operations. Great question from the audience, Kevin. Go ahead. Yeah. So my question is, within a single node, is the architecture also shared nothing or does every physical CPU core and OS thread get its own independent buffer pool and memory or is there a shared memory architecture where a single node all of the memory is shared across them? Every node has its own memory, but this memory is shared across all operations that the node does regarding clustering or inserting data or executing query. There are, of course, some thread pools there, so its thread pool has some responsibility. For example, like some internal scheduled jobs or the inserts with the search is where queries are different search pools, but that's it. Got it. So it's one big like buffer pool and one big application level scheduler per node and that's shared amongst all the threads and CPU cores that run on that node. Yeah. Okay. Understood. Thank you. Yeah. Again, there you can also, you can just influence the number of threads in the thread pools, but yeah, you can influence some things and configure it, but in general, it's a shared thing, as we said. Yeah. So shared nothing architecture was sort of like the way to build things 10 years ago, but certainly in the cloud now, everyone is moving to a shared disk. And I think this would make sense in analytics, right? Because it's, you're trying to store a lot of data, although you obviously want to keep things, you know, with the senior indexed and, you know, you would keep that local. Have you thought about, you know, maybe you'll get this, is there a cloud version of createDB where like it now becomes shared disk where you're storing things on S3 or even EBS or is it somehow using the scene to acquire you to be a shared nothing architecture? Yeah, like to have a good performance, you have to have local disks and actually SSD disks or now even NVMe disks could get the best out of it. So you can use cloud storage like S3 for snapshots. And we have this idea to be able also to search on those snapshots on S3 to have slow searches, of course, on let's say archived historical data to give this opportunity, but definitely you don't get the performance you get with the Lucene on local disk. Maybe Maria said this and I missed this. Like, you're trying to do real-time analytics. So what is your notion of real-time? What's the, like, what's the target P99 that is the sweet spot for you guys? So, can you repeat again, please? So you're doing, you're trying to target real-time analytics. So it's not, you're not trying to replace Redshift, you're not trying to replace Snowflake where they can just do big sequential scans over large chunks of data. You're going to be using Lucene as the index. So therefore, you're trying to target queries that are fast. I don't understand, like, when you guys say real-time, is that like in the hundreds of milliseconds range or something less? Like, you're trying to understand, like, what's the target SLA you guys are trying to provide? And obviously depend on the query and the data, but I'm just curious. Yeah, as you said, it depends on the situation. How, first of all, what is the throughput of inserts and updates you want to support together with real-time? Because, of course, when you increase the throughput that you want to get, then you don't see your data immediately. There is this concept in Lucene, the refresh rate, the refresh time. So the data that you insert becomes available after, let's say, one second or 10 seconds or 100 milliseconds. So it's something that you configure, but the less you configure it, the more you decrease performance of inserts and updates, of course. Okay, awesome, thank you. So, yeah, as an overview from the stack of the architecture, what you will see in each of the Gradient B nodes and let's start from the bottom. You have your clients that connect to the cluster through either the HDP endpoint or the Postgres HGL wire protocol. And each statement, either it's the DML, DDL, or administration statement, goes through the parser. The parser, of course, parses the HGL and validates the syntaxer and creates an abstract syntax tree, which is then passed to the analyzer, which knows a bit more about the tables that are available, the columns, the privileges, and so on. And that's semantic processing, sex data types, and all these things that you can do before runtime, before the actual execution, I mean. And then it passes you to the planner and optimizer, which has several stages. So first you create, we will see it later, but first you create a logical plan, you optimize it, and then an execution plan. And then you actually send it to the distributed execution engine of this node that you have connected, which will, in turn, use its local Lucene storage to store locally things and also transport data and requests to other nodes in the cluster to fulfill your statement. So regarding, yeah. Sorry, so going back, the, is the query app master, is that written from scratch, or are you guys relying on calcite? No, it's written from scraps. Got it, okay. Do you know what it's like? I mean, are you doing cost-based search? We do some basic things at the moment. So we do some analysis on the table to get statistics regarding number of rows and like average length of every row in bytes to decide for some things. Yeah, but it's not sophisticated. It has some decisions and some optimizations, especially in the subquery area and joins. But we will see also a bit later about filter pushdown and limits. And so maybe you can hold a bit the questions and ask me. Hi, thanks. Thanks. So when it comes to node-to-node communication, then there is one more protocol, a network protocol for kdb, call it transfer protocol. And it's dedicated to communicate, request data between the nodes. And you can also enable encryption for each one of those protocols. So only for Postgres, only for HTTP, only for transfer, for two of them or for all of them. And the configuration is done via host-based authentication. You can use trust and certificate authentication methods. And of course, when you enable one of those protocols or two or three, as I said, this applies to the whole cluster. So you cannot have two nodes working with unencrypted HTTP and all of the others with encrypted or transport between specific nodes to be encrypted with other nodes. So it applies to the whole cluster. So next, we will talk just a bit about the storage layer, which is Lucene. So when you define a table and you start inserting data into it, then all of your data gets indexed by default. You have the possibility to disable indexing for certain fields if you're never going to search by them to gain some speed for insert updates and also some space. So for text values, Lucene uses an inverted index. We'll see a few more details next. And for all numerical values, including all types like timestamps, geopoints, geosapes, internally they use numerical values using block KD3s for the indices. On top of the indexing every field for all the fields, you also have the column-based data storage of Lucene, which is called dock values. And this is what provides very fast sorting and aggregations. And currently, you can disable these dock values, column-astorads only for text fields, but not for numerical. And we're planning to do it soon. It's no radar to save even more space for fields that you don't need. So let's talk a bit more about the text index. And first, we'll talk about the analysis of the full text index, which comes first. So when you have a large document before you start indexing it, you need to split it into tokens. And then you have to maybe apply some filters on those tokens and maybe also on top some character filters. So all of these components comprise, I'm missing the word, analyzer. And there is some standard analyzers that you can use, but also you can define your own tokenizer as well and combine the filters that you want to do your job. So I will show just an example of the standard analyzer for English. So first, you just split all your the tokens with the white spaces when you get the words. Then you just lowercase everything. And the other thing is that you remove the stop words like the end and words like that that are not useful when you're searching for. So next, we will see how now we index this data. So before we see the tokenized version, let's see what happens when you have defined a normal text index without analysis step before. And in our table, you can see this model field and you have almond milk, almond flour and milk. And this text, the plain index, just get indexed as a total phrase as the whole string. And you just point to the document IDs that this phrase can be found in. But when you are using the full text and you have done the tokenizing analysis featuring and everything, you can see that you have the your token split and you see that the token almond, you can be found document IDs one and two and milk in one and three and flour only in document two. So in this way, when you start searching for terms like that, you can very fast refer to the documents that those values can be found. And on top of course, with Lucine, you can do much more complicated things with phrase searching and fuzzy logic and wild carbs, prefixes, suffixes, and so on. And scoring. So you can use, you can define how the documents that you have mapped are returned in an ordering regarding which are the more relevant to your query. And for the doc values, as we said, is sort of the columnar sorabs of Lucine and its numeric value gets its doc values index. So basically what happens is that you have, you can very easily and quickly access all the values of a certain field like the quantity here, and you have them related to document IDs. And what makes it fast is that you can have a query that you filter out things, and you decide that this is a subset of document IDs that maps my query. And then directly you can get a very fast iterator over the values of the quantity field over the IDs that you have preselected, pre-fitted. And also you can do very fast aggregations of numeric ones like average sums, min and max with that. So for all the things like the sorting, the query execution, is that being done by Lucine? Or like is that done by Crate? Like is this Crate, the Crate's going to do a query, say it's going to do a sequential scan or something, right? Like, hold on. And the data is in Lucine. You're just reading the data of Lucine as a bunch of vectors and then do the query processing in your code, or can you push those filters and everything down into Lucine? Yeah, it's both actually. So there are many things that can be done by Lucine, and we will see next that we will try to push down as many things in the Lucine to get advantage of the index to avoid doing processing in the CrateDB nodes. But of course, when you have custom logic and operators on the fields and you don't have simple filterings, then of course you need to do all this work in CrateDB. So regarding compression, because of course you cannot not have all these indices for each field, the dog values on top, with plain storage, otherwise you will need tons of space in the disk. CrateDB uses two compression algorithms. The default is LZ4, which is more or less a good trade-off between the storage efficiency and the performance, both query but also inserts, and also deflate, which is much more aggressive compression, but it comes at a cost both for inserts and queries. And on top also the dog values are, there are several techniques that are used there regarding the numeric type, the numeric ranges and so on to compress them. So there is delta encoded if you have many values that are very close to each other. It's also there can be bitpacking or this is the common divisible compression. And we have made some tests in the past to compare CrateDB with Postgres regarding the storage requirements. So we had this table with a variety of different fields with 120 millions of rows in it and there were only two indices for on time and device ID enabled also for Postgres of course to compare it. And here is the comparison with the two algorithms of compression for CrateDB and for Postgres SQL. And next we will talk a bit about the query execution engine and about plans. So as we said before after the analyzer we go to the planner and the first step of the planner is to create a logical plan and we'll take the very simple query here like the select name from users and for every query and statement basically you can add the explain prefix and you can get the logical plan for it. So we just select here a field name from the table users and this translates to a logical plan which is an operation collect. So this is to go to Lucene to the users indices and retrieve the name and the truth here is the filtering because there is no filtering so basically it's returned everything. So if we just add a plain filter in the workloads where name equals a literal string name then you just see it being added here and this is the query expression that it's pushed down to Lucene to the indices and translates to Lucene term query in order to return the documents that you want. So as we said before with your question we try to filter to push down all the filters that we can down to the Lucene level to take advantage of the indices and a simple example here when you have just a subquery a nested query and you have a filtering on the edge in the inner one and a filtering of the name on the outer one then the logical planner can combine it with an end and create a Lucene term query again with an end and get advantage of the index. So after the logical plan and these optimizations that happen then you have to create the physical execution plan that it's actually going to be executed in the execution engine on the nodes of the cluster. So the logical plan we have seen before even with the optimization doesn't know anything about the data distribution at this point. So it doesn't know where your SARPs are located because normally every table you will have at least three nodes let's say in your cluster for redundancy and to sustain a failure. You will cluster it with minimum three SARPs and get one SARP to the node. So the planner that needs to create the execution plan now needs to go into the routing table of the cluster which holds all the information about where the SARPs are located to which nodes and create the execution plan and asks for the nodes to retrieve data. Of course keep in mind that apart from the primary SARPs in KDB you can also configure one or more replica SARPs. So if you have if you define for example three replicas then you have three times three you have nine SARPs in total and then you also have some round robin algorithm or your queries to not access the same SARPs for every client that runs a query on the same table. So you get a round robin behavior and also you get some statistics about the current pressure of the cluster regarding CPU and two threads to decide where to forward your requests. For every execution plan the nodes that participate in the execution can be divided are divided into three roles. So there are the collect nodes, there are the nodes that hold the leucine indices, the SARPs where the data is gathered with filtering and ordering that are pushed down. You have the merge nodes so these are the nodes that retrieve results from the collect nodes and merge them but they can be one of several merge nodes depending on the operation and the number of nodes and SARPs and there is only 100 nodes. It's the one that communicates with the client, the one that the client has originally submitted the query, the statement and it's responsible to gather all the data from the merge nodes and return the bug. For simple cases the merge node can be also the handler node so you only have one merge node you have maybe several collect nodes to retrieve data but merge 100 nodes are all the same and so on that the client connects to. Let's now see a bit more regarding the execution, the physical plan. So we saw the logical plan for this simple query and if we if you do an explain analyze as a prefix then you get the actual execution plan where you can also get timings here for each operation because it actually executes the query for you and you get timings and you can see also which part of the query is the one, the bottleneck. So for this simple one we see that it's routed to two nodes, node 0 and node 1, the node 0 holds SART 0 and 1 of the table users, node 1 holds the 2 and 3 SART of the table and for both of them you request to fetch the value for the name field. So the merge phase is done for example node 0 can be also the handler node where the client has connected. This has two upstreams which in this simple case one of the upstreams can be also the node itself, it's the same node because it holds SARTs of the table we have queried. So when it receives data from the other node and from itself from the operator executing then it merges them and sends them back to the client. Now let's see a bit just a bit more complex query when you have an order buy and a limit. So notice here that already the logical plan becomes a bit more complex and we start from the bottom which is again the first operation that it's executed in Lucene for the users table. Now you request this special internal field called fetch ID together with the ID you send down the the filter that you want on the name and you ask to order buy ID which is also will be done by Lucene. This is the order buy that we have and then you apply the limit and then you apply the final fetch. So let's talk a bit about this fetch thing. Since yeah you can have tables with many fields or even a few fields but very large fields like texts or geosapes and things like that it's important that you avoid sending around data that it's not needed. So when you are executing a query like that instead of returning back the fields name and names that you have requested you ask for Lucene to generate this internal fetch ID for its of those documents and this fetch ID it's somehow combination that encodes the SART as well together with an internal document ID and I don't mean sorry I don't mean this ID which is a field we have defined but an internal document ID. So when you have applied the limit because you don't want to return like 10,000 or 10 million rows but only 50 then you have the fetch IDs that you want and then is where you go back to the Lucene to the nodes and you execute the fetch operation to retrieve the actual data for these fetch IDs and just to say a bit more about this limit you can you see that when you have a limit of 50 then since you have many SART you push this limit to all of the nodes so you get 50 nodes for each of them they are ordered and in the merge node you just keep the top 50 to return back to the client and since we are with the questions yeah we're a bit over time I will go a bit quicker through the joins so crazy bit supports joins the default algorithm is nested loop I don't need to explain more about nested loop it's very clear to why the default join algorithm is nested loop yes because we haven't invested a lot in the joins we will see next about the hash join so the default for to support all cases where you have complex join conditions with you know greater less not equal joins you have to go to nested loop we haven't implemented any sorted merger this kind of algorithms up to now so even for the nested loop we have a distributed logic there so we have several nodes executing the nested loop and for each node that that does a portion you have to broadcast SARTs from other nodes in order to do the portion of the nested loop so of course if you have the SART one of the left table to this node you have to to receive all the SARTs of the right table in order to to go through them and execute the nested loop and at the end you have the merging of those results and for equally join some quite some time ago we have implemented block hash join algorithm so you have the first the build phase where you try from the sorry from the stats as we said earlier that you have you scan the smaller table in terms of memory that it will use and you create houses and you store them in the hash table and then you probe all the rows of the of the other table using this this hash table and of course you are using blocks because maybe if your table can be small but not small enough to fit in the available memory so you might need to have multiple blocks and build multiple hash tables to be able to fit finally the table to the memory and also the block hash join is can be executed in the distributed passion so what happens is that beforehand you do a hashing and a modular on the rows and you distribute them to the nodes that you have decided to execute the block hash join and every node receives the rows that potentially will match the it it cannot miss any rows with this initial hashing and then its node performs the portion of the block hash join and then you have the merge phase where you merge the results and for all these operations kdb supports two methods of communicating data between the nodes so you have the direct response and the paging so automatically it selects what seems to be ideal for the situation regarding the nodes the subs are participating the number of expecting data if you have limits and so on and you can see it basically in this graph that when you have a direct response in this communication let's say that node zero receives the initial query from the user and needs to send the request to node one to get data or execute something so the for the direct response you just get the result as a response to your job request from node one but when you have larger amounts of data you cannot transfer that this way so instead you use paging the steps in the communications become a bit more complicated because in this situation node zero will first send the job request to the node one the node one with just acknowledge with node data and when it has some data available in a sync function it will send the first page to node zero node zero will process and it will ask for the next page from node one which will send it and so on and all these communications are asynchronously so you have listeners and futures and you you will get at some point the response from node on with data and spart continue processing so if you have some questions maybe now for the situation anything before we proceed with maria to the next chapters or we can also do it in the end yeah there is not much left actually so we wanted to mention logical replication since we were talking about building replications already so what do you prefer doing some short q and a nice keep keep going with your q and a pn okay awesome so logical replication so we we were talking about building replication the short level but also crdb allows you to replicate tables or subset of tables from one cluster to another it follows published subscribe model where publishers are clusters that make data available through publications and subscribers are clusters that actually receive data and not only data replicated but also the operation operations such as inserting the leading updating and changing schemas are also replicated to the subscriber cluster so syntax is relatively simple also follows a sequel so can you go to the next slide so you'll create publications by specifying whether you want to create to make this data available for all tables that are on the cluster only for subset of tables and during this process you are you don't actually you're you cannot do aggregations on on the other side you can create subscription that actually make a connection to a publishing cluster and you can also specify currently one parameter called enabled which actually says whether you would like your subscriber to be actively replicating so and currently the data on subscriber cluster are read only so it goes only one way and we have seen a couple of interesting use cases for uh this uh this feature which has been available since i believe may or june this year um the one is actually for improving one use case aims to improve data locality so imagine if you have one central reporting cluster and you want to make data closer to the to the applications that are actually uh using this data so you can for example replicate the data to the regions to the clusters residing in regions of the closer closer to application data and in this case actually you can achieve we can achieve much lower latency because the data you need for your applications are are actually closer to you another interesting use case is something related to the to the analytics and can we just mention this quickly on the next slide so in this in this use case we are actually allowing um replication of data from different distant regions to one central reporting cluster very interesting and common use case because you sometimes would like to aggregate all the data that you might need for your analytics in one central cluster that is closer to this analytics instead of querying the data across different regions and you don't need to do this querying across multiple sources you do this from a single source and also you can also you also have a control which data you would like to replicate the central reporting cluster so you need don't need to to query actually all of them um before actually finishing for today we would like to give a few words about deployment uh that we that we have currently an ecosystem in general so I already mentioned very beginning uh that we have two deployment options first one open source kdb is available under patch license uh 2.0 and as it's open source you can just try to turn on the hardware you would like but we also have a cloud version um that is available on more or less all major hyperscalers and uh we also have um recent we also recently um licensed on the under patch license uh the crate operator which is actually a piece of software that you can use to deploy kdb on Kubernetes cluster residing in the region of your choice um and yeah if you're some of those people who like to contribute to open source software and I just would like to invite you to check out these github repositories um so we mentioned the ecosystem uh kdb definitely supports postgres file protocol there are there are definitely some limitations uh that I briefly mentioned but when it comes to the features that we don't still do not support which doesn't mean you want um so currently what is available in postgres does not in kdb is time without the time zone so this is something that we still don't support uh we also do not have option to um for declaration and usage of multi-dimensional arrays and another feature that is also quite often used in in postgres but still not uh supporting kdb is uh interval input units for example milliseconds or microseconds and on this slide you can also see the uh how we implement postgres interface and there are a couple of other limitations that we are still working on so like it's postgres compatible but not 100 percent and let's go to the next slide to say what is next for us so kdb has many interesting features but of course there is a lot of things to do first of all first of all um we believe that ecosystem needs to to improve and we are working also hard to facilitate kdb integrations with other systems and tools and only postgres compatible tools but other tools as well um we also have we didn't we didn't mention in this in this presentation because there was no time um we kdb has a functionality to read data from external file system or local file system and to write as well now this is done through copy from and two statements and we are currently working on improving user experience and functionality of these statements and i think what mario's already mentioned there is a lot of optimizations that are on our checklist definitely we would like to improve insert performance so the query queries are quite fast insert is um relatively doing well except in some cases and there is also this notion of a scene plush that takes some time until data are available on the disk and when it comes to this query execution especially joins further optimizations are and uh with this i would like to thank you all for your time invitation and attention if you have any other questions um not only now but also after after the the the q and a just be free to to send us emails or to join our community and ask questions there so maybe now we can go to thank you i'll clap it back for everyone else all right perfect awesome uh we actually have time for for a few questions so if anybody in the audience wants to go for it go ahead and meet yourself in far away otherwise i'll be selfish and i'll use all the time for questions okay cool um so i guess the my first question would be again and this would be me not having any experience with scene but how does how do you enforce integrity constraints is that all done at the great level or will the scene do that for you like you know for example you show the the object type and yet there's no strictness there is that being done by the scene or you have to do it you actually look at the data as it comes in before you show it to the scene yeah you have to do it uh yeah upgrade the b level you can also um you you can also define constraints on every field like not null or not only these complex calculations expression with operators like model of this plus this equals greater than whatever so all all these yeah are done before you uh send the data to be indexed in the scene and then the scene has a notion they have their own sort of type system right so you have to make sure that create these matches the primitives and the more complex things you guys handle up there sorry can you repeat that like so there's a type system right the scene probably has their own internal system but you guys yeah you have to translate it yeah you need to translate it to what postgres supports regarding big decimals or things like that yeah you need to do the translation from the sql data types to yeah and also in some cases in between you have the java thing because pretty busy in java so sometimes you don't even have exact compatibility the special case is the interval for example we support intervals not storing them in the scene but doing calculations and having an interval type as a result like subtracting time sums so yeah in sql interval is very specific and does you don't have this thing in either java time or joda time and you need somehow to make the translation in java for this thing got it okay um what is something you wish you what is one say basically you guys are you're not a hard fork of the scene right you're you're you're following along their main line so you get all the optimizations and new things that they support is that or you guys do a hard fork i didn't get sorry sorry for the scene did you do a hard fork and their repository of the source code or are you are you embedding them and always using whatever the latest version that they put out uh yeah no we are we're using the the latest version of that okay yeah yeah we use it as a library as a dependency but yeah of course we have some some classes that we override when it comes to this data type mapping thing that you mentioned but yeah mainly we just update and you have to adjust in most cases just small parts of the code to adapt to the new version and the new features got so it's quite backwards compatible then you need to see if the new features somehow you can use them now and push down more features the queries or yeah so what's up is there anything you if you can change one thing of the scene to make creativity better is can you can is there something you wish it had it doesn't have now there is a there is a thing that it's very tricky yes and it's the null concept so you seem doesn't have the null concept and whenever you do queries with is null it's not null then this is featuring you need to execute with create db code so you go to leucine and then leucine has a callback to the create db code to execute the featuring which is of course much more inefficient than doing a featuring on leucine itself and then the null thing that it's very very common in the relational database goal it it's not cheap because of these restrictions not restriction but the way leucine didn't support it from the beginning as a concept and this plush actually is leucine plush that makes it available on disk sometimes is maybe taking quite some time um because we have seen a bit a couple of customers actually complaining about insert performance it's not for all data usually not for all types of data but like there's sometimes use cases where this doesn't work as expected and take some time until data are available on the disk yeah this is maybe not something that scene doesn't have or has but it's something that maybe could be improved in the future got it okay and as my last question is is you guys are the only database system we've had this semester that's written in java actually Victor you have a quick question uh just what's the best practice to it's for database design through the best to loop uh joining what's the best method to improve nested loop join yeah i put in the chat yeah so because it's like the best nested loop manually for a large amount of data that's contained long time so what is the key design principle to improve that and how to create your database no no no i he's asking how how do you make a nested loop join run faster the answer is not run nested loop join like yes worth one right there's nothing you can do and especially not run joints on multiple tables that could be very slow yeah i i mean victor the answer is hash join is almost almost always superior especially if you have to have good at desk um i watch the class lecture we cover this okay um so my last question my question is again you guys are the only uh your database system that's written in java we had this semester um it's been all rust c++ one somebody in zig um and so you guys are are almost 10 years old now and is there anything that you know in over the years last decade of the iterations of java is there anything that has that that's improved in the jvm or the jdk or you know the java ecosystem that has made things better for create db and that can be either in terms of performance or software engineering um for jvm definitely the the garbabs collection that g1 cc when it became the standard stable made an impact for sure okay yeah and now with java 19 there is this very proven with mapped ffs or so which we haven't tested so far but lucine is recommended to start using it at least for to test it so we are planning to have benchmark with this thing we haven't done it yet because it was like two weeks ago that we we moved master to java 19 so we didn't have the chance to test it so far okay and yeah code guys there are some improvements but in general they are not those that make the the big impact and there are many things like the streaming interface the streams that in many times it's much slower when you do the uh the micro benchmarking that instead of writing you know verbo is in the four loops and doing some featuring there