 All right, so next talk is Max talking about CreateDB. Welcome, everyone. Thank you for coming to my talk. My name is Max, as you already know. And I'm a software engineer at Craterio. And while you might have not heard about CreateDB yet, I want to tell you a bit more about CreateDB. CreateDB is an open-source distributed database. And it's kind of special because it's built on top of Elasticsearch and Lucene. And some of you might wonder why would you even do that? Does it even work? And I can give you already a hint. It works like a charm. But let's see why and how it works. So why are we talking about another database? So you're all very familiar with traditional databases, SQL databases. It's there very well researched, very well executed. There's open-source databases like Postgres, MySQL. And of course, there are also commercial closed ones or Oracle. But typically, what you experience when you run these on a large scale is that to be able to have scalable search can be quite tricky with these, although they are also making progress in that direction for sure. So on the other side, there are search engines, which are kind of like databases, but they're optimized for search and scale. So we have libraries like Lucene for very efficient searching, local searching of documents. And we have the distributed frameworks like Solar and Elasticsearch, which use Lucene to have distributed search capability. But the downside is you can typically use SQL with these search engines. And one might actually ask, why? Because SQL is a pretty mature standard. It's been around since the 80s. Everybody knows SQL. I mean, at least maybe the kids nowadays not anymore, but SQL pretty well. And it's pretty, it makes sense. So why drop that? And that's where creativity comes. So creativity is a scalable SQL database and is optimized for search, but it doesn't have all that no SQL bullshit. That's a very broad statement. But basically, what you see in creativity that there are no funny APIs like you have another scalable databases, distributed databases. It's very easy to configure. And it actually works. It's built with tools which have proven to work in production. And it's not like another no SQL database. I don't want to say any names, which in production breaks completely. So createDB in a nutshell. It's a product around since 2014. And the GitHub address is here. Create, create on GitHub. It's Apache 2.0 license in the community edition. And so it is built using elastic search and scene. As I told you about that already, we see a bit more in detail what that means. It's SQL 99 compatible, which sounds like really outdated, but it's probably all the SQL you will ever use. And it has various ways of connecting with it. So it has like a rest and Postgres interface. So Postgres means that you can actually use the Postgres client and the Postgres protocol. So if you have existing infrastructure that connects with Postgres, you can pretty much just plug in createDB and has like a rest interface, of course. But it only accepts SQL statements. So because I want to be honest, because I might wait this broad statement, I want to tell you what createDB is really useful for. So as I said, it's easy to set up. It doesn't have funny APIs. It has great scale out, massive reads and writes. It is container aware, which these days is really important. It's not so great if you want to use transactions, because it doesn't support transactions, or stuff like foreign keys. So, sorry? Who won't want that yet? Yeah, we're talking about scale of a search. So what is it like to use createDB then? So in a way, createDB is just like a normal SQL database. So we can use SQL statements, create a table for some speakers. We can use primary keys. We have, of course, data types, integer, string. And then if we have a speakers table and a talks table, we can, of course, insert data in these two tables. And then, for example, I would in a typical relational database do a join on left join on the talks tables and join in all the speakers on the speaker ID. But there's more. I mean, you already know that from a typical relational database is that if you want scalable, high performance, you usually denormalize your schema. And that is what a lot of people do as well when they use createDB. So we can retain the same information as in the previous slide with the two tables. Have just one speaker table and have the speaker name and an object column with title and string, with the title of string and abstract string. And inserting data works like this. And the beautiful thing is that you can just select from these objects. So talk is our column and title is our field inside that column. You can just select from it. And it looks like an expensive operation. It's not. It's the same as having another column. And this has to do with how createDB actually realizes these tables that we use Lucene under the hood. I will get back to that in a second. So bottom line is, createDB is also very great if you have unstructured data like this, because you can pretty much just ingest it in this way. It's very flexible. So when you create a table, because we're just with a database, if you have a four-node cluster, for instance, you can specify additionally a clustering scheme. So you can cluster by four shards. So that means that this table you are creating, the speaker's table, will be split into four parts, which we call shards. You might recognize this term from other, like from elasticsearch, for instance. And then your data will be all the names, which have the same name, will be on the same node. But there's hashing involved. So your data will be evenly partitioned among the nodes. And of course, we also support replication. So that you specify with this with number of replicas. The good thing about replication is, of course, it makes, I mean, here, of course, the replicas are not in the same node. They will be arranged that if one node fails, then you can recover from another replica of another node. But the cool thing about replicas is not only that they provide four torrents, but also you can actually increase your search speed, because you have the data available in multiple places. So that makes sense. What additionally you can do in CreateDB, which is really neat, for instance, for time series data, is you can collocate your data further. And that is called a partition table. So if you have this table here, like we had before, and you might have, like, a year, because every year is fostered, you spread your data into shards with disgusted by, but you can additionally partition your table. So you can specify one or multiple columns. And that means that for every year, you will, so here we have three years, because we have three parts, like three petitions for our table. But you can have as many as you want. And so every time we have a new value, a new year coming, you create another primary and replica shard for the table. This is what we call a partition table. And there's more. There are, of course, all kinds of aggregations. There's geo-search, so you have a geodata type for your columns, and you can search by distance or use other geometric methods. The text analyzers, so if you have, like columns are, by default, indexed always in creativity, but if you have some more, like, full text search or stemming on your string in your column, you can use one of the built analyzers or even write your own to do that and to do efficiently search for something, some value in the column. We have user defined functions, snapshots to backup your data or restore your data, user management, schema, privileges, encryption, MPTT support, and so on. But I cannot go into this into detail, of course. But what is more important now is to look a bit how these features are realized in creativity. So look a bit at the architecture. Thank you. We are, of course, like a lot of projects nowadays on the shoulder of giants, so we are a distributed SQL execution engine, but we have a large number of important dependencies. These are the most important, I would say. So we use an antler that's like a parse generator to parse our statement to generate an abstract syntax tree. We have neti for all communication, for rest interface, postgres, and web interface. We have Lucene, which is storage, indexing, and our internal query format that we generate from the SQL query. And we have elastic search, as I mentioned, for the transport, clustering, routing. So this lets us concentrate on the upper SQL execution part, which is really nice. So I'm not sure who of you used Lucene before. Yeah, some people. So a very basic introduction to Lucene. So Lucene stores documents, and these documents in creativity could be, I mean, is our row data pretty much. This is how rows are translated into documents. It's completely transparent, and you will never know that actually happens. But we wanted to learn more. So I'm telling this to you now. And these documents have fields. So for instance, we always have this Anoscore ID field, which is like an identifier for the document on our node and our chart. And if we have a column, then there you will have these here listed. And as I mentioned, these fields are indexed by default. So if you want to find Bob, you can do that very efficiently. You don't need to create an index. It's tunable. You can turn it off if you want, by default it's on. And we have also, due to Lucene, our column store integrated. So if you have a column that you want to aggregate, sum up some numbers in revenue column, whatever, you can do that very, very, very efficiently due to the column storage. Because you can just aggregate the whole column. You don't have to go through each row and find the value. Elastic search, who has worked with Elastic Search? OK, it's about the same amount of people, but different people than Lucene, I saw. So in Elastic Search, you have these core concepts that revolve about indices, charts, and replicas. There was really good talk yesterday from one of the Elastic guys, which also explained that. But I'll just quickly go over it. So we have these charts in an index, like I mentioned before. And each chart has replicas, of course. And these replicas are also useful for search performance, already said that. So how do tables relate to indices and charts in CrateB? So a CrateB table is always represented by at least one Elastic Search index with a mapping. If it's not petitioned, then it's just one index. And each index in Elastic Search has a mapping, which kind of looks like this. Well, it actually looks like this, I copied this from the source. So if you have our table with our speakers, you have for every column, you have the name and some attributes, like keyword is like a simple string. And you can have nested properties as well, which was our objects that we saw previously. So if you have a normal table like t1, which is not petitioned, you have an index for this table called t1. And it might have four charts that we saw before. You might see another table, which has three charts. These are not the same charts. It just means it has four charts. And then you have a table t2, which is petitioned. And we create these petitions. Well, here's day, but before we had year. So if you have a new value for your year, and you are petitioned by year, thank you, then you will basically create a new index for every year or every day. And yeah, this is quite handy if you want to perform data exclusively on a year or a day basis like you would in time series analysis. So how does this all come together? So when we have a SQL query, it will first go through one of our clients, or submit it by one of our clients, which can be our web interface, or psql, crash is our command line utility, and JDBC, Python, whatever. And then it will rise on one of the nodes. And it doesn't matter which nodes. You can just pick anyone. And depending on whether you use REST or Postgres, we will receive your statement. Then it goes into the parser. So this is all bit on top of these other layers I mentioned. So parser makes sense of SQL statement syntactically. And then the analyzer does the semantic analysis and gets all the relational information, looks up, for example, the elastic search indices. And does basically a semantic check. And the planner then generates the execution plan. So it figures out what nodes are going to be involved. And then the executor will use the elastic search transport layer here and submit the data to all the involved nodes. And the nodes itself, of course, they have their own execution logic again. And this all gets submitted in the end back to the looked up on the storage layer, the data that is necessary. And then it's going to be sent back, merged. There can be multiple phases involved, depending on the SQL statements. If you have a sub query, you might first do some other operation before you do the final operation. So what is some highlights from this architecture? You have distributed storage, distributed execution. You are a masterless, which doesn't mean there's not a master. There's always one master to coordinate, but any node can potentially be the master. So if the master fails, another one takes over. We have, obviously, replication, which is really neat. We don't depend on any storage system. We use only local storage. So if you use containers and you want to shut down a container, bring another one up, you can totally do that without worrying about data persistence if you, of course, if you have replication turned on. And it's highly optimized for search with Lucene indexing all the tables by default. So I wanted to do a bit more hands-on. I realized I probably don't have so much time. So what can you do with CreateDB? Obviously, monitoring is a big thing where you receive sensor data, IoT data, large amounts of data, or network events, and you want to make sense of this data in real time, which you do when you do monitoring, or intrusion detection, or something like that. Then some people use it for stream analysis. It's not streaming in the sense of Apache Fling, for example, but you can ingest data and raw data and then perform at very short intervals aggregations on this data. So it is a streaming-like functionality, which a lot of people use. Then, of course, text search, obviously, because we have the analyzers in Lucene, it's very often used. Time series, I mentioned with the partitions. And you, spatial queries are also very often used. So how does it look like? So this is the admin interface. So you have a very good overview of your cluster here. It's a general overview page where you can see it's everything replicated, how much records we have, and some basic load information. And we also made that C-group aware, so you can use also this load information in a container environment. Then we have a couple of more pages here. Our time's up. I thought I had 25. OK, I'm done, basically. So you have your tables here. You can see how many shards, partitions you have. And note information. We have some sharding information also to see which shard is currently in replica. OK, so the bottom line is elastic search, use Lucene heavily, and NETI, among others. And we use elastic search, Lucene, and a bunch of other stuff to build a distributed SQL database on top of it. I think KTB is really perfect when you want or have to use SQL. And if you want to store a large amount of structured data and have many thousand queries a second. So I would invite you to try it out for yourself. You can go on the download side or use this fancy curl command if you trust us. And Docker run create is also a possibility. You can check out the docs. And yeah, if you want to contribute, there's some information how to contribute and stack overflow. And yeah, we have a select channel as well. So thank you very much. Thank you, Max. Any questions? Do you use parent-child relations in elastic search? If we use parent-child relations. From elastic search. I'm not aware that we use that. So all the relations are built on top of elastic search? Yes. We have our own relation management. You have any performance numbers? I explicitly didn't go into performance or benchmarking because this is a difficult topic on our website. We have some benchmark results if you want to check it out. But in the elastic talk, benchmarks are always subjective, I would say. So it's very fast. Depends on the use case. If you're joining relations all the time, it's going to be a bit slower. If you have demonized schema, like I showed it, it's going to be very fast. Because I know that elastic is switching away from the parent-child relations because of the performance issues among others. But that was the main goal of dropping type support and not using the relations at all. Yeah, so they wanted to switch away from the relation databases, schemas, and everything. Because people mistakenly took it for some relational databases and tried to do joins, which is impossible on those data structures to perform good. So I was wondering how you resolved that. OK. Joints are not impossible. But yeah, I agree. All right, thank you for the talk. I just want to ask, have you considered the Sphinx before you started the CreateDB? Because Sphinx is out there for a decade and has all functionality CreateDB have. So you're asking if I know Sphinx? Have you considered Sphinx? I don't know what that is, actually. So I guess I've not considered it. Sorry. So is the PostgreSQL wire protocol, how consistent would your SQL dialect be with the PostgreSQL dialect? And would it be considered a bug if there was a significant difference? In our code base, we have the SQL logic test, which help us to keep track exactly how much we cover from the SQL dialect. I mentioned no support for transactions. So obviously, we want to match 100%. But we have a lot of people who plug CreateDB instead of Postgres, and it works. So we have information schema and all this stuff you would expect from Postgres. I cannot give you all the details. It depends. But I would say it's pretty good coverage for most use cases. OK, that's all we have time for. Thank you, Max. Thank you.