 The Carnegie Mellon Quarantine Database talks are made possible by the Stephen Moy Foundation for Keeping It Real and by contributions from viewers like you. Thank you. So we're super excited today to have Matthias J. Sachs, who's a software engineer at Confluent. Matthias is going to talk to us about KSQLDB, which the relational database built on top of Kafka. Matthias has been at Confluent working on Kafka and Kafka-related projects since 2016. Prior to that, he got his PhD at Humboldt University, where he studied distributed stream processing systems, which is perfect to work on in Kafka, right? So we'll do like we normally organize this like we normally would. If you have any questions, please unmute yourself at any time, interrupt Matthias because we want this to be interactive. But say who you are, where are you coming from, and ask your question. Okay? I'm Matthias. The floor is yours. Good morning. Okay. Thank you very much for the introduction, Andy. Very happy to be here and have the opportunity to speak about KSQLDB. So I call it a stream-relational database system. What I have to confess might be a little bit of a stretch. I tried to have kind of a catch title. It's also kind of unclear what a stream-relational database system I think actually should be. And well, I will try to give you our take on this and will explain what we have built so far and what ideas we have in mind for future development. So maybe start with kind of what should this actually be. So we also call KSQLDB a streaming database. And so the idea of a streaming database in our mind is to have a SQL abstraction over mutable tables and immutable append-only streams. So what does this actually mean? So if you compare to traditional databases, relational databases, they have two main use cases. You have OTP use cases where you have a lot of transactions modifying your state and concurrent updates. And you need to make sure everything is consistent. And on the other hand, you have OLAB where you have those complex autocracies of analytics. And KSQLDB is not built for any of those use cases. So KSQLDB is built for use cases that you could call online stream processing. I've heard other people using these terms so maybe we can establish a new OLSP category here. Let's see. And so the idea is really to say, well, we want to work over streams and tables. But we really focus on continuous queries. So in regular databases, you issue a query. The query is protected via S guarantees and basically works against a snapshot of the database. Here continuous queries is kind of no. We know that tables are changing over time they're involving. And we have those streams where always new data comes in and we want to continuously process this. So for KSQLDB, besides those continuous queries, we also allow actual lookup queries into tables. So this is similar to classic queries that you know of. However, in KSQLDB, it's very limited. We have only simple state lookup and I will explain this a little bit later in more detail. And then we have a third type of query that allows you to subscribe to data streams. So when you're a client and you connect to the database, you can just say, I want to subscribe to this data stream. And whenever new data is appended to this data stream, just send it to the client in a continuous manner. And so two main use cases we have seen so far that people are using is this kind of streaming ETL where you might have some upstream producer that is publishing data. So you want to, you know, transform this data and then eventually maybe lend it into a data warehouse for us and a talk all up queries while KSQLDB is really just doing such a streaming ETL part. And the other one is materialized view maintenance. So you can basically say, I have my query and my query computes a table as a result and I want to have this table continuously updated while new data is arriving into the system. And then I want to do like some simple lookups into the materialized view from my client. And well, if you use this in a streaming fashion, then of course you get very low latency updates to your materialized view and that is exactly one of the sweet spots of KSQLDB. So before we go into many details of KSQLDB, maybe first of clarification, as Andy said already, KSQLDB is built for Apache Kafka so it only works with Apache Kafka. It's sort of standalone system. And so there's a high level architecture is really said we have, we have two clusters of servers. So on the one hand we have the broker cluster that is Kafka and the broker cluster can be considered the storage layer of the system. And then we have a cluster of KSQLDB servers where we do the actual computation and both clusters communicate over network. So it's kind of decoupled. You can also scale both clusters independently. And because everything is built on Apache Kafka, some things from Apache Kafka is something we've inherited in KSQLDB in our design. And for that reason I would like to go a little bit into background of Apache Kafka, what it actually does and how it works, to allow us to understand how KSQLDB works that is built on top of it. So if you look at Kafka as a storage system, the first thing you have to understand is what does Kafka actually store? And in Kafka we store logs. Logs are append-only sequences of messages. And so we have this logical abstraction, what we call a topic. And topics are first partitioned. And so every partition is a log by itself, you can say. And because it's an append-only log, well, if you write new messages, you always append to the end of the log or to the tail. So it's a very, very simple thing, nothing can be updated. And when you read from the log, then usually you do a scan of a partition or multiple partitions or multiple topics. So from old records with lower sequence number, here indicated with the numbers we call those offsets to higher sequence numbers. The main difference to a regular messaging system in Kafka is it really says those logs are persistent and replicated. And that allows clients to also say at some point in time, actually would like to go back in time and read old messages, as long as they're stored and it's configurable and you can do this. So but there's a main access pattern is really linear scans over the partitions. That is kind of important. So there's not really random look-ups here. So this is actually a quite, quite simple model. And the next question as well, we store those messages, what are those actually? And so Kafka is really a very primitive system with this regard because messages are just plain bytes to Kafka. So it uses the key value model. So it has just kind of two components in the message, but it doesn't really know what is stored in there. So when a client is writing something into a Kafka topic or appending something, it has to serialize the data client side and then it just sends over the bytes and Kafka just puts into the log and it's done with it. The advantage of this is that Kafka is very high-performance because it doesn't do a lot of stuff. And that makes it very, very, very scalable. So the key value model is important because the key is usually used for partitioning. So even if the client decides in which partition a particular message is written, it usually uses the key to compute, let's say, a hash or does a range partitioning to pick a partition and then write it to it. And we will see that this becomes very important, this notion of a key for some applications. And then besides key and value, we also have an associated timestamp with every message. It's also important for stream processing because here we want to do temporal processing and consider the timestamp when the message was actually written or actually when the producer is setting the timestamp, not the broker. So this has also become important. Then the other important thing is, well, sales logs are appended only, as I said. So if you cannot delete anything, well, it would grow unbounded to something we don't want to have, obviously. And so topics have a so-called cleanup policy, as it allows you to actually get rid of old data you're not interested anymore in the topics. And there are two policies here. So the first one is a retention policy, and it just tells you, well, when you append new data to the topic at the tail, at some point, old messages are ready to be perched. So we're not interested in this anymore. And so the retention policy here is either time-based or size-based. So you can say, store data for the last month or for the last day, or you can say, oh, store whatever, a gigabyte per partition or something like that. And then retention policy is quite straightforward. So other ones, the compaction policy is actually much more interesting. And here now the key value model also comes into play. So when you configure a topic with compaction, you basically want to apply some kind of update semantics to your data, where you would say, well, I treat the key of the message as kind of the primary key, if you wish. And every time I write a message with the same key, I consider it to be an update to the old message. And then the compaction policy allows that older versions for the same message with the same key can be deleted from the log. So in this case, we have this example where we have five records with keys and values. These are the letters. And we see that we have three records for message A. And so when compaction runs, compaction is allowed to delete the first two and only keeps the latest one. That also implies that if you put a record into the topic and you never touch it again or you never write another record with the same key, it will stay there forever. So that's also important. You would never lose data or lose stage for a particular entity. And what you can also do here is you can also delete stuff. So there we have so-called tombstone messages where we say, well, if you write a message with a particular key and you set the value to null, then we just treat it as delete and then we can delete the entry from the topic entirely. Well, as database people, you know, this log compaction looks actually pretty neat. And what it allows us to do is to talk about what we call the stream table duality. So if you have a table and you modify the table, then well, you can record all those changes in a so-called change log. And then you can reuse this change log to actually recreate the table. So here on the left-hand side, we have an input table and we say, well, we do an insert with Alice. That is something we could record with Alice is the key. And just say, well, we have seen Alice once. We just go count here. Then we have a Charlie. We append it to the table. And now we do an update from Alice from one to two. Then in this case, in the change log topic, we just append a new record, Alice 2 here, and so forth. And now when we reread this, we can turn the change log back into a table and recreate the table state. And the interesting part is here, if you're only interested in the latest version of the table, in this case, on the very bottom right, and we run compaction on this change log stream, even if we delete the first record, Alice 1, you can still recreate the table. So the topic compaction really gives us this property. If something is not updated, we never delete it. But if it's updated, we can purge old state and old versions of the table. We are not interested in this anymore. And this is a very important concept that is used in Kafka Streams and KasequalDB a lot. So just keep it in mind. And it's all really based on this key value model and update semantics. So now when we go now a level higher in KasequalDB, I mentioned already, we want to talk about streams and tables here. And while a table is very similar to a database, we have a create table statement. We define our steamer. So all records are very structured. And we require that all records follow the same schema, like in a regular database. The main difference is here is really that we have this with clause, where we say, well, where is the data actually stored? And this refers to a Kafka topic here, because that's the storage layer. And we also have here the notion of the primary key again, the user ID. And in this case, this tells us that this attribute, the user ID, will be stored in the key of the message, while all the other attributes are stored in the value. So far, so good. And now in addition to tables, what we also allow people to do is to say, well, I actually want to create a stream, an event stream over a Kafka topic. Similar, we have all those attributes. It's structured. We use regular database types that are available. But now, we basically say, well, we have the same data. And we have data in a Kafka topic. And so creating a stream or a table basically gives us a different semantic interpretation of this data. And usually, when you go back to the retention policies, we just explained that usually, if you have a data stream, you configure the topic with retention time. And if you have a table, you usually configure this compaction. So there's kind of this alignment to those configurations. Why does that matter when you're actually declaring the table or the stream? Well, it doesn't matter. It's just like what you should keep in mind how the system works. And very often, the topics are actually already exists. So that is kind of similar to what Haifu is doing. You have an HDFS with some data in it, and you create a table over some things that already exists. So we create streams and tables over topics that already exist in the Kafka cluster. Does that make sense? So what changes, if I say I want to create table versus create stream, what changes in the query API? Is it just like the underlying semantics of how things will be computed? Well, queries over streams and tables are executed quite differently. So you get different operator semantics for sure. Okay. We can see this later. Because as I said, when you have a table, you actually say every record with the same key should be an update to a previous one. But in a stream, you treat everything as an append only in the second sequence. There is no updates. But how are you expecting things are getting updated through the Kafka API or are they getting updated through KsigoDB? So the most prominent use case in production is where some upstream application, and I have a slide on that later, is actually appending new data to the topics. We also allow insert into statements. But they also work a little bit different. Okay. As far as I keep going. Yeah. Okay. So that is the data model. So we usually assume we have existing topics, and we want to process the data. So we say treat this topic as a stream or treat it as a table. And now we issue queries. And the bread and butter of KsigoDB is really those persistent continuous queries. And well, they take one or multiple streams or tables as input. And they compute the result stream or table. And now here, actually, the right path comes into play. So when this query is executed, and we create a result stream or table, since the result is published back into a topic in Kafka. So we use this create stream and create table as select statements to express this. So the select is really expressing your actual computation. And then we just say create stream, create table to write it back into a Kafka topic. And as I said, those queries then are deployed in the KsigoDB server and run them forever until you would say drop stream or drop table. And then we can also terminate the corresponding query. And what is now interesting because we are already in a data parallel system, and I will explain more details later how this works. So those queries are executed data parallel and that makes the whole system scale up and for tolerant. So you can always scale out queries and you can also scale out the cluster by adding new compute nodes and we will just use them. And if any server goes down that is executing a query, then another server will take over and we will fail over and I will explain a little bit later how this works in detail to make this happen. So this is the persistent queries and that is really the semian use case, VPC. However, we also support those transient queries. So we have a regular CLI where you can interact with KsigoDB. We also have a Java client library. And there you can easily say I want to deploy a persistent query into the server, obviously, but you can also write those transient queries. That we call pull queries and push queries. And a pull query is really like a classic query. So it's against a single table and you can do state lookups and say give me the current value of this row. It's a moment if you're limited, we don't support aggregations or joins on purpose because if you want to have more complex aggregations, you would always say, well, submit a persistent query to compute this output table so when you said you want to query, and then you would just do the light lookups into the table itself to actually query the state. And then we have those push queries that I mentioned where you can say, I actually want to subscribe to Result Stream. And that is something you can think of. Well, as I said, when we compute the result of a query, we pipe it back into a Kafka topic. So you could basically say, well, I want to read this Kafka topic conceptually and get all those messages that are altering my result. And I only issue this query once and it just runs forever. And the results are just streamed to the client. And the client doesn't do anything more about that. Excuse me, can I ask a question? Sure. Yeah, by the way, I'm Ling, I'm a PG student here. So I'm wondering, is there any fundamental reason that you cannot do aggregational joins in physical DB tables or is just something that this has not, does not have enough need to support it yet? There's no fundamental reason. I mean, it's really something that we don't really need at the moment. And we're also not sure if you will ever need it, because that's not the use case we really optimize for. I see, I see, I see. Yeah. And besides that, I mean, our current architecture, because we never optimized for it, it would also make it very hard to do. Not impossible, but, and we will see, we will see later how it works. Right. I don't know if you're going to talk about this. Like, what is, sounds like you don't have a query optimizer or you do have one or how should I, like? Not really at the moment. So we do some very simple optimizations like like filter pushdowns that is rule-based and stuff like that. Got it. But in general query optimization, I will talk, it's the very end of the talk. It's a very interesting topic for us. Because we need to optimize, basically, those continuous queries that are deployed in the server. And then completely different questions are actually interesting in a, in a, in a relational database system. And then, so also to like your, your SQL syntax is different because it's one, no, no. I mean, the, the create stream versus create table, that, that's different. But like the SQL itself, there's no semantics that are specific to ksqlDB, right? So, so we'll see examples later on. Okay. And then your, your network interface, like the, for the, for the client, is that based on something else or is that, you know, is that written from scratch for Kafka, for you, for you guys? Like did, did you base it off of like, you know, HCODB or something like that? So it depends on, so here for the, for the Java client libraries that is directly talking to the, to the, to the ksqlDB server, that is something we built from scratch. Got it. But of course we have the communication from the ksqlDB servers to the Kafka blockers. And there is, we just use the, the, the, the Java clients that are shipping this, the Kafka project. Okay. All right. Thank you. So yeah. So, so the overall, like high level, all of your ksqlDB is usually something like that. You, you do have the Kafka cluster that is storing topics that be then register in ksqlDB with those creates free main create table statements. You have usually upstream applications that are, you know, writing in those topics. And then you have the ksqlDB cluster that is, that is consuming them with those continuous queries and is publishing results back into the Kafka cluster. And in addition to that, while ksqlDB is executing those queries, well, it also accumulates state for tables. And so this table state is sent and exposed via cells, cells, cells pull queries where you can say, you know, do a look up into this table and give me the current value for this primary key. Or the application can say, I want to actually subscribe to a data stream and send the ksqlDB servers pushing out all those results automatically. And because all those results are published back into the Kafka cluster into topics again, there can also be downstream consumers usually that are consuming those topics for further processing. So going back to the streaming ETL pipeline, we said, you know, you might have some upstream database where you do a CDC data capture, you know, have everything into Kafka. So you use ksqlDB to crunch your data, to modify your data, write it back. And then you use another downstream consumer to pipe it into a data warehouse. That would be a classic ksqlDB application. Okay, so with this in mind and Andy asks us this already, let's talk a little bit about querying. So what we try to do here is to say, well, if we query tables, we want to behave like regular SQL, however, one difference is really that we want to have temporal semantics. So SQL also has temporal extensions and it's kind of going into this direction and there's some overlap. And so also temporal operators we have, say it should be event-time-based. As I said, every message in our Kafka topic stores this timestamp. And we will always use this timestamp to apply our temporal operators to compute the result. And this allows us of course to build a deterministic system. We would never use something like wall clock time because then everything would be non-deterministic. So if you think of a table, you really think of the table as something that is changing over time. And conceptually you can associate kind of a version to a table for every point in time. And then of course, well, we have our streams and as a question as well, how do we want to query streams? And so let's start with a simple stateless query. We say we have filters and projections and that should be rather simple. So let's assume we have our click stream as an input and we say, well, we want to apply a filter and on some user ID here and then we project some fields and we also call a UDF here and then we pipes the result into the click stream. So what does it actually do? Well, we have our click stream input and every time a new record is appended to the click streams, then this record would be piped through this query and if it passes the filter, well, we would apply the corresponding projections and expressions and then create an output record and write it into the output stream. That's rather straightforward, I would say. But now the question is, well, if you have a stream and a stream is conceptually infinite and unbounded, how can I actually aggregate a stream? And here we follow a little bit of a different approach than many other stream processing systems. So what we say is, well, if you aggregate a stream, what you actually compute is a result table and while we get new input data in the stream and process this input data, we are continuously updating the result table. So here we have an example with this account. So we have an input click stream. We have an additional filter here. We do a group buy by user ID. So we want to count by user ID and we want to count how many URLs this user did basically visit. And we can do this in a windowed fashion, a non-windowed fashion, it doesn't really matter. And the result will always be this table. So it's like a SQL DB server will store in the server itself, but all the changes to the table while we are executing this query will also be written out into Kafka topic as the change log stream of the corresponding table. And so you can think of this as when there's a, here it says as a grouping attributed to user ID. So basically the user ID becomes the primary key of the result table. And while new input records are coming in, we can just, you know, update the corresponding row for this user in the table while we are computing and updating and refining the result while a new data comes in. And then if you apply a window like here, so, and this is one of those extensions that is not standard SQL, we have this windowed tumbling clause. We say, well, actually we don't want to count for infinity. We always want to restart the count every 30 seconds and start at zero again and chunk the input stream in those windows. And for this case, the table would basically create a new entry for this user every time we start a new window. And that is quite unusual because most stream processing systems, when they do an aggregation over a stream, the result is also a stream. And then of course, say, always force you to have a windowed aggregation because otherwise they cannot do the computation at all. In our case, because we say the output is a table that we continuously update while new data arrives and we support windowed and non-window aggregations and can both represent as a table, actually. What's a good example where someone would want this choice? Like I understand the windowed one, like is there a use case of like an application when actually need both because you're deviating from the SQL standard? Yeah, I mean, as I said, the window tumbling is actually a non-SQL standard clause and that is what people usually want to do because they say, well, I have an aggregation and I'm only interested in a certain amount of time. So let's say if I do fraud detection, let's say, I'm only interested in the transactions for this user that are happening in the last hour. Everything else, it doesn't really matter to detect fraud. All right, so I thought the SQL standard had tumbling windows, this makes sense. No, it does not, it does not. So the SQL standard has a window clause. Well, it has an over clause that is part of the select statement, but this window tumbling clause, that is non-standard. That is something we added. And there's actually a big debate going on in the community. If this is the right approach, other people have other ideas how a tumbling window could be represented without the need of the syntax, but we think adding the syntax actually making the language easier. They've been debating this for 15 years, so it's not new. Sure, sure. But so there is actually an effort at the moment and Confluent is also involved in that to maybe define a streaming SQL standard. So say there's a lot of debate going on there. I can, like Stonebricker tried to do streaming SQL 10 years ago. I mean, you know, if there's enough companies now, I think you guys can pull it off. So whereas before, there's only like stream base and a few others. Yeah, exactly. That's such a big difference. I mean, in the last five to 10 years, stream processing become kind of mainstream. So now defining a standard becomes much more interesting. Before, if it's just academia, nobody really cares and it's not gonna happen anyway. So let's see how far it goes. I mean, it's very early and maybe there will not be any standard. Who knows, that could also happen. It's an interesting discussion, so. It's, I don't wanna spend time talking about your competitors, but like for Spark Streaming, is there a syntax widely different than yours? I'm not totally sure at the moment because I'm not very familiar with Spark Streaming. So I know that some systems, especially Flink, say trying very hard to be SQL compliant, where we say, well, we give us a little bit more wiggle room and, well, we prefer like a clear, easy use syntax in order, but instead break compatibility to the standard because it's just better for users. Famous last words. We will see. Yeah, all right, keep going, sorry. Okay, yeah, anyway, so I think this is pretty interesting. So Streaming aggregations are basically now producing a result table and always keep in mind that the result here is kind of a dual. So we have the actual state that we store in the KsqlDB server. And we also have the change of topics that we write into the Kafka topic. So what do you know, all we want to do, we want to want to joins. So if you have streams, how do you join streams? And well, because streams are conceptually unbounded and infinite, we need to enforce users to apply a window, a sliding window, to bound the join. Otherwise, you would need to buffer the whole history of the streams that is something we cannot do. And this is also non-standard. So here we have editors within clause where we say left stream, join right stream within five minutes. So that means records can be at maximum five minutes apart from each other. And then we have some join condition here to execute the joint. So if you look at an example here, we have two input streams. And here I basically only show the timestamps. The numbers and letters are just for illustration to see the output result, to give every record a unique identifier. And we apply this time windows and here we can, we would get three output records. So A is joining with one, they are three minutes apart falling within the window. And B and two is joining, they're five minutes apart and so forth. What is interesting so is the last output record. So if you pay close attention to the first up input stream and the timestamps, you see that the last record has a descanning timestamp. So because in Kafka topics, the timestamp is set by the producers, there's no guarantee that the records in a topic are ordered by timestamp. So we call those records out of order records. And so the engine needs to take care to process those records correctly. So that's kind of important, just want to mention this on the side. And what we also need to do is for this reason, we need to define precise operator semantics to compute the timestamp for the output records. Because now I have two input records with two timestamps. And so the question is, what should be the timestamp of the output record? And so what we do here, we define the semantics to always use the maximum of both inputs. So for this last result record, even if we might process the record three, this timestamp 1408 later in wall clock time, send the B record, we would still give it the B record event timestamp. And that gives us deterministic semantics. And that's super important. And you know, we had a lot of discussions internally how we do those things. And now we have an agreement. We're not sure if it's the right thing to do, but that is what we do. And we think it works great so far. I like, this is so esoteric. I can't imagine any of your customers even knowing like, well, I guess many customers don't know that that's true. But the point is still if they issue a query and they get a result back and the issue is the same query, they want the same result back. Yeah, that makes sense. They don't really care, but it must be deterministic. So and getting determinism in so streaming systems is not straightforward. But that is why I'm pointing it out. This is one of those things that is different because everything's temporal. Another thing that's interesting is end-way joins because well, if you want to join multiple streams, those joins are not associative anymore. So if you have a proper end-way join over three input streams, it actually gives you different results than if you use two binary joins and change them together. And the other interesting part is, you have those multiple limitations how you can do those binary joins and all of them give a different result. So let's look at this why this actually is. And so here again, now we have three input streams. We again assume the time window is five minutes. So in this case, we would get a single output result. So only two Y and B are joining. So they are within five minutes and fall into one join window here. So if we now execute this join as a binary join and just pick the first one as an example here, then we would say, well, we first join the first two input streams applying of the same window of five minutes and we get an intermediate result. In this case, we get three intermediate joint results. And now as a second step, well, we would join the third input stream with our intermediate joint result. Again, applying the same five minute window. And what we can see now is that now we are getting two output records and not just one. Because the first one is new, the second one is the one that is also into the first result. And the main reason is really how we set the output timestamp, what I explained before. So in the intermediate results, the first output record gets timestamp 1406. So at this point, we're losing track of its original input record one in the very top corner with 1401. And because we are losing track of this timestamp, we're now joining it with the third input stream and produce a new output record. So what is interesting so is that we know that all those permutations of binary joins produce a superset of the end rate join. And if we basically would preserve the timestamp information, then we could also apply an additional filter after the second join to actually filter them out. And this observation actually gives us some opportunities for join optimizations. So we haven't implemented anything yet just to clarify that, but I hope that we can exploit this in the future. Because if you try to re-optimizations then you can exploit those things. But it's really important to understand because of this time semantics, cells joins are not associative anymore and if you want to re-order them, then you need to play more tricks in the optimization to make sure that you compute the right result. But again, I think this is one of those things that's so low level I don't think your user would know. Maybe, again, I guess it depends really if you write a join and it's a moment you don't do any join re-ordering. So the user specifies a join order for us and they would write it differently as they get a different result. The users need to be aware of, you need to tell them at least even if you don't do anything. I mean, that's also gonna make it super difficult to debug your optimizer because it's gonna depend on what events are showing up at the ordering of those events not just what the data looks like. Yeah, yeah, yeah, absolutely, absolutely. I don't envy you. Yeah, I mean, with re-optimization we are still in the very early phase but yeah, that's gonna be exciting topics in the future for sure. Okay, so let's talk a little bit about table queries. I mean, table queries are kind of straightforward but also not because you always need to keep in mind that we always consume a change of topic and then we kind of maybe or maybe not materialize it into a store for processing. And so while we do have like regular SQL and everything should behave regular SQL it's still slightly different because as I said, you don't really query the table what you actually do is you query the change of stream and because we have those persistent queries that we deploy and that operate over the Kafka topics and materializing something in actual store in the KSQL DB server is either a necessity to compute security or it's just an optimization but the semantics are really, really driven a little bit from the change of topic but of course we try to align them with regular SQL semantics. So when you have a regular filter or projection query or even an aggregation of a table then you can think of this like you have this input table and this input table might be updated from the change log and it's evolving over time and so the query is executing ourselves changes and producing an output stream and then this output stream can be again materialized into a result table. So the result table kind of becomes a materialized view actually of the input table and the query is running continuously in the background over the change log to update the materialized view result to keep it in sync with the changes applied to the table and here of course we try to have the exact same SQL semantics so if you write a SQL query the result table should be the same like in a database however the execution is quite different because we operate over the change log stream. Another interesting thing here is now if we look into joins so because we have streams and tables we can also actually why not joining a stream to a table and the idea here is to say you have a stream that you want to enrich with data that is stored in a table. So if you compare to like data warehouse use cases very often you have a fact table and then you have all your dimensions, tables that join together. So we would actually say well such should be a fact stream and all your dimension tables around it. So for every record we are getting into stream we would look up into the table to enrich it so you can think for example you have credit card transactions and you want to say every time somebody makes a purchase you look up the current balance of the credit card enrich the stream and then later on make a decision if you approve or decline because you might be overdrawn or something like that. And at the same time keep in mind that the table is not static. The table itself is always updating in the background as well while we're doing those lookups into the table. And here again, temporary semantics comes into play. So let's assume we have a change log stream on the very top so it's updating our table and in the bottom we are having the event stream that is coming in that we want to join against the table. So now what we need to do is we need to interleave processing between both because we need to make sure that we have the table with the right state when we do the join. The table should not be in the future of the record but it should also not be in the past of the record. So while we are interleaving here we're always looking at the timestamps and we said well either we update in the table or we're taking a stream record to do a lookup into the table. Because again, if you go back to the example of the credit card transaction if I have two credit card transactions and the first one looks up the balance and the second one is looking up the balance an hour later well I need to look up a different balance and if we reprocess the data I always need to look up the same balance that was valid when the credit card transaction actually happened. So we need to make sure that the table is always in the right state. So we get real temporary semantics here. So while we are reprocessing the input stream here and for example you see that we are updating the first row in the very end with A so we only have the primary keys, we update B at some point then we always need to make sure sets of lookups based on the event times correlate to the corresponding table version and since temporary semantics now goes beyond stream table joins we also apply them to table table joins. So when we have a table and we update those tables over time we also want to make sure that we join tables with the content that is valid at the same point in time. So if we have a table users and a table details and we joins them together and the computer result then we so conceptually we want to join table at version one with the other table at version one. In this case here if nothing changes for example for the user table then of course version two of the user table is the same as the version of one of the user table so we can just join the server cells of one and so we basically get cells free results for cells input tables and cells updates and again we need to make sure that we do this correctly in this regard to a set of temporary properties. So it's like SQL but always with time in mind. And again if you run a query like this while the user and detail tables are updated in the background you would always update your profile table that is the join between those to maintain this relationship of inputs and outputs. Okay, now I would like to switch gears a little bit and talk a little bit about the runtime and how this is actually implemented. This was all more about like semantic levels so far and so Kafka SQL DB is using Kafka Streams as its runtime. So let's look at what Kafka Streams actually is. Kafka Streams is a Java client library. It's part of the Apache Kafka project and it allows you to write data flow programs over topics and again the topics here are represented at kStreams and kTables and that corresponds to our Streams and Table Abstractions we also use in KSQL DB. So the user can say I want to read a stream and then we have all kinds of operators like map, filter, group by, window by, aggregation, join and say they change them together to build their data flow program. And that is exactly what KSQL DB does. So we are getting a query, we are parsing the query and then we are compiling it down into this data flow program reusing the operators from the Kafka Streams runtime then to actually deploy and execute the query. And now also scalability and fault tolerance I mentioned before is actually coming from Kafka Streams. So Kafka Streams has also things built in and can equally be realized just on Kafka Streams to get those properties. So and first and foremost, I always talk about persistent queries here. So as I said, in Kafka Streams then we have the DSL program, we compile it down into a data flow program but we call it topology. So we have the operators that are represented as nodes and we have the data flow as edges. And when we have this program, now we can execute this and now says the parallelism comes into play. So as I mentioned in the very beginning Kafka topics are already partitioned. So we are inheriting this partitioning of the Kafka topics in our runtime. So when we have a data flow program we can instantiate it multiple times and then apply it to process the data of a single partition. And of course, very often we need to take care of like data co-partitioning and co-location and stuff like that but Kafka Streams has taken care of those things. And then we have also those tasks that are our units of parallelism that can be executed in parallel. And what it basically means is well, every time we get an input record in a topic, we are consuming it, we are piping it through the data flow program and then we append the output records to the corresponding output topic. To go into a little bit more details and that is something I mentioned before already. So internally Kafka Streams is using the Kafka consumer and the Kafka producer client libraries that are also shipping with the Apache Kafka. So we adjust the higher level abstraction. And for state stores or to store a state or tables, we use RocksDB internally. So it's an embedded Qvalier store. This, it's written in, it's from Facebook, it's written in C++. They might even have been talk about that, I'm not sure. We had Rocks, not RocksDB. Okay, yeah, okay. But they're also built on RocksDB. Yeah, of course. Yeah, exactly. And so, and so it's an important part is here really that again, Kafka Streams now inherits a couple of properties from the underlying clients, especially the consumer clients. So the consumer clients have a feature that is called a consumer group. So you can deploy multiple of those consumers at the same time, and they're getting aware of each other and since they can do load balancing and making, and also give you full tolerance. So those consumers need to heartbeat on a regular basis. And if a consumer dies, or if a new consumer joins the group, then the petition assignment of the input topic partitions to the corresponding consumers is recomputed. And we always ensure that every partition is assigned to exactly one consumer. And that allows us to give to process stuff in parallel. And the other important part is here is that as you can see here, we create four tasks here and we create four tasks here because we have four partitions. So we always have this one-to-one mapping, a partition maps to a task, and also our states that we are maintaining for the table. This state is also charted. So here we see four state stores. Each state store will be its own RocksDB and that is associated with the task. And so when we need to move task around for scaling or for tolerance, then we have those units of parallelism and the corresponding chart of state that we can move around together with the input partition. And here we also see that we're running with like two threads. And so as red can execute one or multiple tasks, and that is our whole execution and parallelism model. What is now also important is, as I mentioned, let's say you do an aggregation and your data is partitioned by a certain attribute in your schema, but you want to aggregate by something else. So for this case, you need to repartition data because when we do the aggregation, we want to compute it locally in a single task. So we need to make sure that all records is the same grouping attribute or landing into the same task. So we must write them into the same partition. So for this case, we actually split up topologies into sub-topologies. And those sub-topologies are now connected via topics. So they don't interact with each other directly, but the only exchange data, again, were Kafka topics. So you would have an upstream topology that is maybe applying a filter, removing some unnecessary fields. And then it's using the grouping attribute to redistribute the data and write into the intermediate topic. And then the downstream top topology can just consume one partition. And now it knows that all records with the same grouping attribute are in this partition and they can just compute the aggregation of this particular grouping attribute or group locally and updates their corresponding state. And that is very nice properties because the point is that those partitions or topics, intermediate topics that we use for repartitioning. So basically infinite buffers. So we don't have to handle problems like back pressure or stuff like that. If something is slower downstream, upstream can still process and just write into the Kafka topic, Kafka will buff or everything. And then later on we can catch up. And another important thing is here when we update the state. So what we will always do, we will create a dedicated topic in the Kafka cluster to write all the changes of the state into this topic partition. And that makes the states that we maintain in the KsqlDB server kind of ephemeral because we can always recreate it from the corresponding change of topic that is stored reliably in the Kafka cluster because there it's replicated in highly available. So here we are playing against kind of idea of stream table reality to say, well, we do those local updates in RocksDB but we also push everything out into the change of topic. And if we need to recreate state, we reread the change of topic so it is hopefully aggressively compacted. So we don't reload state for recovery. So this is the high level architecture here. And then of course output data is also written into output topics. And having this is for tolerance mechanism with change of topics gives us a couple of other things we can actually do with it. So if we say we execute a part of a query in machine one and machine one fails and we need to migrate the state to machine seven and reread the change of topics that could take a lot of time. So we would not be highly available. But what we can actually do is we can actually say we deploy a hot standby on machine seven. So while the query is executing machine one and we are writing the changes into the change of topic we can actually eagerly read the change of topic at machine seven and pre-populate a copy of the state on the site. So now when machine one fails you can immediately fail over to machine seven without any downtime because machine seven has already a copy of the state. So this is a very, very, very quick failover mechanism that we have. And we said we can just build on top of Kafka. Besides that if we have cell standby it also allows us because as I said in the beginning we also allow people to do those regular queries or state lookup queries into the state. So if we would not have hot standbys and somebody says give me the current value and they want to go to machine one but machine one is down. And machine seven takes like a minute to recover the state or maybe even five minutes or longer. Plus they cannot issue the queries. The system becomes unavailable. But having those hot standbys we can actually allow to query either on machine one or on machine seven. We can also load balancers for queries. And if people are okay with that we can also allow them to query sales state. If I say, hey, I don't machine one has too much overload but a couple of queries are okay. If say, you know, query states that it's like a second old because of course machine seven would always lag a little bit behind machine ones that is actually executing the query. Then we can still allow people to query machine seven and become a more scale level and unavailable system. So that is hot standbys and for tolerance and basically it's the same thing here. And we also use the same mechanism now to do dynamic scaling. So we can say, well, because we pre-shared our state already with those tasks even if you have a single thread executing all our tasks. When we add new instances it's easy to migrate the corresponding state because it's all like pre-partition anyway by replaying the change of topic. Of course, one challenge here is if especially in the scale out case if we add a new machine, the new machine will first of all have no local state to have an empty disk. So what we do here actually is we do a two phase scale out process. So when you add a new server before it actually takes over any query load we would only deploy some of those hot standby tasks and since I will reread the change of topics to warm up the local states and only after that we would do a second phase of rebalancing where we now actually migrate the query load over so the query can execute with server downtime. For the scale in case of course we don't have the problem if you have hot standbys anyway because those hot standbys exists so we can fail over immediately but overall this architecture makes the whole system like elastic and we can dynamically add and remove things and so hot standbys, scaling and for tolerance is all the same mechanism. It's all built in this idea of change of topics that you just deploy and exploit in different scenarios here. Okay, then wrapping up I would like to point out a couple of challenges we are having and also some work in progress what we are doing. So I want to highlight three things that we are looking into. First is streaming SQL by itself. We would like to enrich our SQL dialect to be more concise and to be more powerful. There's a couple of things we cannot really express at the moment that we would like to tackle. We also would like to make improvements to time and operator semantics so there's still a couple of gaps where it's kind of yeah we cannot express this so cannot express this as that. So there's still a lot of work to do and we hope to push out a lot of changes in 2021 actually at this end. Then we're also working on consistency guarantees so because KSQLDB is inherently an asynchronous system as we have seen. So we have some change of topics even if you do a write into a Kafka topic there's a delay until the query is reading it. So as it goes back to your question about updates or inserts and if you allow an insert in KSQLDB for consistency reasons or correctness reasons the insert would always be an append to the topic first. So if you do an insert into a table actually write into the topic and say we need to read it back into the materialized state. So we don't really have read after write guarantees here. So and we have ideas to use for example vector clocks to give more guarantees for those things where you could say you know you do an update on a table maybe and then you can reason based on some vector clocks that we of course hide from you internally to give you a guarantee we only answer your query after your update was applied and things like that. So it's a lot of interesting challenges here to use vector clocks and those distributed systems. And then of course we also would like to improve our transient query support. At the moment you never really focused on this but we see some interest that people want to build applications like that because a classic KSQLDB application was always well you have your upstream producers you have your KSQL in between and then your consumers are always reading from the Kafka topics directly. But now we're seeing more people saying now I don't want to talk to the Kafka at all I just want to interact with KSQLDB and then for this case we have those push and pull query set I mentioned already but cells are not really built at the moment for production workloads or limited let's put it that way. So there's a lot of challenges to make cells query scalable and also more powerful. And then also as I just mentioned insert update delete support and stuff like that. Yeah we have it but it's maybe not what people expect. And so we're still you know collecting a lot of input from users to say you know how do you actually would like to build an application against the KSQLDB system because it's so different to relational database. And we also don't want to replace relational databases so sometimes you also say you know no that's not a use case we will have as a part you know just use your Postgres mySQL server or actually whatever you want to. But there are a lot of use cases where it's actually just like it's a good enough you know sufficient to have a KSQLDB and then it's kind of why do I need to you know export all the data into a relational database system in addition to build my application. And so there's a lot of use cases that we actually think we can serve here. And yeah we're still figuring out how far we can push those things and then become a real stream relational system because it's a moment this kind of relational part it's there but it's you know it's not a database. Let's put it that way. And then future work what I personally have in mind and hope we can do at some point is look into query optimization. That's of course cause very interesting. As I said currently we have a very simple rule as optimizer as it is doing some filter push down. We apply a couple of physical optimizations. So for example if we need to repartition data twice we don't create two repartition topics but we only do it once. What was not there in very old versions we would primitively you know create multiple of those repartition topics. So then we have an output table for example as I said for table state we always create a change log topic. But if we rise the change of also to an output topic we also get two topics here. So we can merge those topics together and use the output topic as for fault tolerance and recovery and things like that. We do this already for input topics for output topics it's not implemented yet. And then of course at the moment rule-based optimization is nice but could we actually have a streaming cost model and a cost-based optimizer might be super interesting. A big challenge here is we don't have any statistics on our data at all. Because I mean most topics are populated by upstream applications. Kafka itself does not collect any statistics and it also cannot really because it doesn't know anything about the schema of the data. So there's a lot of challenges how we could do this and then of course because we have long running queries adaptive free optimization as a runtime is going to be a big part at some point where we you know also when we deploy multiple queries it would like to merge and split queries dynamically and stuff like that to share load to share state. So there's a lot to be done in the future. Also runtime improvements there's a lot to do here. So at the moment we support multiple data formats like chase and after all. And whenever the input topics use this we basically use also for all our intermediate state. And then we need to serialize and deserialize and all those kind of things. So the idea would be actually maybe we only want to have chasing and after all on the input and output topics. And then we translate into a more efficient internal data format that we use in our change log topics and repartition topics and in our ROX DBs to be much more efficient. So there's also a lot of interesting questions how we could do this. And also the task assignment problem because tasks in our case are stateful. We have this problem to preserve load balancing but it's the same time as the standby tasks and stuff like that we also want to be sticky. If you have a standby we don't want to assign this task somewhere else for failover but we want to reuse the state. And so we have to balance the load balancing and stickiness and maybe even say while we have a current deployment but we know it's unbalanced. So we want to lazily migrate state in the background in order to get to a balanced assignment later on. And so this task assignment problem is a big one as well. And then for SQL, looking into subqueries might also be very interesting because we have the subtraction of streams and tables. It's a little bit unclear how subqueries should actually execute and what the semantics of cells actually are because in SQL standard those subqueries are well defined over tables but now we have this stream thing. So it's a little bit unclear how we should actually handle those things. Okay, with this, a couple of references. So if you're interested in that so there are two papers that you could publish at EDBT and it's a virtual workshop. And so more books and papers about Kafka can be found on the Apache Kafka website. And of course I want to invite you to check out ksqlDB at ksqlDB.io. And one last sentence is if that's interesting we are hiring not just this year hopefully also next year for the ksqlDB team and Kafka teams. And I'm happy to take any questions you might have now. Okay, awesome, I'll applaud on behalf of everyone else. All right, I'll be time for one or two questions from the audience and then everybody asks them guys. Okay, so the Kafka stream stuff that's actually doing some computation, right? Or that is all the computation but there's certain things that you want to express in ksqlDB or ksql that cannot be expressed in Kafka streams. So you run them as like landers or whatever, right? Set of Kafka streams, is that correct? Well, I mean Kafka streams is always the runtime and at the moment we have not found anything in ksqlDB that we cannot do. But there's always the possibility also in Kafka streams to actually implement custom operators. Okay. So if that ever occurs then ksqlDB would implement its custom operator, plug it into the data flow and would just be happy. So that's certainly possible. And then you guys don't support indexes because... Not at the moment. So it's also something we would like to look into especially for our pull query support and say, well, I have this table. It's the moment ROXDB is indexing it on the primary key, obviously. So lookups are efficient but if now people start to query this with other things then yeah, maybe we want to build indices but then again, the problem because everything is async comes into play how do we actually do this? I mean, you almost fake it now because if you make a, whatever, can use query table that's sorted by whatever you want to join, right? Then you'd be able to query that way as it was. Yeah, sure, absolutely. But at the moment, we limit so it's queries really for primary key lookups because that's the only thing we can execute efficiently. Then we can say, okay, if you don't have an index we just do a primitive like ROXDB full scan that's gonna be health low but at least we can execute it. And since the next step would be, oh, actually we can also allow you a create index statement and then we would repartition the data, materialize it a second time and stuff like that but then the main challenge is really how do we keep the index and the table in sync? Yep. And then DDL changes, is it even bigger nightmare in your world too? Presumably you don't support add drop column, right? Is that again? You don't support add drop column like alter table? No, no, not at the moment. That's also something we would like to, so what we actually support is an update statement. So create or replace basically for a table that allows certain changes to the underlying queries that is populating the result stream or table. But that is of course also quite difficult to say, how do you do this in a backward compatible manner? I mean, getting a filter or something like that is always easy but if you start to change the columns you have or stuff like that it's actually pretty complicated. My last question is what was the starting point for your SQL dialect? Was it just did you take the Postgres grammar file or did you guys write one for scratch? No, we started with Presto. This is Presto. Presto, okay, sure, sure, sure. And then we just modified it accordingly. Of course, yeah, okay. Okay, awesome. But guys, I thank you so much for doing this. This is super enlightening and this is definitely different than a different way to think on databases as hopefully the students understand like our student needs is a total, it's not. Yeah, it's a different beast, especially with all those temporal semantics that are really giving us a hard time and everything is distributed in asynchrone and means super interesting to work on. Yes. And so let's see how far we can push this but it's for sure a different beast. Yes, awesome, thank you.