 The Carnegie Mellon Vaccination Database Talks are made possible by Ototune. Learn how to automatically optimize your MySeq call and post-grace configurations at ototune.com. And by the Steven Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. Hi guys, welcome to the last talk in our Vaccination Database Seminar series. We're very, very excited today to finish off the semester with Benjamin Wagner. He is the lead of the query process team at Firebolt. He's also an alumni of Thomas Neumann and the famous database group at TU Munich, where he's currently located now. So as always, if you have any questions for Benjamin, as he gives us talk, please unmute yourself and say who you are. And feel free to ask question anytime. We wouldn't just talk to the Interactive. And with that, Benjamin, thank you so much for being here. The shirt is awesome. The floor is yours. Go for it. Perfect, Andy. We'll get you a shirt. So yeah, thank you so much for having us. Thanks for the intro. We're super excited to be here and finally be able to actually drill down, go in depth on how we're building Firebolt. Right. So I'm Benjamin, team lead for the query processing team. As Andy said, any questions coming up, please interrupt me. I'm sure Andy will. Feel free to join in. Good. So before we get started, I want to give you a quick outline of the overall talk and kind of get you ready for what to expect. So basically, when we go out of this talk, I want you all to know, okay, how do we think that data warehousing is changing? And basically, if you decided to build a cloud data warehouse today, what are the architectural choices you would have to make? Which kind of crossing points would you have along the way? And what are maybe interesting choices? And yeah, just where can you go in general? So without further ado, let's jump right in. And before we move ahead and actually talk about Firebolt, I just want to take a step back and really say that it's been an amazing decade for cloud analytics as a whole. So here we see kind of four of the biggest cloud data platforms out there right now. And all of them came out in the last 10 to 12 years and really changed the landscape, innovated a lot and kind of pushed the space to where it is today. At the bottom left, we have Redshift, which was of course one of the first kind of database as a service offerings on AWS. So pushing this onto AWS, the biggest cloud vendor. Databricks really show like the world just how distributed high performance data crunching is possible. Snowflake innovated a lot when it comes to decoupled storage and compute, right? Kind of isolation of workloads through different warehouses. And finally, Google BigQuery kind of changed the game when it comes to the pricing model, right? They're kind of abstracting away from the actual resources and giving you a more serverless type of offering. And thanks for all of these great products. Nowadays, cloud data warehouses have become really commonplace across a massive variety of workflows. So this includes things like ELT, ETL, ML, ad hoc query, BI, reporting, and so on. And one of the amazing things is that then over the past years, there was kind of this pushback to SQL, which is amazing. So now you have the super wide ecosystem around SQL as a common language, right? You're kind of ELT, ETL, or transformation tools like 5Tran, DBT, those speak SQL. At the same time, they're actually visualization tools like Toblo, SuperSet, or Looker also speak SQL. And going back, this just allows the data warehouse to be used in tons of different contexts across a wide variety of workflows. What we're seeing at Firebolt though, and this is where we're starting to kind of get into the details of the talk, is that more and more companies are collecting massive amounts of data and actually want to build a new type of workflow or use a new type of workload in increasing fashion. And these are customer facing data applications. What does that mean? So I'm giving you this term. I want to give you an example. Say, I hope some of you are gamers, you build an RPG, right? Kind of an online role-playing game. And you have players running around your map, interacting, creating, talking to each other. There's a good chance you're generating massive amounts of data, different types of events, and then ingesting them into some data pipeline, backing it up to either S3 data warehouse or something custom built. And then you actually want to use all this data and re-expose it back to your players. So you might want to show them a dashboard saying, okay, with these types of items, that's how your trades improve. This is where you're doing especially well on the map. These are the players you're interacting with most live. And if you're the gaming company building this RPG now, this is super important for you, right? Like this is driving revenue directly because it helps you to retain players. It helps you kind of help players dig into their data. And all of you have played games before, probably know that this is actually quite fun to do. And these types of workloads where you're collecting insane amounts of data, and then want to kind of re-expose it back to your customers as a company is gaining more and more popularity across a wide variety of different sectors, right? That's not just game tech. There's also things like ad tech, fin tech, and so on. And currently, very often, these companies build their own data pipelines to be able to run these types of workloads because they have pretty hard workload requirements. But going back to the second bullet point here, kind of our claim is that wouldn't it be amazing to actually be also able to run these types of customer-facing data applications on top of your data warehouse? Why? Well, you don't need to retain like two data pipelines. There's a good chance you're backing up all your data into your warehouse anyways. And then having this wide ecosystem, SQL as a language, which a lot of people are able to write, is actually super useful. However, these workloads are actually pretty challenging for data warehouses at the moment because they require low latency. You don't want your users to wait kind of three seconds for your dashboard to load on how you've been doing. They require predictable latency. In your RPG example, you don't want to have the dashboard loading in 200 seconds at night, and then during the day, during peak hours in two seconds. And tying into this, the final thing is you also have high QPS, so queries per second, and just in general high concurrency, which is also pretty challenging in this setting. Because you might have thousands of players at the same time looking at dashboards. And so all of this makes it a very challenging workload, and that's actually what Firewall comes in. And that's currently the workload kind of we're trying to excel at and really be able to provide this low latency data intensive applications. While at the same time also solving these more traditional types of workloads along ELT, ETL, batch, and so on. So before we drill into technical details, the RBG example is nice because it's very visual. I do want to give you a real world example of similar web, one of our customers. They are a market intelligence company collecting information about 100 million websites and in total they have more than 200 terabytes stored in Firewall. And what they're then doing is they're using this data to expose it back to the users and give them information and kind of analytics interfaces into things like audience loyalty, and so on. So this is an ad tech real world example, which I think kind of illustrates this nicely. And by using data warehouse for these types of things, building new use cases, building new kind of visualizations, and so on, is actually quite simple. Good. Let's start digging in. This is probably the time where things get more interesting for most of you. So this is Firewall at 10,000 feet. And over the top, we'll drill down into each of these layers. At the very top, you have your like overall REST APIs, JDBC connector, data science connectors, BI tools, and so on. And these all talk SQL to some service endpoint we're providing. In a cloud data warehouse, you then have the service layer containing things like administration, security, billing, metadata, and so on. And then below that, you have the actual compute part of the system. And this is where things get really interesting for most of you probably. So Firewall builds up on the couple storage and compute. This means that all our data is actually stored like you're abling on AWS S3 in our custom file format called triple F. We have tons of sparse indexes, materialized view, etc., which are also stored on there. And then our unit of computations are so-called engines. And an engine is a cluster of nodes between one and 128. And these then kind of serve as a SQL endpoint and are able to execute a single query concurrently. So one query can run on all nodes within an engine. We can nice to see that these engines provide resource isolation. So you might have one engine running for ingestion, which reads external data from parquet or JSON, something like that, writes it to S3. And then the other engines will start seeing these files, loading them into your caches, and be able to actually run queries on them. And by, for example, running one engine for reporting and one for customer-facing dashboard, you can make sure that the data engineer running a massive kind of batch job is not going to slow down your customer-facing dashboards. Good. So let's now drill down into each of these layers. And if you're building a modern cloud data warehouse, you actually kind of need to start off with a high-performance engine as a baseline. And for us here really, the kind of important thing was we wanted to build up on the modern distributed high-performance engine. And in this world, you really have two choices. Either you build from scratch, which is, for example, what Snowflake did, or you decide to fork your actual runtime from an open source engine. And so this is indeed what we did. Our baseline for the runtime is actually Clickhouse. And Clickhouse at this point, we just want to give a huge shout out to all the open source contributors, people working at Yandex or now Clickhouse Inc. Building the system, making it available to the public. Really, for us, this choice was a no-brainer. Clickhouse arguably is the fastest distributed open source system out there. And kind of the overall traction it has in the open source community, really around the globe. So also in China, for example, really is a huge testing into that. Let me give you a quick one-on-one on Clickhouse before we talk about how we're kind of moving ahead with that as a baseline. So this is a hard fork, right? This is not you trying to maintain your own distribution? We'll get more into this. This is a hard fork. And we'll kind of show you where we really move the way. So Clickhouse is an open source, high-performance, all the database system, which was originally developed by Yandex, and now actually spun out as Clickhouse Inc. So congrats on that. And it's a vectorized column store with distributed processing capabilities. And those of you who are into Clickhouse will actually know it's not just vectorized. It has also some elements of just-in-time compilation sprinkled in. For example, like for expression evaluation, certain types of aggregations, and so on. Exactly. And distributed processing capabilities are kind of a must in this space we're in, because if you have, yeah, like, dozens or hundreds of terabytes of data just scaling up on a single node simply doesn't cut it. Clickhouse is a great starting point as a high-performance runtime. But ultimately for Firebolt, that's what Clickhouse is, our high-performance runtime. And sure, this is an important part of a cloud data warehouse, but you also have to solve a lot of different challenges. So this is decoupled storage and compute, a metadata layer, query planning, then like support for these very, very complex data warehousing queries. You have to build a service layer and orchestration layer to make it a real cloud service, and then you have to integrate with the ecosystem. And so all of these are things we're touching at the moment at Firebolt. And in the remainder of the talk, I actually want to go in-depth on all of them and just show you how we're building them and what we're doing in these spaces. Let's get started with decoupled storage and compute. Going back to our early architecture diagram, we have marked kind of in red which parts this are. So this is on the one hand, the actual storage layer in S3, so our file format, indexes, etc. And then also the local caches on the different engines, because you don't want to run, of course, from S3 if you need to do low latency queries. And before we drill in to our actual decoupled storage and compute layer, I want to quickly talk about the merge tree of Clickhouse, which is the backing storage engine within Clickhouse and actually the most stable one used there. Good. So in this example, we're creating a table customers with four columns, date, user ID, some user balance, and some string column for a browser and browser version. And then Clickhouse allows you to do things, two things at the merge tree. A, it allows you to partition. In our case we partitioned by month, and B, it has this concept of like a primary index, in this case ID, and we'll get to this in a second. So now we're ingesting these rows here to the left. These are from March and one from April. And then Clickhouse within the merge tree will now sort and partition this data to the right here, where for example able to see one file within the backing March partition, we can see the April row is of course not a part of this. And then within these five within this file of the overall March partition, we can see that the file internally is actually sorted by ID. So this primary index, and these can be multiple columns and expression, and so on. So why is it called merge tree maybe, because one partition kind of multiple files and Clickhouse in the background, a bit similar to an LSM actually compacts these over time and builds larger and larger runs of data. The second thing Clickhouse's merge tree has which is amazing is support for sparse indexing. So, let's take again our one file on the left here and see how Clickhouse actually build sparse indexes on top of that. In this case, Clickhouse will split the file internally into smaller blocks, which are also called ranges. In our case, these are blocks of two rows. So, and then for each of these blocks or ranges, you can start building kind of one granular for your sparse index. For example, this is one, this is maybe a well known one, you have a small materialized aggregate or min max index on balance. And now we have a query like query like balance smaller than, I don't know, 250, you would be able to exclude the second range in this case, but you also can for example build bloom filters which are great for high cardinality string columns, if you for example now have an equality look up on the browser and the version, or you can build a sparse primary index, which is the order of data in order to also be able to provide pruning. So in this case here the sparse primary index on ID would say okay, the first row in the first block has value 12, the first row in the second block is 56, and the first value in the third block is 171. And then if we have kind of a range predicate on ID, we could also use this for pruning. And exactly, our file format triple F is actually based on Clickhouse's merge tree. And, however, the merge tree was built originally to store data durably either on SSDs or HDDs. So locally attached disks. And for us, we really kind of ripped out large parts of the storage layer, and made sure that data stored durably on S3. And then we built a multi tier buffer manager on top of that, to be able to retain data locality. And multi tier really means that the storage hierarchy in the cloud is just getting insanely deep, right? If you're building a single node database, you already have a pretty deep hierarchy. You're thinking about caches, main memory, kind of your CPU registers at the top, of course, and then also like local disks, so SSDs, or maybe rotating hard drives. In the cloud, you have on top of that kind of EFS, for example, a distributed file system. So we're running exclusively on AWS. So that's why I'll talk about their offerings for now. And then of course, you have S3 at the very bottom. And as I said before, you really don't always want to go to S3. So in production, pretty often we're actually seeing S3 at excess latencies of a few dozen milliseconds. And if you want to build interactive dashboards, that's just unacceptable. And so our buffer manager uses both local SSDs and EFS, and then S3 to basically keep data as close to the actual compute as possible. And another thing that's important here is that our buffer manager actually doesn't really operate at file granularity. So we said before that Clickhouse Merch Tree takes these kind of builds these files and internally splits them into ranges. And these ranges are actually the granularity of our buffer manager. This means that we actually have a buffer manager, which kind of works on offsets within files, and is able to only cache subsets of certain files. And for us, these blocks, so these kind of minimal ranges we're able to access, usually have, I don't know, a few 10,000 rows in them. So this is really extremely fine granular. And then for each of these 10,000 rows, for example, 20, 30,000, we'll store one of these index grandals. At first glance, this might seem kind of excessive and it's definitely a lot more fine granular than some other players are doing. However, we're actually seeing that it works well, and that our index sizes are not becoming excessive because even if it's only like every 10,000 rows, you will usually have like index overhead maybe in the order of 0.1% or even less. So even if you have terabytes of data, you're actually indexes will only contain have a few gigabytes. And then of course you can split these across the most of the cluster. And the point about the buffer, your proper manager, there's an operant blocks that you're operating ranges for like, you're basically saying that like, there's like a set of tracking with page one, page three, just basically saying I had this range in but of course that range maps to a file anyway. Yes, so just range maps to a backing file. That's correct. But so kind of be like buffer manager as far as over these files. Okay, I'll give you an example like I have a visualization in a second so then we'll kind of dig in. And good. And so our kind of the couple storage and compute layer is of course also heavily tied to our like central metadata service, and we'll go more in depth on a second. And this is super important for like transactional consistency and making kind of updates and inserts visible across the different engines. And so we're not okay managers. Sorry, is it okay to ask question right now or yeah, yeah, this is Steven, and I have a question regarding your metadata service. What is the system that stored your metadata service we know snowflake uses FDB so for you how do you store your metadata. I'll also answer that question later so I'll really give you the rundown of the whole stack. Perfect. So, let's keep this for later. On top of that, we're also extending the file for it actually working on kind of new index types and compression codecs and here we really like want to stay up to date with recent research. So for example, currently we're investigating actually emphasis team by bunch license Neumann at the old B 2020 as a compression codec for springs. And then we're also implementing the cuckoo index by kiff folly was it cool. Also published at the old B 2020 and cuckoo index is pretty neat for kind of cloud data warehouses, especially if you have like many of these ranges because internally it kind of for medium cardinality columns it will save a lot of space, because instead of building for example bloom filter on every block. It builds a cuckoo hash table and then maps to like fingerprints or certain bit vectors so I encourage everyone to look at that paper because it's a very nice read. We have another bite long coding scheme question scheme in state model history should look at as well. Nice. Awesome. Thank you. Awesome. Good. So let's get to the example for me to have a meet have a question. Sure. Meet yourself. Yeah, so I should have to see so who is managing the. I think you cut off. Yeah, you cut off. Okay, so can you hear me now. So, who is managing the cash in the clusters. And the second piece of that is that the read on the cash or it is a retry cash. Right, so great question. Who is managing the cash. Well, we are and we're trying to manage it in a smart way in order to be able to keep the law like hot data in our local cash so we'll get to this in a second, but the user if that's what you're asking has like no influence on cash which is completely transparency. The second question of whether it's like a read cash or read write cash. Well, mostly it's a week cash right because if you're actually updating or writing data, you always have to write through directly to S3 in order to make the changes visible to the other engines. So, I hope that answers the question otherwise we can of course drill down later in the Q&A. Okay, so that will that will slow down your injuries, particularly because you have to wait until the optics store. I know that you got the data. Yes, definitely. But in other cases it's going to be extremely hard to be consistent right because if you're kind of no crashes before you wrote strings to S3. It's actually going to be lost because this is like a terminal data. Yeah, that's because you're not doing any logging. Right. Yeah, so this is like another big topic. I would say, if we want to have like a longer discussion let's move it towards the end of the talk, because I would be of course happy to talk about that. Perfect. So here in our storage layer on S3, we're storing things in this triple F forward. We can see that we have two files, which are internally divided into these five ranges each and each file attached has bars indexes so these are really at file granularity. When we start up the engine, we load all these bars indexes into our local SSD caches. This is important because the access to the indices has to be fast, because these are later used for crewing. So now a SQL query enters the system. SQL queries parsed, turn into a logic query plan, repeatedly optimized, turn into a physical plan. And once we have the physical plan, we have our table scan nodes and attached predicates. And we can really start asking the question, okay, which data do I need? So what do we do? We look up our sparse indexes based on these predicates. And in this case, identify that, okay, from file a, we run a read range a point two, and from file B, we want to read range B point four. Pretty often in production, one of the most useful indexes for us is actually this kind of sparse primary index using ordering of data. So now we can load these into our local SSD caches and then actually execute the query. So I'll talk more about this in a second. Bear with me for a moment. And now of course if the query gets kind of sent in again, and we didn't evict anything from our cache, we'll be able to next time do this pruning run again with these sparse indexes, but see the data is actually cached locally and then just execute on these ranges directly. This is actually a bit of a simplified view. And this works nicely for a single table. However, it's not like a two step process where we first load all the data into our cache, and then actually do the execution. This would be wasteful. So you have something like select star from a huge table limit 100. You actually want to do this in a pipeline fashion. So as we're executing the table scan, we're kind of reading things and then if we're kind of feeding it pipeline into the execution pipeline to be able to in many cases turn it. Good. And did this answer your earlier question on kind of how the buffer manager doesn't work like for full files but actually only fetches ranges so maybe to give a bit of context here. So in the sense then this range becomes like your minimal accessible block right these are usually heavily compressed. And these have to be like loaded and told because if you don't have like a random access compression format, you actually need to decompress the whole range in order to read. Perfect. So what's the size of the blocks that you bring in. As I said before, kind of in one of these ranges you will have usually a few 10,000 rows, and then it's a column store as well. So you actually only read kind of the columns from the ranges that you actually need. So, right, let's say, okay, eight by integers. In that case what you have like a few hundred kilobytes or something. But these will usually be heavily compressed as well, if possible. So the actual like as three reads for a certain range will be smaller. But then of course you like very often, you actually access consecutive ranges, because for example, you're what are interested in a certain range based on your sort key. And then you can do actually like larger reads on S3. But in principle, if you have like really like queries accessing tiny amounts of data, you can narrow this down very aggressively. And this then ties back into what I said earlier of how we want to like be very, very good at these types of low latency workloads. And this is kind of a hard requirement, because if you have to scan excessive amounts of data, no matter how fast your engine is you're just like going to hit limits at some point of your like network throughput and so on, and how fast you can execute the query. This is like this, the bottom of all is this is click out something happens. This is what you guys have at it. Yes, this is built like totally on top of that. So like this kind of multi tier buffer manager using different layers. This is completely built by firewall. I have a question in terms of the MVCC and also we didn't really talk about whether multiple cluster can actually write into the storage but I'm very interested in what happened in terms of MVCC when the data on S3 changes that impact your caches. Right, I mean, so that's a great question. And in general, what might happen is that you start getting cash misses right but if you kind of interface with your metadata layer in a smart way, and build something like snapshot isolation into this and for context we're not 100% there yet but we're working like very hard. If you want to get more in depth on that on the metadata layer, you would kind of get your transaction timestamp, then fetch from metadata defiles you need to access. Push these down into your table scans and then okay, if like you got new files in the meantime, which were kind of written by some other engine doing ingestion, you would just have to fetch them from S3. Does that answer the question. Yes, thank you. Perfect. Good. So, let's go on drill down into the metadata layer I loved that there's so many questions. So metadata is here at the top in this case, but in our opinion actually kind of metadata might be the most important thing to get right in the cloud data warehouse. And first off, what is metadata, and really everything in a sense does not table data. So data stored by the user in the warehouse is metadata. So this includes SQL objects like tables views and so on. It contains things like user accounts security policies roles roles which are important for user management and kind of access control. It's also responsible for transaction management right it has this information on which files there are in which versions are currently active for example, and all of this has to be handled by the metadata service. And at the same time, some of the hard challenges you're seeing here are that there's like, not just this wide variety of data, but also a massive variety of use cases, right. For every query, there are certain calls to metadata fetching timestamp or reading certain parts or certain files from this. At the same time, you have for example maybe some billing service, which wants to subscribe the synchronously to certain events happening in metadata service. And then you might on top store something like optimizer statistics in there to be able to do proper cardinality estimation. And all these things together kind of give you this massive variety of use cases, and also different kind of just data structures and different data use cases, which all have to be supported by this layer. Our current metadata layer is actually built on foundation DB in most parts. And yeah, so kind of the lesson learned here is that building this is actually extremely hard. And it's going to be a huge engineering focus for us in 2022. Because you also have things like for example like multi tendency and isolation which have to be supported by this right. If one user starts sending tons of queries, and you have this one global or regional metadata service. You don't want other queries from other users to be slowed down in their metadata requests in the sense, and you have to kind of be able to do workload isolation here and so on. So these are actually hard questions. Maybe next year, we can come back and talk just about these challenges. The language you're using kind of sounds like you're using foundation DB now but it's a temporary solution and you're going to replace it or right so we're using foundation to be now. So I think kind of for a while it's going to stay that way, but definitely like I mean, no one said you have to use foundation DB for these types of workloads right so we're always kind of investigating which other options there might be. But I am not like I'm not the metadata expert at Firebolt. So I won't be able to like go super in depth on your like super detailed questions on that. Again, it sounded like it was a temporary solution. I just want to make it make understand what you're actually trying to say. Just clarifies. Thank you. I have a follow up question on the data. And is that so I don't know if you actually talked about I think in Firebolt is your customer bring their AWS credential in so the processing power is actually using the customer AWS account. I want to understand for the metadata. Is it the same do you launch the metadata layer in your customer or is it multi tenant in your in your own AWS account. Okay, so I mean, there's two things here a the actual compute of the engines is also running kind of in our like AWS. I don't know like namespace in a sense like these are our clusters in a sense. However, to be able to like ingest data from S3, you of course like need to be able to access the actual like data stored on customer buckets. So like this is the only part where we actually interface with the outside world, and then like all the ingestion computation is actually done within Firebolt. Got it. Thank you. Perfect. So maybe the second part of the question is, is it multi tenant metadata surface or is a per foundation DB per customer. Yes, so great question and this ties into like the isolation things I talked about. So it's multi tenant. And this kind of is one of the things that makes it harder because you have to make sure that just like different workloads coming from different customers don't interfere with each other and slow each other down. Thank you. Sure. Thanks for that from Karen. Yeah. She wants to know why you're moving off the foundation to be for the metadata layer. Excuse me. She wants to know why you guys are deciding to get off a foundation to be or considering getting off a foundation to be for your metadata layer. We were not like, nothing is considered in this case I think current thing I mean foundation to be is a great kind of key value store. It's used very successfully for like these types of metadata services right it's like available it's distributed it's super scalable. And so we're like definitely happy with foundation to be I didn't want to say we're definitely moving off of it. Not at all it's just kind of if you're if you're in this space I think you have to be open to like at any time considering alternatives and just seeing what's out there, and that's something we're trying to do. I'm going to ask you maybe what you're asking is, what are the problems are facing with foundation to be that would cause you to consider right now to consider something else. Right, I mean, again, like kind of, I'm not the huge expert on kind of metadata at Firebolt so I'm actually not the perfect like person to answer this I would be happy, like shoot me an email and I would be happy to put you in contact with the people and you can like ask us. But I just like at the current time like don't have a perfect answer for that. Perfect. Thanks. Perfect. Thanks. Good. So, next thing the actual query engine so this is where hopefully can answer all the questions coming up. And the first one has actually two parts right in our case, and the first one is the query planner. And here, we decided to really completely replace the click house parsing and optimization layer. This is because click house actually has a custom SQL dialect, and also not a huge amount of optimizations within their optimizer. What we really wanted to have and this was a hard requirement in a sense is first off, have a SQL dialect, which is as close as possible in a sense to the postgres dialect. This is important a for integrating properly with the ecosystem. And being it's important for actually ease of use and adoption, because customers like will usually have used postgres and be familiar with that syntax. And the second thing we really needed is a powerful query planner. And this is really like a no brainer in a sense for a cloud data warehouse, because usually the people writing queries will not be kind of huge domain experts deeply into databases. And pretty often queries will actually be auto generated by some type of application. And in that case, there's a good chance for example like join order specified in the SQL query actually doesn't match for example the smart join order that you actually want to execute. This is an optimizer originally forked off high rise and I saw some of the high rise folks so like Marcus in here earlier so huge shout out to them for building kind of an academic system at House of Plata Institute in Potsdam close to Berlin, also Germany. And high rise is like a very neat kind of standalone single node academic database system. It has its own SQL parser, and then kind of an optimizer and also a whole execution layer. And we took this optimization layer and the parsing layer and turn it into like a standalone kind of embedded optimizer library. For this, we had to mix up quite a few things. So the first thing is, we had to heavily extend the parser to actually be like closer to postgres, which we talked about before. And the second thing is that kind of we put high rise as I said as an embedded service into our actual like status compute clusters. So the things running kind of the click house engine as a runtime. And this is interesting, because we then don't have like a custom parsing and planning service at the top, which routes certain things to like the compute clusters, but rather this is embedded. So the choice we made, again, because of this kind of low latency requirement, because we really want to like minimize different round trips and network round trips which are happening in the hot path of the query and this is super important because kind of over time as you build up more and more of these network round trips, and these start actually accumulating and just adding kind of constant latency to every query which is going to be hard to get rid of. Yeah, so we turned high rise into kind of the stateless optimizer right you also don't want to have like the single node schema from the original high rise in there but rather tie this into your metadata service. And then we build tons of different like rule based optimizations into it. So, for example, aggregation push down view insertion things like common subtree elimination. So high rise also has, which is very nice basic cost based optimizations like join reordering, which we worked on quite a bit. However, here for these cost based things really what we're seeing is that in a sense the hardest thing for us is to collect solid statistics for the optimizer right and then kind of feed them back in an efficient way into the planner. So, at the moment we are doing things like cost based join order optimizations, but our estimations here could actually be a lot better and this is also going to be a big focus for like the upcoming to work on that to get better cardinality estimates and so on. I kind of like to ask the question of course in contrast with another competitor like snowflake one of the strength they say is because when you have a database you do ingestion you actually have really good knowledge about what data you just ingest. So I want to understand what's the challenge since you are handling the ingestion so why are the statistics not just super correct just based on how you watch the ingestion. So in general, yes, kind of this is a very nice or nice setting to build like a statistics engine to your optimizer right at ingestion time, you're kind of building these files writing them to S3. And you, in a sense, since you're like sorting aggressively partitioning and so on compressing have touched all the data anyways. The most important thing is that you have to keep these statistics by consistent right you can't say okay, I want to run a statistics refresh go over all your S3 and kind of do sampling there it's just going to take like insane amounts of CPU resources and also like just be very slow. The kind of hard thing here is keeping all your statistics consistent in interfacing with your metadata layer, and then making them quickly available to your optimizer, because if you're touching like, say thousands of files like one S3 and have to collect statistics from all of them in the hot path of your query execution. And that's actually quite a bit of an engineering challenge. Thank you. The question. Yeah, thank you. Perfect. Good. So, the next thing I want to talk about at the engine level is actually distributed query processing and click house has, as we talked about earlier, distributed execution within its runtime. And this scales well for some types of queries. Examples are distributed aggregations with small results cardinality, or something where you can for example broadcast to immediate results on to the different nodes in the cluster. However, general purpose data warehousing and remember this is also a use case we want to kind of excel at is full of extremely complex queries right aggregations with massive result cardinality distributed joins with like large extremely large tables. And so what we're currently doing at the engine level is actually completely reworking distributed execution. And the idea is to really like, build a state of the art distributed execution layer, which takes a query breaks it down into different stages, which are then scheduled across the cluster. And as the background here, we're building a high performance shuffle operator. So for those of you who haven't like thought about distributed processing before, kind of, if you want to build a distributed join for example, you need the shuffle operator and shuffle operator ultimately implements partitioning function across your network. So you're doing in their drawing of two relations on two columns. Then what you want to make sure is that you're able to actually do these joins locally in a sense or in every node. And you can do this by for example hash partitioning in this case on the joint columns, shuffling this over the network, then doing local joins, and now the partitioning functions to minister and the same for both columns. So these, like, kind of rows here joining will be on the same note, then you can do a local join and feed it either to the user as result, or to the next stage next shuffle to then for example do an aggregation on a different column. And one thing that's interesting about the way we're building like shuffle and distributed processing which maybe is different from how some others are approaching this problem is that our design is extremely focused on low latency, instead of like fault tolerance. So at the moment, and this ties into the things we talked about earlier, we want to really excel at these kind of low latency queries right. And for this for example, and it might become unacceptable to write everything like to S3 as intermediate results within your shuffle stages, and read them again, just because you're adding so much high latency on both the reading and writing side. So our like whole design tries to optimize for this. Let's look at themselves. Excuse me, do you, by any chance, write intermediate results into the cache, not the street. What's the cache in this case. So like the local SSDs for example, yes, definitely because like, you do have workloads which go beyond main memory, and you have to like be able to support this because otherwise you're not going to be able to scale like to dozens of terabytes because the very large cluster in a sense will. Okay, it might have like a few terabytes of main memory capacity, but not kind of scale to infinity so you do have to use like your local SSDs for intermediate results or spilling parts of it same for aggregations or joins right during the actual execution. So let's give a quick example and distributed processing. Again, we have our S3 here in the background, or at the bottom of the hierarchy with two files sparse indexes and in this case, six ranges for each of the files. We then spin up an engine, which in our case has three nodes. So this is not a nice power of two, but it was nicer to visualize. And each of these nodes had to has its local SSD cache, which again has all these sparse indexes loaded in this, in this case for the files, and also then kind of parts of the ranges. So here we have an injective mapping from the kind of backing ranges on S3 onto the different nodes. This means not every range might be loaded into a cache at the moment, but if a range is loaded, it will be loaded onto exactly one node. And in this case we have the following query. We have a join B, B has a filter on top which is very selective and so B is at the right side, which in this case is meant to be the build side. I know some people put the build on left. In our case builds on the right. And we have these two S1 and S2 in here, which are the shuffle operators. So, going back to the thing I talked about, to be able to run this join in a distributed fashion. You have to like put the shuffle before both sides of your join to line the keys and then do a local join. Exactly. And in our case, as we said earlier, we actually have our optimizer embedded on the compute nodes. And for this query here node one will serve as the primary. This means that in the beginning, we have to actually like schedule kind of the build side of this hash drive. So this means we send out a task to each of the node which says, hey, scan the local ranges you have on B, apply the filter, and then feed the rows which pass filter into your shuffle one. Once this is done, all the nodes will report back to node one that okay, I'm done with this task, and we can get on to the next one. Note that node one also has an error to itself. This is of course a path you can optimize, but no one also has to take part in query execution, which is why we're visualizing it here. So now we get to the scan on a and feeding into shuffle to. And only once this is done, we can actually start scheduling the local hash drive. And here in this case, in the beginning for the first two tasks, we wrote to shuffle. And this is like was implemented and like the positioning was implemented. So now when we start task three, we can be sure that like the data is shuffled already. We can read from the shuffling service to be local joins, and then pass the result to the user by either uniting it the primary, or maybe exporting it to S3. And exactly this is kind of a stack we're completely rebuilding in our engine. And this is actually like pretty significant engineering effort. So if you're doing servers or something more sophisticated, you're doing your scheduling tasks. Excuse me. If you have two queries show up the same time, or one after another, is it just first come first serve or are you actually trying to be clever about like, if you're trying to do the original thing we talked about, you do have to support concurrent queries. Right, like you, you don't want to run just one query at a time because then you have exactly zero concurrency on your cluster, and might not be nicely saturated resources. Right, but I'm saying like, like if you have to say I would pass what I send out next, do you get priority to the query that showed up first, or like, are you looking ahead and think I know these tasks are going to be parallelizable versus single note and then do some, some scheduling that way. Right. So there's a few things we're trying to do here. So first off we're actually trying to like nicely isolate resources across queries so we for example have like a memory manager, which we built which tries to kind of make sure that like the all queries have enough memory, and that not one query can like start others. And then we're trying on the like CPU allocation to also be like as fair as possible. So usually like fair is a decent heuristic in this case, because say you're like running customer dashboards, then usually like the dashboard queries will actually be pretty similar. So this is like a reasonable metric to implement. So it's a follow up question so in the context for example for very distributed curry. And you have lots of notes working on many different parts of the curry, if we have a straggler. How do you do the restart and also I want to understand, is it very much deviated from the original click house execution model. Yeah, that's my question. So, maybe let me start with the second question. And yes, this is a huge deviation from click house and like kind of this ties into where we're hard for. And it's not about kind of keeping click houses a runtime forever is about kind of building on top of that moving beyond it and like building or a kind of shirt at the moment like keeping some of the vectorized numbers, but actually like this is kind of completely changing the flow of a query through all parts of the engine. Going back to your first question on how to deal with stragglers. This is actually a pretty significant question, which is like not super easy to answer and you would actually want to run into a case where you don't have like the first step to optimize for this is to actually not have as many stragglers right so for this unique good partitioning functions. This is often not so hard with hash partitioning. Sometimes it is if you have heavy skew, and but it is not so easy for range partitioning. Because there again you need statistics and so on. And so getting this right is actually pretty hard. I don't want to drill super deep into this because I have like one or two more things I want to show you that are thinking interesting. And, but I'll stick around in the Q&A for as long as I have to. So you can like keep keep drilling holds into me. I have a two year old we got to cut off at some point to keep going. Perfect. And how much more time do I actually have because you said 45 minutes but we spend also some time on questions so I can either wrap it up now, or I can like spend like, I think I can do the rest in like four minutes or something. Is that fine. It is applied yet we can go to go a little longer. Okay, awesome. Good. So, let's move on. The next thing is actually like service layer and orchestration later. And this is, as I said, huge for cloud data warehouse, because you have tons of services admin metadata, building security, and so on. And also, you have to kind of orchestrate and provision the compute resources in your clusters. And so our services are mostly built and go, and kind of orchestrated through Kubernetes, and currently our engine provisioning happens through Terraform. And we thought in the beginning that this was like a great choice because Terraform is great at actually spinning up instances, and kind of loading binaries onto those. And so kind of as we're moving along and mature and we're seeing that using Terraform really only solves the provisioning part of things. And it doesn't help a lot with orchestration. So do you want to like orchestrate right to health checks kind of do updates to recovery and so on. You have to either use like different open source projects which you start integrating at trying to roll your own on top of Terraform. And what kind of would be a lot nicer in this case is to actually use to mature Kubernetes infrastructure, which helps you solve a lot of these problems. So this is a lesson learned for us. And at the moment we're actually moving all of our engine provisioning over to Kubernetes, which will allow for things like better scaling better recovery better updates right when you're kind of deploying new versions, and also through things things like service better telemetry. Final thing I want to talk about very quickly, I then will wrap it up is the ecosystem side of things. And the only thing I really want to stress here is just how huge the ecosystem has become. And so here we can just see just some players in the ecosystem and how they would integrate with your cloud. And at the moment they are on the roadmap, and we hope to get there soon, but just FYI, this doesn't like represent the whole ecosystem we support. You have like tons of stuff on ETL, ETL and stream right five cram rivery Kafka, then you have things doing actual transformation on your database so dbt or formerly fishtown is the perfect example for this. You have tons of different tools consuming your data in different ways so the eye and visualization tools like looker and to blow data science tools like Python in general or like you want to query through Jupyter notebooks which is big of course for ML workflows. And you need to support all of this, either by building your own connectors, or by building partnerships with these companies to build connectors for you. And here as well, being close to Postgres SQL is pretty nice, because it makes kind of either for yourself, or for the companies, or in the ecosystem, a lot easier to build connectors. So, this wraps up the talk. Kind of I want to go back, spend two more minutes just going over what we talked about and the lessons learned along the way. So, we first talked about the couple storage and compute. And here really the lesson learned for us is that maybe in the first step of your building a cloud data warehouse, even more important than like getting query performance and engine performance right is actually getting like your storage and indexing layer right because keeping data close to your compute building proper buffer managers and just indexing aggressively is the most important thing you can do to get these low latency queries right. And then we talked about metadata, and I'm sure there will be a few more questions later on. And here really the lesson learned is that if you want to build this right think about all the different use cases you want to support. Think about multi tendency think about kind of all the different types of data you will store and try to design a service which will be extensible to support this over time. And then we talked about the query engine. Here, we a said okay, you kind of need to build a high performance distributed runtime, which is able to scale to in a sense, arbitrary algebra fees. And you really want to have your own custom high performance optimizer and planner in order to be able to like deal with arbitrary and also auto generated queries. Finally, we talked about the service layer. And here maybe a lesson learned is that actually building kind of like native the art kind of microservice layer here, which is run on Kubernetes is actually quite a nice fit. So of course the actual engines doing compute will be quite monolithic like this is inherent I think to just a lot of database systems and it's not a bad thing right, but for this like service layer actually going into like a microservice direction is pretty nice in our opinion. So, this concludes the talk. I'm sure there will be some questions now. Thank you so much for joining in. And we are hiring. So either go to our website or ping write me an email if you want to learn more. Thank you. Thank you for doing this. So we have almost 10 minutes for questions. So Chris, Chris has a question and chat you want to meet yourself and go for it. Yeah, well, thanks Benjamin for talking. This is Christian Databricks. I was curious what's the typical service front end latency. So, actually, because I'm not like super heavily involved in that part. I cannot give you a super precise answer. However, like I can say that we are able to like execute queries and like, which obviously don't touch insane amount of data in the order of like a few dozen milliseconds. And like this goes through both the service layer, and then like the method service metadata layer, and then like the engine layer so overall like at the moment, our latency like here in all parts we're trying to keep low and of course with everything we're building kind of this is a thing where we're keeping in mind our testing and yeah. Cool. Thanks again. Sure. Maybe one more thing actually, I want to give a huge shout out here on the click outside to Robert from Altinity, who previously like gave a talk, I think in the seminar series. So if any one of you kind of want to know more about that system, go in and watch this great talk. I put the link in the chat. Perfect. Okay. Yeah, I got a question. Yeah, go for me. Okay, so you replace the storage engine you replace the runtime, you replace the optimizer. So what percentage of your system really the original clickouts. Right, I mean, what, what's the metric here like how do you quantify a percentage. I don't know lines of code. I mean, lines of code I actually think is not a great metric for this, because there are like some things which are just like very hard to build in a sense and these are the things we're trying to replace first right. So this is things like distributed runtime or the optimizer I would say, at the moment, like in the actual like vectorized runtimes the operator for example, many of these are still pretty close to click house, but we are changing for the joins for example. And then I think kind of, especially in 2022 as we're ramping up teams and so on, less and less will actually like look like click house or look like high rise right and at the moment for example I can tell you our optimizer kind of which we originally thought of high rise, most of the code is actually from us. So like kind of if you look at the optimizer now you you don't necessarily actually recognize large parts of high rise anymore. So have you done any performance comparison with the click house. So I would like to actually not get like deep into benchmarks because I think that's like such a hot topic right now. And there's like so much controversy about these things. So click house. I'm telling you, he's not on Twitter he doesn't know. Yeah, there's a there's a there's a there's a fight between snowflake and Databricks right now. And that blood's been shed we don't we don't need to get into the benchmarks right now. Okay. Okay, I will question from a lemai. I can ask him to mute. Otherwise, I'll just read it. He asked, what kind of quick application optimization do you do when you have more than 50 relations in your query. Right. So great questions and our. So, first off, like just the general rule based optimizations right kind of filter push down aggregation push down all these things like free view insertion. And then the joint order optimizer at the moment and this like is largely kind of actually was built by more it's ice and that high rise I think. So this comes from them is actually like a bottom up DPCP algorithm. And, like, over time we'll like look to replace this with I mean, there's obviously like fancy, fancy papers there for example by the Munich group on how to make the optimizer scale nicely even in these bottom up DDP worlds for large amounts of relations. At the moment, our production queries don't tend to have like an insane amount of joints so this is actually like not our most pressing problem in a sense to like build a super fast joint optimizer for I don't know 50 relations 100 relations and so on. I had a curiosity, some of you sees these real time analytical workloads. We could ask the rock set guys we just drew guys as well, like what is the typical number of relations DC in a single joint. Just on tail I understand but like, like, I mean, this is a very interesting question I would say like, most queries do have a few joints in them so like four, five, like something between two and five like sure there's queries which just go over kind of big tables doing massive scans, but you do see joints frequently but kind of on the tail end like you don't actually at the moment, see like tons of huge queries but I do want to be up front here and also say, and there's of course kind of some like actually like bias and the telemetry you get, because like telemetry on the things you're actually very good at right. And so I talked about it earlier kind of distributed joints is something we're like heavily investing in right now, but we're not where snowflake is for example right now they are where data breaks is. And so I would assume that in their workload data, and they actually see larger joint queries. All right, cool. Any other questions. I think my last question would be sort of like going back to the optimizer thing I understand you. You ripped it out of high rise and worked with that. But at the time did you consider me at the cookhouse optimizer is, I think it was as a last year was very primitive right it was the rule base. I understand why I want to get something different, but it didn't seem that the high rise one would be that much more sophisticated than in what click house already had. And again, I, they may fix it or worked on it last year I don't know I haven't looked at it. I mean, point out the postcards one is not really easy. But like there's, there's orca certainly from the green plum guys there's calcite which a lot of you are using. Is there these other options. Great, great question so there's there's a few data points here I have for you first off, kind of the choice for high rise was that it's actually like just neatly build parser and optimizer, which you can like turn into an embedded library and then extend right. It was not about like choosing an optimizer which will serve us forever was about choosing an optimizer, which we can actually build up up on and kind of turn into our own. And for this high rise was actually great choice. And then kind of on the alternatives considered side. And so calcite is a good example for example but a we wanted to be a C++ shop. We didn't want to have like to calcite is built in Java right and runs on the JVM. And we didn't want to have two languages on the actual like high performance database side. And this would also then like you would have to split our processes right our way of embedding the optimizer would actually be very hard. And for us is like super important for low latency stuff. So really, this only leaves kind of sees their C++ optimizers. And there, another option maybe nowadays would be Dr. B, which is a great system. Three years ago it wasn't as mature as it is now like the doctor be falsely amazing job kind of maturing that system over the past years. So that's why back then we actually decided to go with high rise. All right, let's go from from some deep. Oh, I mean, No, that's okay. I let me go and we'll finish up. Sorry, I can help with it. Yeah, I have this quick question. So what is your biggest differentiator compared to snowflake. How are you able to achieve much better latency and concurrency. And not just not not just snowflake I think big query as well. I think that's another way to consider. Right. So first off like I don't want to compare us hugely like to the competition in terms of performance numbers again, we like this is a hot, hot ground to treat at the moment I mean, I think I hope at least that kind of from my talk. I was able to like convey what types of workloads were optimized for right and like these are very different workloads from these traditional internal like batch the high types of workloads and so that's what we're trying to excel at at the moment. And this is like how we're trying to be different in a sense. I mean, I get the answers you could better your competitor is not the righteous snowflakes of the world. It's, it's the rock sets it's the druids, right, materialize in some ways. But I, I would actually like partly this or like disagree with them, we do want to be like, we do want to build a cloud data warehouse right like one, you'll a full flash cloud data warehouse which is able to like process super complex drawings super complex aggregations and so on. And also want to excel at these like kind of low latency customer facing experiences.