 The part of the reason why we had this, we invited Robert to come give this talk is Clickhouse is one of these interesting systems that are out there that from an academic standpoint we don't know a lot about and but certainly the feature list that it supports is very impressive. So Robert is the CEO of a newish startup out of Berkeley or in the US that is the commercial representation and the company that can be selling or backing Clickhouse outside of Yandex which he will explain what that means in a second. Prior to this he was at VMware because he was at a startup called Continent that was bought by VMware and he was the CEO there for almost nine years. His background technically is non-computer science. You have an undergraduate degree in medieval studies and then you have a master's degree from University of Washington in Japanese international studies so it's a very rare background which is awesome. You'd be surprised how useful it is in computer science. I read a lot of Cicero when I was in college. I use that more than computer science sometimes. Okay so the way we're going to do this everyone to remain muted so there's no background noise while Robert gives his presentation but if you have a question just unmute yourself and interrupt them but then please also introduce who you are so that Robert and everyone knows who you are right? Sound good? Okay Robert the floor is yours go for it. All right thank you so much Andy. Hey it's a total honor to be here. We've been looking for opportunities to introduce Clickhouse to people in the academic community particularly in the United States so this is just a true pleasure and as you'll see I'm kind of a database geek. I'm going to put some constraints on that so that you don't ask me questions I can't possibly answer but I love databases so if you have questions and you want to stay around and talk I'm your man I love doing this stuff. What I'm going to be talking about then is Clickhouse and the title is the fastest data warehouse you've never heard of and the reason we give it that title is Clickhouse is pretty well known in other parts of the world but is just beginning to really pick up steam in the United States and Western Europe. So what I'm going to do here is give a basic introduction to capabilities and the things that are really causing people to get excited about it. Let me move forward. I hate great slides are working so just a little bit about me as Andy said you know I have this this background really I've been a hacker since I was a kid I started with a on the washtenaw computing or time-sharing service in Ann Arbor, Michigan in 1972. The school I went to the science teacher thought that that we should learn how to program so they bought like a Texas Instruments teletype they had an audio coupler and modem that ran 300 bod and a pink book that talked about this language called basic so that's where I got started. I've been working on databases since 1983 and that's what I've done for most of my professional career they've included everything from M204 that was the first one I started on through I worked at Sybase for many years I've worked built a lot of apps on Oracle SQL Server and then all the way up to Clickhouse which is a database approximately number 20. So the company I work for is called Altinity as Andy said we're a startup and our goal is to offer enterprise software and services for Clickhouse basically make enterprises in the United States and Western Europe in particular productive and successful on Clickhouse. Beyond the business stuff that we do we're a major committer and community sponsor in the U.S. and Western Europe so we participated in we only have at the end of last year we had 20 people on the team are actually about 19. We participated in some like 16 events of which many of which we sponsored ourselves and with the number two committer after Yandex to the Clickhouse server as well as ecosystem projects so that's about us for the rest of this I want to dive in and talk about Clickhouse but before I start talking about it I'd like to do just a really short demo which I think gives a sense of why people get so stoked about it. So what I'm looking at right here is a screen that is pointed out pointed SSH into a VM it's running up in Amazon it's got eight vCPUs 32 gigs of RAM and it is using local NVMe SSD storage so pretty good storage underneath it but not a particularly powerful VM beyond that. What I'm going to do is I'm just going to jump in and talk straight to Clickhouse so this is a more or less bleeding edge build that we're talking to and what I'm going to do is I want to just do a very quick demo that gives you a sense of the speed of Clickhouse and so the first thing I'm going to do in this demo is nothing to do with Clickhouse or nothing to do with queries I'm going to do a setting that says hey use direct IO so no cheating here no no reading pages out of the the OS buffer cache so we've set that and now what I'm going to do is I have a data set with taxi data in it and I'm going to go ahead and count the number of records in it so let's go ahead and grab that command and it's called trip data I counted it takes a fraction of a second like one millisecond and it gives us 1.3 billion rows now the fact that I can do it in one millisecond is not very exciting because there's all kinds of ways that databases can cheat on counts but all I wanted to do was find out okay how many records do I have in this table and the reason I'm going to do that is I'm going to run a little experiment I'm actually going to generate this many numbers and take their average all right let's get the number so what I'm running here is a query that generates numbers out of a special table called system dot numbers under bar mt mt stands for multi-threaded which means that we can we can read them in a in a parallel way and what I did was I computed the average so it's monotonically increasing first numbers one and then it goes all the way up to 1.31 billion so here's the average the important thing to note here is it took 1.162 seconds to get this result so the question now is suppose I go back to that taxi data set and I want to know the average number of passengers per taxi ride do you think that that would be faster to do in the way that I just did where I did it purely in memory everything on the cpu or would it be faster to actually go and compute that average by scanning the real data set so question if anybody wants to throw in an answer in memory or go to storage quick clarification this this this the second query the average here this is a synthetic data set we're just getting a bunch of random numbers from yeah just coming off a function monotonically increasing set of numbers okay so your question is would it be faster to read from disk or read from memory for this or to do it right here the way I did here where read from memory hi this is Steven from yeah I think the traditional day warehouse they should keep a lot of us so maps and statistics it should be faster than you have to actually generate all those numbers so I'm saying faster okay it's going to be faster but not for that reason so we actually one of the things I did at the beginning was force this not to use any caches so the only cache that we're actually going to use here is the only cache that we could use in this particular case is the os buffer cache what I'm going to do now is I'm just going to run that average and it's a brute force average we're going to go out and read the data directly off the data set so here we go bang it gets done in 0.5 seconds so you're absolutely right it was faster but the reason that it was faster was twofold one click house is incredibly parsimonious in the way that it stores data so this was well compressed passenger counts are generally small numbers so you know one two three something over that kind of bounce around there we only use a bite for it and then it's compressed on top of so that's reason number one that it's that we're reading a relatively small amount of data the second thing is that click house is incredibly efficient at parallelizing i o and in this particular case it turns out that click house is not so good at parallelizing that function that was generating the numbers and in fact what happened was i was able to use all eight vcp vcp use go out read the data assemble the aggregates and put them together so as a result going straight out to the data was only half the time of of working purely in memory so that's sort of an illustration and this is just brute force uh table scan that's doing this the question that we always ask in in click house is can we make it faster and in the case of click house that's generally the case so this is a good speed if you are an analyst and you're doing olap type queries slicing and dicing you want to look at the data one way you want to look at the data another way you're not going to have to wait very long so um the interesting thing to do is now to see hey can we get this down to the level that it could actually drive something like real-time bidding for adtech and if you're familiar with adtech you know that the response time has to be 10 milliseconds or less it turns out that we can do that and what i've done is actually constructed a materialized view which i'll talk about this in in this talk but it basically takes those 1.3 billion uh taxi uh taxi rides and aggregates them by day and when i run off the materialized view the speed of the count which you can see is exactly the same out to 15 or so decimal places is um is now uh responsive to in two milliseconds so with click house you can not only get this very high olap speed so that you know for for completely random queries but you get very very fast results for humans you can also by pre-aggregate and get results that are fast enough that they can serve the needs of machines so so i i don't know if you can talk about this but like you are these query plans cached are there a pair of statements nothing cached you're part of your binding optimizing in in everything okay there's no caching here it's just all brute force and in fact that's as we look into the features of click house it's really important to realize that it's kind of like if you were going to say what kind of car is click house it is a drag racer so it maybe doesn't have doors it maybe doesn't take a lot of passengers it maybe doesn't have years but it's incredibly fast um it does not have to turn put it in database terms we don't have a query optimizer for example like a cost-based optimizer we do planning but it's very basic if you want to do the plan a different way you have to tell click house what you want so it's a it's a what you see is what you get kind of kind of product maybe you'll get into this the talk but are your are your catalogs transactional yes okay yeah so let me dive in then so that oh go ahead i was saying for that query you're starting a transaction to do a lookup to get that you know that the map the table name to the table object at zero point zero zero two seconds includes committing a transaction actually there's no there's there's the the transaction the transactional overhead is very very minimal in the system we don't have a fully acid transaction model that's another place where we save and in fact we the notion of having transactions at the row level does not exist in click house and that's a great segue i'm going to dive into the talk and actually explain the features of click house that um that allow us to do this so now you sort of understand why it is that people care about click house and it's the speed that really gets people um that to say i gotta look at that but there's another a number of other things that i think are are just interesting at the top level first of all it's a c++ binary it's very very simple architecture there's just one binary you run it and that's your server it's kind of like my sequel it talks sequel has a shared nothing architecture so compute and storage are bound on a bunch of particularly as you scale it out they're bound together sort of like the original vertica architecture column storage with vectorized query execution it has built-in sharding and replication and it's released under an apache tool license and one thing that um about the c++ that i should note is it is unusually well written c++ i'm not a c++ programmer i did not write any of the code in this in this server but when i get stuck on click house i go read the code it's excellent and you can figure out what's going on and and you know sort of reason about what it's doing so that's the basic feature set um let me tell a little bit about the history of of click house it's been around for a long time one of the reasons that it's so capable is in fact it has the necessary decade of experience of people hammering on it started with a as a simple group by prototype uh written by lexandl avidoff who's the main committer and uh this was basically to solve a problem of doing funnel analytics on their on their website and to understand hey where do people drop out you know what is it that interests people on the site it was sufficiently successful that by 2011 they had it deployed and it was underlying a product called yandex metrica which is kind of yandex's version of google analytics and so for the next five years or so it ran a bunch it ran metrica and a bunch of internal apps and it just kept expanding out to more use cases until in 2016 um mill avidoff managed to persuade his management to release it under apache 2o and the um the the reason that it was um released was it was uh you know yandex felt hey we got something pretty cool we want people to see what kind of technology we can do and and uh and they went ahead and release it from then on things happened pretty quickly there was some things like there was a blog flare article our cloudflare article about um how they were using it for their dns analytics followed by another one uh web analytics there was a a uh 1.1 billion taxi rides benchmark by galko mark lit winchik and it began to uh things began to move forward pretty quickly thereafter so at that point um it began to get some uh you know some real uh mind share in the united states and as of 2020 we're we're now working forward to try and um you know to sort of increase worldwide adoption but in particular to get it to get a really strong presence in the us so um i apologize there's i've got a menu bar at the bottom which uh windows i knew i should not have run out of windows but it's misbehaving today so um here let me it's fine i'm just gonna keep going all right so why is it fast i think there's two types of things that click house does that that make it really great one is the code is itself is optimized from speed um if you can make anything faster uh in the code it's click house is going to do it so for example instead of having abstract interfaces that point down to algorithms the interfaces are all based on what is it the algorithms need so for example we don't have uh if you need initialization to make something go fast well that's got to be in your public interface there's also a huge amount of specialization so group by there's 14 different algorithms um there's other types of specializations here and alexi mille beat off has a great talk where he describes how it is that they that they think about the optimization of the code and then finally we have vectorized the query execution so we need at least ssc 4.2 on the um on the chips we run on um and then has various fish and dispatch across all available cores so if you run this on a um you know sort of on a on a on a machine with 16 cores and you run htop you're going to see them pretty much light up uh to a hundred percent when when these queries run so this is at the yeah go ahead can you talk about that multi arm bandit for the lz4 yeah so lz4 there's um multiple um there's multiple algorithms to do lz4 decompression and so what they actually do is do some sampling uh beforehand and they use that to pick an appropriate uh algorithm to for the decompression the differences are not huge but uh again in this i recommend this talk number four uh it has a great example of how they go about um doing that doing that optimization and it's it's dynamic based on the data they're looking at so let me go into how click house works internal so this is a sequel database it has tables and the workforce table engine is something called merge tree as you'll see click house has a bunch of table engines but this is the main one that this and its variants of the main ones that people use so if you if you're familiar with sequel you see create table on time you see some data types you then see this engine merge tree that says hey i want i want this particular way of managing data you've also got a partition key which says i'm going to break it up by the and this is flight data as it turns out flight on time data break it up by the month and within the uh these parts as we call them ordered by carrier and flight date so this is a typical uh this is a typical table definition and what it creates in storage is this merge tree layout which consists of a bunch of parts and the parts are chunks of data which are indexed so they have a sparse index which uh and by sparse i mean it only has entries for every so eight thousand or so uh rows by default and then the columns are all all the data is is ordered in columns and they're both sorted and compressed so this is the so when you look at a table you'll just see these parts as a directory on disk and then you'll see a bunch of files underneath the contain the innards so what i'm going to do next is look inside the layout because this is of one of these parts because this is really the heart of how click house is so fast so here's the sparse index it's in a file called primary dot idx and the idea is that every 8196 uh rows you're going to have an entry in the index and the distance between rows is called a granule so that means that you in fact one of the properties is that you only you always you're always going to read 8000 the equivalent of 8000 rows at once then over in the columns each column is going to have two files um ones a dot mrk file that is an array that contains pointers that relate the granules so granule number three is then related in each case to a uh a segment inside the bin file and that's a compressed block and so the marks enable us to quickly say hey you know if we want to read uh granule number three go look in the mark file and then hey here's the offset inside that bin file where you need to go to start decompressing data so so this is the basic layout and for most for almost all data where you want to do um scan it quickly this is how it's going to be organized so one of the things that one of the problems that you get in these systems is that they they have very large amounts of data and in order to get large amounts of data in you're going to have to put it in pretty fast so click house is is optimized to allow the data to be inserted as quickly as possible and then reorganize later so that it's more efficient for query so if i insert data into this table what's going to happen is every time i do an insert it's going to take whatever i inserted say a block of a thousand rows or ten thousand rows or a hundred thousand and it's going to create a part and that's instantly quarter variable so as soon as that insert completes click house can i have two inserts maybe they complete in parallel i can immediately do a select count and it's going to show those um it's going to show the data however this isn't particularly efficient because you saw that structured there's a lot of little files and so if i have to read a lot of parts to do this count or to do whatever query i'm doing i'm actually going to spend a lot of time in you know sort of manipulating storage so what click house does is uh what are called background merges where from time to time it examines the parts and then it basically puts them together in a new part and in an atomic operation just does a switcheroo so that then queries from then on use the use the collapsed part and so the idea is you get sort of a quick but half-hearted organization of the data going in which means it's instantly queryable and then over time it becomes more and more efficiently distributed so that your queries run as fast as possible and this is a really fundamental fundamental behavior of of the click house server and is one of the reasons why you can have very rapid ingest but also highly efficient reads so there's no global catalog state that says like like every every it's shared nothing so every node basically has the information on what it contains but there is no like global because this is the guarantees across the different no no and in fact um i'll talk about replication a little bit but it's um yeah so the the sharding and the replication model is um multi master eventually consistent with zookeeper used as a uh source of the consensus on what are the parts that actually need to be replicated but that's actually a major that's a major design point in this system that we're we're eventually consistent and in fact another thing since we're talking about transactions and i promised we'd say something about this but the unit the transactional unit in click house is the part the individual parts are written atomically so it either entirely goes out or or not um there's no notion of being able to do a transaction at the level of a row so when we change things for example uh when we delete data for example we actually have to rewrite the parts typically to get the data deleted and that's the that's the unit of transactional work and then when you do the merge that's when you actually do the more heavyweight compression yes well actually what will happen is the compression um the the i the what the the efficient the place where we really get the efficiency the compression is the same it's just you have longer ranges of data and so more chance for data based on your sorting order to be in some favorable ordering that will then get get you even better compression than what you had before and the the the difference between having a lot of parts versus um having a few can often be an order of magnitude in terms of your query speed this is pretty simple to demonstrate um and in fact speaking of IO sort of the base performance of click house is just because it is fundamentally you know we don't use a lot of tricks we're we're basically just doing brute force reads for most things so the more we have to read the slower we go so if you can so what you're always trying to do in click house is to take your your query and minimize the amount of data that you're you're traversing when you um when you run and this is a simple example that just shows query response uh and the number of marks that we read so this is airline data it's a very simple example i don't include a join um but what we're doing is we're first of all reading all of the marks so that's everything that's you know every every block in every column that we need to look at to to get this query result and we can see that it takes 0.4 seconds if we can restrict it to a year it's going to take much less than if we restrict it to 40 days it gets even less and click house is fairly intelligent about this wear clause it's fairly intelligent because it knows that that's actually can be related to the partition key and it's able to avoid even looking at partitions in this case um so that it further speeds up the queries but you can see that the the marks read in the query response are are just one to one at this point so beyond that since we have good parallelization of of cpus if or excuse me of of the i o if you can add cpus at that stage it really helps your execution so here's a simple example again another query but with a group by an order by um and what we're doing is trying it with different numbers of hardware threads max threads is a is a setting that says hey how many cores do you use so these are anything you can see in proc vcpu that's a core so we do it at two four six eight and what we see is that at two you know we get a certain level if if we double it it pretty much uh you know doubles the performance but then it it drops stuff asymptotically so you're beginning to see sort of an amdahl effect in the in six and eight and that's because you're probably spending some time grouping you know finishing the group by and doing your doing your sword and then burping out the the ten rows that you found at the top so um so this but generally speaking if you want to speed something up you want to you know reduce the amount of i o and apply more cpus and then you'll get faster through but so up till now i think this is fairly you know there's a lot of databases that do this but what's interesting then is that's just the base performance you get that stuff what i just showed you you get that by doing nothing just by defining a table and running the queries click house has a boatload of performance features and this is kind of an eye chart and i was i felt kind of bad about it because there were more than i wanted to there were more things that i really wanted to put in here these are just some of the big ones uh so lz4 and std z std compression that's uh lz4 is default uh we we do our best to cache the primary key index as well as those mark indexes uh we can do sampling we can do skip indexes these are indexes which are kind of like bloom filters for example there is in fact one that is based on bloom filters they try and knock out things to read so it's all devoted to reduce the amount of uh marks we have to read and within them maximize the amount of data that we get out of each one what i'm going to do is actually zero in on just two of these to give you a flavor for how they work i'm going to look at codex and compression and i'm going to look at materialized views these are two of the more popular more popular features and click house that people use for speed before you jump into that quickly the specialized engines is it like yes like the mysql in the 2000s where everyone had these sort of off-brand random engines yes it's it's like that's a great that's a great a great analogy and i even wanted to bring it up because it's like my sql except the engines actually work so that's why i was asking yeah so so my sql yeah the engines are sort of a disaster i they had this this elaborate interface and of course the one engine you use is in odb click house has dozens of engines and so we're we're looking at urge tree but there's basically it's fairly easy to construct new engines and so click house has a raft of them all focused on different use cases we'll look at at least five engines in this talk um and and and and you'll see those as we go through but it's a very it's been a very successful model so uh compression and codex they're really simple so here's a test um a is a string so the first one is just going to get lz for compression i can apply what's called a low cardinality codec on it um that is going to do dictionary encoding i then have some numbers uh that's column b and then i have various combinations of using a delta encoding you know sort of encoded by the differences in the steps um double delta which is the differences in the slope in the steps and then you can combine those explicitly with your favorite uh compression if you don't do anything it doesn't compress uh you actually have to specify it explicitly but this is very simple to do and you can just bake it into your schema moreover it's pretty easy to change um the important thing is how much of a difference it makes and it can be huge so this is one of the places where we usually go for performance tuning is that you know somebody has a string will say hey what's your cardinality if it's anything under you know a 10 or 15 000 make it low cardinality and this is a typical example i think this i don't know what the cardinality was on this data it was in the dozens but by the time it it um reduced the data and then compressed it it had knocked off 89 percent so um that's a pretty good compression ratio the numeric encodings like a delta encoding applied to a time series can actually be really really huge so it's not because those are particularly if you've sorted things so that they're monotonically increasing um you're going to get uh with the delta encoding in this particular example we got 99.5 percent and then double delta was even better because these were fairly uh you know sort of integers that were increasing pretty gently um we got like 99 uh compression so these are these have a huge impact of course on the amount of data that you're going to be reading and what this means is for huge reaches of columns you might only read a few marks so so that's going to that's going to speed up your query uh speed that's going to speed up your queries enormously the other um feature i want to zero in on just briefly is materialized views and these are materialized views uh exist in many databases i've used them in Oracle for example and um they have a somewhat and they exist in part in Postgres they have a kind of shaky reputation in OLTP databases because they're hard to it's hard to get them to work in Click House we have the advantage that stuff doesn't change very much so materialized views get hard to maintain and if you're constantly having transactions on your base data in Click House you usually write the part and then yes it gets merged later but it doesn't it doesn't get updated very much and so and so the the uh the materialized views in Click House are super efficient and so you can think of them as kind of like synchronous post insert triggers and they're also very flexible so you can use them for aggregation that's an obvious thing to use views for because they will um basically there's a there's a table type called summing merge tree which is designed to hold aggregates it says something merge tree but it can actually be any kind of aggregate um and these are optimized to hold these parts to hold partial aggregates which can then later be merged together so use them for aggregation but another thing we use them for commonly is to read from Kafka we have a table engine for Kafka that makes that that makes the cues or the topics look like tables and you can have a materialized view that reads from it on at intervals and puts it in a real table you can build pipelines because you can chain the views you can here's an interesting one last point queries are very common in time series like what's the current CPU usage across all my VMs in the entire data center so yes you can solve that in a query but that would require if you look at the entire time series for every VM in order to to get it in this case there's actually syntax in the in the views that allow you to to create this incredibly efficient view and in fact the one that I'm showing here where the compressed data is point zero zero zero nine percent that's actually a last point query and then finally you can change sorting or primary key order so sometimes it just helps to have you know the column stories it just helps them sort in a different way this is something that the vertica and c-store folks realized so they're kind of like you can use it to create the equivalent of vertica projections people love materialized views they're used very very widely in click house implementations so it sounds like it's like it's you have click house specific dialect for yes like it's not just general per sequel you're doing something no no this this is where this is one of the places where click house really departs because they actually have a data type which is corresponds to these i call them partial aggregates because you can think of your views particularly because we're we're distributed click house has this notion that what they'll do is compute for example if they have like a if when they're doing vectorized query dispatch you know we'll get a chunk of the array we'll compute partial aggregates across that chunk and so for example if you were doing an average it would be a value and a weight which would then pass back to be to be fully aggregated or we say merge somewhere else so we actually exposed those partial aggregates as a data type it means it takes a little use getting used to but it's you can do a lot of really powerful stuff but then also there was no query optimizer so it's up to the application to know that like precisely yeah that's why in fact yeah in fact a lot of the optimization i didn't talk about it here but a lot of the optimization because it tends to be around things like hey how much data do i have how how heavily is it compressed click house has really great system tables and so what you do is there's a table for example called system dot parts it tells you how many parts you have what's the overall amount of storage system dot columns tells you all your columns and all your tables and you can see the codex you're using and the exact levels of compression so you go in there and you look at it and you play around with it so that's materialized views these are super popular let me talk this is the final part is we'll talk about scaling so obviously you can scale vertically and everything that we've done here is you know talking you know sort of our things that the features i've talked about so far you can improve their performance by adding more cpu adding more ram that just gives us like having a bigger buffer cache so when we do need to uh do things again we've got it adding more storage these all help the nodes grow but at some point you need horizontal scaling people deal with data sets that run into tens of petabytes on click house so click house has sharding and replication built in so the way to think about it is the shards are uh they're disjoint pieces of your data set and what you do is you break your tables up and spread them across the shards and then the tables are grouped and then you can replicate between tables so you do the replication mostly because obviously for a j if you lose one of the uh one of the hosts then you still got other copies of the of that part of the table but the other thing is if you have a lot of concurrent queries uh click house is smart enough to to go look at them and and use spread the load across them so to get this to work there are table engines more of them so we start with a table a distributed table it's like an umbrella and it reads the definition of the shards and the replicas basically the hosts that they're on and then it uses that when it processes a query to dispatch to you know to pick replicas to go through to understand that hey there are multiple shards we need to you know dispatch the need to do federated query across them and the other engine is replicated merge tree which uh and there's variant there's other variants of that that are suitable for aggregation but they basically are tables that know that have a a path in zookeeper and can share data automatically so you can put it in any table it will propagate asynchronously to other other replicas so that's the base table types um yeah here's a little bit about the what happens is it just takes that um you know the processing of parts that I showed you and basically it now happens across replicas so for example when we have to do a merge we'll use zookeeper to do a leader election actually the table at any given time one of the table replicas is the leader and it's responsible for deciding what to merge and when everybody else gets told about it and they just follow it lockstep um when we have to do it a strip yeah go ahead like you made a big deal about like oh it's a single binary but like zookeeper is a whole nother yes yes um there yes I did make a big deal about that and that is the one place where I kind of wish that we're part of the binary as well but on the other hand um I you know the thing that there is a problem I think this was a practical decision to use zookeeper it worked well enough there's been murmurings over the years about replacing this with etc which I think would probably just give us a different set of problems um long term maybe you'd want to have it just baked into the uh you know baked into the um to the core service process but as you know that's something you don't do any story yeah I want to mention um Kafka I think in the project metamorphosis they just announced that they're also moving away from zookeeper to just yeah use the raft protocol directly on Kafka since you already have capability on the engine itself to store the metadata encounter right right yeah and I think over time that's something that this is a place to where click house could use some optimization or improvements around zookeeper so how does it manage consensus and then a closely related topic which is what is the transaction model so I said that the the fundamental unit that you manipulate in a transaction is the part but there are also cases where for example if you have changed materialized use and something breaks halfway through writing to them you don't necessarily clean up completely so there's definitely there's definitely some work that can be done in that area um distributed queries so how do they work well the application comes in it just says hey it goes to the on time distributed table and we call that the initiator node on time is going to look at the that that engine is going to look at the query and then it's going to look up okay where are the shards that I need to go to and it's going to do a ferret query across so typically what will happen is if you have like a bunch of if you have like a bunch of joins the leftmost join will get dispatched typically down to the down to the bottom query depending on how you do it you can control how much gets pushed down but what we'll do is we'll compute our aggregates in the ideal case all the aggregate states get computed locally and then we brought back up to on time and then your final aggregation will be done there and passed back that's where you're merging stuff is there any kind of like uh scheduler or like priority queues at this or is it just best effort oh it's best effort so yeah if something there are you know sort of practical things like hey if somebody is if we detect that a replica is lagging for example um you can have a setting which says hey what's the what's the maximum lag that you will have and I think this is a really big thing about click house which is you know why I said it's kind of like a drag racer uh you know it doesn't have doors um if you want to change the particularly for distributed behavior click house doesn't have a cost-based optimizer and let alone one that that is distributed you know sort of where state across the cluster so it's basically going to try to do essentially what you tell it to do there are um here and I'll and I'll show you an example that in just a minute so reperformance when you get the setup right it's great um this is an example of something I ran for a webinar a while back uh the the performance is linear if you can push as much if you can push most of the query work down into the into the individual shard tables it just tears so this is basically an example of two runs one one cold one hot there I had the I was allowing the caches as it's linear scaling so and for very large data sets this this type of you know where you're mostly working on a single large fact table this is typical behavior for very fast queries you know like if you're trying to get down into the you know sort of below 10 milliseconds network latency is going to you know just your setup of calls and things like that is going to may dominate these effects and you wouldn't see a big difference um I said the cost is not isn't doesn't have a query optimizer but it does do distributed joins as long as you uh you have to give it direction there's a keyword called global which you can use to say where to go get things but generally speaking that a fault behavior is if you have a query like the one that shows up shown at the top if that hits the initiator it's going to run something that looks like what's on the left it'll basically go down and run it um just push that hole it'll break up the sub query for as many shards that has fine replicas for each one and just push the sub query down to the local table you can however force it to um to actually do a global join so instead of using t2 like if you're um you know if you got two tables here if the keys are not aligned on each host you can actually push it down and force it to do a global query on the um you know on the um on the node where it's executing and then finally you can actually force it to do a broadcast where it's going to run that select a from t2 first and then basically broadcast it out it gets stuck in a temp table and then um then the the nodes will join against that temp table no magic here though there's uh query settings as well as the global keyword which allows you to control the space so you have to know you have to know what you're doing the good news is that's that's bad the good news is it's not too hard to figure out what claus is doing it has a very good query log and you just go look there if you're not seeing what you like you uh rewrite your query so i know you guys are only going till 5 30 so i'm almost done i promised i would talk about use cases there's a bunch of um i'm not going to read them for you but where click house really has a sweet spot is any case where you have very large amounts of structured data and uh simple example um network flow locks those contain about 16 columns but they're pervasive throughout uh content delivery networks as well as uh public public and private clouds so on and so forth whenever you have a case where you have very large numbers of records and you want to get results very quickly either human quick which means that seconds to do uh slicing and dyson queries or machine quick where you need to get results in milliseconds click house is ideal for this and so that we see a huge amount of penetration i'm just taking the network management case every single network company that i've talked to in the end even if they said they weren't using click house we sooner or later found somebody that was actually using it so it's it's it's very very popular um so these are these are typical examples um and then many more but i think uh lots of good examples of that and lots of companies that have talked about using it this way um current and future work there's lots of this we have hundreds of people in the community who are committing most of the commits are uh are done through yandex as well as our team uh but these are just a few of the areas where people are are working our company for example is actively working on object storage uh long term we feel very strongly and we need to for cloud operation we needed to decouple computing storage so kind of obvious uh things there uh getting better elapsed support so for example window functions there's a cry gone out in the community uh merge joints so we can do distributed joints uh between large tables for example uh all the way down to machine learning and performance a lot of our work at at altinity has actually been around cloud native operation basically leveraging kubernetes as the cluster manager so that we can you know sort of effectively spin up and spin down clusters very quickly and use pools of resources to uh to basically enable multiple clusters as opposed to a single monolithic cluster and then there's much much more uh if you go out and look at the uh the release notes for cart releases you'll it's amazing how much stuff has been done um so the resources these are just some things and i'll send these slides out i think the one of the slides is secrets of click house performance optimizations if you like database internals this is a wonderful talk um there's a couple of really good talks that uh blog articles got this on the map and then our own humble contribution earlier this year we showed how we were able to uh to beat a silla dv uh i think they had like 80 nodes in their cluster and you know they they were getting a billion rows a second we uh we did it on a nuke so that hit hacker news was caused some interesting discussion and that's it so uh i have to remember to say we are hiring where i've got about seven positions open if you like databases then uh you'll like what we're doing i've i've worked on a bunch of databases and i think of all the 20 that i can think of the two that are closest to my heart are cybase where i had that the honor to work for most of the 90s and this one of the two most interesting ones so it's a really it's a really fun project to work on great code really great community at a lot of really interesting applications i said i is there is the HQ in in the valley or you guys distributed no we're fully distributed we're kind of like uh you know because we're open source folks so we're actually a Delaware company uh have have a subsidiary in the uk but basically it's between my life you know my office is in berkeley and the url mountains in russia okay awesome uh so they can email you directly to send send their cv yes yes okay fantastic all right so uh i mean i'll thank you on behalf of everyone else you can click the for the clapping button but that's again an empty gesture um so we can open up to the floor if anybody has any questions uh we can go to your 530 panels yes yes yeah these tons of sensitive years of facebook are getting interesting talk you have stressed a couple of times that uh you don't use any cost base optimizer and yet you are getting this fantastic performance but isn't it true that actually you have a human base optimizer correct what is the effort yeah that's because the way that the other it requires some sophistication even to come up with these partial aggregates and come up with the ways to build them towards final aggregation with joints it requires a lot of effort that's that's true and i think what we're what i've typically and i can't say that um let me think how to answer this i think at the current state of usage of click house it hasn't been a huge problem because a lot of people particularly in united states are early adopters and they're just interested in the in the technology and click house gives you a couple of tools that you can use to just to understand what's happening one is there's a text query log along with a query log table that shows you the queries and a lot of statistics about them so for example you know did i use the index it's kind of like the equivalent of a query plan only a bit more readable actually than a typical query plan so you can use that and then the second thing is you have the system tables which allow you to see the level of compression the amount of space that you're using you know the number of concurrent queries so yes it's correct that people are are solving these uh you don't have to have to use the effort to solve it i think a lot of the use cases also involve very very large fact tables with a relatively small number of dimensions and so just in terms of the the complexity of the queries and the difficulty it's it's just less than what you would get in a typical Oracle application yeah it's yeah it's i i can see that but still the way that uh my cat feeling says is that i building you have a tool for uh application specific development and basically your database designer is your database administrator that has to set everything for your end users to become effective users so i was wondering what if you from your experience what is the average time to develop such an application uh that's an interesting question what i see happening in the field is something like what happened with my sequel where when my sequel arrived in the early 2000s what happened was a bunch of devs turned into develop database developers and they wrote a bunch of crappy applications uh which which they then you know which other people then spent years um you know in some cases then did a lot of work to to deconstruct but i think one of the things that's click house is simple enough because it doesn't have a query optimizer and it doesn't um you know do a lot of magic it's relatively easy for people and it's relatively straightforward to set up and run it allows people who aren't necessarily deep database specials specialists to get something working and the fact that you can set up a single cluster for every different application means that you don't have to have this thing isn't a monolith right you can our vision actually for how you should use this is you set up a cluster for each specific service that needs the that needs uh analytics and all you have to do is optimize for that particular application and not others and i think that simplifies the management problem to you know to a sufficient degree that it makes it accessible to a lot more developers at the same time i think you're right that without some level of you know decent query optimization there's you know there's a there's going to be a limit to the number of people that we can reach and i think one of our challenges is to develop that optimization in the future so that we can we can reach that larger audience thank you i have some more questions but i would leave other people to jump in and then we can step in the other question i've looked at click house cast we've been exploring other options we use some commercial overlap systems and they're a lot of talk with presto and certainly i think in the network environment like spark sequel with their work on delta lake and apple really sort of emphasis hey since spark's such a good in-memory processing system that's for all the traffic law and your spark to crank on it sort of how does click house compare to those type of systems in terms of approaches and implementations um well spark certainly um spark is just a completely different architecture um and i think that the one of the one i think one of the big distinctions that we get is between people you know between systems that work off data lakes versus systems that that are that force you to place the to organize uh the data in memory along the lines of of that product we're in the latter case so what that means is that for very very large um you know like you compare us against anything that does federally query anything that's reading out of s3 we're not necessarily as efficient as they are or not able to operate easily at the scale that they sometimes need to operate at but on the other hand the trade-off is that we get much faster performance so for example we do um one of the things we are looking at is the ability to read things like parquet as um as a format which would then give us some ability to you know to have a consistent data format across multiple across multiple products that's definitely something we're very interested in but but i think that we are fundamentally um different in the sense that we do require you to put the data into our system and i i even as we go to you know like sort of use s3 i don't see that that's going to change because so much of our performance is dependent on the organization of the data um you know the the specific way that we do compression for example allows us to minimize IO thank you all right panas you want one more question yeah i to one comment one question actually when i was so following your discussion it reminded me there may be my graduate student years where we were trained to do physical design organization to start thinking about how you partition your data and how you build it exists in the water materialized use you need to build in order to make the system run efficiently so i that's why i ask you that in some respect these were the challenging part of database that i'm not to construct nested queries it's how to come up with this physical design yes we deliver the speed that we were competing between our teams how to do the best design i think in the end that's always that's always something that you you struggle with and if and if i could say this in terms of you know comparing this with other databases i think that one of the things that's kind of interesting about clickhouses it actually has relatively few knobs on it that you need to adjust to get good performance you get very good performance out of the box other systems for example that have cost-based optimizers tend to have a lot of settings that have to be set correctly for the optimizer to work i've seen this for example on oracle where there's a great deal of detail configuration that has to be correctly done otherwise the optimizer won't won't give you good results so i think there's a balance here and that my problem with i like query optimizers and i like having you know the work done for me but then there's always cases where i feel like hey if i just understood what i was doing i could actually solve the problem faster so um so it's an interesting balance and i think as we go forward the places what we're what we're hearing from our users is not so much we need a query optimizer or we need it to the physical implementation should be easier it's hey just give us window functions please you know because it makes this easier it makes it easier to write queries in a in a way that we understand and are comfortable with from other products interesting so my other question which was again a clarification uh uh aren't they ask you about uh if you have a catalog and you said no yet you don't have a scheduler but still you try to do the kind of load balancing when you partition and do replication and miss that connection there how you can feel and sure things without getting to bottleneck or skewed distributions with great great question um the most of the scheduling in that sense is just done through heuristics so for example when we're um when we are uh when we're scheduling a query or when we're planning a query to the extent we plan it uh for in a distributed table we have choices uh for example if the replica if if we can use the local replica should we prefer that as opposed to finding the same you know replica of the same data but on another host uh what's the load balancing algorithm that we use is typically just round robin these are all just settings um so those are these are all done through heuristics they're not based on having a catalog with um with um analytic you know with with any sort of analysis of the underlying data and one other thing that I think is important to to say about click house is and this is probably I think in large clusters the single biggest issue that people run into we do not rebalance data automatically so if I have if you use Cassandra if I pop a host into the rack Cassandra will because it's using consistent hashing is going to spread the data out and um in click house not the case if you're using time series what will happen in load balancing it will gradually fill the host up but um you know you actually have to do work to make sure that the data gets propagated and if you take a shard away for example because you want to reduce the overall number of hosts in the rack and the number of vms that you're running you actually have to move the data manually off that shard this I think is that's actually one of the big challenges that that we need to to address and and I think that one of the ways that we can solve it is to you know stop using the attached storage and instead use things like um like object storage which isn't directly attached to a single host questions and you are mute you cannot hear you yeah i'm an idiot sorry um yeah one quick question then we'll go um of the like can you roughly say like how many of the operators are actually vectorized or is everything vectorized no it it it's pretty hit or miss you know I I was actually going through the code to just see okay what's what is vectorized and what is not um it's pretty much I don't I could not say what the pattern is uh a lot of the vectorization is actually done in libraries that we uh that we use um you know so we have you know like vectorized uh mathematical operations things like that um I think this has really been something which it kind of gets to the point you've made in a lot of your lectures that a vectorization is something that has to be done kind of carefully um and the so I think that a lot of the pattern is a result of little vdoff and other people in the community trying operations finding out you know hey this could be vectorized because you know hey you know it's an array we you know we don't have a lot of problems with you know we can also minimize branch mispredictions so we're gonna go ahead and vectorize this and um and do it but I don't think there's a systematic pattern that you could really point to it's just wherever it seemed to make it faster okay all right cool awesome all right so again we're we're very happy to have Robert come talk about Clickhouse as I said it's a system we wanted to learn more about this was been super insightful