 Hi everybody this is Robert Hodges and let's get this party started. I'll be talking about breaking out of the proprietary cage real-time data warehouses come to open source. So before I go into the slides I'd like to give a big shout out to the people who at the open source summit that made this happen. I've done a lot of conferences and I think you've all of course attended many of them. It's just a huge effort to get something like this to work in the circumstances that we're all facing right now. So thank you so much. It's just a real pleasure to be here and I appreciate everything you've done. With that let's dive in. Here's my title slide, same thing. So I'm going to be talking about data warehouses and how a new kind of data warehouse that I call a real-time data warehouses now emerging in open source. Let me just give you a little bit of background. First of all about me. This is a picture of what I look like when things are going well on on on on my SQL examples. I'm CEO of Altinity. We are a software and services provider for Clickhouse which is the main data warehouse I'll be talking about in this talk. As a company we're major committers to the project. We're also big community sponsors in the US and Western Europe in particular. As far as my own background I've been working on databases since 1983 plus I took a few breaks. I was working at VMware for a while on virtualization and security. When I go back and count I think Clickhouse is number 20 but after a while you tend to you tend to lose count. So it's basically been the main focus of my career and it's a really really great subject and I hope that in this talk I can I transmit some of the enthusiasm I feel about working with data and particularly with data warehouses. Okay oh by the way I see questions popping up and I just want to give you a heads up that I'm gonna make sure that I have time to answer them in this at the end of the talk but because of the way this platform works it's really difficult for me to see them coming up and I don't want to screw up the talk by trying to manipulate the questions so they'll definitely be time to answer them. Please queue them up and I'll be delighted to answer as many as I can on the talk and then pick up anything else after that in Slack. All right so what I'd like to do is is dive in and kind of frame this problem about what makes analytic applications special because in this track we've talked about different kinds of databases in fact if you saw Amanda's talk a few minutes ago she was just scanning the whole environment of you know sort of the panoply of different kinds of data databases that exist and what we're gonna be focusing on today is databases that are specifically designed to do analytics and these are basically answering in the when it first emerged they were answering rather general questions about business problems and I'm giving a simple example here of just sales data and imagine that it's organized in a table because we're dealing with relational databases they're concerned they're they like to have things in tables and you can ask general questions about sales data that can then help you drive company strategy and even make in some cases real-time decisions about how you should react to things going on in the market so let's just take one of these questions which kinds of companies are most likely to buy skew 556 so that's some product number and what we're trying to ask is you know why is it that why isn't somebody buys this when do they buy it can we can we understand their buying patterns because that would allow us to for example pick companies to market to maybe give them special offers maybe have an inventory position in particular places these are all questions that these are all things that come out of having an accurate answer to that and when you go back to the table that contains the data it quickly becomes apparent that this answering this kind of question is very different from just querying a table in a database like my sequel and I'm gonna give you three differences that that make this problem qualitatively different and therefore require a different technical approach so the first thing is that when you look at an individual sale there's just an enormous amount of data related to that sale that you might want to know we have the part number but what's the name of the product that the corresponds to 556 we have the date but we could be thinking in terms of weeks or months or years things like that so we want to have different different levels of time we have the city but what about geographic regions we have the customer what about the industry that they belong to the country where the headquarters is located things like that so you can see that there's an enormous amount of data that you need to effective to have at your fingertips that allow you to adorn this simple sale and then have more context about what was going on where and how when this happened the next thing is looking down the table is the data values could be the number of data values could be enormous in sales data it actually tends to be kind of small because it's generated by humans but in a lot of analytic applications the data is actually things like people's locations off their cell phones or where they were clicking on on on a web screen we're talking trillions or even quadrillions of rows of data and then the final thing is to answer these questions we need to be able to take this very long you know sort of very long list of records this very wide range of data associated with each record and we basically need to be able to combine it in any pattern imaginable we call this slicing and dicing so so there's no particular we can't make an assumption about what the access pattern is on the data the person who's asking these questions and trying to get solutions could look at the data in any way that pleases them so these three things became apparent to people starting in the in the in the 1980s and they led to some really interesting innovation to create what we now know of as data warehouses so they were basically three looking back in the history they were basically three sort of rounds I think if if you will of technical technical advances so the first one was sort of set up in the 1980s but really became apparently early 90s with a couple of products Cybase IQ and Terra data so the first thing that the people did was they said hey we've got a lot of data what we'd like to be able to do is process it on multiple hosts and this is something that Terra data did spread the data out be able to multi you know sort of issue a query that that breaks that attacks the data on each machine and takes advantage of more computing resources so that's MPP enabling there with things like column storage and bitmap indexes I was actually working at Cybase when we acquired IQ and I remember hearing a talk from a guy called Clark French about hey these this is a new kind of business problem here's how we have to cluster the data basically loaded in columns use bitmap indexes because this is a different kind of problem and he you know we we sort of dimly realized that the time hey this is a this work this really does require a different technology the next line of advanced a couple of advances came in the between 2000 and 2010 and that was with products like vector wise and and vertical so things like being able to vectorize the execution using SIMD instructor or SIMD instructions being able to reorganize have different organizations of the same data that was an innovation vertical and then of course things like compression and then the final the final sort of set of technical innovations have come over the last decade and with the advent of the cloud and just to to name one Amazon Redshift was a groundbreaking product because what it does is we took these data warehouses which are complex software you often had to wait months just to get the hardware for them let alone get the software installed on it and basically the the Redshift team made this happen within a few seconds you go to a screen enter a few clicks you know tell you know what size of the database you want and within a few minutes you have a data warehouse spun up that you can begin playing around with that so these are all sort of profound innovations and and sort of in each in each area demonstrated enormous creativity to try and answer these questions more effectively however the solutions all had a single unifying or the the people working on them all had sort of a unifying characteristic and you know if you're an open source person or a database person you look at this and you may it may just jump out at you these are all proprietary products so this was this innovation particularly on but extending up into even to today has been driven in large part by innovations on proprietary products and in fact when we look at the open source competitors specifically for data warehouses again sort of these are relational databases a relational model sort of tabular data designed to move you know answer quite general open-ended questions very very quickly over large amounts of data there's actually not that much out in the in the open source that's a direct competitor I'm gonna give three examples here that that illustrate the kind of technology there are others of course but I think these give you a sense of where people were going so for example there's Presta which was originally developed at Facebook it is designed to do query over data lakes so large collections of data for example living in object storage or living on HDFS that's one solution that's out there not really a direct competitor to the to the data warehouses but but SQL based and and and focused on the problem of large amounts of data at the other end is Druid which is a popular open source system that's designed to handle queries on large event streams so things running into trillions or quadrillions of records and it was innovative particularly innovative because it it was able to throw a lot of hardware at the problem and guarantee a certain level of latency so this is this is a definitely an interesting technology but it was not actually originally SQL and then in the middle what I'm going to focus on for the rest of this talk is a database called Clickhouse and this was a ground-up SQL implementation to get to get quick answers on structured data so it started out as a relational database it was originally developed at Yandex the the first prototype of it was was done in 2008 and it was open sourced under the Apache 2 a license in in 2016 so what I'm now going to do is shift away from history and let's look specifically at what makes these databases you know sort of a particularly powerful so let's let's just look at the at the key features of Clickhouse and the way that I like to explain Clickhouse for people that aren't familiar with it is hey imagine that my SQL which is a very very popular open source database and Vertica which was one of the the databases I mentioned thus proprietary databases I mentioned in the data warehouse field imagine they got married and had a baby well that baby would be Clickhouse and so from the my SQL side you get a simple simple operational model so Clickhouse is just a single C++ binary you can basically install it and get it running in about 60 seconds it's about the same speed as installing bring up my SQL it has SQL language of course we get that from both parents it is open source and it's relatively simple to run from from the vertical side of the house we get things like shared nothing architecture so we have a bunch of computers each with their own storage but don't have shared file systems don't have you know sort of complicated networking architectures it it's a relatively simple and easy to understand architecture we have column storage with high compression and codecs we'll talk about that in more detail we have vectorized query execution and then we have MPP enablement basically being able to split the data up into shards and replicants so that's coming from the conceptually from from Vertica and then the whole thing is really fast so everything in when we when we think about Clickhouse we're using column storage we think of everything as either sequential read or sequential write and there's a huge number of optimizations that are built into the product as well as along with product features that allow us to get answers very very quickly and that's what I'm going to jump into next so that's your basic overview and then now what we can do is dig down into Clickhouse itself and understand how it works and what makes it fast before I do that though I want to talk a little bit about what it's not because every time you solve a problem you're making choices about problems you're not going to solve so what doesn't Clickhouse do well it's not an acid compliant database like my sequel or Postgres Clickhouse has a transaction model it deals with large time basically the transactional unit if you will is is a large chunk of data called a part but it's not focused on things necessarily like isolation like although it does support it to a certain extent it's it's so it's not not completely acid acid compliant and it also doesn't deal with things like updates particularly well because it makes the basic assumption data is immutable what else is it not well it's not a distributed key value store so if you have a large number of you know sensor values and you want to go to each one individually and visit them and see what they're doing something like Cassandra is probably better it's not a highly concurrent cache server like Redis so if you're storing session data for users Clickhouse is probably not the product and finally and this is really important when we look at direct competitors in the data warehouse space it is not it the full sequel complete sequel compliance is not the main design point of the system it is speed followed by having enough sequel that you can get the job done and also feel comfortable working with it so for example things like window functions which you may be familiar with if you work with analytic databases they don't they don't exist yet in Clickhouse although we're working on so what we do have though is speed so let me just talk about the code this is sort of a sort of an eye test I want to just focus on a couple things basically Clickhouse is some of the best C++ I've ever seen it's actually readable if you if you've used I'm not a C++ programmer so don't anybody hammer me I've mostly worked in Java go and Python but the code is a is really readable it is very well written code and it's it's there's a huge emphasis on optimizations for speed I'll just give you a simple example when we talk about group by which is the sequel construct to do aggregation Clickhouse has 14 algorithms for doing group by that are specific to data type so we're always trying to choose algorithms that are best suited for the data the type of data that we're dealing and its distribution and you'll see that constantly throughout the code in Clickhouse there's no one way for example to do a hash table there's a bunch of different algorithms for hash tables and and we'll pick one according to what what particular use case we're solving another really important feature of Clickhouse is vectorized query execution so basically as you'll you'll see this more but we basically focus on breaking up data into pieces farming out the process to every core and every hyper thread on those cores and doing that as efficiently as possible and we're applicable applying as SIMD instructions so that we single instruction multiple data so that we can basically get multiple operations done in a small in a smaller number of machine cycles so I'm not going to go deeper into this but if you look at the code it's it's filled with interesting optimizations and there's some great talks on how this works that you can check out now turning to stuff that's visible to a user we have a table type called merge tree and one of the interesting things about Clickhouse at least to me because I worked with my sequel for a long time is Clickhouse uses table engines so if you've used my sequel you you probably remember these there there was NODB there was my ISAM there's Falcon you know a bunch of different table types my sequel follow excuse me Clickhouse followed that that design pattern partly because the the folks that wrote it were very fiddling with with my sequel but it's used much more broadly so there's about 40 different table engines and they all do something useful there they're basically tuned to particular work cases the number one table engine is called merge tree and it's really the kind of a family of actually turns out to be a family of tables of which this is an example so it's a great table if you've used sequel before this looks fairly familiar our data types are a little bit different from antsy sequel but then you have these you have these extra things at the end of the table first of all defining the engine so it's merge tree giving us a way to partition the data because this is a table type that is designed for very large amounts of data we want to have a way to break it up into parts and what this sequel is saying is take the date this is flight data as it turns out flight on time data and break it up by month and then finally within those parts how to order it and we're going to order it by the carrier and the date of the actual flight so so this is something you see visibly and then what happens when the data is actually implemented what we see is that you'll actually go to the file system and you can see parts which consist of what we call a primary key index it's a sparse index I'll show you the structure of it in a minute and then all the data is present in columns so we store the data is as highly compressed arrays and those arrays are sorted on the order by columns and the index gives us the ability to find particular rows and group them together so that we can take the you know if we have to if we're referring to like three columns we can we have a way of finding out where they're located in each of their individual rays and getting the data get the data consistently so that's the basic high-level layout and you can see this on the file system when you actually go to one of the directories that contains this and one of the cool things about click house that I really like is that this is all visible it's kind of like my sequel where you can just go and and not with the inno db table of course inno db table type but with my isam you could just go and you can see all the data lying around on on disk click house is very much follows that and you can see you can basically see all the structures and what you see when you come in is this primary dot idx file and that's us as I said a sparse index it's not it's not used to maintain data consistency in this in the way that a primary key would be in postgres or or click house it's used to find things and it's sparse in the sense that we only have entries for every say 8192 entries so what that means is we read data in chunks and the lowest resolution that we're gonna get in a query is we're gonna read about 8000 rows you can change this if you want that's called a granule and then to actually find where the data for the columns is located in in each in each of these columns we have what are called dot mrk files and these are just an array that matches that primary dot idx and they contain offsets into the actual column data and each of those offsets is a chunk of compressed data so compressed block that's that's also may have some additional transformations on it so this is your basic structure those blocks are called marks and as I say they contain a compressed block of data and you can just bounce in and start reading at that point and and then bring the data into memory and begin processing it so this structure is is really important it's super efficient because it minimizes the amount of data we store in storage I'll show you some examples of that and the other thing is it means that hey if we're only talking if we're only looking at two out of a hundred columns we actually only read those columns and then only the parts of those columns that we think are relevant for the query so there's a hope looks like my trunk so click has this very focus queryable there's no you don't have to stick stuff in and wait a while for it to be to be a available what click house does is when you insert data it you typically do it in blocks it's common to do you know say 10,000 50,000 100,000 rows in one block and what click house will do is it will insert it and create a part and so when your insert comes back your part has been added to the table it's now queryable and this is optimized to be pretty quick you can the ingest rates are very very high because what we're doing is creating this part and you can describe this as kind of a fast but half-hearted organization because the part might not be very big you know it and so when we're actually doing the queries we might have to read a lot of parts if we just took the data as it was an inserted well click house takes care of that by doing what are called background merges and that's where the name of this table type comes is that what it'll do is a look at the parts and over time small parts will be coalesced atomically into larger parts so that what will happen is your your queries will run much faster and the difference of you know sort of when you aggregate these parts together it can make order and like an order of magnitude difference in your performance but it's it's something that happens fairly quickly and so your initial data is not organized as optimally as it can be but then it quickly is merged over time you know merged into this this more efficient structure so that's like once again where the merge tree comes from and this is really fundamental for getting high performance so the and speaking of performance the first thing you can do when you're trying to make things fast with click house is just add CPUs it's very good about everything is by definition parallel by default it'll use it'll it'll grab half the cores that it can see so anything that's in proc CPU the but there's also a setting called max threads and this is just a simple example of a query on this flight data where I set the max threads and these are actually hardware threads as it turns out to be two four six and eight and then just check the query response and what you can see is with two when you go from two to four threads it pretty much cuts your response half adding more threads doesn't really help because in this particular case you get you're seeing some and all effects that we're not just scanning the data but we're also doing some aggregation that has to be done at the end where we have to do that in a in a single thread but you can basically control the performance very well using this and this is the first thing I do when I got something I want to make it run faster just add more CPUs and click house will use them efficiently the other thing though that you can do and this is where you really get the big the you know huge impact is to minimize IO when you know when you're trying to make database as fast the less they have to go to storage the faster they go and click house is actually quite efficient about this so for example this is an example of a query on the left again on flight data where I'm looking at canceled and delays delayed flights and what I've done this is interesting so part of the little bit of the data is missing in this picture what we can see in the query responses is that if I put no filter on this query and have to read everything then I end up basically reading all the parts and all the chunks of data the marks we call them in the table so I get a certain query response and that's the no filter part of the graph if I you know if I only restrict it to one year I'm going to read less data and if I restrict it to 40 days I'm going to read even less data and what this graph is showing is that the query performance and the number of marks that click house thinks it has to read are pretty much pretty much track linearly in this case and what's missing from this is there's a nice picture here that actually showed the which you may not be able to see it showed click house demonstrating that you how many marks it was reading and that's actually how I collected this data so when we're looking at it at optimizing click house performance we're going to do things to reduce we're always going to be focused on reducing the amount of data we read how do we do that well a good way is to improve compression so okay once again we're missing some data here I apologize for this it looks like it looks like the way that this is showing up let me explain what you should have seen here is we're adding codex so click house has LZ4 and ZSDD compression you can choose them generally these compression algorithms have no knowledge of the data so LZ4 which is the default if click house can see data it's going to try and compress it with LZ4 if you don't tell it anything what you can do though is you can add codex which are transformations on the data that would you know that actually you know change the values so and they are type specific so for example one of the things that I can do and that was shown in this A under bar LC is I can use dictionary encodings so if I have a bunch of strings like airport names I can actually apply this this transformation and instead of storing the airport name it will just store an integer which vastly reduces the amount of disks that's inside let's go ahead and look at the graph so what you see is this actually a dramatic effect on storage size so the low cardinal encoding that's what I was describing there will reduce the data significantly so and basically the the benefit here is that it compresses it reduces the data size before we even try to compress it so we end up once it's come fully compressed with 89% compression rate actually this was using LZ4 if you go with this ZSD compression you can get it down even lower you get like 93% and similarly we have a bunch of different numeric encodings that you can apply like delta encoding what's the difference you know we're basically all we do is store the difference between numbers works great if they're increasing or we can do double delta we store the difference between the numbers actually the difference in the change in the numbers so if they're slowly increasing this is optimal and the actual results you get in terms of storage it makes a huge difference so so so basically this is one of the key tools another tool that we use is to create materialized views and once again apologies we have some stuff missing from this the slides will will show it but the basic idea with a materialized view is we're going to just reduce the amount of data that we read by applying a transformation on the source and putting the reduced data in another location this is an example of a materialized view that the where the CPU where we're basically asking for the last value of CPU usage on a bunch of measurements on CPUs so it basically allows us to take a table instead of scanning the whole table to find out what the last value was for each CPU we can collect that information in the in the in the in the materialized view and basically we get an effective compression ratio that's that's enormous we end up with far less than 1% of the data and as a result that queries orders of magnitudes faster a very common pattern here is to use this for you know to to aggregate data so for example you have a website you're doing web analytics you want to track hourly unique visits hourly sessions you keep those long-term and then there's a a feature in the you know in the table definition where you can add a TTL click how supports this it's unfortunately not showing on this slide but basically it deletes the data after seven days in this example so you keep the source data for seven days that's your detail but then you keep the aggregates forever so this is really efficient way of using storage and then what we can also do is click us as a feature called tiered storage and what this allows us to do is within that source table we can also have different qualities of different types of storage so for example we can use NVMe SSD for the for the data that is has just arrived we can put a TTL on the table that will then say hey move it from my hot storage because NVMe is really fast suitable for the you know the small percentage of queries that you're looking at most commonly after a certain period of time like two days move it off to to hard disk and and we can also group those you can you know you can raid them yourselves click house also understands raid patterns and can do it itself so this is a really an increasingly common pattern and actually something that we worked on quite a bit over the last year or so to implement beyond single a click house servers we can of course cluster them and this gives you the horizontal scaling basically sharding is built in so you and a shard is a portion of data that like you can think of the data being divided up into disjoint sets like you know groups of tenants for example you can also replicate it that's that's also built in there's multi master replication and what those what that does is allows you to get more concurrency because there's more copies of data that you can use so as we look at this we see the you know and how this is actually implemented well click house uses more table engines so for example we have a distributed table engine that understands how to take the parts of the tail or excuse me that a bunch of replicas divided up into shards and it knows that when it receives a query it should pick one replica from each shard and then bring the bring the data back together when we do when we set these clusters up one thing that clicks we have to add to the architecture is zookeeper that's something that we are looking at how that can be can be ease but that's an extra piece that is added to the system because you need to have consensus about which parts have which replicas and how they move back and forth so what this what the distributed table does is it then allows you what you typically do the common pattern is you have the distributed table on every on every node an application can connect to any node run the query against the table and then the distributed table automatically distributes the queries down to down to the local copies so and there's various options that you can use to make this more efficient but the basic idea is to do a pushdown where you get as much of the query as possible to run on the local select or the local local data and then the and compute aggregates locally and then they come back to the initiator which merges them together and hands them back to the application so this is is a really powerful feature and for good queries so well-behaved queries and and when you set things up properly you can actually get linear improvements by adding additional nodes and this graph shows a couple of runs on on this airline data case where we're just doing a pretty expensive query the cold data and the hot data are blue and red respectively so that's you know with caches and caching enable without but in both cases you get essentially linear performance with by adding nodes so it's a really powerful feature and what this means is that for extremely large datasets we can for example split them up over 50 nodes for example and and thereby get vastly better performance than we would if you're running a single node so that's click house internals I'm gonna just talk about a couple of patterns of use to kind of close things off and then we can then we can take some questions so a really common way that click house is used is together with Kafka so click house can ingest data very fast if you go look at published published articles for example Cloudflare wrote a really widely read blog post about about using a click house and they talked about how they get what is now about 10 million events per second coming into their cluster for all their web analytics DNS and service logs they they use Kafka to drive this because Kafka is able to like click house can scale horizontally it allows you to have very high to basically share enormous amounts of data and ingest them quickly into collect them and ingest them quickly so and a pretty common pattern is to just write your own consumer so you write it you know write some little go programs that read out of the you know read off the topics turn around and hand it to click house but click house also has a cool feature which is there's a table engine that makes Kafka topics look like a table and and this this is actually kind of a cool example of the creativity that bind both table engines and materialized views so the table engine encapsulates the topic and it looks like a table and you can select off it and what that will do is remove whatever you know it's all basically read it from the topic well you don't want to do that manually so what you can do is actually have a materialized view which is constructed on the table it automatically selects and then puts the you know puts the values in a different location in this case in this example into a merge tree table and so basically what that does is gives you an automated transfer out of the topic and into the merge tree table so this is is is also another way of integrating with Kafka and I would say actually overall probably 50% of all people that use click house are integrated are also using Kafka as a way of ingesting data another pattern is visualization using Grafana so the click house integration with with Grafana is quite good there's a Grafana plugin that actually we maintain and I play around with constantly what we're looking at here is actually some of our data from our Amazon cost billing which we of course stick into click house but we serve it up in Grafana so Grafana is written in type script there's a click house plugin which uses one of two major interfaces and click house is an HTTP interface it can just do gets or posts to run the queries there's also a wide number of of other client types python go lag c++ and then of course javascript java curl things like that so and there's also I should mention there's a really great command line client it's called click house client it's super good it's like my sequel or opc sequel if you're familiar with my sequel postgres really handy to use and easy to load data so there's one final use pattern that I think is interesting to talk about and that's running at kubernetes so click house is not cloud native exactly but it's fair to call it cloud friendly and the reason is cloud friendly is it's just a single it is a single process so it's easy to put it runs really well inside docker or you know whatever you know whatever container technology you're using and it has a relatively simple relationship with storage so what what one of the things that we've worked on for actually now about a year and a half is building an operator for for kubernetes that allows you to set up clusters on kubernetes and this this is a very common pattern just about every other major database particularly in the open source world has a has an operator for kubernetes the click house operator allows you to define your cluster in a single file it's called a custom resource definition you feed it to the operator and then what it will do is go look at that and say hey you need a cluster that contains in this case you know like one shard and three replicas it'll go allocate the the the storage and start containers to to access it and then put a nice load balancing service in front so this is an increasingly common pattern we we have a number of customers and there's many people outside our customer base that are using this and the cool thing about it is it basically allows you to have a lot of of data warehouses because you could spin them up and blow them away pretty quickly on on kubernetes so it means each service can actually now have it's a data warehouse so that's a really exciting development and i think something very different from from what we see in the you know across the the you know in the proprietary databases so just as a wrap up click house is you know sort of first off it it's really the first data warehouse sequel data warehouse that meets or meets the proprietary offerings in head-to-head comparisons and the place where it shines as i said if you're looking for a complete sequel implementation probably not the first stop it's better to go pay vertica the money but if you're looking for speed and cost efficiency it definitely it definitely fits the bill and that's why most people are using it for one of those two reasons the second is the feature car coverage is expanding rapidly so we're working on the sequel features like a big thing that's going in is role-based access control complete implementation matches what my sequel does nice things like object storage stuff like that and then i think the final thing is because of the scaling click house has this interesting property that it can deliver reliable real-time performance and there's a couple of metrics that that we see commonly one is to shoot for ad hoc consistent one-second ad hoc query performance and you know for those analyst queries and then for things where you're driving like you know online applications like martek applications 10 millisecond predefined quarter response there's a bunch of people using click house to get these response levels and it does this very effectively so here's some resources i'm going to pass over them great documentation a lot of good talks we do a blog that covers some of these general issues and i just want to say thank you and again once again big shout out to the linux foundation you all have done really wonderful i'm really really happy to be able to present um it wouldn't be a real startup talk if i didn't say we're hiring and and here's uh you know sort of ways you can contact us and in the middle is click house don't believe me if you haven't used it don't believe anything this talk just go get it try it out see if it works for you i mean this is open source and and if you see something that that you don't like hey give us a pr or submit an issue where we really the the community is already enormous and international we love to have new people so with that i'm going to go ahead and look at some of the questions so okay i had a question here and i'm just going to take them in and um let's see we've got i got a good time okay i'm just going to run these questions until i run out of time so uh data question here are those data warehouses i showed in the uh in the initial picture are they similar to sap hannah uh yes and no one of the things about hannah that's that is distinctive is hannah actually stores stuff as rows and then goes to columns later on it's also use it makes heavy use of of in-memory database so it is a proper data warehouse many people use uh use it for i'm not that familiar with the architecture but it uses the the innovations i described at the beginning of the talk they are well known in many cases are decades old so the stuff that i showed there you there you can be sure that most of it is is being used in hannah at some level like compression for example or coatings so what are your thoughts on delta lake which sits on top of spark um it's complicated that's my thought this is the this is where um what and one of the one of the big reasons that people use data warehouses uh is that they are simpler than uh than building stuff through pipelines i think they that spark isn't is really really powerful it has excellent ml integration but the thing that's nice about click house is that you can connect it with kafka ingest your data and just instantly answer questions about it along with all the previous history that you had so when you're trying to get you know when you're trying to get quick answers to uh to big problems on top of structured data data warehouses are definitely the way to go and in fact if you want guaranteed latency i don't think there's any substitute for actually having your own storage format that organizes the the data in a consistent way so that's that's kind of my basic thought um the these are alternative technologies if if you have enormous amounts of data and you don't care how fast you get the answers yeah something like spark works if you have smaller amounts of data and i you know say 20 petabytes as opposed to exabytes then uh click houses and data warehouses in general a much better approach um let's see uh let's see there's a question given query performance affecting immediately query that you inserted is a good fit for the query portion of the cq rs pattern uh i don't know what that pattern is for interactive user experiences read from click house and transact with the system of record actually i wouldn't recommend doing that but there's kind of an interesting um i i think that what you want to do is think in terms of databases that serve a single purpose so if um you know if you're doing transaction processing uh in my sequel like doing sales just have them be there but i wouldn't want you wouldn't typically want to have an application do something in my sequel and then read from click house there's just too many ways that could get screwed up where click house is really going to help you as if is if you have a data source that's like you know you behavioral data uh from marketing you know like what they've done on different websites you just stream that stuff straight into click house and it bypasses my sequel completely um let's see what enterprises use click house cloud player sysco i'm just trying to think of the people i could name comcast yandex of course they invented it um message bird i don't know if they're still using it but they were early adopters uh Spotify so a bunch of people and i'm just naming the ones that have talked publicly about it there are many many more including a lot of people in financial services um can the Kafka table engine enable us to serve real-time online use cases as opposed to just analytics use cases yes so uh typically you can achieve ingest you know sort of time to query uh of 500 milliseconds it really depends on how you know you have to make sure that Kafka is well tuned but that's why Kafka one of the big reasons why we like Kafka is it's not just the scalability uh which of course Kafka is wonderful but it's also the fact you can get stuff really fast so that is in fact a an important use case and what where i think this you know if i if i could you know contrast this for example what confluent is doing if you talk to confluent what they would say is hey we should we should learn um uh you know uh ksql on um on the query stream as it's coming through yes you can do that but in a way it's just easier to dump it as fast as possible into the data warehouse and just query the whole thing there because you're not just looking at the stream you have all your data and moreover you can partition it in ways and store it in ways that mean that the most recent stuff is hot and you can get at it really fast um how does click-out speed compared to hana i don't have numbers on that um so it is much much it's definitely much faster than redshift fast and elastic for use cases that are well suited but i don't have numbers on hana um when would it be a good idea to choose click house over dbs like post grass great question when you have a lot of data and in fact what happens is people um mux.com uh is an example of a company that started with post-grass they started with one and then they made it bigger and then they and they brought in cytos and then they had you know like a big cytos uh distribution and finally they gave up and just put it all inside click house and the reason is that with click house they could just throw the data in and uh basically be able to um to do their queries without a lot of complex pipelines aggregation things like that okay um i'm just looking back i want to make sure that uh i think i got all the questions if uh let's see i'm just checking that i didn't miss any great so um if uh if there are no further questions we'll go ahead and close this up i am available on slack if you want to post additional questions and you're interested in this i'd love to answer them i do hope you'll try out click house it's open source um just one last pitch i think it's other than sidebase which was probably my favorite relational database click house is the most interesting database i've ever worked with so i definitely try it out it's really accessible and is a very very welcoming community and once again thanks for the third time to linux foundation for making this talk possible hope to see you all in person soon thank you very much