 Hi everyone, thanks for coming to open source summit specifically for coming to this talk analytics at 1,000 qps and beyond I'm Gian Rellino, and I'll be talking about how we do this with Apache Druid So, okay. Who am I? I'm Gian Rellino. I am a committer, and I'm the PMC chair at Apache Druid And I'm also co-founder at imply a company that is all built around Apache Druid Well, we'll talk about today first a brief intro to Apache Druid Then I'll talk about the three levels of high load analytics We're going to talk about level zero one two, and then we're talking about how to try this at home So first off a brief intro to Apache Druid. What is it all about? We like to think of Druid as an engine a rocket ship. It's a database. It goes really fast But what Druid really is is it's a big calculator. It's a big fancy calculator Scales to thousands of servers does analytical queries really really quickly Druid's purpose What it was originally created for was to power analytical applications I've got a screenshot here of an example. This is imply pivot a analytical application that we build at imply And the thing that makes analytical applications What they are is things must queries must respond very quickly. There's gonna be interactivity between the user and their data lots of pointy and clicking lots of Very rapid fire interactions think applications like Google Analytics So some people use Druid with imply pivot the app that we make many people make their own apps Apache Druid speaks sequel and you can build your own app on top of that Druid's use at a lot of different places So there's a here's a bunch of articles that were written in the last few years about Druid use Druid enables analytics at Airbnb Netflix uses Druid for real-time insights ensuring a high-quality experience Salesforce uses Druid for insights as well and Mopub a Twitter company Uses Druid to query terabytes of data in seconds Some stats here in the top right of the slide Of what's happening in real Druid clusters. So real Druid clusters are doing a hundred plus billion rows a day of ingestion Retaining over a trillion rows over a year of data Hundreds of servers in a cluster the largest cluster is actually over a thousand servers And all this was sub-second to a few seconds of query latency and a mix of streaming and batch ingestion So Druid is a database Druid's a very scalable database a very fast database And for these kinds of applications It is super super important to be able to do large numbers of queries and at the same time Maintain that high query speed and consistent query performance And so I want to talk now a bit about how we do that and what Techniques we take in the software itself injure itself and also what techniques people that deploy Druid do On the deployment and configuration side in order to achieve what we're trying to achieve here so For this talk, I want to talk about three different levels of how this can be done and each of these levels Is kind of going out from the hardware into configuration. So the first level zero. That's about CPO and IO Really here focusing of the software itself in the hardware Level one talking about scaling. So that's the next level up talking about how you take How you take the software and how you scale it out to a cluster in a reasonable way And then level two handling heterogeneous workloads level two is once things are scaled out Now you may have many different kinds of workloads running on the same data and how do you handle that? So we'll talk about each of these levels in turn each one has some techniques. I want to highlight The first one level zero the most innermost level CPU and IO So the goal here is to minimize the CPU and IO required to execute a query This is all about local efficiency These techniques apply equally to single servers or large clusters the ideas if you had one server You want to do as much as you can on that one server be as efficient as possible And if you have a thousand servers the way to get good performance at that thousand server scale is to make sure each individual server Is running as efficiently as possible? There's many techniques to achieve this and I'll discuss some of them but first off why does this matter well it matters because Efficiency at this level at the level of CPU and IO at the hardware level Maximizes the number of queries that can run per unit of time for a given CPU and IO budget And the idea here is that less resource use per query means we can do more queries Good way to think about this is sort of like a bin packing problem Imagine you have a hundred CPUs and each query takes ten seconds of CPU time If you make it the most efficient use of these CPUs possible you can run up to ten queries per second because you know ten times ten is a hundred and So that means that if you want to scale beyond ten queries per second You're either going to need to add more CPUs or you're going to need to reduce the amount of CPU time each query takes That's what that efficiency is about So I want to talk a bit about how we do that The first thing I want to talk about is the software the most fundamental aspect of Druid is that it's a column oriented and segmented database What I mean by that is every table is split up into segments So if you have a table of a billion rows that might be split up into 200 segments for example Each segment tends to have a few million rows in it and within each of those segments We store data in a columnar fashion with indexes So what you're seeing here in front of you is a schematic representation of one segment of data So this segment has eight rows in it You can tell because the data section of each column here time artist city price and count The data section of each column has eight elements in it Real segment would have millions, but we can't show them on one slide So an important thing here The first thing to see here is that each of the columns is stored separately. This is important because Real queries tend not to use every column in a data set They tend to use maybe two or three four or five columns out of a hundred row data set for example And that means that that storing the data separately column by column is very helpful because it minimizes the amount of IO we have to do It means we do not need to read any of the columns off disk or even from memory to the CPU Meaning we're saving on IO and saving on RAM to CPU bandwidth Let's walk through an example of how we would do a query using this columnar data So here in the top left of the slide, we have an example SQL query. We might run this data set. This is a Ticket sales data set. So let's say we want to see Total price of tickets sold grouped by city Where the artist equals a particular artist just in in this example So that's select city sum of price From sales this table where artist equals Justin grouped by city And I'm gonna Just focus on what happens on this individual segment. Of course in real table or in real Query we'd be querying many segments and then merging the results of those different segments But I'm just going to look at what happens at one segment for for this example So the first thing that we're going to do is we're going to resolve the artist filter So artist equals Justin and we're actually not going to look at the data section of the artist column for this We are going to look only at the dictionary and the index The first thing we're going to do is we're going to look at the dictionary to see What the dictionary code is the id for Justin in druid We use a dictionary to assign an id to every value of a string column like this So Justin has been assigned id zero, which we will find out from the dictionary Then we'll go to the index And we will see that id zero corresponds to this Rows zero one and two the index has the same number of elements as the dictionary It's a one-to-one there the first in this entry is for the first dictionary entry and so on So once we've done this we now know that rows zero one and two are the ones that match this filter And we did this without having to actually read the data out of the artist column using these indexes Next we're going to go to the city and price columns And we're going to walk the data sections together So we are going to first look at And we're only going to look at those first three rows We're going to first look at city two price 1800 then city one price 2912 then city two and price 1953 We then group those together And then finally use the dictionary of the city column to replace those dictionary codes one two With the names of the cities la and sf finally giving this result that we have on the bottom of the screen La 2912 and sf 37 53 And you can see that the advantage here of storing things in this way and doing the query in this fashion Is that we never had to read anything out of the columns that weren't being used So count and time were not read at all There is no need to copy data from disk to memory or from memory to the cpu We also did not need to use the data section of the artist column very valuable We did not need to scan the index even of the artist column because we were able to use the dictionary And then do a random access into the index to get the bitmap in question Then in city and price we did not need to read any rows that didn't match the filter So we only read those first three rows So this is really being very economical about What data we access in order to resolve a query And therefore minimizing the cpu and io required to do a query There's many many other techniques that we use like this, but this is an example of some of the most basic techniques Next I want to talk about locality. This actually builds on the prior techniques It builds on the column orientation to make things even better. Let's say that we are doing a lot of filters on artist This is something that's a really common pattern in analytical applications It's common to have a specific column that you are almost always filtering on it doesn't always happen But it is very common and when it does happen we can take advantage of it And the way we take advantage of it is by sorting data based on that column So let's say we're going to sort the data based on the artist column What's going to happen here is that instead of all the artists being all jumbled up inside each segment We're going to see that all the zeros are together all the ones are together and all the twos are together of the three artists And every other column is sorted the same way as the artist column So we're actually resorting the entire data set based on the artist When we do this if we filter on artist You know, let's say that we filter on artist number one We with the with the unsorted data on the left We see that we actually have to read every block The data set is split up into four blocks here for this column We have to read every block because there's a one in every block With the sorted data on the right We only have to read two blocks because the first and the last block did not have any ones in them This is really helpful. Um, this further minimizes the amount of IO we need and further minimizes the amount of cpu We need four queries that involve filtering on artist So here's an example Using query profiling a really powerful technique where we see exactly where cpu time is going An example using query profiling to show why locality is so powerful So I mentioned blocks in the previous slide that that example with four blocks in druid We store data in blocks in the segments. So each column is split up into blocks of a certain number of values And we read one block at a time and each block is individually lz4 compressed so what we're seeing here is that This is a query profile where most of the time is being spent in this function lz4 decompress safe This is the lz4 decompressor and this means that we're actually spending most of our time Reading and decompressing blocks this actually counts both the time to read the block and the time to decompress it And this is in this example The data set that we're running the query on we're doing a high concurrency query workload Generally filtering on a specific column. This is an iot data set. So we're we're filtering on the particular time series But we're not sorting by the time series And what that does is it means that all the the data for an individual time series is all jumbled up And so we're spending all this time reading blocks On the next slide, what I've done is We've rerun the test And in this version of the test we have sorted the data by the time series by the name of the time series And what this does is now instead of spending 90 of the profile on lz4 decompression You can see we're spending a much smaller percentage and you can't see it in the profile But we did measure this to be eight times lower cpu time Which is really powerful and this shows why it's so powerful and so important To ensure that your data is sorted in the right way to achieve the right locality Because this 8x difference in cpu time 8x difference in io as well Is an extremely powerful thing Another important technique in level zero is deferring computations So the key insight here is the data size generally decreases in the later stages of the query So imagine you're doing a query that does a filter followed by a group buy That filter is going to throw away some rows and that group buy is going to generally reduce the size of the data set The idea here is you want to move computations as late as possible So here's an example Let's say we're doing this query. So we're grouping by the lower case of a user So take the user convert this to lowercase And then we're counting the number of records for that user from a traffic table And then we're ordering by the count descending So this is the top users by number of hits in a traffic table So in this case, we're grouping by lower of user and selecting lower of user So we're grouping out an expression which requires computing the expression during the irrigation step This example took six seconds to run Now one transformation that that you can do Is instead of grouping by the lowercase of the user, you can just group by the user This defers the computation till after grouping It's not done automatically because results are not the same if the function might map two inputs the same output So let's say your users you could have two users that Are different in case, but have the same lowercase representation In that case These two queries return different results, but let's say you know that won't happen Let's say you know that users are unique in a case insensitive way And you just want to lowercase them for display In this case you can group by the user and then select the lowercase version for display But use the the column directly for grouping In this case the query runs eight times faster It runs eight times faster because that expression is being deferred Until later in the query I don't have a query profile of this query But if you were to look at one You would see that the version that groups on lower of user and selects lower of user The one on the earlier slide you would see it spends a lot of time computing that lowercase function This is one reason why looking at query profiles is so powerful It tells you exactly where cpu time is being spent and helps you reduce that cpu time So next let's move on to level one scaling talking about how we scale out these individual servers into a cluster And issues we run into along the way So with scaling we have two goals The first is we want scaling to be lightweight We want performance of distributed queries to be similar to local queries single on a single server The idea is that when we're going from one server to two, so let's say Maybe you're splitting up a single server. You're splitting up a 64 core machine into a 32 core machine When we go from one server to two, uh, we don't performance be very different The second is we want to scale linearly So we want 200 servers to handle two times as much load as 100 servers if they're all the same kind of server So what are the impediments to that? One of the main things that can go wrong with Lightweight scaling that can make scaling much heavier than it needs to be is having a lot of data movement So the performance degradation from one to two servers can be high You could actually have heavy weight scaling if data movement is high So there's many techniques to reduce data movement between servers during a query. I'll talk about a few Uh, one is sketches. So sketches. This means uh ways of doing approximation using much less data transfer than the exact solution So examples are hyperlog log and data sketches These sketches are useful for count distinct medians quantiles approximate joins that kind of thing Another is another kind of approximation. There's a little bit of a theme here Approximate top ends are useful for top ends style queries, especially on columns with billions plus of unique values Another technique that's not an approximation Is to distribute joinable tables to all servers in advance of a join. So this speeds up a join One reason that joins are famously resource intensive and distributed systems is because of all the data movement that is required But if we can distribute those joinable tables ahead of time Then there's actually no data movement required to do a join. Uh, which makes them Just as inexpensive as doing a join on a single server This makes the broadcast part of a broadcast join free and is a very powerful technique Next I want to talk about uh impediments to linear scaling and the biggest one there is the number of servers involved in each query Now I want to talk about why this matters it matters because uh Of the variance in performance between servers, which can just be down to random chance So imagine a situation where each query on each server has a point one percent random chance of running slower than normal So with one server, that's no big deal Most queries are very fast and the overall throughput is very good only point one percent of queries are running slower than normal But with 50 servers Each individual query has a five percent chance of experiencing at least one slow server because it's got a point one percent random chance per server And that really uh, it adds up the chance that no server is slow the chance that they're all going to be fast. Um Is not 99.9 percent. It's only 95 percent And then with 500 servers it gets even worse. There's a 40 percent chance Uh, that at least one server is going to be slow The overall throughput there is significantly reduced because 40 percent of queries are getting hit by a slow server So what can we do about this? Uh, recall that in level zero We took advantage of the fact that many Use cases involve filtering on certain columns Uh, these are things like time user ID, etc In level zero, we use that to sort the data by those columns. We're going to filter on thereby improving locality Here in level one, we're going to use it for partitioning. We're going to partition the data set with those columns Now the idea is that when we partition data set with those columns, let's say we have a thousand segments in a data source Each if we partition by time then each day is only going to appear In a relatively small number of segments each day of data will not appear in all thousand segments It might appear in you know, let's say five or six segments Or if we partition by user ID same thing for user ID What happens here is each query uses a subset of segments instead of all segments Which minimizes the number of servers that need to be involved because the number of segments is low Another approach you can do here is speculative requests The idea here is that you send the same request to two servers Uh, in whichever one returns first you use that one and then cancel the other one The idea is that if you have one slow server and one fast server And you don't know which one is which you send it to both and you take the fast one This increases total load in the cluster because you're essentially doubling the number of requests being made But it does reduce variance Uh, we actually have not implemented this technique in druid although it is a technique that is used in other distributed systems Okay now on to level two heterogeneous workloads So in level zero we talked about minimizing the amount of cpu and io required for an individual query So we can pack more queries into the same unit of time Uh in level one we talked about scaling from a single server to a whole cluster without introducing new bottlenecks And now level two we'll talk about when you have a whole cluster How do you run different workloads on that cluster without them interfering with each other? So the the question here is what do you do if your workload is not uniform? Maybe you're mixing high concurrency app driven work with low concurrency But computationally expensive reports at exports and ad hoc queries The problem here is how do you avoid affecting application driven work? Which in most cases that app driven work is the work you really want to be very high performance So there's a couple of techniques we can use. Um, yeah, it's all about resource management There's a couple kinds of resources that we have I like to think of resources in terms of releasable resources and non-releasable resources So a releasable resource is a resource that can be acquired and given up repeatedly while a query is running An example here is time on cpu So a query can take some cpu time and then it can give up the cpu And then it can grab the cpu again and then give it up again. It can do this repeatedly In druid Any resources that are releasable are shared according to a query priority. So every query has a priority It can be assigned by the user. It can be assigned automatically By based on administrator policy there's various ways of assigning it, but each query does have a query priority and higher priority queries get these shareable releasable resources Before lower priority queries and they get to preempt lower priority queries Now with non-releasable resources the situation is trickier The example here or an analogy here is is a car on a freeway The space on a freeway is non releasable A car will take up space continuously until it leaves the freeway It can't pop up in the air and then pop back down. It must take up that space Once it enters and it can only give up that space once it leaves So the idea here Is to reserve some resources for higher priority work Going back to that freeway analogy Some freeways have a carpool lane or a high occupancy vehicle lane that is reserved for vehicles With one or two passengers in them in addition to the driver The idea is that we want to encourage those kinds of vehicles They're higher priority And so we reserve some space for them To ensure that they have better service We can apply the same idea to databases. So in druid resources are reserved using query lanes This is a Evoking that that carpool lane concept Similar to the carpool lane on the freeway The idea here is that there's certain kinds of resources like Certain kinds of buffers certain kinds of space and thread pools certain kinds of Scratch space on disk that are not releasable and must be reserved when the query starts and must be Retained by the query while it runs And those kinds of resources we have some set aside reserved for high priority queries So they cannot be starved by low-party queries In druid, we can also use server tiers for the strongest isolation Server tiers the idea here is that you actually have a separate pool of servers dedicated to handling certain kinds of queries Some people have a dedicated high-party Pool of servers that all their high-party queries go to Or some consider that their default set of servers and they have a dedicated Low-party set of queries or low-party set of servers. That's the low-party queries go to Queries go to These kinds of techniques Guarantee resource availability for high-party queries, but they can reduce overall efficiency. And so whenever possible, it's best It's it's better for more kinds of resources to be shareable And so we always endeavor to make as many kinds of resources shareable as possible in druid Okay, finally, you can try this at home. Of course, this is Apache con druid is open source software Stay in touch with us. You can check out the project website at druid.apache.org You can follow druid on twitter at druid.io is our handle and also join the community We're very active on slack github and the apache mailing list Go to druid.apache.org slash community Thank you, and I hope to see you there. Bye. Bye