 and PMC member and I work for Hortonworks. Before joining Hortonworks, I used to work as a back-end engineer in metamarkets. So today, I will be going to talk about scalable real-time analytics using Druid. So first of all, what is scalable real-time analytics? I'll start with analytics. To me, analytics means exploring raw data and getting insights out of it and possibly taking strategic decisions based on those insights. Now what is real-time? Real-time has multiple meanings in different contexts. First is fast response time. That is, how long does a user needs to wait before his query gets answered? Fast response time is quite critical for any interactive experience. If you want to provide an interactive experience in your application, you will need to serve the queries as fast as possible. The second aspect is data freshness. What that means is, how long does it take an event to be queryable after it occurs? Is it of the order of milliseconds, seconds, minutes or does it take hours? For certain applications like analyzing firewall events, analyzing sensor data, you can do so much in real-time and the real-time data is much more valuable than old data. The ability to query an event as soon as possible really helps people build very good applications and automate your systems. The third thing is scalability. Scalability means, here I am showing a small picture which has on the left-hand side, x-axis denotes the amount of hardware y-axis denotes the throughput that you are getting. As you can see, in the first system, as you add more and more hardware, up to a certain point, throughput gets increased. But after a certain point, the increase in throughput drops. This system is not scalable. On the other hand, the other system, where as you can add more nodes, you get more higher throughput, that's a scalable system. I will be discussing about how you can do scalable real-time analytics using trade. The agenda of the talk will be, I will be first going through the history and motivation behind developing trade, what was the need, why trade was developed. Then I will be showing you a small demo. We will be going through some details about trade architecture, what are the different node types, how do they interact with each other. Then we will be discussing storage internals and diving deep into what exactly makes trade fast. And then we will be seeing some performance numbers from real-world clusters. Starting with the history, trade was initially developed at metamarkets in early 2011. It was then open sourced in 2012 with GPA license initially. The license was later switched on to Apache V2 in 2015. The initial use case for which trade was developed was ATAC analytics product at metamarkets. What the team at metamarkets really needed was a flexible query system with low-latency ingestion and low-latency queries. Some of the motivations behind developing trade were to answer BI questions, to create interactive real-time visualizations on complex data streams, to answer questions like how many unique mail visitors visited my website last month, how many products were sold in last quarter, broken down by some demographic or some other dimension values. The key thing about all these use cases were we were not interested in dumping the entire raw data set. But what we were looking at, what we were really interested in was getting an aggregated view of the data set, visualizing it so that we can get some insights from it. Just to add some more sense to how big the initial use case is, here is like some estimates about how many daily programmatic ad transactions happen. So which is like nearly 400 billion programmatic ad transactions occur in a day, which is almost 100 times as compared to the New York Stock Exchange trades. Not only the volume of transactions is higher, but one event from programmatic ad transaction has nearly hundreds of fees, which is also quite high as compared to stock exchange events. So the data that we were dealing with was quite huge. We had lots of data, which we needed to analyze and visualize in real-time. So the team first started at looking at what are the existing solutions that we can use. The first solution they looked at was traditional RDBMS solutions, pretty heavily experimented with Postgres. What they did was they created a star schema with aggregate tables and added a query caching layer in front of that in order to speed up the queries. The results that we got were nearly 5.5 million rows per second per course scan rate was there. One day of summarized aggregates were nearly equivalent to 60 million rows and one query over one week of data with a 16-core machine took nearly 5 seconds and one page load took almost 20 seconds, which was not really an interactive experience for the user. So this thing was quite slow. Query caching helped a bit but still any query which was not in cache was still slow. So the experience was still bad. So on the aspects of scalability, we didn't found it to be scaling too much. It wasn't the very fast and real-time-ness was missing. So started evaluating another solution, NoSQL, which was pretty popular. What we did was we took pre-aggregated all the dimensional combinations and stored them in a NoSQL store. So we essentially pre-computed all the queries that a user can do and stored the results in a NoSQL store. Now, whenever a query came, it was just a simple lookup. So we had everything and the queries were blazingly fast. We achieved fast queries but there was still some issues with this approach. Pre-computation was one of the biggest things. Arbitrary queries were not possible. We could only serve the data which we have pre-computed. If some combination is not pre-computed, it cannot be served. Not only this, the pre-computation time was also taking quite long and your data freshness was missing. For example, with 500k records with 11 dimensions, it took almost 4.5 hours on a 15 node Hadoop cluster. So if an event happened now, it will be visible on the dashboard after 4.5 hours, which was not really a real-time thing. Not only this, if we add three more dimensions, the pre-computation time just doubled even after adding 10 more nodes to the cluster. So this approach was not really scaling up. So the pre-computation was really, really a bottleneck over here. Data freshness was also missing. But we achieved fast queries. Next thing we started developing DRID, which was an in-house solution to tackle exactly this problem. So what is DRID? DRID is a column-oriented distributed data store which can serve sub-second query latency. It has capabilities for arbitrary slicing and dicing of your data. You can group on any combination of your dimensions, filter on any particular value, or do any kind of analysis, any kind of slicing and dicing with your data. It also supports real-time streaming ingestion. It has capabilities to ingest with various real-time technologies, various streaming engines that are present out there. It also does automatic data summarization for you. It also has support for approximate algorithms. So if you do not care about the exact answer, you can ask it about fast approximate results. It is also highly available since it's designed for production workloads and also scalable to petabytes of data. So it was fitting all our needs. It fulfilled the needs of real-time-ness and scalability. Now let me show you a quick demo. So the first demo I have is about EC2 spot instance. Have you guys heard before about EC2 spot market? Okay, cool. So EC2 provides instances to you. So you can just go to Amazon Cloud and ask them provision these many instances for us. So they have a whole bunch of instances with them in the pool and they'll allocate you some instances. They also have a spare capacity at any moment. So what they do is they launch that spare capacity, they provide it in an open market in a bid fashion. So they decide that the price of this instance is 0.01 cents per hour, for suppose. So you could go ahead and bid for that instance. Whoever has the highest bid, whoever has the bid which is greater than the current price of spot market gets the instance. As soon as the price increases from your bid, your instance will be terminated. So here what essentially I have done is I'm scraping up the AWS instance prices in real-time. So you can see that the dashboard is pretty much updated up to this minute. And it's ingesting all the data into Kafka. Through Kafka, I'm pulling the data into DRID. So I have placed a filter about for R3.2 Excel instances and spot market. So let's say we wanted to provision R3.2 Excel instance and we don't care if our instance goes down. Our application is good enough to handle early terminations and handle fillovers. So then there are many questions I need to answer. Like what should be my optimal bid price? How much should I bid? Which availability zone should I select? Because different availability zones also have different prices. Which type of product should I choose? Should I go for like a Windows-based system or should I go for Unix-based system if my application is flexible to run on both? So let's see. And let's try to answer some of those questions. And also like other thing can be like should I even use spot or should I go to normal on demand or on reserved instances? So let's try to do a filter based on like availability zone. And let's try to see. I think I'm not, yeah, I'm not connected to VPN. Let me just connect. It should be fine now. Yes. So I just placed a, I'm now comparing across different availability zones. So there is like quite high variation across availability zone. The first graph shows like the average price. The second graph shows the maximum price for each R. As you can see, there was a spike last night. So if I would have bet like 1.5 dollar per hour for this instance, it would have taken away from me. If it would have been in US East 1E zone. So I could try to like look at the history and see like how the market is trending right now, how the prices are going on. And I could decide like which type of availability zone should I spawn. So let's say I went ahead with US East 1B, which has the least amount of fluctuations. My VPN again got disconnected. It's blue now. I hope it will reconnect. Yeah. Yeah. So now I have selected my availability zone. So I could, it went again. So I also prepared a video in case the internet won't work. So I'll use that. I did it in the morning today because yesterday the internet was like very much fluctuating. Anyways, so that helped. Yeah. So here I just selected the instance. It's the data in the morning. So not up to date. But yeah. So now what I did is like I added a comparison between different operating systems. Like I'm looking at like Linux and Unix. And what you can see is that there is a quite high variation in average prices. But the windows price are like much more stable, like predictable as compared to Linux instances. So the variation in demand probably or the capacity on AWS is quite high in terms for Linux instances as compared to Windows. So if you're, you want to learn like short jobs, probably like you'll go up with Linux because it's cheap and cost you less. So here another trend that I observed in the morning was like the price fluctuation was not always there. It was like during some days there was fluctuation. But on the other days there wasn't any fluctuation. And I was wondering why is it that like why is the graph like not really trending like on the same time. What I found out was like this was a weekend when there was like very less fluctuation and less fluctuation would mean there was either more spare capacity or there was less demand. So to validate my fact, I looked back in the history and saw observed the same behavior every week. So on every weekend, we had like a drop in variation as compared to other days. Finally I tried to compare like, would I even like save anything? I tried to compare prices between the spot market and the on demand market. So the yellow line here that you can see is for on demand market, which is almost like 0.7 cents, 0.7 dollars per hour. The spot market was like quite cheap on the weekends. It was almost 10 times cheap and the reserved were like almost 30 times cheaper than on demand. So that was the first demo. Yesterday I also met few folks from sensor without borders and they had this stall outside this hall. So I talked to them. They also have like some pretty good real time data sets. So what they are doing is essentially is they have their sensors, they deploy their sensors at different locations and what they do with them is they capture the amount of particles that are in the air and from that they predict like how much clean your air is. So I asked them about a sample data set whether they can stream me the real time stream and I can analyze it and see if there are any trends in that data. So yesterday night I got like a week worth of data from them and I also tried to like analyze that. So that data was from some devices that they have added in there in Chennai. So I tried to look at the data. So the first thing that I observed was that there were spikes in the data and I was wondering like what these spikes are. Can I correlate it to anything? Since it's a daily granularity of like this from here to here it represents like one day and the spike was I observed that every day there was a spike in the evening time. So when people leave from their office there is like lots of pollution that gets emitted on the streets and the peak pollution is at nearly like from the hour between 8 to 9. Then I selected like one specific of their devices and tried to like analyze that. From 8 to 9 it was like quite peak. The next thing I did was I looked at their different sensors and tried to compare how are my air quality levels across different sensors and surprisingly I found that one of the sensors was reporting really, really high values as compared to others. It was almost more than double. So I came back in the morning, asked the guy from sensors without borders about why is that and he gave like pretty good reasoning about it. So he showed me where their sensors are exactly placed. So the first sensor is placed here at this location. The one that is emitting like less data or like less air pollution. The one that is emitting the maximum that is showing the maximum values is placed here at this point and I was like both of them are almost on the same road and this road seems like more busy. So it should have like more traffic but then it's showing less values. How is that? So he gave me a pretty good, pretty good answer. Like here as you can see here is the airport right? So there will be like lots of flows of air and it's an open ground. So all the particles that get emitted so they will be flown away with the air and the air at that place that's why the readings are low but the sensor which is like placed at this location will give more readings. The air will be like more polluted because it's surrounded by buildings on both the sides. So in that case like the flow of air is not that much and all this exercise like I was able to do within like few minutes. I just ingested the data into DRID and I just like did all the analysis in real time. So let me just go back to my slides and start discussing DRID architecture. DRID has couple of different types of nodes. The first one is real-time nodes. So we have these different nodes because each node handles like a specific set of tasks. Real-time node handles real-time data. They are capable of ingesting streams of data and also serve queries. Historical nodes are specifically designed to serve low latency queries. Broken nodes are designed to like break the query across historical and real-time nodes, combine the results and provide it to the users. Coordinated nodes are designed to coordinate your data distribution across your cluster. So let's see how these nodes play with each other. So first of all real-time nodes. So they have support for streaming ingestion. They can pull data from your data source or you can either push data to them. You can also use an ETL layer in front of your real-time nodes to do, to enrich your events, to maybe like join multiple streams into a single stream or do like any other kind of ETL logic. Once the real-time nodes gets an event, so let's see how an event flows. So once the real-time node gets an event, it stores it in memory and at this point if a query comes up, the query can be served from the real-time node and the event will be visible on the dashboard as soon as it's present in the real-time node. The real-time node, they maintain a row-oriented data store in memory. They periodically hand off the data to deep storage. Deep storage is just a distributed file system. It can be any of S3, HDFS or any other network file system which is used as a permanent backup for feed data. So the real-time nodes, they wrote a read-optimized format of your data and stored it in deep storage. From deep storage, that data is loaded onto the historical nodes. They notice that we have got new data and loads the data. Once the data gets loaded onto the historical node, it's dropped from the real-time node. So now you have your data in the historical nodes and you can serve queries from your historical nodes. In certain cases, you don't have streaming data. So you might have your data in batches and lying in HDFS or somewhere else. So you can use Hadoop or Spark to create DRID segments. We call the smallest unit of data DRID generates as segments. I'll be discussing more about segments later. And those segments can then be handed over to the historical nodes. Going into more details, like how these all these nodes talk to each other. So all the nodes follow like a shared nothing architecture and they talk to each other via zookeeper by announcements and watchers. So the broken nodes, they watch for segment announcements. Segment announcements are like whenever any node loads a data, it announces that I'm serving data for this interval. I'm serving this data and the broken node notices that announcement. And from that point onwards, it will serve queries. It will redirect queries to that node. There is another node, coordinator nodes, which are for coordinating your data distribution within the cluster. So they are the ones who use a cost-based, interval-based cost distribution function in order to assign your data across different historical nodes. They also manage data replication. So you could you can specify like how many nodes do I want to replicate my data to and they'll do that do that for you. They also have variety of retention rules. So you can specify I only want to retain like data for like last three months or stuff like that. And it will automatically drop data, which is like older than that. We also have a metadata store, which can be any of like SQL engines. It is used to store the metadata about the segments, like where my data exactly resides in deep storage. Now let's talk about, I talked about segments. Now let's talk about what they exactly are. To discuss that, let's take an example of a small Wikipedia edit dataset. So Wikipedia as you all may be knowing that like it's like anyone can go and do an edit on Wikipedia and whenever an edit is made, an event gets generated. So these are the some of the events from Wikipedia, which has like several columns. We can categorize these columns into three different categories. First timestamp, that is when an event happened. Second dimensions. The attributes about that event, like which page was edited, which country the edit was made from and etc. The third is metrics, that is measures like how many characters were added, how many characters were deleted, how many lines were modified, etc. So it categorizes like all your data into these three broad categories. And it does primary partitioning based on time. So we'll create like segments based on a specific time, time period, and that time period is configurable. What this gives us is whenever you have a query, whenever you say that I want to view data for like last one week, it will only scan the segments for last one week. It will, the broken nodes will know exactly which segments to scan in order to serve your query. You can also specify secondary partitioning scheme based on hash function of like some specific combination of dimensions or dimension based on like some specific dimension values. So within an interval, you could also shard your data across multiple segments. The second concept it has is data summarization or data rollup. What that means is, let's say I'm only interested in viewing data aggregated by an R. I never, I never want to see like how many ads were displayed in one particular minute, because that doesn't give me that much value as an early data gives. So in that case, I can specify that I'm interested in getting like counts, like how many characters were added in like this last R, how many characters were deleted and stuff like that. And I could ask you to summarize my data. So what did we'll do is it will truncate your timestamp by the granularity that you have specified and store that. So for the first three rows, since they have the same combination of dimensions, what it will do is it will store only a single row, and it will store that there were three edits being made to just in people page in that R rather than storing individual three rows. What this gives is in real world data sets, we have seen a drop in like a summarization rollup ratio of nearly like from 10x to 100x. And this saves like you a lot in storing your data, because this is what gets stored at the end in droid. So one, it helps in reducing your storage. The second thing it helps is like, while you're executing your queries, you're executing your queries on already summarized data. So that also helps the number of rows that are to be scanned in order to answer a query also reduces many folds. The third concept reduces is dictionary encoding. So what it does is it creates IDs for each of your values. And it stores the IDs in a column wise format. So for example, in the page column, we had three different values, and it will assign IDs like 01 and two to these three values. And it will finally end up storing these long like these IDs instead of storing the row values. And it will also store the dictionary encoding. So if you have lots of repeated values in your data, that won't take much space. Similarly for city column, it can be like since we have only two cities, it will be 0 0 0 1 1 1. The next thing do it does is like the third thing is it creates bitmap indexes for each of your values. For example, we have just in Beeper and that appears in row number first, second and third row, row number 0 1 and 2. So what did we store is it will store a bitmap index for like 1 1 1 1 for like each row, it where the that value occurs and 0 if that value does not occur in that row. Similarly, for other values, it will for each value in your data set, it will create these bitmap indexes. Now, whenever you want to like filter your data, right? So a filter would be like, give me a results where value x is this and value y is this and stuff like that. So in those scenarios, what it can do is it can answer your filters, it can filter your data just by looking at these indexes. What it will do is it will just do like simple bitmap ores and and operations in order to know exactly which rows to scan. And since it's column nerve, so it also knows if you have access like only three dimensions out of your 100 dimensions in your query, so it will only scan those three columns rather than scanning all your data set, which gives it like huge benefits. And these indexes are compressed with run length encoding, which is concise and roaring. So we have integrated with like these two encodings till now. Moving on to, so these were the storage internals, the sum of the concepts that Druid uses internally in order to speed up the queries. We want to do it in practice, like how does generally like production clusters look like, what's the kind of performance that you can expect in numbers? So the largest known Druid cluster has, which is hosted by MetaMarkets, the initial developer of Druid has nearly 50 trillion plus events. It has almost 50 petabytes of raw data and over 500 terabytes of compressed queryable data. The performance of real time streaming ingestion is we are ingesting around like 500,000 events per second on an average, 2 million events per second on peak and 10 to 100k events per second per core. This, the events per second per core varies a lot depending on like how many dimensions you are ingesting, what's the cardinality of your data, etc. Talking about query latency, so we have seen the size of the cluster in ingestion performance. Now comes the query latency. So these query latencies have been measured again on the MetaMarkets cluster. So all the numbers that I have mentioned till now are for MetaMarkets cluster. The average query latency over a day, over a span of a day was found to be nearly 500 milliseconds. 90th percentile was less than a second and 95th percentile was less than 5 seconds. There were almost like thousands of simultaneous queries going on on the cluster coming up from the customers. The queries were mixture of like top ends, group buys, time series. These were like the majorly like time boundary queries. So it was like sort of a mixed back. Another interesting thing is like since Druid nodes have shared nothing architecture, so you don't do not need any downtime to do any kind of upgrades. So each node is has like its own independent data set and it can run on like separate version as long as the API between the nodes does not get changed. And we also support like rolling updates. So you could completely upgrade your cluster in a rolling faction without doing like any downtime. Here is a list of like some of the companies using Druid. I can talk about some of them. I have talked about MetaMarkets. They use it for attack analytics. Yahoo, which is Verizon now, they use they use Druid for their user behavior analytics and real time application monitoring. eBay and Alibaba, they use Druid for e-commerce analytics. PayPal uses Druid for tracking their tracking events on their website and analyzing them. Cisco has a peculiar use case where they use Druid for detecting for like analyzing network flows in real time. And they also provide they have a complete product suit which can like run queries over Druid in real time and provide you alerts if there is like any fluctuations or any anomalies in your network flows. And there are like other companies which are using it. So these are all the like sweet spots where like Druid is really good. But there are like some some cases when Druid is not ideal. The first one is if you have small amounts of data, if you have really small data, you can use like any any database you can even like do pre computation and like serve it, it will be like even faster. For OLTP use case, Druid is specifically designed for OLAP and BI use cases, not for OLTP use cases. If you have frequent single row updates. So for updating the data, what you need to do is like the data is partitioned at like a segment level. And for updating the data, you can only update a segment. You cannot update like a single row. So you need to like reindex all the data for a specific interval if you need to update it. And obviously dumping the entire data set. Since it's a columnar data store. So it's not designed for dumping entire data set. It's designed for doing like analysis OLAP kind of queries on top of it. Finally, we have like a good community. You can reach us on if you have any questions you can reach us on on our user group, dev group, or like this is our GitHub handle for the code. We are we also are present on IRC where if you want to chat with any of the Druid Committers, you can do it there. I'm open for questions and hey hello. Hello, I'm not even yeah, I'm not audible. So I have a question on here. Hello, I'm yeah. So I have a question on on that how Druid enables real time data aggregation, real time integration of aggregation? Like, so suppose you have a dashboard where you are you are representing aggregates on historical data. Yes. Now your real time feeds comes in. Yes, how how you'll factor in that real time feeds into the aggregates, which is appearing on the dashboard. So I talked about real time nodes, right? So they are specifically designed for this purpose, right? So as soon so the historical notes, they maintain a column oriented format, which is like very good for reads, but not optimized for rights, right? And the real time notes, they maintain an in memory, row oriented data store where the as soon as like you are right come in, it will append your data and summarize it, right? So whenever a query comes to the broker, it knows that the data is present both on the historical note, as well as the real time. It will split up your query, it will send it to both the nodes, right? So it will split up the query, send send it across multiple nodes, and real time as well as historical notes, they have both the same API for queries, right? So the as soon as the query reaches the real time nodes, they have the data in memory, they'll execute your query over the data, provide you the back the results and all those results are then merged in the broker nodes and given back to the user. So my question is, so your query booker is going towards the both nodes, where that higher level aggregation happens. So the real time nodes are aggregating the real time feeds, right? Historical notes are aggregating the historical feeds. Historical notes are not aggregating anything. They are pre aggregated. Yeah, so they just load the data which was pre aggregated by real time nodes. So let's say like, so I showed you an event, right? So they'll get some data, they'll hold it for a while. They'll create a column oriented format out of it. They'll hand it over to the deep storage. The deep storage is now a permanent backup of your data, right? From there, the coordinated notes will notice that okay, I have got a new segment, which needs to be loaded on some historical note. The historical note will fetch the segment from deep storage. It will load the segment, and it serves it from there. There is the data at historical is like not mutable. No, I understand, I think when a query. Yes. Yes, right. And the, the, at the end, the user has to be shown the aggregated values. Yes, so that aggregation is done in the broken notes. That happens in broken notes. It is done in the broken notes. So you'll get like intermediate partially processed results. Yes. And broken note will merge those intermediate results and give it back to you. Okay. So it's like two level of merging. Some of the merging, like, let's say it has like 100 segments. So it will merge those 100 segments as like one result, provide it to the broker, and the broker will merge like the results from different historicals as well as real time node and show it back to you. Does that make sense? Yes. Yes. Yes. So it will, it will show you like a snapshot of thing, right? So it's, it's of the order of like, I told you about like sub second latency, right? So I mean, suppose this takes like 10 millisecond to like merge the results and this takes like somewhere around like 100 milliseconds to process it and send it to you. So everything is like real time. It's like sub second. Something similar possibly. So the real time nodes really aggregate the data and the aggregated data is what really gets stored. Of course, that is configurable. Yes. But then what is the back end engine is doing like all those Spark, SAMHSA, etc. things which are essentially the compute things which typically do aggregations. So if one is making use of Druid for storage, then what is that compute piece supposed to do in your experience? How is that getting used? That is question one and second is the UI that you shown, does that come pre-built with Druid or is it a REST API that is getting thrown and through browser we can see anything. Cool. Both are good questions. So the first question is like, what's the use of ETL, right? So ETL is optional but I mean, you may not want to store like your raw data, like the data that came might have like only few dimensions, right? And you might want to enrich your data. You might want to like do some look-ups, have some more data into the like add some more dimensions, right? Or it may be possible that the data is coming from like multiple different streams, right? So in that case, you may aggregate those two different streams, you may merge those two different streams into a single stream and send it to Druid. Or you may have some other business logic on behind on top of like Druid that like you want to pre-process your data before sending it to Druid. So yeah, not really aggregates. I mean aggregates Druid can do for you. So the UI is developed by Imply, again a company which is like started, founded by Druid committers and that's also open source. So it's not bundled with Druid but you can set it up quick easily and like it's called Pivot and developed by Imply. Hey, I just want to know that whether Druid support continuous query as well? Yeah. So does Druid support continuous query as well? Right now, no. You'll have to like pull have like a you'll have to do that in your client side. Well like you'll have to do like queries again and again. OK, yeah, thank you.