 needs to be scalable and handle like billions of events per second or per minute. So to discuss this particular problem, I'm taking a sample data set, data stream which is provided by Wikipedia. So Wikipedia is a collection of documents and whenever anyone goes and makes an edit to any of the page, an edit event gets generated and that is provided as a stream by Wikipedia. This event has some information about which page was edited, from which IP this page was edited, how many characters were deleted and added. So for this talk, what we are going to discuss is how we can take these events in a streaming way and create a dashboard on top of those events. So to show you guys the dashboard, I here have a sample dashboard which is right now running live on Wikipedia edit stream. So what we are essentially interested in having a summarized view of your Wikipedia edits, for example, like how many unique edits were being made, what are the total number of edits which are being made in past like one week, how my edits are being made over the time, we are interested in visualizing it as a time series, from which countries my edits are being made and how those edits are grouped. For example, here we can see United States is the country which is contributing like the major number of edits, second is Italy, UK and so on. We are also interested in seeing like what are how many characters are added from like different countries, what are the top pages being edited. This is like a sample table showing only edits filtered by from location India. So we can see that people are editing like lists of banks in India, the rise of Sevagami and like these pages are being edited like frequently in last one week. So this is essentially like the our end goal where we want to reach from this Wikipedia edit stream. So this is the first line shows a sample event that Wikipedia provides us. This event has that first the title of the page, the URL of the page, the IP address from which that page was edited, how many characters were added or deleted. They are also provided. So what we are essentially interested in, let's try to like break this problem down into smaller steps and first we are interested in like consuming those events, getting those events, those raw events provided by Wikipedia, then the second step would be pass those events and enrich those events. Since we have the IP address, we can do an IP address lookup to also add like geographic location information like from which country or city the page was edited. Finally, we are interested in like storing that event in some persistent store which can provide like subsequent queries and can power my interactive dashboards. The final layer of the solution would be a visualization layer where we are interested in like creating those dashboards and visualize those events and analyze them. So essentially to build this complete solution, we need four components. First is event flow, which means we need to transfer events from the source where they happen from one place to another in a reliable and guaranteed way. The second is event processes. Processing, we need to process those events one by one and enrich those events. Third is we need to store those events and finally visualize those events in a visualization layer. I'm going to discuss all these four components one by one. First is event flow. A general event flow looks something like this. You have a set of producers and you have a set of consumers which are interested in consuming those events. And you need a queuing message broker or some queuing solution in between which can facilitate this event flow. So the requirements for this solution are we need something, some solution which can provide low latency, high throughput. It should be able to handle fault tolerant. So if there are any failures at the end of producers or consumers, it should be able to handle that. It should provide some message delivery guarantees, like ordering of the messages, for example, which event happened before the second event or the ordering needs to be maintained in some cases. Also, the delivery guarantees like at least once, exactly once, or at most once. These delivery guarantees varies from use case to use case. The last one is like scalability. It needs to be scalable to handle like billions of events per second. So the solution I'm going to discuss over there here for event flow is Apache Kafka, which is sort of like the defect of standard in the industry these days and is used by many big companies. So how Kafka works is it has a set of brokers and these brokers have partitions for different topics. So each topic is divided into multiple partitions. And there are multiple producers, and each producer can produce events to multiple partitions. Events are stored in a partition in a sequenced way, and each event is identifiable by an offset. Which gives you an ordering guarantee within a particular partition. You can also have multiple consumers, which will consume events from these partitions in a sequence way. There can also be multiple consumer groups to consume those events into multiple consumer groups. Each consumer also tracks its own offset so that if there is a failure on the consumer end, it can come back again and start reading messages from the last good known offset. So to summarize, the Apache Kafka provides low latency, high throughput. It has message delivery guarantees of at least once. The Kafka team also introduced exactly once guarantees in their latest release last month. It also has a reliable design to handle failures, which with having the message acknowledgments between the producers and brokers, you can configure data replication on brokers for broker failure. And consumers can read from any desired offset, and they can keep track of their own offsets. Next layer is event processing, where we want to process the event. There is some source where the event is happening. We want to consume those event processes and then produce that event to another sync. So a consume process and produce pattern. But what we want to achieve from this is we want to enrich and transform those event streams. Apply some business logic, for example, like filter nulls in our Wikipedia rate stream case. Enrich those events by adding the geolocation information. You might also want to apply some windowing logic or maybe join multiple streams into a single one. This also needs to have failure handling and scalability aspects. So there are many, many solutions out there in the market. For example, Apache, Samza, Spark, Flink, Apex, Kafka Streams, and Storm. The one I'm going to discuss over here is Kafka Streams, which works very well with Kafka. So Kafka Streams is just a lightweight streaming library embedded with the ship with the Kafka releases. So it processes an event at a time. It also has operators for stateful processing, like windowing, joining, aggregation operators. It uses a local state, which uses rocksDB underneath to store the events. And this local state is also backed by a Kafka change log topic, which means that this change log topic is used for failure recoveries. So if you restart your Kafka stream application or if there is any failure and it gets restarted, it can read that change log from Kafka and replay those events and create the local state again. You can also scale it and run it in a distributed and fault-onerate manner. If you compare it with a Kafka consumer, it is a higher level, which is faster to build any sophisticated app. And at the cost of it provides a little bit less control with very fine-grained consumption. If you need that, you can use low-level consumer APIs. So for the Wikipedia, Edit Stream here is a simple code, which does the processing in three steps. The first step is it reads the data from a Wikipedia raw stream, which we are getting from the Wikipedia. It reads that data and builds a Kafka stream object. So on that Kafka stream, then in the second step, it parses those events, matches those events against a pattern, parses those events in a Java object named Wikipedia Message. And then it all, again, then it maps and processes these events at geolocation information to those events and filter any events which are, for example, null and any empty strings. Finally, it produces those events to a Wikipedia end-risk topic, which, from where, our data store can now consume those events. So the next piece is the data store, which is also one of the most critical pieces in this pipeline. So the data store needs to power an interactive dashboard. It also needs to have multiple requirements. So it needs to be able to first ingest streaming data because we are talking about creating dashboards on data streams. So it needs to be able to consume those events coming from a streaming way. Then it also needs to make those events available for queries as soon as those events are ingested into the data store. The second requirement is because we are powering dashboards over here, it needs to provide sub-second query response times so that those dashboards feel interactive. And it also needs to, because we don't know what the user is going to query, so it needs to support arbitrary slicing and dicing of the data. Also, as we are visualizing the data, what we are interested in is providing summarized and aggregated view of your data. And the same requirements of scalability and high availability needs to be there. So the solution I'm going to discuss over there for this is DRID, which I work on. And so DRID is a column-oriented distributed data store, which can provide sub-second query times. It has support for both real-time streaming ingestion as well as batch ingestion. So you can do batch ingestion via Hadoop or Spark. And also do streaming ingestion from multiple data sources. You can either push data to DRID or pull data from Kafka or any other messaging broker. It supports arbitrary slicing and dicing of your data. It uses concepts like automatic data summarization, approximate algorithms to provide you fast query response times. It has been scaled to petabytes of data in production and is highly available. So now I'm going to discuss the suitable use cases and how DRID handles this much workload and how it internally works and how it is able to provide those sub-second query times. So the suitable use cases are powering interactive user-facing applications, which we are discussing in this talk, arbitrary slicing and dicing of your large data sets. User behavior analysis, like measuring distant counts, how many unique users visited my website in last one week or last one month. Doing retention analysis, like how many of my users were retained in this week as compared to previous week. Doing funnel analysis, AB testing, and those kinds of use cases, user behavior analysis. So what it is not very good at is, if you are interested in dumping the entire data set, it is good if you are trying to use, get an aggregated view or like summarize view of your data in the form of queries. But if you are trying to do and dump the entire data set, since it's columnary nature, it is not very good at that. So coming to the storage internals on like how do it internally stores the data. So do it internally stores the data in the form of segment files, which are partitioned by time. So I can have, for example, in this figure, I have my segments partitioned by each day. And so for Monday, I have different segment. For Tuesday, I will have different segment. Ideally, the segment sizes are smaller than like one gigabytes in size. If your data for a particular day is larger than that, you can even create like multiple charts for that particular day. For example, on Friday, we have two different charts. So these time-based slicing provides due to the ability to filter your segments based on the time range you are querying. So for example, if I'm interested in only visualizing events which happened in last one week, I can quickly only scan those segments which are corresponding to last one week of that date time range and provide you the results rather than scanning all the different segments. So coming back to our Wikipedia example, this is how our data looks like after being enriched when it is sent to the data store. So the data has a timestamp of when that event was edited, which is the first thing. Then it has some dimensions and those dimensions are, for example, like which page was edited, which language the edit was being made, which city and country the edit was being made in. Then there are some metrics about like how many characters were added, how many characters were deleted in this particular edit event. So now I will discuss like the various techniques that DRID uses in order to provide fast query times. The first technique it uses is data ruler or data summarization. So by that, what I mean is if in your use case, you are, for example, in your use case, if you are only interested in querying your data aggregated by an hour. So what you can do is you can provide that at the time of ingestion to DRID that I'm only interested in aggregating data by hour or any further granularity. I'm never interested in querying data by, for example, like by each minute or looking at events which happened in each minute. So in that case, what DRID can do for you is it can summarize your data. So for example, these first three rows, since they all happened in the same hour and you are always going to query me data aggregated by an hour or a course of granularity. So what I'll do is I will just summarize all these three rows into a single row and store that there were three edits being made. And these were the sum of added, deleted, min added, min max edit. You can define all those metrics that you want to compute at the ingestion time. And so this reduces your data size which needs to be scanned on the query time by many folds, depending on how you are ruling up your data. This helps with giving faster query times. The second technique it uses is dictionary encoding. So instead of storing the row columns as it is, what it does is it creates dictionary encoding for these columns. So what it will do is it will assign IDs to each of the values for these columns. For example, for this page column, we have Justin Bieber, Kesha, and Selena Gomes. So it will assign IDs of zero, one, and two. So instead of storing the repeated strings, what it will essentially store is the dictionary and then the column data which is just a sequence of those IDs compressed using different encoding, different compression techniques. Similarly, like each of these columns are compressed using this dictionary encoding technique. The third thing it uses is bitmap indices. So it creates a bitmap index for each of the value which is being stored. For example, in this case, Justin Bieber, that appears in first three rows. So it will create a bitmap index saying that having the value one, one, one, wherever the Justin Bieber is present and zero in the rows where it is not present. Similarly, it will create these bitmap indexes for all the possible, all the values in your column. Now, when you try to query and filter your data, what it does? For example, I will, I query for like any row which has either Justin Bieber or Kesha. So since it already knows in which rows Justin Bieber is present and in which rows Kesha is present, so it can just do a bitmap or to compute the final result set. And then it will only scan those rows which needs to be scanned in order to answer your queries. So the data filtering is essentially just doing some bitmap or and operations in rate. And these bitmap indexes are compared, are compressed with concise or roaring encoding to compress the size, to reduce the size of these bitmap indexes even further. The next technique it uses is approximate sketch columns. So generally you have like user IDs in your click data or you have like long page URLs. So and you're not interested in, in most of the cases you're not interested in looking at those individual user IDs what those individual users are doing. But interested in getting some aggregate information about that, like how many unique users visited my website and things like that. So for those unique users and distinct counts or like retention analysis kind of things, you need not store those exact values. What you can do is you can store approximate sketches which can be hyper log log or like any other data sketch which supports these type of computations. So these data sketches reduces the size of your data even further. So instead of storing the row columns, what you do is you just store a sketch object which can do computations like distinct. For example, if you store hyper log log, you can do like distinct count and things like that. To summarize, to get a bit more details about like approximate algorithms, you can store sketch objects instead of row column values, which gives you also like better rollup because you can roll multiple instead of storing individual values, you are storing like single aggregated object for that particular column. You can have, you'll get like reduced data size. You'll also, the use cases are like approximate distinct count, approximate histograms, funnel or retention analysis. There are certain limitations to storing these sketch objects. Obviously, since you are not storing the raw data, you cannot do like exact counts in this fashion and you also cannot filter on individual row values if you are using approximate columns. Coming to grid architecture. So this is a diagram showing like grid architecture. So we have like first of all, we have streaming data coming in. Then there is an ETL system, which is massaging and enriching those events, transforming those events. And finally, those events are being sent or consumed to buy the real-time index tasks. So these real-time index tasks, what they do is they keep an in-memory state of the events as soon as they receive it. So they create a right optimized in-memory data structure which can be used to serve queries. So now at this point, if there is, if someone asked for the data, it can provide the data in the query results. Periodically, what those real-time index tasks does is they convert that right optimized data structure into a read optimized immutable data structure which are the final grid segments. And they hand over those segments to the deep storage. What they do is they persist those segments to some distributed data storage, which can be like S3, Cassandra, ISDFS, or any other network file system. It just needs to be available for all the nodes. From this deep storage, there are another set of nodes which are called the historical nodes, which loads this particular data segments and then memory maps those segments and serve queries on top of those segments. So periodically, the data is handed over from the real-time nodes to the broken, to the historical nodes. And these historical nodes can then handle historical data, which is immutable in nature and can serve queries on top of that. The historical nodes, they are the main workhorses of the grid cluster and serve queries. There is another set of nodes called the broker nodes, which just keeps track of all the data, all the segments, and on which nodes those segments are present in the cluster. What they do is they have the ability to scatter your query, to break your query into multiple segments, send them to multiple real-time as well as historical nodes, gather back the results from these nodes and provide those results to your query layer. Just to see like the flow of an event, for example, an event happened, it was enriched and then sent to the real-time nodes from at this point, if a query comes, this query cap, the event will be served and visible on your dashboard via the real-time nodes. After a certain point of time, these real-time nodes will hand over this particular segment to the deep storage from where it will be loaded on the historical nodes. From the historical nodes, now, since as soon as the historical nodes loads the data, it will announce that it has loaded this particular segment and the real-time nodes will drop the data from its memory. And now, if a query comes, it can be served from the historical nodes itself. Just some quick facts about DRID's performance and scalability on how it looks like in production. So DRID, the largest tester is hosted by MetaMarkets, which was also the company which initially started DRID. So that ingests around like 300 billion events per day. Jolata uses DRID for computing around like one billion metrics per minute. There is the largest tester at MetaMarkets is around like 200 node cluster. And the largest early ingestion rate in terms of data size is reported by Netflix, which is around two terabytes per hour. Here is a list of some companies which use DRID in production. There are a bunch of different companies over here which uses DRID in production at like various scales and various sizes. The next layer is the visualization layer where now we have the data in a data store which can provide subsequent queries. Now we want to create dashboards and visualize those events and analyze them on a dashboard. So for that, the requirements are, we need to have a visualization layer where we can create the dashboards which has the rich dashboarding capabilities. Generally in the organizations, we do not have data in a single data source. We have data spread across multiple data sources and coming from like multiple sources. So it should be able to also query like multiple data sources. And since it is a user facing application, it needs to have a security and access control. It also needs to be extensible enough so that you can customize it for your own use cases. So the solution I'm going to discuss for the visualization layer is superset which is a Python backed Flask application which has authentication and uses pandas for like rich analytics. It uses like SQL Alchemy for its SQL toolkit. For the front end, it uses React and NVD3 and it also has a deep integration with DRID so that like it can generate native DRID queries and which are optimized and pass back the results and serve those results in the dashboard. So just to show like some of the dashboarding capabilities, it has support for like various visualizations like remaps, sunburst, it also has many, many visualization layers. In fact, like I can show you guys some of the other dashboards, some of the other sample dashboards which has those visualizations. For example, this is a world's bank data dashboard which has world's health and population data. So you can see there are like map visualizations. There is like this sunburst visualization of breakdown of the rural areas. There are, you can have different like line charts. You can have tables and like stack charts, bubble charts, lots of lots of visualizations and you can, it also provides an explore view where you can see what all visualizations are there. Yeah, so these are all the all the visualizations that superset supports today. It is also extensible. So if you want, you can add your own custom visualizations to this. And it is very easy to like create dashboards with superset on like multiple data sources. So you can have your data sources defined. You can connect like multiple data sources. So I have one SQLite database and on the data sources, I have these like different data sources connected to this particular superset instance. So yeah, coming back to this slide. So yeah, it has lots of lots of visualizations which you can use and customize for your own use cases and create dashboards. So to summarize how this, this whole Wikipedia dashboard was working and so essentially we had like this Wikipedia streams coming in, we were using a simple Java application using like Kafka Connect APIs to just dump all the raw events which we are getting into a Kafka topic named Wikipedia raw. Then we are reading, then we were using Kafka streams and we created an IP to geolocation processor in the Kafka streams, which was reading these events from Kafka and writing them to another topic named Wikipedia enriched topic. And from this topic, we were then pulling data into Droid, using Droid's Kafka indexing service. And finally, we were querying Droid to visualize all those events in superset. These are the link to the project websites. You can go to these websites if you need like additional information about any of these projects. With this, I will end my talk and I would also like to announce that there is an off the record session on experiences and challenges in working with Droid. So if any of you are using Droid in production and are facing like any issues or are interested in knowing like the future roadmap and discussing the challenges that you have faced, please join me tomorrow at like 325 PM in room one for like further discussions. And now I'm open for questions. Questions? Hello. Yeah. I was just thanking you for the session. My question, two questions really. You're not audible now. Am I now? Can you hold the mic closer? Okay, is it better now? Yeah. Okay, thanks. So Nishant, two questions on Droid really. So you mentioned how you would actually decide on a roll up frequency as a design element and the data gets rolled up on that roll up slice that you've taken, right? You mentioned Arli in your case. Often there are scenarios where you want the roll up but you also want to have access to the raw data. Are you able to model those type of scenarios with Droid as well? So you want the roll up to happen as well as you want the raw data? That's correct, yeah. Yeah, so essentially like it's not a queue building solution where you build like multiple queues actually. So what you can do is you can have your data not rolled up by any granularity and then as your query comes so the brokers also have a concept of caching. So they can cache your aggregated results for each segment. And if you re-query that particular aggregated data set it can look up into that cache rather than actually processing your query. So this caching layer actually helps with these kinds of scenarios where you want to query like multiple granularities but you do not want to roll up your data for that particular granularity. Though you will pay for the storage size for the extra storage needed to store that event. Question, how does that Apache Kafka Streams and the Flink? How do you choose that between these two? What, it was about Apache Kafka Streams, what was the question? And there is Apache Flink also, Flink. Yeah. So how do you choose between these two? So actually, like I'm not going to like compare all these ETL engines. So in particular, the times. In particular, like I haven't worked with like Flink. So I cannot like comment on that. But talking about Kafka Streams, it is like you can have your processing in like multiple stages and which is like pretty good and which is pretty lightweight. So if you are moving towards like microcontainers or dockers or any of these kinds of solutions, you can use that if you are already using Flink. I think it's also like provides very, very good APIs for like your stream processing. So you can use that too. Ultimately like the only thing that matters is your ETL layer should be able to handle that event load. It should be scalable and send those events to a data store. And the data store needs to be able to serve those queries on that particular data size. Questions this side? Question there? Hi. Hello. Hello. Okay. Can you give an example of data sketching like a real world example of data sketching how that works? Yeah. So for example, you may have like an e-commerce website and you have different users visiting that particular website. So you store, you get those user IDs in each of the events, right? So those user IDs are of the order of like billions of cardinalities, right? And what you are only interested in instead of knowing like this particular user visited this page, instead what you are interested in knowing is that how many users visited my e-commerce website in last one week? So those kinds of questions can be answered by these data sketches. And one example is hyperlog log objects. Like you can create hyperlog log objects and store that. The second is another library created by Yahoo for like data sketches. So if I compare both of these, hyperlog log is not very good at like doing intersections. It is good at doing unions, but not at intersections. So for intersections, the other data sketches library contributed by Yahoo, that works pretty well. So intersection is needed for use cases like retention analysis and things like that. Yeah. Hi, I'm starting on a similar activity. So this session is really helpful for me. Now I have two questions. One is how does it compares to ELK stack? You know, if I look at the whole framework. So honestly, like I haven't done any benchmarks, but there were some benchmarks done by the team at Netflix with the ELK stack. And what they found was like for search kind of search use cases, that was pretty good. But for if you are interested in like aggregating, doing aggregates, the ELK stack needed like much more resources and was getting like much more costlier in terms of both like query latency as well as the hardware requirements. There is like one blog, one, one like complete post on the Druid query forums, Druid user forums on like this topic. So you can also like search for that and for the exact numbers that they got. Okay. And the superset allows to change the query. Like how difficult it is to write a fresh query. Superset allows to create a query. It's pretty easy actually. So superset has this like concept of slices. So you can just like, for example, I just selected the data source. You can select your visualization. For example, pie chart. And then you can pass in some like group by keys. Like for example, I want to group by country name and then query that. So yeah, there is like, let me just, yeah, this is like. One last question. So this is one way. And the second way is the SQL lab. So it also has an SQL lab where you can just like compose your SQL query and get your results. One last question. That's it. Yeah. Is it good for mutable data? For example, transactional data, which have some columns changing. It is good for like OLAP use cases, not like OLTP use cases where you have like transactional flows. So it right now, what it does is it creates, it treats your past historical data as immutable. And if you want to make edits or changes, what you need to essentially do is reprocess the data for that particular interval. So since the data is time slashed, so you can replace segments for like any selected interval. For example, like for there was like some bug and I want to process like events for last one week. I can just reprocess that in a batch fashion via Hadoop MapReduceJob or via Spark and then create newer segments which will replace those older segments. That may be good for one of a kind of time where you know things have gone bad. But if something is changing forever, for example, last say two months of data is always keeping, you know, keep changing, then maybe this might not be a good. I mean, can you define more on this? So for example, I have a transaction which has different status. For example, no transaction initiated, then in process, success, failure. And then I want to know kind of. There are like multiple ways to like, just like model this data set. So you can model this in terms of like event streams. So transaction at this time, it's state, old state and new state. So if you just have this as an event stream, so this now becomes an immutable data, right? And if you want to query the final state, you can just query the latest event and filter by the latest event, you will get the latest state. And by this, this model, you'll also get the historical states. So you can see that like, instead of just storing the current state, you can store the old state, new state in an event-based way, right? So there are ways to like model such data sets also. Thank you. And I mean, if there are any further questions, we can take that like tomorrow in the off the record session. Yes, we can take this discussion offline. Thanks a lot, Mishan. So we will now take a break for tea.