 All right, hey guys, I think we should get started here first off I just want to thank everyone for attending this session, and I was like right after lunch on the last day So appreciate you guys being here My name is Fengin. I am a co-founder of Technology startup based in San Francisco called imply Also a committer on one of the open source projects. I'm gonna be talking about today called Druid So today I wanted to talk to you guys about this idea of building an open-source stack Basically to handle streaming data, which is the type of data you find very often when working with IOT devices and what I hope to cover during this talk is a little bit about the problems that you face with high volumes of event data and I want to talk about a couple of different technologies designed to solve various problems when dealing with high volumes of Streaming data and those pieces consist of a message delivery piece data processing piece and a piece to serve queries So we start off with the problem. I'm sure by now you guys are very familiar with the problem There's connected devices. They are very growing very very rapidly, you know, you can now Tweet on your fridge so that people can know the whole world can know what you're having for lunch or what you're grabbing for snacks So that's a that's kind of cool, but connected devices are everywhere They're starting to they're continuing to grow in popularity and as a result of that data is growing very very quickly as well and I think a lot of the the problems people talk about When when talking about IOT in this entire space is that there's a ton of data that gets generated and This data is often very very valuable because when you extract insights from this data There's important decisions or optimizations you can make So the problems I want to focus on today is really just around collecting a massive stream of data and then making sense of it Now with any connected device Devices generally emit a stream of events and these events are often called messages or logs depending on how you read them in literature But they're just really bits of information describing what's happening at a particular period in time So when I look at events emitted by devices, I see them typically being composed of three components So there's a timestamp indicating when the event was created There's a set of attributes around this event. So these are properties that either describe the device They describe what's happening on the device or this they describe something of interest and Then as part of this event, there's a set of measurements and these are the numbers of of of interest and When we get a bunch of these events generally what we want to do is we want to calculate a variety of statistics based on the measurements and when we calculate these statistics We often want to group on the attributes or filter on the attributes and by doing so we obtain more Insight onto the data and once we achieve that insight we can make decisions based on the findings So to give you an example of some of the analytics you can do off of a stream of data I thought it'd be useful just to showcase a short example and for the example, I'm just going to be using a UI here and Make this font a little bit bigger. I have a couple of different streaming data sets here There's one for Wikipedia just edits that are occurring on Wikipedia There's your more standard IOT data set, which is air quality data. That's been collected from various sensors This data set is actually kind of boring. So I'm going to pick one of these other data sets to demo instead Really with device data or any activity stream They tend to have a lot of commonalities which is as I previously mentioned a set of attributes that describe an event and a set of measurements So what I'm trying to showcase here is edits on Wikipedia Every time someone makes an edit on Wikipedia There's an event that gets generated and I actually collected these events and put them into a UI Just to showcase some of the workflows that you can do So there's there's edit there's attributes around the edit like the page being edited the time the edit occurred The user doing the edit and then there's your measurements, which are the number of edits The number of characters added deleted and so on and so forth So what I mentioned with most time what you want to do with this data is you want to filter it in some way Or you want to group on the attributes in some way to get some interesting information. So right now we're looking at the last No, let's look at the last seven days. There were about 2.8 million edits on Wikipedia with about 1.1 billion characters added and 37 million characters deleted So if we want to look at how these edits were trending across time We can do that and maybe we just want to pick some arbitrary time range here If we want to look at the top pages being edited that might provide some interesting insight And we see that there's a lot of different types of stuff you can edit on Wikipedia Maybe we just want to filter out filter on the articles being edited so in this Arbitrary range in time that we picked we see deaths in 2017 is pretty pretty prominent up here of championship league North Korean dictators and the Marcus Cousins from the NBA So maybe we want to break this down and understand a little bit better of hey Who are the top users making some of these edits? We can do that for deaths in 2017 for example You know this particular person has been doing a lot of edits for this You know arbitrary range in time that we selected and maybe we want to go a little bit further in our analysis Like we want to filter out those types of people who are not bots for example So we can look at all those users who are not bots and this one particular user here editing deaths in 2017 Doesn't seem like he's a bot. Maybe we want to look at For this particular user what other pages they're editing over the span of time and it seems like this is the only one they're editing and Yeah, so so this this type of workflow I'm trying to demonstrate of taking an event stream Basically breaking it down grouping it filtering it trying to expose insights Trying to look at various metrics this is something very common that you do with really any activity stream and For the end user most of the time they access this data either through some command line or through some sort of UI or application So when we're going about trying to make sense of a lot of IOT data, I think there's a couple of different problems to solve Obviously when there's a very small amount of data when there's a low volume of data problems are very easy and and You know, you don't see a whole lot of talks describing various technologies to use But with IOT data with device data, especially with the current growth of data Even seemingly trivial problems when you have a small scale of data We can become very difficult when you have a very large event stream So I think there's three main problems that people generally try and solve The first is around event delivering So when some event gets generated by device or or as part of some activity stream You have to deliver it from where it is created to some place where it can be consumed and analyzed and Just getting an event from one place to another at a very large scale can be a pretty difficult problem The second problem that you face when dealing with large volumes of event streams is around processing the events Raw data is oftentimes not very useful There's a lot of imperfections in it. There's a lot of caveats to working with raw data So to make that data useful and a little bit more consumable by analysts and users and whatnot But you generally have to process the event so either cleaning it adding business logic potentially transform me in some way and then the third problem is Just taking taking this process data and then making it available making available for queries making available for applications So people can analyze and gain insights from it. So as I mentioned, I think each of these each of these problems is very difficult and It's very difficult to find a single sort of solution that will solve all three of these problems So for this talk, I'm really going to talk about three separate systems And why I think they're they're particularly good at solving one of these problems So in the in the model I described how it's kind of looking out right now is basically you have Data getting emitted from devices or activity streams and at the end of the day You want to have applications or users making use of this data and in between? I've kind of broken down the problems into three main pieces the first is delivery The second is processing and the third is querying so first I'm going to talk about data delivery and this is the problem of Getting events from where they're produced to something that can consume them and and do something further with them So on one side of the delivery system, you have a set of data producers and on the other side You have a set of data consumers There's a couple of different problems associated with data delivery at scale the most of them can Most of them are related to having high availability in the face of different types of failures And also having a very fast and scalable system to deal with high volumes of events Failures can occur for a variety of reasons So for example, if your network is out is out, how do you prevent events from being dropped? If the thing that's consuming the data completely fails, how do you ensure that events are still getting delivered into your systems? What happens when you want to have multiple data consumers? So for example, if you have some very high critical data set, maybe you want many many different systems to get access to that data and At scale these can all be pretty difficult problems Thankfully, I think there's a pretty good open-source system called Apache Kafka which is very good at dealing with the data delivery problem and Nowadays, I think Apache Kafka is really has really become the open-source standard for this problem If you've never heard of Kafka before it's it was initially built in open source that linked in and since then It's the open source movement has grown very rapidly So there's many companies using it to handle vast amounts of activity streams So the way that Kafka works is there's a notion of producers and consumers and Kafka comes with a producer library that you can basically embed as part of your application or as part of the thing that produces data and Then it has a consumer library that you can use to basically pull data from Kafka And in between there is a set of Kafka brokers and the brokers are going to store events as Stored in a distributed message log or message queue So the idea is producers will write events to these distributed logs and Events are grouped logically as part of a topic. So a topic is like a table and a data source. It's it's some grouping of events Kafka is a distributed system. So it's designed to run across many many servers and producers basically write events and these events get spread across different partitions across different servers and You can have consumers then pull the data from from these partitions and read them and do something interesting with them What's kind of different about Kafka is that the consumers They are responsible for basically maintaining information about which messages they've read So Kafka itself doesn't keep track of where a consumer has read messages to So in that sense, it's very low overhead to add many many consumers of the data All Kafka does very simply it just buffers data and allows a bunch of different consumers to read that data So Kafka sort of as a summary. It's a high throughput event delivery system It provides at least once delivery guarantees Which means if you send an event if you produce an event and you transmit it It's guaranteed to be delivered. It might be delivered more than once like you might get a duplicate event here or there But it will be delivered and I know the Kafka folks are working towards having exactly once delivery of events Which is a very very difficult problem with streams It has a very simple straightforward design It's basically just to distribute a log that you write events to and its main purpose is to buffer incoming data So consumers have time to consume it in this model You have a logical separation between things that produce data and then seeing things that consume it so if these guys all fail the producers can still write data to this intermediate buffer and If if a consumer comes back online It can read messages from any particular offset in these different in these different buffers So Kafka, I think is really great just for getting events from one place to another The piece I think makes a lot of sense after Kafka is a stream processing piece And the purpose of stream processing is really to Transform or modify raw data in such a way that it's more easily consumable by systems So raw data oftentimes has many imperfections. It might have null values. It might have just random IDs I need to replace with human readable strings There's you know There's there's a lot of things you have to do with raw data before it's usable and these stream processing systems What they're designed to do is take a stream of events and then transform it in some way So in the open-source world, there are many actually many many different types of stream processors They all have various trade-offs. I'm not going to go into details about how all of them work I'm just kind of focused on their high-level idea But different stream processors that are out there spark streaming Apache flink Apache storm Apache apex. There's Apache SAMHSA as well There's Kafka streams and and seems like more and more are coming up every other day The main challenges that involve processing a stream transforming the data of a stream Like any any system dealing with a high volume of data The system needs to be highly available in the face of different types of failures It also needs to be pretty scalable So be able to handle massive event streams and do something interesting with them The way that most stream processors work is that you transform data in a series of stages because It's actually very difficult to have sort of one big job to take your raw data and make it something useful So instead you have many small jobs and each of them modify the stream in a different way So the stream processing system, I've used a couple of different couple of different types of stream processors, but the one I like the best architecturally is Apache SAMHSA Apache SAMHSA also was first developed and open-source by LinkedIn and it's a it's probably a little bit less popular than then spark in some of the others But I think architecturally it has some really nice properties So the way that SAMHSA works is basically you have an input stream of data and this could be data in Kafka SAMHSA then pulls that data and then Applies a series of transforms on it and these transformations are called tasks And it's up to the user to basically write the logic for a task to do something interesting with the stream So you can have The same stream go to multiple tasks and have more than one task right to an output stream and this output stream might be stored in a system like Kafka as well and As opposed to one transform task you might have a series of tasks that that modify the stream in different ways and Later on the talk and give you an example of this For like a real-world application So the way that SAMHSA basically works is it breaks up processing logic into a bunch of logical stages or tasks and what's really nice about SAMHSA is that For each of the tasks that you write to process the data You can you can tune different resource requirements. So certain tasks involving simple operations don't require a lot of resources tasks require tasks that do more complex things more complex transforms of the data might require much more resource requirements and In that sense, I think SAMHSA has really nice operational properties for transforming data So the third and kind of final piece I want to talk about here is okay You've taken your data from where it's created. You've delivered it. You transformed it You've done something interesting with it and now you want to go about and query it So the querying system is probably has the most complex requirements And also probably has the most number of different choices that you can use So usually what I see people do with a very large stream of data Is they want to be able to issue very interactive query So if you're accessing this data through an application If you're kind of slicing and dicing the data, you generally want queries to be very very quickly However, the data might be very complex that prevents queries from completing quickly So complex in the data might mean very high cardinality dimensions So you might have a dimension which has tens or thousands or even hundreds of millions of unique values And sometimes you want to do operations across those values that that that can be very slow As I showed kind of earlier in my demo often times you want to do ad hoc analysis So when you're looking at a stream of data, you might not always know like what it is you're looking for You might just be looking at a spike or a drop and trying to understand more about why that spike occurred Why that drop occurred and trying to analyze the root cause of some pattern that you've seen Another challenge is a lot of traditional querying systems, especially databases They tend to be more designed for batch loads and are a little bit less designed for loading a massive stream of potentially device data And once again, of course because we're dealing with very high volume event streams high availability and scale are always kind of challenges in the background But I do think that once you're able to solve some of these challenges You can remove a lot of barriers for people to understanding their data The reason why I think sub-second queries is very important and not everyone might feel this way But I think sub-second queries are very important to allow for iterative exploration of data So you look at one view you might see something interesting You look at another view based on your what you saw previously And it's a very iterative process of like asking questions getting answers asking more questions and Trying to rapidly iterate and find the root cause of a situation So to accomplish some of these challenges the third open-source system I want to talk about here is druid Druid is is a system that I work on so that's why I'm probably going to plug it the most during this talk So druid is is it's a column oriented data store and what that just means is your data is stored Individually in typed columns so each column has a type associated whether it's a string That's a number and so on and so forth Druid is very much designed for sub-second ad hoc queries and it supports both Exact and approximate algorithms and some of the approximate algorithms are there to make certain workloads complete quicker Druid is a system that's designed to work with other Streaming systems, so it works directly with Kafka if you don't want me to process your data before you visualize it It works of SAMHSA and many other stream processors as well So at the end of your stream processing job, you can feed that data into druid Druid doesn't just deal with streams. It doesn't just deal with like recent incoming data It can also keep years of historical data as well So there's there's a bunch of different companies that use druid in production both large and small so All right moving on So a very high-level glance of how druid works similar to some of the technical overview of the other systems Druid partitions data first based on time and These time partitioned shards are called segments and these segments are actually immutable So it's actually druid is a system very much designed to deal with a constant stream of events Druid maintains a global index of basically time interval to shards so each of the queries within druid has a notion of time associated with it and so if you query for like a week's worth of data that week's a worth that week's worth of data might Correspond to a couple of different shards and druid maintains an index of basically how different shards map to different time intervals Within each shard the data is stored in a column-oriented fashion and then compressed This is similar to other column stores out there And then each shard also contains a bunch of different types of indexes for very fast filtering. So there's In this kind of demo I showed here when you want to filter on whether or not things are robots whether or not You want to do very fast groups groupings This demo is actually being powered by the stack. I'm kind of talking about right now One other nice property about druid is that it supports different types of approximate algorithms So I think there's certain operations that you might want to do with an event stream It's very difficult or very expensive to do exactly So for example For distinct count if you want to count the number of unique device IDs You want to count the number of unique users having to store every single device ID or every single user ID Is very very expensive Especially if you have now tens or hundreds of millions of users it can be very expensive very quickly to store all that information Just to do a distinct count So there's a popular algorithm called hyper log log a lot of databases have this nowadays But it allows for this estimation of the stink count with without having to store every single unique ID Druid also supports other approximation algorithms like top-end Which is an approximate ranking by a chosen measurement? It supports approximate histograms and quantiles and it also supports approximate set operations So you can do a union intersection difference that sort of stuff with with druid The architecture of the system looks basically something like this the file's part is not that interesting for this talk The streams is really what we're focused on. So imagine that the Input of this is the output of a stream processor What Druid does with the stream is it loads it across a set of processes called indexers These guys will basically intake a stream They will take that stream of data and then create these druid shards, which we call segments And these segments are then loaded across a set of processes called historicals The idea is indexers are only used to create a set of processes called Indexers are only going to be dealing with a very small window of incoming data So these guys only might deal with about hours worth of data Which they buffer up create these partitions and hand off to the historicals and these Historicals deal with anything older than an hour up to like years of data Yeah Right So it varies from system to system But like like in Sam's are for example if I go back to here each each of the tasks here can be like containerized and they can run in a system like miso's or Kubernetes so basically your your processing logic all that lives within a container. Does that make sense? That's actually within druid. Yeah That has never been done most of the time one of the types is like an integer versus like a string versus Something else if if it was a container I would have to think a little bit more if how that makes sense But usually the types are describing sort of the scheme of the data like what is the underlying data But like if one of the types was like mpeg that might be something that's possible Right Yeah Yes, I mean the workflows described are a lot more around like statistical analysis so like finding averages means aggregations That's how to stuff off of a stream of data But come just going last bit of the architecture here within druid The idea of druid is there's a third process called a broker and these guys do query scatter gather functionality So queries go to the broker the broker fans out queries either to the indexers or the historicals And these guys hold recent incoming data and these guys hold historical data So the brokers have a merged view of both real-time and historical data And that that's what gets returned to a caller or to an application So when you kind of put these three systems together Just kind of overview then what you have is three separate open-source systems That are handling three separate problems dealing with streaming data. So data comes in it goes to Kafka Kafka can actually deliver it to the processing piece, which is SAMHSA and With SAMHSA the very last stage of your data processing you can tell SAMHSA basically to send the query to druid Another way of doing it is at the end of processing data with SAMHSA You can write it back to Kafka and then druid can actually pull data from Kafka as well Druid is the system that's designed to stream data from a lot of these other these other streaming systems So I thought I maybe be interested to kind of go through a real-world example to kind of cover You know how how this works in the real world and for this data set here. I actually did not use IOT data I used data that's that's kind of more commonly found in my field, which is advertising data because Like behalf the company Silicon Valley basically make money through data looks like this But with this data there's there's two kind of data streams here there's an impression stream, and then there's a click stream and Impressions for those of you that are familiar of advertising are basically people viewing in ad and then clicks are just people clicking an ad and What we want to do with this data through the stack that we've built up is basically create enhanced impressions Which basically means for a given impression we want to know did someone click on this impression or not So there's a couple of different steps required to process the data and get it in such a way that It's a little bit more usable The first step that we want to do is to be able to join our impression stream with our click stream With a traditional database you would be able to do this join a query time However, because we're dealing with like massive event streams These might be billions or trillions of events and trying to do that join a query time can be extremely extremely expensive So we're going to do this join at our stream processing level And then after that there's going to be a couple of different steps to basically clean up the data I just make it a little bit more user-friendly before we load it into our querying system So the idea is This impression stream and this click stream This is server log data that might get generated into servers We first write this data into two separate topics in Kafka So one topic is called impressions and one topic is called clicks and this is coming from our ad servers for example and We want to take this data. We want to enhance it make into a single stream and then make it visible and druid The raw data looks something like this So your impression stream you have an ID of some ad you have a publisher where that ad was published and then your stream is divided into a set of partitions and If you recall what I said about Kafka Kafka partitions data across many different servers so we have many different partitions that contain our impression stream and also our our click stream as well and The event that we want to join is These two events so one is someone viewing an ad and one is sometime later someone clicking an ad So this is where a stream processor comes in and what our stream processor does is it's going to it's going to create a series of jobs to be able to join these streams the first stage is a task called the shuffle step and It's the shuffle step is going to load data from the impressions and click stream the idea of the shuffle step is basically to rework data in the partitions such that the The join events are going to end up in the same partition because this makes it actually possible for us to do a join later on So we do a shuffle where the impression and the click that we want to join are now both in the partition zero and At the end of the shuffle phase Basically, we're going to write another stream to Kafka and this is going to be the the shuffle topic So now we're at three topics The purpose of the the shuffle topic is that we're going to create another job in SAMHSA which is going to read from the shuffle topic and Basically do something with it and then write it back to Kafka to create this new topic called the join topic What what's happening is what well previously we want to join these two events. We do the actual join I basically removing one of these events in this case the clicked event and then adding a new field to our Impression event and that field is is clicked so the idea is now there's a new new field calls is clicked and if a join occurred we marked that as true if it did not occur we mark it as false and Events that are joined are there are then written to Kafka under this new topic called joined and After that we take this join stream and we can do additional tasks We can do additional processing on top of that join stream So we might add additional business logic like let's replace the nulls with a default value Let's take IDs and convert them into human readable strings And then that single stream is the thing that ultimately gets pushed into druid and that's what and then the queries and applications I'll go through druid So To kind of summarize here all the technologies I talked about all these things are all open source Each of these projects have their own project web pages. You can just download any of these projects The three I talked about kind of work out of the box with one another so you can download these things You can install them you can play with it load your own event streams and try things out for yourself Okay, so I what I hope you've actually got out of this talk is that I think managing IOT data Requires a dedicated components that are targeted towards solving very very specific problems and the three problems I talked about one is data delivery The second is data processing and the third is queries a system for queries And I think Kafka Apache Kafka is a great system for event delivery I think Apache SAMHSA is a very useful system for stream processing And I think druid is one of the best-in-class systems for the interactive exploration of streams Cool, so that concludes my talk and I'm happy to just answer any questions at this time Yes It is yeah Yeah Definitely, so I have real time. Yeah, it is so I actually have seen applications of this stack I mean this stack this run in production at a whole bunch of different types of companies I have seen Application of machine learning at some companies. I'm not sure if I can say their name or not But one of the use cases I've heard about is like just like behavioral analytics So let's say if you have customers using like 20 different products and you want to start doing correlations across like this Across different data sources understanding how one customer is using one product and another product and a third product And are these like all the same customers or not what kind of like what how are they using that the different products? My company offers that's one application of machine learning that I've seen Yeah Right So that's what I was getting out of seen this used where Say snap your window to show me everything that's happened in the last You know 90 seconds. Yeah, so what you're driving off to an alarm and as soon as alarms it then show me everything It's happened over an hour or find patterns. How many times it just happened in the last 24 hours and Literally run it through a machine learning algorithm to find a pattern, right? So yeah with regards to the application of machine learning to this so the stack I described It's all for real-time data the latency from when the event is produced to when it's explorable is like millisecond So it is all very low latency The stack I described actually does both the streaming real-time component also the historical component and that alert piece is something I have seen before how it usually works is People try to automate like spike detection or anomaly detection The best way I've seen that being done outside of like human intervention Like there's always a thing of like alert me if x value exceeds like wise threshold that's a lot less interesting than some of the applications which is Let's take all my historical data and then compare what's happening the last like 90 seconds with all that historical data and if one of Some factor is a significant amount above like what I've seen historically and then immediately alert There's interesting challenges there because a lot of data patterns can be like sinusoidal Like the middle of the day can be very different than the end of the day So and then like the end of a quarter can be very different than like it be a name of quarter But that's why I think it's actually important to have the historical piece there So you can you can look back like last five quarters and then say like is this actually anomaly or not? But I've seen like pretty interesting work being done there and trying to automate like anomaly detection Yes it is being used at DevOps so both actually alerting and a lot of sort of Solution design for IOT make a lot of sense in the DevOps world So one example of why I think the stack is pretty good is that it can do like a very flexible ad hoc slice of dice analytics And the application of that is when you see like a weird spike in in your DevOps data Then the most immediate thing is like what's causing that spike and it might not be immediately obvious What's causing it you need to like kind of break down your data and view it from a lot of different views before you find Like the cause of that spike Exactly Right Right cool any other questions. I did there is actually an example here for github There's a US EPA, which is your classic sensor data, and then I've been loading github events as well So you can start looking at for example top. This is these are all open-source github Events, but let me see if my oh man, is the display not Let me unplug it and try and plug it back in Okay, so This is this is github data actually I've been loading so what I'm showcasing right now is let's say last seven days Who are the top organizations that have been contributing to open-source github and I mean Microsoft the Apache software foundation are Very high up here. You can break this down by different types of repositories as well. So So Kubernetes is here Google Facebook. They're all they're all pretty big contributors to github But you know taking an activity stream from github and like analyzing and breaking it down I think is is pretty interesting. So here you can see Microsoft over the last week is looking at VS code air slim type script Apache the top Apache projects. Here's Kafka. Here's spark. Here's flink. These are this is the stream processor This is a batch and stream processor and can do some other stuff and this is your event delivery piece Yeah The UI or the the logic to So samsa is open. It's under the Apache software foundation The logic of that the most of time the organization that's loading these things right some themselves I think samsa and some of these other string processors probably have some like default ones that come out of the box but Yeah, oftentimes. It's like custom custom business logic that that the organization then writes Yeah, yeah, it is. Yeah, it is But underneath this and this is just an example UI but underneath the seams. It's the same stack So this is like some pretty interesting stuff that you can do this So if you want to look at like the Apache software foundation You can do that you want to look at you know Apache Kafka for example and look at who's Who's kind of contributing to Apache Kafka these these I know this person. I don't know who the other people are but This is kind of kind of an example of how you can start analyzing and slicing and dicing like what people are doing on github And it's a good way of following like what are popular open-source projects Cool any other questions I'm not sure I have to send it to I have to send it to people Yeah, but definitely I will upload it and it has no information Everything I talked about is open source you can download it and as I mentioned even like hooking Components up from one another all that is should work out of the box Cool other questions All right, thanks guys