 Thanks for coming. So this is the second lecture we have in our time series database lectures. So today, we have Karthik from Streamlio. Prior to starting the startup five months ago, he was the sort of the program manager of the Heron Project at Twitter. That's the new streaming system that Twitter developed. And then prior to that, he had a startup with Jignesh Patel from Wisconsin on local Maddox. That was brought by Twitter, and that's why he ended up there. So Karthik has been sort of well known in the database community for a while, because he did his PhD at the University of Wisconsin with Jeff Nodden and the database people there. So we're super happy to have him talk about the new system of the program. Thank you. Thanks, Andy, for this short intro. And also thanks for setting it up. And it's a pleasure to be here. This is my second visit, I believe. The last time when I visited, it was 2014 or so for a recruiting trip that I came and I gave a talk with the CS department and grabbed a few students. But unfortunately, they didn't join. So anyway, so the title of my talk will be Autopiloting Real-Time Processing with Heron. And Heron is one of the systems that we developed to Twitter while I was there before I started a company based on Heron and a few other things. So this is a joint work with Microsoft folks and Twitter folks. So Avrilia was leading on the Microsoft side, and Bill and I was leading on the Twitter side. So again, this talk is about most of our practical experience that we encounter every day to day in the operating real-time systems at Twitter. So that's what this whole talk is and what are the problems that we found and we are trying to solve that problem in an automated fashion. And that is all the talk is about. So what is autopiloting? So like, oh, sorry, nothing, the animation is kind of missing. So autopilot is, according to Wikipedia, is a system to control the trajectory of an aircraft without any constant hands-on control by human operator. And tweaking it a little bit in our system settings, you want an autopiloting a real-time system means its ability to adapt itself as its environment changes without any manual intervention. That is the whole idea of autopiloting. So why we need autopiloting? Because in real-time systems, the value of, you have to look at how the value of data to its decision-making decreases over a period of time. So the value of data to decision-making is the highest moment, the instant it is produced. As it ages, the value goes down. And when the data is at its highest value, you want to make time-critical decisions. And if you wanted to make time-critical decisions, for example, if you're in a financial trading application where if stock price exceeds a certain limit, you wanted to execute an action saying that you either sell a stock or buy a stock. So in order to execute these time-critical decisions, you have to make sure that your systems run continuously in some kind of autopiloting mode so that you don't need any manual intervention. Because the moment you start having manual intervention, there is a chance that you lose the liveliness of the data and the action is very much delayed. So the second reason why we have to do autopiloting is from an enterprise perspective, is loss of revenue. For example, think about the downtime impact, especially during popular events like Super Bowl as well as Oscars, because Twitter makes its most of the money during that ads time within that four-hour period, right? So doing that, if there is even a few moments of downtime, it can lead to millions of dollars lost in revenue. And of course, there is SLA violations. So Twitter typically have deals with their partners and ad partners who are generating revenue for us that if it's down for some amount of time, then you have to pay a lot hefty penalty. So we won't avoid that. And finally, quality of life. This is very important actually. So if you get constant pagers going up in the middle of the night, like 2 AM, 3 AM, or during when I'm an engineer enjoying the Oscars, you might want to reduce your incident time because that improves the quality of life for people who are working on these systems. And finally, like it's increased productivity. So when engineers don't constantly do firefighting in terms of incident and SREs, they can develop good features for the product and make the product better and better and better. So that's why we increase productivity. So that's another reason why autopiloting system is required. So before I go into what I mean by autopiloting and what are the issues that we encounter in practical experience, I just want to give an introduction to Heron. So because I think how many of you know about Heron actually? So I think it makes sense to just give a few slides about what Heron actually means. So total Heron is nothing but a streaming platform for processing real-time data as they arrive so that you can react to it or read data as it happens so that you can act on the data. And if you want to make any decisions based on the real-time data, you can do all kinds of stuff. And so it does a bunch of properties. One is it provides guaranteed message-passing and streaming systems that are three different message-passing guarantees that we need to provide. One is utmost ones where you try to do the best effort processing. At least ones where you guarantee that the data is processed at least once but it could be processed more than once. And exactly ones which essentially means like you have to process the data only once. And it's also, it's a consequences too. So the second aspect is scalability. So by the time I left Twitter, like Heron was running on 1000 plus nodes. So you might, so like if those capacities are filling up, you just keep adding nodes to it and automatically the software should take into account the new nodes and start spawning jobs on those new nodes as well. Then it has to have robust fault tolerance in the sense like in the presence of process failures, machine failures, automatically the software should keep going, again without any manual intervention. Then finally like it has a way to express computations in concise logic in the sense like you can do it at multiple levels, like a lower level as well as a function level as well as a declarative level. So with the various levels of ways by which you can express computation it becomes easy to submit your computation logic and submit it. So to introduce a simple few terminologies like Heron job is, streaming job is called a topology and topology is nothing but a DAG and there are two type of vertices in a DAG and the one is called the spouts which are essentially taps into the sources of data and inject the data into the topology under the bolts or the different type of vertices and they essentially are more computational elements where they take the input data, do some computation on the data then they emit outgoing data. And some of the examples includes filtering, aggregation, joints, any typical database functions or it could be even an arbitrary machine learning function like clustering or association rule mining all kind of stuff like that. And the edges between those vertices represents steams of flow of data going from one vertices to another vertices. To give an example of how a topology looks like like you can have a couple of spouts here then those are in turn feeding into a bunch of bolts in the next stage and that in turn bolts in turn transform the data in some form and pass on to the next stage of bolts. The one big difference between MapReduce and Heron topologies are that you can have arbitrary number of stages in Heron actually as compared to MapReduce where MapReduce is essentially two stages, Map and Reduce and you have to have another Map and Reduce so that becomes two different jobs. So here you can have as many stages as you can. So now when this topology physically gets executed so as you all know data is pretty has high growth because enterprises grow the data by 40% every year and because of the high data growth it's easily you can exceed the limitation of what you can process within a single machine. So which means we ought to go distributed and do kind of partitioning. So when Heron topologies execute in underlying hardware so each of the spout and the bolt will have this notion of a parallelism or each one can have multiple instances of them running in different nodes so that they can sustain the data that is being generated. So if data is being generated like one terabyte per second yes you can have with so many partitions of the data you can continue to process them. So essentially like this is actually what we call as a parallelism and the spout and bolts are sometimes referred as components and the component parallelism essentially means the number of instances that you ought to do per vertex. So then now the one thing is like when you have these parallelism involved now how the data from emanating from one spout instance goes into another bolt instance and so on. So that's where the notion of grouping comes in. So the Heron supports multiple different groupings. One is called shuffle grouping where you can send it to any downstream bolts and the fields grouping essentially is similar to hash partitioning where you just pick up a few fields and compute a hash function out of, I use a hash function to compute a hash value and based on the hash value wherever the destination is it will go to that bolt. Then all grouping essentially replicates to everybody downstream. Then global grouping sends the entire stream into only one task. So if you go back to the physical execution and annotated with grouping, you can have a topology like this where the spout one is feeding into bolt one using sort of shuffle grouping and similarly you can do spout two to bolt three using fields grouping, it's a combination of those, right? So this essentially allows you to route the data in different forms. So now how do you generate a Heron topology? How do you write a Heron topology? There are multiple ways you can do it and one is more procedural and low level API. You write directly your code as a spout and bolt, that's it. So this is very useful for several locations where people process video data, break that video into multiple smaller segments and throw into multiple stages. Then you can do even the second level is called functional which is more of a use of maps and flat maps and transforms functions and windows. It's like more of a scholar like functional. Then the declarative which is SQL which we are not finished it yet, but it's coming and you can just say what you want and the system will know how to find it. So with that I'll go into the high level Heron architecture so that the further portion of the talk you can understand it better. So Heron per se does not have a scheduler of its own because the reason why we designed like that is because in the open source community, the scheduler community is pretty strong. There are a lot of very good quality schedulers that are already available in the market. I mean, one of them is Kubernetes that is a huge momentum behind it. And similarly, Mesos is another scheduler that's already has a lot of momentum. Then Heron also is another scheduler. So instead of reinventing the wheel and having a ton of scheduler that we need to work on, we decide to piggyback on an existing scheduler so that when you submit a Heron job, it will accept like a ton of a job of any scheduler. So one of the beauty of the Heron architecture is what we call as an extensible architecture and completely modularized. And the reason why we modularized is because in the big data ecosystem, constantly things are changing. People come up with new schedulers. People come up with the new, what do you call as a zookeeper like a synchronizing software. Then storage changes. Today's Hadoop, tomorrow is a SAF, cluster with the GFS, those kind of different file systems keep coming up. So how do you make sure that your code does not change in the constant environment that is continuously changing? So Heron has these modularization so that you can write your plugins for each one of these different components and Heron will automatically take into account. So that way, what happens is like unlike other popular open sources like Spark, Storm, where they have different code bases for different schedulers, Heron has one single code base which accommodates all these plugins and you just turn on the plugin when you're doing installation time and automatically takes care of everything. So anyway, coming back to the architecture, like it does not have a scheduler of its own and so it looks like it's another job in the scheduler. So the advantage of doing this is if you're already having a big cluster, like a on cluster mesos or a Kubernetes cluster, already you're running, all you have to do is compile your job and just launch it and then it looked like another Kubernetes job and runs. So now going down into a little bit more detail about the Heron topology components. So Heron runs in terms of what you call containers. Heron job is nothing but a set of containers and the containers could be either mapped into a Linux C groups or you can map into more of a Docker container also. So both are possible. So there is a special container called the master container and within the master container there is a process called topology master which governs that entire topology that is running. And then there are the rest of the containers are called data containers where the actual data is being processed and the data containers could be as many as hundreds and tens and hundreds and thousands. So we have pushed it up to one job having around 1000 containers so far and definitely we can even push it up to a few 1000 containers. The only thing is like limitation is will be the N square socket problem. So if you have 1000 containers, each container has to connect to the other containers so that they can do data exchange. So before going into those details like let me describe what is inside the container container. So the data containers in turn consist of several processes and some of one of the critical processes is called the stream manager and the stream manager is responsible for routing data between data containers. And then the metrics manager is essentially responsible for collecting all the metrics that is being originated out of this so that you can do some kind of a troubleshooting and kind of visible into what is happening with the topology right now. So this is very useful for troubleshooting. And the instances I1 and I2 and I3 these are the processes which actually run your spout and the bolt code that remember we talked a few slides ago where we talked about spouts and bolts and that code is essentially runs on these instances. So the sequence is like first when the topology gets started the master container comes up then based on the amount of request resources that it requires it requests a scheduler. Hey, can you please spawn off like 1000 containers for me? Once the 1000 containers are spawned off by the scheduler then each of this containers will stop the stream manager as a first process and that in turn will immediately try to find where my topology master is. So they are able to discover a topology master using Zookeeper. The Zookeeper is used for a service discovery kind of equivalent where the topology master when it comes up it tells that I am available on this host A and port B. And so then once all the data containers stream manager comes up they look at the fixed place in the Zookeeper based on the topology name then they contact the topology master saying that I have come up in this container. The moment every container comes up and reports to a topology master saying that I am all okay then topology master forms what you call as a physical plan. So the physical plan essentially gets you information about where the other containers are running so that each container will receive that physical plan and they in turn will know where the other containers are running so that they can connect to it because ultimately data exchange has to happen between all the containers, right? So there should be some way for the containers to discover each other and that is through the physical plan. So once the physical plan is broadcast every two all the containers they will connect with each other and start transmitting data. So now there are multiple ways the data transmission occurs. So data can emanate out of the instances and it could be going to instances in the same container. So which means the instances do not interact with each other to keep the software simple and predictable the data bounces off from the stream manager even though it is local for the container. So like it, so in this case what happened is if I1 data has to go to I4 it will go to the stream manager on the container then get back to I4. And the next data exchange is if data from I1 here has to go to the I1 in the other container it goes via to stream manager from that stream manager it goes to the other stream manager which in turn delivers the data to the I1. So one of the basic difference between the rest of the other streaming frameworks and the Heron streaming is it's a more process oriented architecture rather than a thread oriented architecture and process oriented architecture gives a lot of flexibilities because it has isolation. So each instance could be restarted whenever you want to if there is some issues with that. And because of the fact that it's containerized each container can be restarted separately and the whole topology does not influence any other topologies that you're running. This is all very essential in an enterprise setting where one job does not pull out another job because you don't want to share anything at all. So any questions so far because this is very important in this thing, yes. So does Heron actually optimize the topology to reduce data movement or to push down the logic? So currently like it doesn't. Leave the question in the video. Oh, okay. So the question is whether Heron optimizes data movement based on the topology. So currently it doesn't at this point because it's a little bit of a low level API but if you wanted to optimize the data movement it's possible to do but it also depends on there is a notion of what you call a packing algorithm. So what happens is how do you pack this part and this board is going into this container compared to that spot and the other spot, right? So the packing algorithm has a heavy influence on how the data movement occurs. Yeah, you can come up with the intelligent packing algorithm that depends on the topology dag. So- And there's also like, because the question like, you can also push down the boards, right? Like in SQL you pushed on the predicates. Yes, you could push down the predicates of wherever evaluation as far as close to the source, right? The source are nothing but the spots, right? So you can reduce the number of boards that is being generated, yes. So the DSL that we have done, currently it doesn't do the optimizing because optimizing is a completely a whole area that you can write. We have not touched anything at this. That's why if you guys folks are interested in doing it go for it, okay? Yes. Could you repeat what the zookeeper cluster used for? So zookeeper cluster is used for what do you call the discovery. So essentially when the topology master comes up it writes its location in the zookeeper so that other containers can discover where the topology master is. Then the second usage is which I forgot to mention is once a physical plan is formed when everybody reports and says that I'm okay and it records the location of the containers, right? And that information is also saved in the zookeeper. The reason why we save it in the zookeeper is the master container can go down at any given time. So because then if the master container goes down and scheduler brings up a master container in a different node, then the master container can discover its state or re-bootstrap itself from the state whatever it is left off. So is this common across jobs? The zookeeper cluster is shared. So zookeeper cluster is typically used for several softwares and then Hadoop as well as other bunch of other softwares he in turn uses it. So typically zookeeper team maintains that zookeeper cluster. Do you need the questions before we move further? Yes. So what will happen if a container is down or not behaving like it's very slow due to some hard well like issue? Okay, so the question is what happens if a container goes down? So yes, it's a valid question. So like let's take the two cases. One is the master container can go down or any of the data containers can go down. Those are the two cases, right? So what happens in the case if a data container goes down? So when the data container goes down so what happens is like there is a heartbeat mechanism that goes between steam manager and the topology master. So the topology master figures out that this container has gone down. So, and there is another heartbeat mechanism that goes between the container as well as a scheduler as well. So now scheduler looks at it and says sees that this container is down. So it will relocate the container to another machine or probably on the same machine whatever might be the case then the container will come back up. Then the container again the first job is going to discover where my topology master is at any given point in time because topology master knows that I have not got the heartbeat from this container anymore. Now when this container starts up each container has a unique ID. So when the container dies and comes back some other place it gets the same unique ID. So then the topology master knows oh this container has come up in a different place. It will reform a new physical plan and the physical plan again is stored in the zookeeper cluster and the physical plan is again broadcasted to everybody so that people the other containers who are talking to this container they also don't know whether where this guy is running. So they can reconnect and start a data exchange and during the process that it happens the data is buffered in the other container on the other container so that the moment this container comes up the data is kind of dumped into this container so that they can process it. Okay so any other questions before I move forward? Good. So then one of the highlights of Heron we introduced this notion of a back pressure. So what happens in the case if one of the instances of the process is running slow? So let us take a simple topology which is very linear where you have only singles powered which in turn goes to both B2 and that in turn feeds to B3 and B4 and so on. So now a physical plan when it runs in containers it looks like this where you have three instances of each one of them and so all the stream managers are connected to each other as a fully connected graph because data has to exchange. So let us say for example, sorry. So now let us say like if your B2 is going slow. So now the moment B2 is going slow the stream manager on the container will identify that I'm not able to push the data as fast as I want to and so which means I need to slow it down. So now there are multiple ways of slowing it down. One is if I know who are all feeding into the what do you call the both B2 I can slow them down or I can go to the source because if you look at it like in a big topology ultimately the sources are the spouts are the ones which are continuously injecting data into the whole topology. So now the question is like so what we did is we took a simplistic approach where rather than going to the previous stage and slowing them down then back pressure propagates to all the way to the source we directly go to the source so that we can slow down the source. The moment what do you call it? One instance is known to go slow. So what happens is like the stream manager sends some kind of a broadcast messages. The guy who's experiencing the back pressure or detecting the back pressure will send a message to across all the stream managers saying that hey by the way I have a slow worker here or a slow instance can everybody slow down? The moment everybody receives the back pressure messages they will look at in the plan whether they have the notion of a spout because spouts are ultimately the sources of data right? If there is a spout then they will not take any more data from the spout at all. So which means that is equivalent to kind of shutting the data flow into the topology. So which means all the data that is in transient in topologies is going at the pace so that the buffers are being drained before we let any more data. So now what happens when do you re-open the spouts? So we have this stream manager has this notion of buffers and water marks on those buffers. So what happens is if the back pressure is initiated if the buffer crosses the higher water mark so that you can initiate the back pressure. Now when the buffer is being drained and you wanted to go below the lower threshold water mark then we initiate a really back pressure. The moment you relieve the back pressure then every spout starts sending data again. So the reason why we have this high water mark and low water mark is because of the fact that you don't want to go into an oscillating behavior. So for example, if you don't have this water mark and you have only one threshold when it exceeds you can initiate when it goes below you can go down to what you call relieve the back pressure. But the thing is like then it can go into an oscillating situation. So that we found in when we got Heron into production we experienced all these things. Then only we did this low water marks and high water marks so that you have some cushion before we can relieve actually the back pressure. So again, remember the back pressure. So then to summarize about what Heron is like Twitter runs around close to 500 plus jobs on Heron all the time and it crosses over 500 billion evens per day and the latency of a topology can go from around 10 millisecond to 50 millisecond. So some of the sample topologies can range anywhere from very simple ones to complex ones. So do hand people, people code these complex topologies at the spout and the bolt level? No, there are higher level frameworks which in turn generate these DAGs and when they generate these DAGs they might not be even optimizing the DAGs. So just generate the DAGs and dump this DAG into Heron for execution and we are able to execute. So this is a kind of gives an idea of Heron visualization and the reason why we have to give some level of visualization is because of the fact that we believe in what you call self-service model. I mean, there are 60 teams that are using it. So we are a one small team which needs to support all the 60 teams. So that's, we cannot scale that much actually. So what happens is like we give all the enough necessary tools for the team so that they can troubleshoot the topologies themselves. So the UI is kind of geared towards that. So you can visually see your topology, how it looks like and how it's mapped into the containers at runtime. And also you can, you got a dashboard which says that whether everything is green or not. If it is something is, let us say for example, GC is causing some issues and it can even say whether it has happened in the last 10 minutes or one hour or last three hours. And the moment you click on the particular red button, it can even show what container or what instance what caused that issue. So there could be not only one of them, it could be multiple of them so that then they can even go down to that instance and look at the logs and what triggered the GCs and other thing. So all we have to do this because you don't want to spend a lot of time troubleshooting other's problems, right? Because it just doesn't scale. So the UI is geared towards doing that actually. So okay, so now with the introduction to Heron, then I want to go into what are the common practical issues that we face when in operations. So there are two sets of issues. One is developer facing issues and another one is operational issues. So let us look at developer issues. Now when you write your topology, the simple question is, how do you assign, I need a 10 instance of this pod and a 100 instance of this board? So that's one big question. So that's what called the parallelism tuning. So how do we do that today? It's very manual at this point. So what happens is people write the topology and they try to run it in dev mode and in development mode, then they have a, remember the metrics manager that I talked about that gives you like hundreds of metrics. They look at how much CPU is being used at every stage and how much memory is being used at every stage. They look at graphs and charts and all the various stuff. Then they come back manually saying that, no, this is not enough. Then they increase the parallelism and try it. So in order to keep iterating that quickly enough, it takes almost a week or two weeks before they settle on optimized resource allocation because there are some data variations that happen during the time of the day or during the week, weekends and weekdays and all the various stuff. You ought to optimize for those as well. Once that is optimal resource figure is figured out, it takes a couple of weeks manually. Then after that, they multiply it by three or five times before they launch the job because the reason why is sometimes you get spikes. I mean, spikes, the only way you can do is either you can buffer it or over provision it at that point. So when you allocate 3x, 4x resources, it's that resource are actually wasted actually because I've seen like people using 100 cores and they would have allocated around 700 cores which is like 600 cores worth of computation and not only it's cost of the computation but the power cost, maintenance cost. If you add the math, it's pretty expensive actually. So the first thing is can we automate the fact that, hey, as a developer, I should be just focused on just writing my logic. Why should I figure out how much resources that I have to do? Can you make it a way by which, okay, I just try to launch it, let it figure out what it needs. Then there are common operational issues and this is like we deal with every day, slow hose. I mean, like typically in a large cluster, like because Twitter had these clusters, we were aware the computational clusters will consist of around 200,000 missions and those 200,000 missions, I'm sure they will find some bad apples somewhere around. So what happens is like sometimes those cost will be slow and then network will experience some kind of issues. Then of course you will get into data skew because suddenly when there's a spike and some data is growing organically and you have no clue why the skew started suddenly and Twitter is known for a skew especially like when there is what do you call asymmetric medium, right? When Trump tweets 30 million followers carried. Similarly, like some other popular, I think Katy Perry has the highest, I think 100 million followers I think. So when she tweets 100 million, suddenly happens, right? So it's known for data skew and also load variations. So during the time of the day, data increases in incoming data increases around 10 a.m. then peaks it around 10 30 or so, then starts going down. Then again around four o'clock it'll start again. So again, inter data variations and across data variations and during Super Bowl and like Oscar's time, typically we get four X to five X increase in traffic and there are some occasions where we have seen and there is a Japanese cotton movie and Japan Twitter is being widely used actually. So the story was that in that one of the cotton movies, when one character says something, everybody tweets at the time. So we have seen 100 X spike during that time. Thank you very much. In the movie when the character says one phrase or whatever at one particular scene, everybody is just waiting to hit that send in the tweet button. It's a broadcast. Yes, it's a real broadcast. Like so you get 100 X, we measured it actually. Again, so like I say, hey, don't tweet this. And then in the movie, like what's this, everyone knows to tweet this thing. No, no, it's not, we are not tweeting. It's a, for them it's either become more of a cultural thing or a tradition. In the movie, is it, I mean it's not technical. In the movie, is it like, the guy's talking about Twitter? No, it's not, I don't know that because it's Japanese, right? Wish punch, yeah. Yeah, we should find this out, yes. Can we go? Yes. Okay. So this was kind of, I mean, we used to experience all kinds of issues like that. Then finally, SLA violations. We want to make sure that we don't violate any of those SLA's. So slow host. So why host is slow? Okay, so some memory parity errors. Some disk might be failing to, it's impending failures. And it's showing signs of failures. And it could be lower, just lower odds because in a 200,000 node cluster, there will be some older machines versus newer machines. And if you're unfortunately, one part of your container got scheduled into the lower machine, it might be going slow as well. The network. So it could be network slowness or even network partitioning. The network slowness of it affects, it delays processing and data starts accumulating because of the fact there is a delay. And the timeliness of results is affected too. I mean, you can go from millisecond to a few seconds too. Like it might violate SLA's. Now let's take network partitioning. So there are cases where I don't have any solutions at all. So let's look at different type of network partitionings. What happens if a scheduler cannot talk to topology master? What happens if a topology master cannot talk to a stream manager? And what happens when two stream managers are not able to talk to each other? And what happens if a scheduler and a stream manager cannot talk to each other? So these are the various different network partitioning that can happen in the system. Now let's take one by one. Scheduler and the topology master, they're not able to talk to each other. So what happens? Scheduler thinks that the container has died because it then is not getting any hot beats from them. So it will try to spawn off a new master container. Now what happens like since the master container has a lock on the zookeeper, the one which is currently running, which is not able, which the scheduler is not able to contact, then acquiring a master ship on the zookeeper topology name is going to fail on the new master container. So the master container will keep retrying and dying. So which is fine actually, because of the fact that at least this network partitioning is not affecting the topology that is currently running. But it will affect only if the topology master container dies and the scheduler has to know. At the time, in this case, they don't even have to know because of the fact that it spawned off a new master container anyway. At the time that will pick it up. So if the master container dies, it will go into a mode of where scheduler will keep trying to schedule again and again until the network partitioning comes back into what they call either got relieved or probably got the point that it required operator invention or whatever. How many of you, by the way, have you seen network partitioning? We see one per week at least. It happens, yes, yeah. So what I think is when you experience it, like then only you will realize but it experience it as a different symptoms of coming together and you have no clue that the network partitioning has occurred. Until you go to really down, go to the news. Oh, that is a network partitioning happening somewhere. So then what happens in the case between topology master and the stream manager occurs? So in this case, the team master thinks that the data container has died on the way for the scheduler to reschedule the new data container and that never happens because of the fact that the scheduler thinks that the other data container is fine, right? So which means nothing affected in this case. Now what happens to the case when the stream manager and stream manager where the data exchange is actually happening network partitioning happens? So they cannot talk to each other and cannot exchange data because of the network partitioning and the data is buffering on increasing on the stream manager. At some point it's going to dump the data off because they are overflowing the buffers and after that it's a chaos. We have to detect this network partitioning. In this case, I have no clue what to do with that network partitioning. Then in the other case where between scheduler and the stream manager so this case again a new data container is spawned and team master realizes that two containers are reporting that hey, I'm the same guy, right? Because the one data container is able to not able to talk but so then, but the team master what happens is like when a new data container comes up saying I am the guy who's already existing then it will never accept that connection. That way we can keep running the topology. There are pros and cons to accepting the new guy versus the old guy but we decide to choose the old guy because the old guy might be accumulated in some state and other things which you don't want to disturb. So now data skew. Again, there are two different type of data skews. One is several keys map into a single instance and the counters could be very high which is the multiple key data issue. Of course, another skew is based on single key where a single key maps into a single instant and the count is high too. So in multi keys, one simple way to solve the problem is like as you have this topology, let us say like bolt one is the one which is experiencing the lot of data coming through it then you can just simply increase the parallelism on the fly, right? So if you can scale up the number of instance on the bolt alone, you should be able to absorb the processing of the data increase that is happening there. Now, what do you do in the case of single keys? Anybody? Cash. Cash is the partial solution. So what you typically do is do a pre-aggregation stage. So one simple way to do it is like when a single key is getting into a single instance you do have to have some kind of pre-aggregation thing so that it reduces the number of tuples that is going from that stage to the next stage. So what happens with this skew is temporary. That causes issues. Do you introduce a stage on and off every time or just keep that all the time? Again, these are interesting issues. So now the second final is load variations. You can get into spikes or it could be daily organically varying patterns. So these are the two different load variations. So now I've looked at, I've kind of enumerated all the problems and other things that happen. These things happen, we have seen it in our eyes and how can these kind of come up with a solution or a framework which can allow to solve these problems in some fashion. So again, this is a very preliminary work that we have started with and essentially we call them as autopiloting or self-regulating. So what does it mean by autopiloting? So there are a couple of broad areas. So one is how to automate the manual and time consuming and error-prone task of tuning. Remember the developer problem of tuning, allocating container resources, how to automate them and how to, what do you mean by automation? So in the sense like even if you demand automation, you need to have some kind of a goal in order to satisfy. And that goal is expressed in terms of service level objective, saying that this topology has to maintain this throughput irrespective of these environment changes or it has to maintain this particular amount of latency independent of how the environment is, right? So that is an objective that the system will try to achieve all the time and retune itself in some fashion to achieve this objective. Then maintaining that objective in the face of unpredictable load variations. So on the hardware and software issues. And this is what essentially constitutes autopiloting of these tuning systems. Now what does it mean for autopiloting means? There are three different broad, this thing underlying that supports autopiloting. First is self-tuning in the sense like several tuning knobs and we wanted to reduce the amount of time that is going into a consuming phase. And the system essentially should take an SLO as an objective then automatically configure the knobs. Which is very similar to some of the research the database team is doing out here which is actually automating the knobs so that you don't have to worry about tuning the database. Which is very similar in the steaming systems as well. Then the notion of self-stabilizing especially the steaming jobs are long running because there are several cases where people just launch a job and forget it for six months and come back here nothing happened. So where you should take it out. We have to go and force them saying that by the way the software has evolved over a period of time we have fixed a bunch of bugs and everything. Can you please relaunch your job because you have to launch with a new version, right? So some of the things are load variations are very common which because of the predictable nature of the traffic and sometimes unexpected hikes as well. So the system should react to external shocks and accordingly adjust itself. And finally it should do self-feeling in the sense of in the presence of things like software issues and hardware issues you should be able to identify those issues and automatically correct itself. So these are the three different areas of what an autopiloting should achieve. So how do we achieve it? So by working with Microsoft we designed a system called Dalian and Dalian is essentially nothing but as a policy-based framework integrated into Heron. So essentially you can describe a policy for each of the issues and it will continue to evaluate this policy every few minutes or so. And accordingly it would readjust a topology based on how the conditions change. So essentially it executes some well-defined policies that optimize execution based on some objective. And so I will also show a couple of policies that shows how you can scale up and scale down and how to identify slow host and a few things like that. So we have just started with two policies but this gives you an extensible framework where you can write more and more policies and based on those you can solve a bunch of other issues that you have come up with. For example, you can direct the skew and what you can do to solve the skew. That could be a policy on its own itself, right? So this is a very extensible architecture. So what are the Dalian policy phases? So Dalian essentially takes the metrics that we collect. Remember we had the metrics manager in each container, right? Which is constantly measuring information about how each instance is performing in terms of latency, throughput, a number of tuples emitted, how much is a queue length. What are the metrics that you can think of is pretty much there. So these metrics are constantly ingested into Dalian and then each of these metric combinations is used to detect symptoms. So what are the symptoms that is going in the topology is experiencing at this point in time. Then these symptoms are fed into some kind of diagnostic generators which in turn takes a combination of these symptoms and produces some with confidence, this is the diagnostics I am thinking of. Once you have some kind of diagnostics, then you ought to have some kind of actions that you wanted to take to resolve that actually. So how do you automatically resolve it? So there could be a non-invasive, so some of the policy could be, overall could be invasive policy or non-invasive policy. Our invasive policy is actually it's going to change something in the structure of the topology during execution. So which means it is going to affect the topology in some way, right? And a non-invasive policy does not affect anything instead it just reports to a user this is happening. So an invasive policy when it's taking some kind of resolving action or a corrective action, so typically we make sure that we just do one invasive action at any given time because it's just for purposes because if there are multiple resolving actions are happening simultaneously, you don't know how they are interacting, which is that they might even, what do you call break the topology rather than correcting it in some fashion, right? So that's why invasive policies are done one at a time. And between two invasive policies there is some amount of time is given so that we see once the policy is applied then whether the corrective action that that policy took is beneficial or not. If it is not beneficial, then we go on to the next one. So that way like what happens is you don't clobber the topology in some fashion. So now like how the Dalian was incorporated into Heron, so it's pretty straightforward. So in the topology master container we have this notion of health manager which is essentially a Dalian implementation and all the metrics manager, remember if a Dalian requires all the metrics, pipe those metrics into health manager and the health manager runs these policies every two minutes by caching those metrics and looking at what happened in the last 10 minutes or what happened in the last 20 minutes so that he doesn't go into some kind of transient, noisy data, right? So then executes those policies. Now once execute the policies then it takes corrective actions, then it also logs what action it is taking at this any given point so that we can see what is that action is taking place because of what diagnostics, because of what metrics so that we have an explanation of why it did that, right? Because this is very important in double shooting itself or double shooting Dalian itself. Then if some action did not, what do you call lead to some kind of a correction that was resolve the issue, then we blacklist that action in the sense like, hey this action did not resolve whatever it's supposed to resolve so we blacklisted so that you know which one worked and which one did not work in these conditions, right? And that is in turn helps Dalian to improve itself. So let's look at a concrete example of what we call as dynamic resource provisioning. So this is a policy reacts to unexpected load weight. This remember one of the problem that we mentioned is to how to deal with load variations, right? So in this case, like this policy is supposed to correct the topology when a load variations occur. So the goal here is to either to scale up or scale down the topology resources as needed while the topology until the topology achieves a steady state where the back pressure is not observed because back pressure means there are three issues. One is either a slow host problem or it could be the topology is not provisioned correctly or it could be at some kind of data skew. So now if you will take the dynamic resource policies it takes a metrics and tries to detect three things. One is the pending to boost detector which kind of indicates either a slow or whatever it is on the back pressure detector whether are you observing any back pressure in the topology at all. Then the third one is like a processing rate how fast I'm processing for this skew. So once the detectors come up with some kind of a symptom that they're experiencing it could be possible all the symptoms could be all three of them simultaneously occurring as well. Then we go through the resource whether what is the diagnostic? Each could be a resource over provisioning that we need to increase or we need to scale down the resources because the topology is taking more resources or it could be data skewed detectors or it could be even slow instances because of the fact that there's some correlation between the processes that are running on the container that everything is going slow that that container has to be moved to a different machine itself. So once we find the appropriate diagnostics then we go into what we call either do a bold scale up or bold scale down or it's a data skew resolver or restart the instances or restart the container somewhere else. So now let me give you an example of the let's take this simple topology and this is running at a steady state where for the splitter bolt where it is doing around 100 tuples per second and it's skew size is around 20. Now let's see like take the case where it's under provision. So suddenly you will see like you will see like the topology is under provision because most of the splitters, spouts are receiving more data than what we are supposed to process and one of the guys started experiencing back pressure because it's not able to handle the data, right? If all its peers are experiencing at most at the same topology processing rate and also have the approximately the same skew size then you know in that stage is not it's taking more data than what it can process. So which means automatically it should increase this thing so which means totally the topology is under provision so which means we need to increase the resources. Now let us get back to a steady state. Now let us look at the slow instance case. So what happens in the slow instance is like the guy who is going slow is not able to absorb the data or process the data as fast as it could. So it will initiate the back pressure. Now remember I talked about the back pressure where we go and clamp down the source directly, right? Now because of the fact that we have gone the clamp down on the source, the rate of tuples ingested into topology itself slows down. So which means everybody will be getting the same rate of tuples per second now but the guy who is actually slow will have more queue depth or a queue size which means that indicates that container is experiencing slowness, yes. And also the processing rate on the depth will still be lower? It may not be because of the, when we observe in practice because of the back pressure has already occurred there's no more data flow into the topology, right? Whatever the existing data is only is being distributed across all the guys. So which means, right? So that's why you see the 50-50 on both of them. So that's why I explain the back pressure in the front. So that means it's loaded in these things. Now let us look at the data skew, how the numbers will look like. So in this case you will have a more processing rate as well as a queue depth. So which means you know there is some data skews that is occurring there. We don't know whether it's a single key or a multi key but at least you know that this is occurring there. So this is how the diagnostics comes up. So once the diagnostics comes up then you have to get the appropriate action. So typically diagnostics and actions are kind of mapped one to one. So this is an example of another policies but I'm not going to go into the details of this but because in the interest of time. So this is essentially meaning the, you can even describe a topology saying that hey, maintain the throughput for me. Like this throughput should be 100,000 tuples per second. Make it happen independent of however the changes occur in the environment, right? So that's what this one does. So now to just give an idea about how it works. So we tried it with one simple topology where they spout and they're splitter bolt and account bolt. So this is a tweets spout which gets data split into, in the tweet data gets split into the splitter bolt and then we count the words on the other one. So then the hardware, this was tested with Microsoft HD inside and with all the various stuff. The throughput of this thing is number of tuples emitted per minute. The throughput of the bolts is number of tuples emitted over one minute. A number of Heron instance provisions. So it's a little bit this thing. So in this graph if you look at like you have the steady state S1 this is a normalized throughput where the S1 state is considered as 100%. And then we scale it down 20%. Then we see whether Dahlian has reduced the resources or not. So whenever the topology is reconfiguring itself you will see the throughput going down. This is the small spikes that you see which means topology is reconfiguring itself so that it can come back to the normal state, right? So with doing that spike down and all it's essentially the number of instances is increasing or decreasing those kind of things and topology is correcting itself. So then it took once you have the steady state then the scale down it took some time to scale it down. As you detected that 20% reduction in the throughput the number of instances of some of the spritter bolt and the counter bolt decreased and until it reaches the steady state of S2. It has to go through, remember if you scale down one you can scale down another stage as well, right? Because remember the topology is a bag of stages, right? And that's why you see your first load down and your first spike to adjust to one stage. Remember at any given time invasive policy can be one at a time, right? So then you stabilize that for some amount of time then if there's an opportunity to go down further, you go down further. So the overall it took like around 10 or 20 minutes to reach a stable state, right? So now then again we increase the throughput by another 20% to 30% right? Now let's see how the topology can adjust at its scale up and reaches a stable state, right? So it took again another 20 to 30 minutes. The thing is like it went through three scale ups in order to reach that steady state again as S3. So like it's able to follow the adjusted topology as the time goes on, right? So it's a policy you can auto correct itself and resolve bottlenecks even on multi-stage topologies. Remember when you have a long stages it takes multiple steps in order to correct this topology because one stage will in turn trigger a back pressure in another stage then you measure it then you trigger another stage and all. But there are room for a lot of improvements. For example, if you can find out within a DAG there is a lot of correlation between some of the elements in the DAG when you increase one automatically you can increase other than all but you're not learning any of those things yet. So this one is kind of showing actually I wish I had put these two graphs side by side it could have been nice. It essentially means like how the number of bowls and has been increased and decreased during that state changes that you have seen, right? So that's what it shows. The red is actually showing the counter-bold during the scale down, the number of bold instances went down during before it reaches S2 it went down by a couple of instances down. On the other hand, when the scale up went it jumped from around nine or eight to around 14 or so. So the number of instances are being adjusted automatically. So gradually scaled up and down. So that's pretty much I had but I mean like again, Darian is just the initial set of work. There's a lot of things to be done. If anybody interested in working on this, let me know so I can give you all the pointers whatever you would like to. So of course then to conclude auto-piloting is important and the key issues that we are trying to solve is tuning and the slow host problems, network and data skew and you're not even touched as you can see in network partitioning issues even network slowness, I mean network slowness could be emitting symptoms in terms of different things. And especially like in a cloud provider like AWS, the internal network is very not predictable. Sometimes you can get the right throughput, sometimes it might not. And in fact like Google runs here and in one of their teams, Google fabric team. And those guys have done some adjustments where they're not at the system level they have done it a little bit at the topology level where if the throughput is going down on one direction they will read out to the other direction and all. So but we are trying to get that work done and more into a system level kind of this things. So any other questions? I go back to slide to the throughput and time drop like there are some spikes in the... Yes. So my question is because I see the spikes in throughput going down. So... No, the spike, the spike is... It was... So the question is like in the graph that I showed earlier about the dynamic policy provisioning, why did the throughput go down, right? Yes, so then the spikes... The spike means it is the overhead of adjusting the topology. At that time when the topology is adjusting itself it's not processing any data. So that's why you see the throughput down at that point. So because a corrective action is being taken at that point that's why the data processing is not happening at that point. So that is... So it clearly distinguishes what time that action is happening and at that time the data processing is very minimal. So that is why you see the throughput down. But after that the moment the action is finished then it picks up directly wherever it has to. So it takes some amount of time to stabilize itself. So if the overhead is increasing... So like the overhead I wish the, what do you call it? The width of this overhead like should be much smaller like in probably in milliseconds I would prefer. But currently it is not there yet. But because of the fact that for example if I allocate a new container the scheduler is overhead, right? So I mean there are possible improvements by which we can make this very transient. I mean we don't have to even come in go down because currently the back pressure means like we stop the data essentially, right? During the corrective action before we reopen, right? So it's really like the spots are stopped and they're re-corrected apology and then opened up the gates, right? So we could be able to do as the data is in transit and change the stuff as well, but you're not done that yet. Okay. Yeah. Tell me one question. You mentioned that only one policy is executed at one point in time. One invasive policy at any given. I mean one in like, is there a limitation or have you guys just not explored why you can't run multiple policies? No, we have not actually tried making invasive because the reason why we thought like one invasive policy at a time is to kind of learn about because then one action could be enough because remember there are a lot of interrelated stuff that could one action could cause other reactions, right? So we wanted to do one action, watch for other reactions then take the other actions, right? Instead of taking multiple action and then multiplying the reactions quite a bit, right? So I think that's actually a very good idea. This looks like an expert system and normally as you increase the number of policies the number of... Could be... Increase exponential. Especially and we don't know idea because we don't want, ultimately we want to keep the maintaining the topologies as good as it can rather than making it worse and right, so. All right, nice talking to you. Almost done. Thank you. All right, so the next speaker for the time series Davis Lectures will be October 12th. We'll have Super Goal from Two Sigma. Two Sigma is a big hedge fund in New York City and apparently they're building two time series databases. So he's gonna come and talk about at least one. All right, thank you.