 So the title of my talk is Processing Billions of Events in the L-Time Using Twitter Heron. Heron is a new streaming engine that was open sourced by Twitter a couple of months ago. And we have been using Heron for the last two plus years in production within Twitter, and we thought we'll open source it so that the community can benefit it. And I'm going to talk in detail about Heron and how Heron is used within Twitter context to process all those events. So my talk outline will be as follows. First, I'll give you Heron Overview. And I introduce a concept of microstreaming in the context of Heron, how Heron differs from the lot of streaming engines out there in the open source. And I will highlight a couple of problems that we faced in production. So one of them is stragglers. How do you deal with slow machines and variations in machine speed? And also, I will give you some kind of performance numbers, how Heron performs in production, and where the time actually is going when you write a streaming job. And finally, I will give some ideas about if you're interested in contributions. So then followed by I'll do some summarizer conclusions. So why do you do real time? So Twitter is all about real time. And there are several aspects of Twitter which is real time. So the few of them include real time trends. So we continuously keep computing trends and bubble up those emerging trends out of the Twitter feed. Then real time conversation, during sporting events like Super Bowl or even any football game or soccer or Olympics, a lot of the chatter goes on about a particular way of taking a touchdown versus criticizing an umpire call. So those kind of conversation has to be bubbled up as well in real time. Then since our monetization strategy is based on ads, when, as these real time conversations are flowing through those Twitter feed, we need to take advantage of what conversation is relevant to inject the relevant ad to the conversation so that the likelihood of the ad being clicked increases. And finally, we also do real time search, where the tweets that are coming in into the system are indexed very quickly within hundreds of milliseconds so that the results of the search that Twitter.com can be very current. So analyzing all these billions of views in real times is always a challenge for us. And we have been using a previous first generation streaming system called Storm. How many of you know about Storm? So we have been using Storm for a while for the last from 2011 to 2014 or so. And so we ran into a lot of issues when we were running at scale, especially running streaming at a few thousand nodes. It's a big challenge. And so Storm ran into a lot of problems in terms of task isolations, debugging, performance tuning, and they are not able to sustain the GCs. All kind of issues that we have happened, which we summarized in a paper that we published last year in Sigmod, there were totally identified 19 problems. So that's when we decide to go after writing Heron so that we can solve all those problems. So I'm not going to go into details of the Storm problems, but I'm just going to delve into Heron itself. So just to give an idea of the Heron terminologies, a Heron job is called topology. It's essentially like nothing but a DAG. And the DAG consists of vertices and edges. And the vertices represents computation. And edges represents streams of data flowing from one vertex to the other vertex. So there are two different type of vertices. And one of them is called spouts. They represent a source of data for the job. And it can tap into any sources. It could be either a Kafka or a Kestrel, or even a MySQL, Postgres, or even any other data source that you consider as pertinent. For example, if you have a way to tap data from kernel counters, you can even write a spout on your own and deploy the topologies. So the second type of vertices is called bowls. The bowls are essentially your computational elements, where they take the incoming data, process them in a certain way, and emit the outgoing tuples. And some of the examples include filtering, aggregations, joining, as well as some kind of machine learning arbitrary functions like clustering, association rule mining, and all the various stuff. So the Heron topology looks like the follows. As I said, it's a DAG. And you have a couple of spouts in this DAG, which in turn feeds to the next stage bowls. And those bowls process the data and feed to the second stage bowls. To give a concrete example, take the word count topology. And so in this case, we are trying to count a distinct number of words that are occurring in the Twitter stream. So in order to accomplish that, we need a tweet spout. And that tweet spout is essentially tapping into the stream and getting those tweets out. Then we need a part tweet bowl, which takes those raw tweets and breaks them up into distinct words. And finally, you have a word count bowl, which in turn counts the distinct words across the streets. And in the database kernel point of view, we call this as a more of a logical plan. So essentially it describes how the computation and other things look like. But the same thing gets realized in physical plan, but not exactly. It gets realized when it executes in the physical hardware in a different way. The reason why we have to do it differently is because the sheer size of the data that you're processing, you may not be able to process within the context of a single machine or even within the context of a single process. So hence, when the topology runs on a physical machine, you have a lot of instances of each one of those components. So in other words, you have the tweet-spout task, which might need to have 100 instances running to handle the brunt of the Twitter stream. Similarly, the part tweet bowl, if it is taking, the parsing is costing some CPU, it might be having another 200 instances of the part tweet bowl. Similarly, the word count might be doing another 50 instances of that. So those number, what we call is a parallelism for each task, for each vertices. And once the parallelism is identified, which is a part of the tuning phase for a topology, then you launch the topology into the physical cluster. And these individual spout tasks and bold tasks are packaged and run in the single cluster. And we will see more about how it is being packed and run in the later slides. Now one of the issues with this one is when a tweet-spout emits a tuple, where the tuple should end up in the downstream bolt. For example, in this case, the part tweet bolt. So in order to accomplish that, there are a bunch of groupings that we have. And some of them are what we call shuffle grouping, which means the tuple comes out and you can distribute to anybody, which means you randomly pick any downstream bolts and send the tuple. The second one is like a fields grouping where you take one or more fields, hash based on those values, and throw into the appropriate task based on that. And there is another grouping called allgrouping, where you replicate to all the downstream bolts. And finally, there is a global grouping where it sends the entire stream to only one single task. And often in practice, we have seen only a combination of shuffle and fields grouping is what we use. So then coming back to the word count topology example, what groupings that you will introduce. So in this case, you have to introduce a shuffle grouping because when that tweet comes out of the tweet spot, it doesn't matter where it lands up. As long as the tweet gets parsed, it doesn't matter where it lands up, which means the shuffle grouping will be suffice in this case. On the other hand, when you go from a parsed tweet bolt to a word count bolt, you need to send the same word to the same bolt so that the count is accurate. So that's where the fields grouping comes in. So fields grouping directs based on the word to the appropriate task. So with this short terminologies, now we wanted to see why herring. So as I said, the storm was one of the previous systems that we used, and it had a lot of issues in terms of performance predictability. In other words, when multiple topologies were running together, or the fragments of the topologies are mapped into a single machine, one was racing, the other was slowing down, and there was no clear control on how much resources that particular topology is supposed to use, and how to constrain that. And that led to a lot of issues in terms of one topology trampling over the other. That lead to a lot of troubleshooting and pagers that occur like 2 AM, 3 AM. And often during events like Super Bowl and Oscars, where the traffic in Twitter is pretty high, you end up all get enough issues of the trampling. So improved developer productivity. We want to get to a model where the other teams in Twitter, they can write a herring topology or a storm topology, and they should be able to debug themselves. And the storm was very poor in that, especially with respect to the fact that you can't get the logs for a particular process while it's running properly. Then you cannot profile it to identify your performance problems. So those kind of issues were there. And finally, ease of manageability. So storm was what do you call encouraging people to run in their own clusters, which means you have to manage a separate cluster for that, which has its own sort of overhead in terms of having folks maintaining the cluster and making sure that cluster is healthy and all. When there is a big cluster, which is maintained by already a team, and which we call as a compute cluster, which has hundreds of thousands of missions. And why can't you play nice and work with the same cluster in a multi-tenant fashion? So we wanted to make the ease of operations as easy as possible. So some of the design decisions we made upfront was it has to be fully APA compatible with storm because of obvious reasons, because Twitter had made a lot of investments in storm, and all the topologies and all the business logic that has been running for a few years is already returned in storms. All we have to do is if the APA compatibility was possible, then we just have to pull the storm under the rug and put the new Heron. And so any compilation of topology should be without any changes, and they should be launched. So we have to make sure that it works. The second one is we wanted to make sure that we can improve developer productivity by using task isolation and doing containers, because we want to make sure that every task is able to debug and profile and even say that this is the amount of resources that I need for this task, and it doesn't take more than that amount of resources so that we can run in multi-tunnel cluster. And finally, we wanted to use mainstream languages like C++ Java and Python. I don't know how many of you have heard about it, but the storm was returned in a functional language called closure. And the interaction of closure with JVM had its own set of issues in terms of triggering GCs, sustained GCs, all kind of issues were there. We wanted to kind of avoid all of them and stick with the tested and proven languages. So with this thing, first of all, the major addition that we made is we got away from the custom scheduler for Heron. So since Measles has a big community behind it, then Yon has a big community behind it. And similarly, we have managed schedulers like Arora running on top of Measles. Similarly, Marathon running on top of Measles. So we have several schedulers already running and very stable. So we wanted to get away from the scheduler. And all we did in Heron is we just give a mapping to whatever the scheduler that you want to launch the Heron jobs, and you should be able to be ready to go. So Heron job, a real-time job, it looks like a yet another job for a scheduler, except the fact that these are all not terminating jobs. They all be always running until you go and kill it. A streaming job is always running like a service. And only when you kill it, the job kind of leaves its resources back to the scheduler. So then in the case of Heron, you submit a topology to an existing scheduler, and it runs the topology alongside with other critical services in a multi-tenant cluster. So if you zoom in on how a topology architecture looks like, whenever you launch a topology, container zero comes up, and the container zero contains a process called topology master. And the topology master is responsible for managing the entire topology. And then once the topology master comes up, the first order of business is to advertise where other people can find, other fragments of the topology can find it. So it writes what we call the discoverable host and the port into the zookeeper cluster. Then it also identifies how much resources that topology requires. Hey, I need 100 containers, and each container should be of size like 4 to 5 CPUs and also 10 GB per container. And once it determines the resource requirements, it contacts the scheduler, hey, I need this many containers of this much resources. Can you please find them and spawn them? So once the scheduler gets those resources and spawns all of them, so that is where the other containers that you see in the lower part of the slide. So those are all called the data containers, and those containers are spawned by the scheduler. The moment those containers are spawned off, a process called stream manager comes up on those container. The first order of the stream manager is to figure out where my topology master is, where my master is running. So it uses the zookeeper discovery mechanism where the topology master has returned its location and discovers it. And once it discovers, it sends a message to topology master, hey, by the way, I'm off. And once the topology master gets all the stream manager check-in, then it forms what you call as a physical plan. And a physical plan essentially allows you to make the stream managers discover other stream managers in other containers so that the data can be exchanged in a fully connected graph fashion. And so the stream manager, what do you call it? Once they check in with the topology master, topology master forms a physical plan and also writes a copy of the physical plan into the zookeeper cluster. And the reason why we do it is the container zero can fail, because at any given time, it's running in a machine on its own. And it will fail, and the container might be relocated to some other machine by the scheduler itself. So in that case, the topology master has to rediscover where its data containers are and how they are running, whether they're healthy or not. So in order to do that, it has to rediscover its state, the physical plan, and all the various stuff. And so that is why it saves those state information when the topology master, the container zero, dies. So now what happens in the case of container data containers dead? So a scheduler will allocate a new container, and automatically that container's first process will be stream manager. And again, it will try to find its topology master. And since all the stream managers and the containers are assigned unique IDs, the topology master will get this new container reporting that I am the old container, whatever it used to be. And then the topology master, since it has some kind of hotbeats coming from stream manager, it knows that the old container has been dead, the new container has come up, and will reconstitute a new physical plan and send it down again so that the data exchange starts receiving again. So it's built for fault tolerance where container deaths, process deaths, machine deaths, all the things are taken care of without any manual intervention at all. So if you look at how I forgot to mention in the previous slide a couple of other items. So in addition to the stream manager, there are other processes that run on the container. And the actual work is, remember the tweets, pout, and the word count, bold, those processes are run on these instances, called I1, I2, I3 in purple color that you have seen on the slide. That is where the actual bowls and spouts run. And unlike Storm where the spouts and bowls are running on a threaded system, this is a completely a process based system where individual spout task is run on process on its own. And that helped a lot in terms of debugging that process, or doing a heap dump, or even profiling those processes, and having stack trace and all the various cool tools that you can use with it. The second process, the metrics manager, is responsible for collecting all the metrics that's happening, or the number of tuples entering a particular process, the number of tuples exiting a particular process, and how much latency it experience during that execution. All of them are data is collected by the metrics manager, and it goes to an offline system, as well as back to the topology master so that your UI can be served so that people can figure out what are the current number of tuples that has been processed, and any troubles that are occurring, all those stuff. So now, if you look at how the physical execution happens, as you can see, there are four containers for this topology. And each one of them has a stream manager running. And so after the physical plan is shared, they form what do you call some kind of a fully connected graph. So every stream manager is connected to other. There are two sockets that go from each stream manager. One is the control socket, and another one is the data socket, where data is exchanged. I will explain the need for a control socket a little bit later when we deal with stragglers. And as you can see, the instances are the process running in those containers, and the instances connect to the stream manager to send the data. And even if there is a local exchange of data from S1 to B2 on the top left container, you will be bouncing off from the stream manager. That made the architecture simpler and more deterministic. So the topology master, as I said, it's solely responsible for managing the entire topology. And it assigns the role and team and other people who launched it so that we can aggregate capacity based on what team is using, how much capacity for her and jobs, and all the various stuff. Then it also does monitoring, in terms of especially the container health, how the stream managers and all the guys are working or not. Then finally, it also provides metrics, which acts as a serving ground for the UI. Then the stream manager is essentially the heart of the entire processing infrastructure. It has routes to tuples. So whenever a tuple is injected into a stream manager, remember, depending upon the shuffle grouping or depending upon your fields grouping, the tuple has to be routed to the appropriate container and within the container to the appropriate process in the container. So it does all of them. Now, it also manages what he called as a back pressure, especially if one of the guys is going slow, how do you throttle the other guys so that everybody goes at the speed of the same guy, or the slowest guy? So it deals with the back pressure. I think I will describe the back pressure a little bit later. And there is an act management, which is useful for at least one semantics, where when the data entered is guaranteed to be processed at least once. So now the Heron instance and where the actual work is really getting done, which is like it runs only one task, as I mentioned, because for all isolation purposes. Then also exposes the storm API. You can write your own instance if you wanted to write your own API as well. So that way it is very extensible architecture. Then it finally collects all the gathers of metrics and transports to topology master. If you look at Heron instance, there is a couple of threads running within Heron instance. And one is called the gateway thread, which is responsible for dealing with the input-output data from the stream manager. And similarly, gateway thread also deals with the sending metrics to the metrics manager as well that is collected. And the spot logic or the board logic that you wrote will be running on your task execution thread. And the data from the gateway thread is pushed on a queue to the task execution set so that it picks it up. The task execution thread picks it up and runs it. And these quick queues are dynamically adjusted in size to minimize GCs so that we don't keep a lot of objects and keeps hovering around in memory. So with a short introduction to Heron, then how Heron is deployed at Twitter. So we need some bells and whistles in addition to the cool architecture. So we have something called a Heron tracker that tracks all the jobs that are running using Heron. And also we have this Heron web, which is the UI for Heron. Then also the Viz essentially allows you to do an offline analysis of all the metrics that we collect for a particular topology, because if some troubleshooting needs to be occurred, we need to look into those graphs and see what has happened. And so by production, we run it in Meso slash Aurora. And we use a Linux C group container, it's heavily. And that worked out really well. And some of the sample topologies range from very simple to what do you call some complex topologies, as you can see in the graphs. So as you can see, there's some simple topologies which has a four node topologies to something which has 15 nodes. And some of the lower topologies that you see, complex topologies, they are all not handwritten. Instead, it's generated by higher level frameworks. And you can even see some of the topologies where it's a disconnected DAGs. Like there is no connection between them. That could be split up into multiple jobs. But people wanted to manage less number of jobs means then they can write multiple topologies into one single topology as well. So remember I mentioned that the metrics manager collects a lot of metrics. And we have a dashboard that allows us to troubleshoot any topology. And some of the topology dashboard looks like this. I mean, this is just one fraction of the whole dashboard. I can tell you more than 200 metrics are being collected. And all of the you will be scrolling pages and pages before you can figure out what the right metric is. So officially, it has been in production for the last two years. And I can't give out the numbers. But it is processing a large amount of data. And we run Heron on a few thousand nodes on a multi-ton cluster. When I say a few thousand nodes, it doesn't mean that the node is dedicated completely for Heron. But it's like all containerized. And it's running on the Meso slash Aurora, which is the compute, big compute cluster that we have within Twitter. And the equivalent of container, the equivalent of the number of machines that we use is a few thousand. So the Heron use cases within Twitter are divided into seven buckets. So the first is real-time ETL, where extract data, sorry, extract transform on loading of the data. And the real-time BI, where the data is picked up as people are doing an engagement with Twitter, then those data is picked up. And it's broken out into multiple dimensions and aggregated on those dimensions so that you can look at how people are interacting with Twitter at any given time. For example, we have something called real-time active user interactions. And by meaning that, you could see people using Twitter broken down by a region, broken down by a device, and broken down by what operating systems that they are using, all of them in a real-time dashboard, where you can get up to the second kind of view on that. So we also use it for product safety in terms of fraud detections and abuse and tracking all of them. Then we also compute all the real-time trends. Then real-time model building, as well as real-time model enhancement for data mining or machine learning folks. And we also process a lot of images-related features to classify images that has similar features together. And we also use it for real-time ops, in the sense like, since we have a lot of machines, data out of those machines are collected continuously in real-time. And we process those data to identify which machine is going bad or going slow and even do some kind of prediction when that machine has to be replaced at any given time. So there is lots and lots of use cases that we use it for. So now I'm going to go a little bit, zoom in on a couple of issues that are the highlights of Heron itself. And so one of them is what we call the Compromising Heron. So in the current world, a lot of systems are continuously, new systems come into play, and old systems are continuously evolving. For example, in the case of scheduler itself, we have unmanaged scheduler, which is Yon and Mesos, unmanaged schedulers like Aurora, Kubernetes, then Amazon EC2, Docker container servers, then Marathon. And God knows, even HPC community has that one scheduler called Slurm scheduler. So there's a bunch of these schedulers keep coming up and going away. And similarly, state management and synchronization, that can be like ZooKeeper. Or you can even use the local power system if you are riding in a single node. And even Hadoop could be used in some way, some kind of a state manager as well. And when you upload your job, you can upload to even S3 if you are going to execute in an Amazon environment or even Hadoop if it is in a dedicated cluster or even local power system if it is a single node. So we wanted to coexist in this environment where the environment is constantly changing, but the core should keep the same, because the core has been debugged and it's in production. We don't want to keep changing the core, but you should be able to adapt to any environment. So when we said how do we do combine, how do we tackle this problem? So we looked back in the history and we found micro kernels. And I'm sure a lot of people here will know what micro kernels is. One of the inspiration was the mock micro kernel paper. So they went from a monolithic kernels to a micro kernel where they kept only the essential services. And then the rest of the other services became a process on top of these essential services. And that's what we followed in microstreaming, where the basic streaming requirements is an IPC communications. So essentially how to transfer data from between process and how to transfer data across processes in different systems. So that is one of the basic requirements. The second one is a scheduler abstraction where if you want to fit it into any scheduler environment, how do you do it? Similarly, a distributed state management where you have to store the physical plan and all the various other synchronization mechanisms. They are interface for that. So once we have that, then all the process and protocols once it's established, it becomes already compromised. For example, in this case, topology master is a process that deals with the users IPC to interact with stream manager instance and all the various folks. And it also interacts with the scheduler and the state manager whenever it needs to use that services. And similarly, the stream manager also returns us a process. So you can go and replace anything as long as you can, what do you call, satisfies a protocol of transferring data, receiving data, and processing those appropriate control messages. So that's all we need. So this gave a great advantage for us because when we deployed Heron into production within Meso Slasher and when we open sourced a couple of months ago, people ask, can we run in Yon? Can we run it for HPC committee and a Slurm scheduler? Or can we run it on Marathon and a DCOS? So people have been asking all these things and we were able to turn it around quickly within less than two weeks for every scheduler. So now currently it runs on native Mesos as well as Marathon, as well as Slurm, as well as Yon, and of course Mesos and Aurora. And we also have a local scheduler which allows you to bring everything into a single laptop and continuously run it within the local laptop, in a single laptop, so that you can do development and all the various stuff. So quickly we are able to map these schedulers as fast as we can. And similarly, this architecture also allowed to support Python topologies. I mean initially when we started, we were supporting only writing Java topologies. Now we just finished the, you can write the topologies in Python itself because Python is such a nice language for speeding up development. But of course performance will be a little bit slow, but still you can write your topologies in Python and it runs on a native Python interpreter itself. Unlike Storm where it will be going through JVM from JVM, then there is a multi-lang protocol which is essentially spitting it out in JSON that goes into a Python process. No more complexities of that. All we did is that instance, instead of Java instances, you do Python instances. So we wrote the Python instances and when you launch the Python topologies, it knows that it's a Python topology has been launched. It spawns up only the Python instances. So, and there has been already a request for people who want to write in C++ because they want to have legacy code and all the various things and that's in the works as well. So any new lipstick or any new API, you can go and do it very easily. Another project that we are working on is SQL on here and itself. And that also will have its own SQL instances which in turn does all aggregations and other SQL operators that can be efficiently implemented. So this gives, or the whole point is this architecture gives a flexibility to do a lot of plug and play kind of components. So as I said, like the micro steam engine gives you an advantage of plug and play components. As environment changes, code does not change. Then multi-language instances supports multiple language as I said, Python and C++ and multiple processing semantics. So we can even replace a steam manager for high speed. For example, let us say like all seed folks come in and say, I wanted to have even lower latency than what the Heron provides. Yes, we can write a steam manager that can directly work with an infinity band which has a low latency and so we'll be ready to go without having to change any of them at the top. So ease of development because of the fact that it's compromised, multiple teams can work on independent things and it gives a faster development iteration for us as well. So today, the Heron runs within Twitter, contacts local schedulers, this is our development environment and for testing and everything, we use Meso slash Aurora and Zookeeper and HTFS for uploading stuff. Then for production, we use the Aurora scheduler, Zookeeper state manager and Packer, which manages a versioning of all the topologies and all the various stuff and in the open source, it does run on several other schedulers as well. And so I'm going to highlight one production on operations experience, this is called the stragglers and how we deal with it. So stragglers essentially are the norm in a multi-train distributed systems. So there are three primary reasons why stragglers occur. One is a bad host, which is essentially it's not a failed host, it's a bad host in the sense like it is processing slower than what it should be doing. And also the execution skew, in the case of some process hitting a highly popular key and Twitter is known for it, for example, if some popular tweets get retweeted everywhere, then the same tweet will be hit the same process and the sheer size of the data that is getting into the process, the process cannot handle it and that could be the reason for a straggler as well. Then the third reason is you might not have adequately provisioned resources for the job itself. So let us see how we take the approaches to handle stragglers. So there are multiple approaches. One is simply say senders to stragglers. If I am sending a data to a straggler, I can't send as fast as I am getting data, I'll simply drop it. The second strategy is to, hey, like the straggler is not receiving as fast as I'm sending the data, but so I will slow it down so that I can go at the pace of the straggler itself. The third one is how to detect stragglers and reschedule them proactively. So if you look at the drop data strategy, one of the problem is it's unpredictable. I mean, in the whole DAG, in the distributed environment, it's possible that somebody is dropping data somewhere without having visibility, so which means it's affect accuracy quite a bit and there has been instances, this is the data strategy that Storm adopted and they found out that some of the data that we were sharing were 2x, 3x inaccurate. So that was the reason why we did not adopt this strategy itself. So instead, we did a slowdown sender strategy where this one provides predictability because everybody goes at the pace of the straggler and the whole data consumption rate might go down and that in turn translates into a lag in the source itself. So it can process the data at the maximum rate when it recovers and also reduces the recovery times as fast as it can and nicely handles spikes because Twitter is known for spikes, especially during Superbowl and all the various stuff or even during gaming events, whenever a touchdown occurs, you see a 3x, 4x spike which will die down in a couple of seconds. So in the slowdown sender strategy, let us take the example of a topology which is linear in the sense like we have a spout one followed by a bolt B2 and followed by a bolt B3 and it again goes to bolt B4. And so let's take this same, the physical realization of the topology with the containers and all the stream managers connected to each other. Now let us say B2 is a straggler in this case. So the stream manager on that container will detect, hey, by the way, the B2 is going slow. So I need to do something about it. So instead, so what it does is the stream manager sends what you call initiate back pressure message to all broadcast the message to all the stream managers and the stream managers look at in their physical plan whether in the container any spout is running because ultimately the spouts are the gatekeepers of injecting data into the topology. If we can slow down those spouts then automatically the stragglers will be allowed to drain the data in whatever phase that they want. So once a stream manager receives those initiate back pressure message, the spouts are all completely disabled in the sense like no more data is taken by the stream manager from the spout. It's essentially like the stream manager will turn using the lib event loop. All we do is just take the socket out of the spout and then we don't consume. It's automatically the spout slows down. And so then once the back pressure or the straggler is able to process back to its normal rate, then we will send another relief back pressure message to all the stream managers. And automatically, everybody will open up the spouts and start processing again. Now this kind of behavior could lead to what you call initiate back pressure or relieve back pressure, some kind of flip-flop mechanisms. And in order to alleviate that, what we do is stream managers have this notion of buffers. And those buffers essentially have this low watermark as well as high watermark. And whenever you go beyond high watermark, we initiate the back pressure. And when you go below a low watermark, you need to relieve the back pressure. So those buffers essentially act as a cushion. So that's what we did. And in practice, for example, in this case, if one of these containers goes into back pressure, then what happens is you start deviating from in the source when you are reading the data from the source, like either a Kafka queue or a distributed log. Your read will start lagging, and the write will keep going because the data is coming at a faster rate, and you're not able to keep up with the data. And this is what we call as a lag. And this lag is continuously measured for every topology so that you know at any given time, hey, this topology is lagging, which means there is a straggler in that. And so then we can manually restart the container if it needs to be. So there are a couple of things. In most of the scenarios, back pressure recovers in the sense like for doing spikes and everything, it automatically occurs without any manual intervention. But there are cases where there is sustained back pressure because this is because of irrecoverable GC cycles because especially if the topology is riding to some external systems for that output and it leads to some kind of GCs because the external systems are behaving differently. And or, alternately, it could be a faulty host, which is constantly triggering back pressure all the time. And in these cases, we have a scope where you can manually restart a particular container and automatically that container will be restarted and it will start processing. The scheduler will relocate the container and automatically redo all the physical plan and everything and restart it. Then sometimes what happens is people will say that, hey, by the way, I'm lagging so behind. And which means my real-time liveliness of the data is lost. So I will just drop. I don't mind dropping the data and move to the top of the queue. Which means we provide some kind of a sampling mechanisms or a drop mechanism at the spout level. So it constantly measures how far I am lagging. And when there is a lag, boom, it's just apply some what you call some kind of rules and go back to the top of the queue itself. So detecting bad and faulty hosts. I mean, since the bad and faulty host can influence the topology is quite a bit, we often have to proactively keep detecting these hosts. So if you have these large set of hosts, then what we do is like we run another topology that continuously collects data from these machines and looks at how fast that machine is going on, depending upon those metrics that are collected. Then we blacklist into the scheduler saying that, by the way, please take these nodes out of the scheduling conservation for scheduling. So then it automatically will remove it and continues to kind of keep the cluster sane enough. So then finally I'm going to give some performance numbers and OK, finally I'm going to give some performance numbers and everything. So here the Heron, I mean, one of the things that we want to understand is how the resources are being used in this Heron topology. So in this example, we have 60 to 100 million tuples per minute being injected into this topology. And a filtering step occurs within the spot itself, where it filters the 60 to 100 million to 8 to 12 million. And after that there's what you call a blow up operation happens, where each tuple it's split up into 5x number of times into 40 to 60 million tuples that come out of the spot itself. Then we give it to a next stage bolt which aggregates those tuples every one second and outputs to Redis's cache, but 25 to 42 million per minute. And these are the codes that we asked and requested from the scheduler. So for Redis it's a dedicated machine which is 24 codes. And the number of codes actually were used to do 2 to 4. And the memory used was 48 gig. On the other hand, with the Heron which is running on the Aurora cluster, then the codes that we request is 120. But the actual usage, depending upon the traffic, is around 30 to 50 codes. And the memory used is 200. And requested is 200 GB. But actually used is 180. Now where does the codes are being split? I mean used. So as you can see, 85% of the whole codes were given to spouts themselves. And the bolts hardly take 9%. And Heron over it is hardly 7%. And certainly we were surprised to see why spouts are taking so much of the core resources. So when we profiled the spouts, again, this profiling is all possible because of the task isolation or things that we did, we found out that the deserialization cost takes 63% of the whole time. Serialization, deserialization. And the Kafka fetching takes another 25% of the time for the 16% plus 7%. And then some of the other things like parse filter and the user logic takes the remaining amount of time. And so that is the bulk of the, when you profile bolts, the writing data into Redis takes the chunk of the time when you go across the boundaries. But other than that, the rest of them are, again, serialization, which is the dominating cost. So in the overall whole, this thing, you can see 61% of the whole resources are dedicated to fetching data from other systems. And especially deserializing it. And the user logic takes 21%. And the writing data takes 8%. And Heron takes 11%. I mean, there's room to improve the Heron to even make it half of it, like 5% or even like 4%. But the other things have to be improved before we can improve Heron itself. So to give some kind of numbers in terms of throughput and all the various stuff, you can see the real-time active topology, which is one of the close to production topology, where we run some logic to figure out how the users are interacting with Twitter at any given time. And so from this one, we can see that the number of resources consumed is hardly more than 10x. It's like 370 cores in Storm, whereas in Heron, we consumed only 30 or 40 cores, which is 10x reduction in resources. That's a huge deal, especially if you're saving a few hundreds of machines. And similarly, the no acknowledgment case, which is essentially the best effort. Again, the number of cores used is much lower in Heron. And this is with respect to word count topology and the one example that I showed you guys. And this is the worst case for any streaming system, because streaming system is measured in terms of how fast the data movement can be done. And again, we said we can do 10x higher in throughput. And similarly, latency also is much lower. So if you are interested in learning more, we have published three papers. And you can look into that. We have a few more papers that is coming out. Then if we did a tutorial on real-time analytics, that's the 300-plus slides, which has a lot of information about what are the systems out there, and what are the pros and cons, and all of them. So it captures all of them. So then Heron is open source. If you're interested in contributing, please feel free to do it. And you can follow us at Heron Streaming. It's a Twitter account. And you can get the latest updates from there. If you're interested in some contributions, these are active projects that you're interested in. So we are working on auto-scaling. If there is something that you're interested in, reach out to us. And then we are using machine learning to identify the root cost deduction in Heron so that we can effectively troubleshoot them. Then open versus close streaming systems, care all about the latest data. Then sampling and dropping how effectively you can do. Then tuning. Then even C++ topologies. So a bunch of stuff like that is going on. So that's all I had. If you have any questions? Yeah. Yes? Yes. So that is the scaling of the topology project that I mentioned at the end. So we have already a prototype working where the decisions of changing the parallelism can be done manually and given us a command. What do you call this? Like a Heron submit, where you submit a job. There's a Heron update, where you can update a job with a different number of instances. And so once you have that instances, automatically new containers will be grabbed. And whatever the increase in the parallelism and other things will be scheduled in those containers. And the rest of the physical planning, again, will be rejiggered, then the topology will keep running. Yes? It's possible in just a few more weeks. So yeah. Yes? So a spout is essentially like depends on how you write the spout. You can write any code in your spout by yourself. So within Twitter context, we also maintain the spouts. Because spouts is a very common thing that everybody wants to link with. So we have this distributed log spout, which is essentially our messaging system. Because we don't use Kafka anymore within Twitter. And we used to have a Kafka spout also. Those spouts used to have a lot of logic in terms of whatever you mentioned. Sometimes we have some kind of a little bit queuing and a few things that we can adjust. So that even if we get the data out of them, we can keep the data around within the spout's memory. And we inject the data. For example, some kind of rate limiting that we also can do from a spout to bolt as well. Yes? And we are going to modify the storm Kafka spout to take advantage of some of the stuff that we learned. And we will push those things into the head and rapport as well. So currently we will inform the scheduler team, by the way, these machines are not working very well and all. Then they will blacklist into that scheduler thing. That's where it's going on. But what we wanted to do is we want to get to a fact where the streaming job and it's continuously running, it's able to identify what machine is faulty and automatically switch the container or probably kill the container and move on to the next one. So that we should not get to it. Because by the time the whole manual process takes, it's probably a couple of hours. So which means the real-time liveliness of the data is lost by then. So we wanted to do it automatically by it. Thank you.