 Hey, good morning, everyone. So I'd like to start off by making the claim that your code is wrong, and really that everyone's code is wrong, including my own. So let's start with an example of what I mean by this. And the example I'm going to use is a small feature in Storm called the report error method. Now for those of you that don't know what Storm is, Storm is a real-time computation system. It's like Hadoop, but for real-time processing. In Storm, you run what are called topologies that are infinite computations, whereas in Hadoop, you run jobs which are fixed computations. So to understand this example and some of the later examples I'll be using, it's helpful to understand the very high-level architecture of a Storm cluster. So let me just spend a minute going over this. So there are three classes of nodes in a Storm cluster. On the left here we have Nimbus, which is the master node, and it plays a role similar to the Hadoop job tracker. Nimbus is where you submit topologies for execution, and Nimbus will take care of launching workers around the cluster to run your topology. And Nimbus will monitor your topologies, so if a worker dies, it'll reassign that worker to another machine. In the middle, there's a ZooKeeper cluster. So ZooKeeper is not part of Storm. ZooKeeper is an Apache project. But Storm uses ZooKeeper to do cluster coordination, which is pretty much exactly what ZooKeeper was built for. And on the right we have a set of worker nodes. Each worker node runs a daemon called the supervisor, and the supervisor communicates with Nimbus through ZooKeeper to determine what should be running on that machine. And then the supervisor will start and stop workers as necessary, as dictated by Nimbus. All right, that's all you need to know about how Storm works. So let's get back to the report error method. So the report error method is a function that a developer can use that's provided by Storm that can be used to display errors in the Storm UI. And what the Storm UI is, is it's a central place that a developer can go to to see what's going on with a topology. So you can see things like statistics and metrics, and also errors that are happening in the topology. So by using the report error method, it's a quick and easy way to see what's going on through computation all across the cluster. The way we implemented the report error method is that error information is stored in ZooKeeper. And the reason we store it in ZooKeeper is because it's really the only place we can store state in Storm. And even though ZooKeeper can't handle that much load, this is expected to be safe, because errors should be relatively rare. Turns out there were some serious problems with this design. So what would happen when a user deploys code like this? Which, in fact, one of our users at Twitter deployed code almost exactly like this. And what this code does is that for every input tuple, it throws a null pointer exception. And then it catches all those null pointer exceptions and calls report error on that null pointer exception. And so what happens is that report error is called once for every input tuple. So if you're processing 100,000 tuples per second, you're now calling report error 100,000 times per second, which is a lot. And the effect of this is it causes a denial of service attack on ZooKeeper and brings the entire cluster down and kills everyone's topologies. So what went wrong here was a mismatch between how we expected that method to be used and how it was actually used in practice. Or to put it another way, the input space that we designed for was different than the input space we actually saw in production. Now by input space, I mean more than just the arguments you pass into your functions. I also mean how often do you call these functions and what's the context in which they're used. But that's not exactly what I mean when I say your code is wrong. At least that's not only what I mean. I also mean your code is literally wrong. The logic itself is incorrect. Now maybe I'm being presumptuous, because after all, I don't know what your software does. I've never actually seen your code. And I've never even met you. But I'm going to stick to it. Your code is definitely wrong. This is just a fact of nature. It's impossible for your code to be completely correct. Now if you want to persist in saying that your code is correct, then I can reasonably ask you, why do you believe your code is correct? And then you'll probably say something like, well, my code depends on dependencies 1, 2, and 3, and it matches them together in just the right way to get the properties that my code has. And then I might ask, how do you know dependency 1 is correct? And then you'll say dependency 1 depends on dependencies 4 and 5, and it matches them together in just the right way to get the properties of dependency 1. And then I might ask, how do you know dependency 4 is correct? Now we can just go on and on like this, and eventually we're going to reach the hardware. Upon which I can reasonably ask, how do you know the hardware is correct? Now at this point you're probably going to pause for a second, but then you'll say something about how the electronics work. And then I can ask, how do you know the electronics are correct? And then you'll say chemistry. And then I'll say, how do you know chemistry is correct? Atomic physics. How do you know atomic physics is correct? Quantum mechanics. But Richard Feynman said, nobody understands quantum mechanics. So how can you say anything conclusive of anything that follows from it? Now maybe you're smarter than Richard Feynman. You understand quantum mechanics, and you understand the whole tree of reasoning. But I wouldn't put my hopes on it. So I think I'm justified in saying your code is wrong. Now I'm being facetious here, but there's an underlying point. What underlies our software is an unbelievable amount of complexity. There's no way anyone can possibly understand all of this. There are mistakes will be made all over this tree of dependencies. But forget all of this theory. Just look at the evidence. Any software you've used of even the most minuscule complexity has had bugs in it, including the software you've written. So it's arrogant to think that this time you got it right. I think the best you can say is that your code is sometimes correct, maybe even mostly correct. But the thing is that's actually good enough, because we're not in the business of creating perfect software. We're in the business of providing value to our users. And software does not have to be perfect to provide value. And that's a good thing, because otherwise we'd all be unemployed. But the fact that things don't work perfectly is true of every machine we use, whether it's natural or man-made. Rockets sometimes explode. Our pipes break, our computers break, our headphones break, our bodies break down, our cars crash. We all know this, and software is no different. So given that your code is wrong, it's wrong to treat your code as deterministic. It may be deterministic in theory, but in practice you have to treat it as non-deterministic. You have to treat it as probabilistic, as something that might work. And this brings me to the key point that I want to make, which is when you embrace that your code is wrong, you can design much better software. And what I mean by better software is it's software that works a higher percentage of the time. It's software that's more robust. It's software where the input space you design for has greater overlap with the actual input space you see in production. So let me give an example of what you can achieve when you embrace that your code is wrong, in terms of building better software. So what I want to talk about is a problem I saw in Hadoop's design and fixed when I was creating Storm. Now, Hadoop has gotten better since then, so not everything I'm going to say is still true, but this was certainly the case when I was designing Storm. So the way Hadoop works is it has a job tracker that manages all the running jobs, and it assigns tasks around the cluster and things like that. And the way the job tracker does this is it keeps a bunch of state in memory about those jobs. Now the problem was that when the job tracker dies, all the jobs would die. And this really sucks when you've been waiting 30 hours for a job to complete. Now there were all sorts of things that could cause a job tracker to crash. I remember there was one thing that was particularly aggravating, which is when you submitted a job with too big of a job configuration, the job tracker would have an out-of-memory exception and then crash and bring down the entire cluster. But we've already established that your code is wrong. So it's unreasonable to think that your processes will run forever. They will crash. There is some bug in there that will cause them to crash under certain conditions. So when I was designing Storm, I decided Storm's demons would be process fault tolerant. And what this means is that a process could die and restart and have no effect on running topologies. So if you look at Nimbus, which is Storm's equivalent of Hadoop's job tracker, if you kill it, nothing happens. The topologies just keep on running like nothing happened. And the way this is accomplished is that all state for Nimbus is either kept on disk or in ZooKeeper. So when Nimbus comes back up, it recovers its state and then resumes where it left off. Now, of course, you need Nimbus up to do reassignments and to submit new topologies. But as long as the process is able to restart, then you're totally fine. And what I achieve with this design is that the software just works better. It works for a greater portion of the input space. We're now protected against stray errors that cause Nimbus to crash, things like race conditions that are unlikely. Now, of course, the software isn't perfect. It's just better. It just works more of the time. I remember this one time, we made this mistake and in the internal Twitter cluster, we deployed a new version of Nimbus that had this blatant null pointer exception in it. Literally, Nimbus would start up. It would do a round of reassignments, and then it would hit the null pointer exception and crash. And then it restart, do a round of reassignments, hit a null pointer exception and crash. And what's crazy about this is that the cluster was actually working fine. Like topologies were totally OK. And even if a topology had a failure, it would still get reassigned properly. Now, of course, you couldn't submit new topologies, because Nimbus wouldn't stay up long enough to accept it. But at least the existing stuff was running perfectly OK. And this was really cool to see, because it was seeing the design of process fault tolerance really paying off in action. Now, it's an obvious point, but in order to design better software, you have to consider what are all the possible impacts of your code being wrong. And the bad stuff happens whenever your software sees an input that's not in your designed input space. So on the right side of this diagram, we're talking about things like failures, where your code does the wrong thing, or produces the wrong answer. We're talking about things like bad performance, or security holes. If you think about it, the entire computer security industry exists because of the mismatch between the actual input space and the designed input space for the software that we build. So there are a number of design principles that emerge when you embrace that your code is wrong. So I want to go through a few of these. The first I want to talk about is that measuring and monitoring your software are the foundation of solid engineering. You cannot possibly improve your software without an understanding of the conditions in which it works and in which it doesn't work. So what I mean by measuring is answering the question, under what range of inputs does my software function well? Under what range of inputs do my dependencies function well? Measurement is crucial to understand how to avoid abusing your dependencies. And by monitoring, I mean answering the question, what's the actual input space of my software? And both measuring and monitoring are critical to doing good engineering. And personally, I don't see nearly enough of it happening, measuring in particular. Like you should be doing thorough tests of every dependency you decide to use, especially infrastructure, especially all the NoSQL databases you're using or thinking about using. Like when you look at other engineering disciplines, like let's say an electrical engineer, they measure the hell out of every component they use, every transistor, resistor, capacitor. The functional input range of every component is very carefully measured and understood. What's crazy about software is it's actually way easier to measure and monitor with software, yet we do it much less rigorously. I think part of the engineering process should be listing out every knob that affects your software and seeing what happens when you twist those knobs all the way. And you should be monitoring every possible aspect of your software. We're talking about things like latencies and throughputs, stack traces, buffer sizes, memory usage, CPU usage, and the million other metrics that you can collect on your software. I would actually say that how you monitor your software is as important as its functionality. Because problems in production will happen. You will have downtime. Even Google went down this past week. The only way to quickly diagnose problems in production is by having really, really good monitoring. And that's the key to minimizing downtime. Like think of Apollo 13. Those astronauts were in serious trouble. And one of the reasons that they made it back alive is because they had amazing telemetry so the engineers in the ground could quickly figure out what was wrong and deploy fixes and get them back home safely. The second design principle I want to talk about is embracing immutability. And immutability is something that's definitely not embraced by the vast majority of applications. When you look at most applications, they look something like this, where you have an application that talks to some sort of read-write database. Whether that database is MySQL, MongoDB, React, Cassandra, or the million other databases in existence. But as we've established, your code is wrong. So writes you didn't expect are going to happen. And your data will be corrupted. And you may not know why it was corrupted because you won't detect the corruption until potentially long after it happened. Now, these kinds of errors are the worst. They're incredibly hard and incredibly time consuming to debug. I've lost weeks of time just debugging one of these problems. Plus, whenever this problem happens, you may have permanently lost user data, which should be completely unacceptable. Now, there's alternative ways to design your data architectures. And that's the basis on immutability. Now, the idea behind an immutable architecture is that you have, as your core data store, an immutable, ever growing list of data where the only write operation is adding a new piece of data. And then you build views on that data that aggregate and index the data so that your application can do queries efficiently. Now, in today's world where storage is cheap, these architectures are very doable and very viable. It's not that expensive to just store all your data. And one thing you can achieve when your only write operation is adding new pieces of data is you can add redundant checks to actually make it really difficult for random error to cause problems, to cause corruption. So you can add things like permission so that updates and deletes are literally not allowed on that master data store. And this gives you a lot of protection from the random errors that you'll see in production. Another benefit of this is that it makes it much easier to debug problems because you always have available exact inputs for your outputs. If you see there's a problem with your view, you can see exact inputs that went into producing that, and therefore you can fix it much, much easier. So those of you familiar with my work know that this is the basis of the Lambda architecture, which I've talked about extensively. Now, of course, there's tons of details to actually implementing this, but this is the general idea. So the third design principle I want to talk about is minimizing dependencies. This is actually a little bit more subtle than it looks. But the general idea is that the less that can go wrong, the less that will go wrong with your software. As an example of this, I want to consider one aspect of how Storm uses ZooKeeper. So one of the things Storm stores in ZooKeeper is the location of all workers in the cluster. So if a worker dies and is reassigned to a new location, that information needs to propagate to the worker because all workers need to know the locations of other workers so that they know where to send messages to. Now, there's two ways a worker might get location updates. The first is by the polling method, where the idea is that the worker pulls ZooKeeper every few seconds to refresh the location information. This works great, but it adds a few seconds latency into when the location information is available and when the location information is propagated. The second method is to use a feature in ZooKeeper called Watches so that as soon as the information is updated in ZooKeeper, that information is pushed and immediately propagates to all workers. Now, of course, method two is much faster, but it adds a reliance, it adds a dependency on another feature of ZooKeeper. So what I decided in Storm's design is that Storm uses both methods. So it uses the Watch feature, but it also pulls every few seconds just in case the Watch feature doesn't work. Now, this turned out to be a far-sighted decision, because it turns out Watches had a bug in them that would have affected Storm, but it did not affect Storm because Storm had this redundancy in it. Now, I'm not saying that you shouldn't have any dependencies, that you should be re-implementing everything you're using. That's obviously ridiculous. Everything is a trade-off. In this case, eliminating the dependency was justified because of the very small amount of code required. We're literally just talking about 10 lines of code to remove the dependency on that feature. The next design principle I want to talk about is explicitly respecting the functional input ranges of your components. So let's come back to that report error method that I talked about. So the problem with this method is that a user could abuse it by calling it too often, which would cause ZooKeeper to be overloaded, and that this would then bring down the cluster. And so where the way we solved the problem is we had the method throttle itself to avoid overloading ZooKeeper. So any errors reported over the throttle rate would be logged locally, but not written to ZooKeeper. And this causes report error to work for a greater portion of the input space. It now no longer fails when it's called too often. And you see a similar pattern happen whenever you're dealing with log files. It's really important that your log files automatically trim themselves, or else you'd run into situations where you can run out of disk space. But the more general principle is that you can use self throttling to respect the functional input ranges of the components you're using to prevent cascading failures. Because the only thing worse than something failing because it's wrong is something failing because something completely unrelated is wrong. Cascading failures are surprisingly common, and it's really important to prevent them. The last design principle I want to talk about is embracing recomputation. So when I said your code is wrong, there were multiple meanings to that phrase. The first was that your design input space differs from your actual input space. The second was that the logic of your code is literally wrong. And the last meaning I want to talk about is that the requirements for your software are constantly changing. So the code of today is wrong for the requirements of tomorrow. A recomputation is a technique that helps with all of these definitions of your code is wrong. So let me start with the last one. So obviously you need to be able to change your code to match the shifting requirements of your software. But the problem is that's not all you have to do. You may have to do a big migration of your existing data to make that new code work. So for example, let's say you're working on building blogging software. And then suddenly you have a new requirement that you've never dealt with before, which is that you need to be able to search on all articles on blogs. So of course, you need to build a search index on all your data. Now building the search index is much easier when you can easily run a computation on all your data and build them from scratch. Now fortunately in today's world, this is actually pretty easy because we have tools like Hadoop, which are very good at running functions on all your data. So you can use Hadoop to build a partition search index in a massive batch job and prepare your deployment for your new code. But recomputation gives you so much more than just adapting to new requirements. And it works especially well in conjunction with immutability. If we consider the immutable architecture again, and now we consider what are some of the mistakes that can happen. Recomputation can help with a lot of those mistakes. Let's say you write some bad corrupt data. There's some bug that starts writing bad data to your immutable data store. And this will then corrupt your views because then that bad data propagates. So what you can do is remove the bad data or add code to ignore it and then recompute your view from scratch and then everything is back to normal. Or if the code for actually generating your views is wrong, you can just fix the code, recompute the views, and again, everything is back to normal. Recomputation adds a lot of robustness to your data architectures. And I personally would question the robustness of any data architecture that's unable to easily do recomputation. Now I've gone through a number of design principles that result from your code being wrong. And the pattern that I hope has emerged is that software engineering is really no different than any other engineering. The underlying challenges are the same, at least when it comes to what it takes to get what you're making work better. Like consider engineering a bridge. A bridge is dependent on the steel and the wires that comprise it. And even though it's a static structure, it has a lot of inputs. Its inputs are things like wind, rain, snow, the varying weights of the vehicles crossing, the occasional earthquake. Now there's always some magnitude of earthquake for which the bridge will fail. And it's the job of the engineer to make sure the bridge works under the appropriate input space. Or consider a jet engine. A jet engine will work fine in normal or even stormy weather. Now it's not going to work too well on a hurricane or if a flock of birds flies through it. But that's fine. That's an acceptable functional input range. Engineering is about balancing your functional input range against how much it's going to cost to increase the input range. And the questions you ask of all engineering, whether physical or software, is the same. What's going to break what I'm making? What are the limits of my dependencies? How much stress can this piece of metal handle? Or how much load can this database take? How can I add redundancy to increase robustness? Like a bridge is redundant. If you chip a small piece of metal off a bridge, it's not going to collapse. Likewise, process fault tolerance is an example of adding redundancy to software to make it work better, to make it more robust. How can I isolate failures so that they don't bring down the entire system? All these questions can be asked as readily of a bridge, as it can be asked of Cassandra or React or Hadoop or Storm. And the only difference in software is that our raw materials are ideas instead of matter. Thank you. So just raise your hand if you have a question. OK. I'll take this one and then come back to you. Regarding zookeeper usage, you said there is your watch as well as your polling mechanism. So I can repeat the question. And there is your watch to listen to the changes as well as there is your polling. Isn't it overloaded or too much of resource being utilized in this architecture? Rather, that can be a different architecture where it can be a synchronous communication. It could be a lightweight. So are you saying is that using too much resource usage with the strategy? Yeah. I believe it is a redundant architecture. And it is going to take a lot of resources. For example, if it has to poll, that means there will be a background threat running. So as if it is on the watch side in the zookeeper, it need to keep your registry and keep on looking for the changes to update. So instead, if there is a synchronous communication between the zookeeper and the client node, it would be a lightweight architecture, isn't it? I don't quite understand your question. If you're asking, does it use too much resource usage? Then the answer is no, because clearly storm clusters work in practice. And obviously it wouldn't be adopted if it used too much resource usage. We've scaled storm very, very, to very large clusters. And this is not a problem. Why don't you meet afterwards and you get to the bottom of it? Hey, Nathan, quick question. I use storm, and I think it's a great product. But my question is, what was the imperative to reinvent the wheel? Why didn't you just fork Hadoop and add the changes that you needed there? Well, what storm does is fundamentally different than Hadoop. Like Hadoop is about running large, fixed batch computations, and storm is about running infinite streaming computations. The only things I really would have wanted to reuse from Hadoop are task management and resource management, which I felt that Hadoop didn't do well. For example, the process of autonomous thing I brought up. And so it was better to reimplement it. So when you talk about real-time computation, the requirements are fundamentally different. Like one of the things I've talked about in my other talks is how the fact that Hadoop has these problems where Hadoop had a lot of problems. A lot of robustness problems, where it would crash or go down or your jobs wouldn't work. But this actually isn't that big of a deal for Hadoop, at least when it comes to its requirements, because it's a batch computation system, which means it's a high latency system. So if something goes wrong, then that adds latency to an already high latency system, and that's not that big of a deal. A system like storm, if you add latency to a real-time system, you're no longer meeting your requirements in real time. So your robustness requirements are much, much higher. And so therefore, the way you design your software needs to be much more rigorous, which is certainly what I did with storm. On your right. So I have been using storm. I'm a great fan of storm. My question is that, where do you see the future of storm, especially with cloud? One issue I see is that we want to scale up in a cloud environment and be able to scale down as well, depending upon need. But the topology in storm is static in that sense that you just can't add more nodes to a running cluster. So do you see that that's something that would be improved in storm going forward? Well, first of all, you can add nodes to a running cluster. There's limitations on how you change the palism of a running topology. Now, storm has ways to deal with that. It's possible to deploy a new version of a topology. That wasn't me. It's possible to deploy a new version of a topology with minimal downtime. I can talk to you afterwards about that. It's only in terms of the future roadmap for storm. A lot of the stuff we're adding is things like better security features, especially being able to integrate a storm with the kind of security that Hadoop uses. We're adding high availability to Nimbus so that you can have a cluster of Nimbus nodes instead of just one Nimbus node. In a cloud environment, I think having auto-scaling would be really cool. I think we're a long way from that. Certainly, allow the metrics work we've done in the past six to eight months is part of the foundation of that. Now you can actually get very, very fine-grained metrics and monitoring on your running topologies, which, of course, is the basis of being able to make any sort of runtime decisions on how many resources you're using. That's the general future roadmap. One more over here. So you mentioned the Hadoop Job Tracker as an influence in some of these design principles. Were there any other projects that you can point to as big design influences for storm? That's a good question. I guess probably from a philosophical level, Clojure. So Clojure is a functional, list-based programming language for the JVM. Storm is actually written in Clojure, but Clojure embraces a lot of the principles I've talked about, like especially immutability. And I guess more generally, Clojure embraces this philosophy of simplicity. The creator of Clojure is this guy, Rich Hickey, and the work that he's been doing is absolutely brilliant. I highly recommend watching his talks. Certainly, his philosophies have definitely influenced Storm as well as all myself. OK, we're out of time, but thank you very much, Nathan. Let's give him a round of thanks. Thank you.