 mostly known, I think, in the airline community. I'm new to Elixir in here, so I'm doing a bit of what OSA mentioned yesterday, which is go and advertise your stuff in conferences and languages that you don't have anything to do with, and I'm doing this with Elixir. More frankly, it's because I think, yeah, hello, microphone. Is this cool? It feels like being a rock star, so I'm thinking next year it should be doing a dojo so then you can feel like a code ninja and you've done all the hipster stuff. But yeah, you might know me for learning some airline for great good, airline in anger. I recently wrote properdesting.com. I have a blog where I have opinions and I'm also the maintainer of Rebar 3, so Elixir people having issues with mixed builds. That's probably where you know me the most. And if you have never heard of me before, my latest project was being on a t-shirt for this conference. I'm a systems architect at a company called Genitech and this is a bit of how I represent myself in my job. Really, my whole thing is making broad plans and if they go well, I take all the credit and if they go wrong, I blame the developers for not executing well in my previous vision. More specifically, Genitech is a kind of company where we do security systems, video surveillance, license plate reading, access control with like badges and cards to read, and there's something interesting in that it's deployed in all kinds of environments, mom and pop shops, coffee shops, department stores. We had it even in train stations, airports or citywide surveillance systems. And there's something quite interesting about that because it's closer to the older model of doing things where people have their own installations, you ship them software and they have to do the upgrades and you don't even have access to the machines where it happens. I used to work in cloud stuff and now this is all kinds of new, but there's a very fun challenge in that if your software crashes, the people taking over for it is someone that needs to fly out for seven hours to get on site, and otherwise it's military dudes with submachine guns taking over the system. So you don't want that to happen and you try for that not to happen. And that's kind of my jealousy systems architect. It's coming up with broad principles that will prevent these issues from ever taking place and hopefully they do, but without crushing the hopes and dreams of the developers we have in doing a kind of technical micromanagement. So few of the plans survived the whiteboard. They kind of die early, but that's really where I come in there. I even have the clothing right. So what I found out as being a systems architect, and this is mostly a C-sharp place with a lot of Windows computers, but we're introducing Erlang in there, is that all the things that I do related to systems architectures and that kind of stuff to make reliable systems in the fields and areas we don't control, I end up always coming back to the vocabulary that OTP and Erlang gives us. And I wanted to present a bit of that today and how we can approach making reliable system when we don't know what the hell is going to go wrong with it. And so the way I like to describe it is based on the areas of knowledge we have. And so there's a little brown orange circle in there. Those are the things that we know. We know a bunch of stuff that is not related to code. We understand a part of the code. Most of the code, we don't even know what it is. It's going to be stuff in the operating system. You're in charge of that if you ship a product. Hypervisors, if you're in the cloud, the rest of the libraries, the virtual machine, this is technically all your responsibility. And most of us know only a tiny portion of it. And that's what we ship and that's what we trust. And that's the area we know. Fortunately, the overlap with the bug sections there is interesting. Because the bugs about the things we know are super easy to handle. There are going to be things like the bugs you have found in QA before shipping something if you have a QA process. But you decide, you know what, we don't have the time, we don't have the budget, we're going to ship a buggy system to hell with it. But usually, those are easy to fix because you understand the system, you understand the code that goes around in there. If there is a problem, it's easy to figure it out. It's probably related to a to-do someone left with bad error handling somewhere. It just shipped it with a to-do when through code review and then it bites you in the ass and says, oh yeah, I know how to fix that. And then you just removed it to do when you pretend it's fine. And then the other more dangerous area is the stuff you think you know, the little purple circle in there. This is the stuff where you think you understand something, you have a mental model of how it works. You have no proof, but you're pretty sure yourself. And yesterday, I think Rob had the thing like, are you sure about wanting to ship that? And the things you think you know tells you, yes, I'm sure. I'm sure of it. You know you're wrong when they're asking you, like, do you have any proof or metrics for that? It's like, I don't. But I know. And that's the things you think you know. You're not actually certain they're going to be there. In a system, these errors are kind of really, really tricky. If they're on top of the system, something late, like you didn't know your logging system would be truncating log lines. It's not that bad. It's easy to fix. It's easy to turn around. But if the Jenga tower of software that we have built has the assumptions in the wrong place, for example, you assume that all the tests you did on the Lube back interface on local hosts was representative of the deployment in the field, which is entirely untrue. The TCP stack doesn't work the same at all. If you do it on the Lube back interface, and if you do it with an external computer, then all your error handling and error detection and connections is going to be screwed. And so if you do that at the very basic of your system, you have shipped a system that just plain doesn't work, which is still a way to make money, I guess. But ideally, you don't do that. And then there's just stuff that is not in any circle, the things that we don't know in the known universe. You cannot prepare for these bugs or these kinds of issues because you don't even know they are a thing. I think the best example we've had recently was the meltdown bug. All the security measures are kind of useless because people are able to read whatever memory they want and do whatever the hell they want. Another one that I like a whole lot is the thundering hurt problem, where if you have a server that is kind of centralized and you have a bunch of clients trying to connect to it and send data to it, and the server dies and comes back up, all the clients rush and try to send data at once. It just like a denial service attack and it kills the server back and back again. And if you have that in a production system and you are not ready for it, you're kind of screwed. You have to find exceptional means to make everything work again. And those you just can't prepare for. And it's my favorite category in all of these because the moment I tell about a bug, it's no longer valid as an example. So the bugs that we have in there are in these categories. So the stuff you know it should be easy to fix. You have all the knowledge you need. The things that you think you know is kind of hard. You have to be careful. You have to do some exploration, measuring, and it's kind of fine. And then the stuff you don't know, you have to gain knowledge and possibly prayer is a solution to these. But the general approach I see in a system is how do we shift the bugs to the easy category. If we can have a kind of method by which we reduce all the unknown stuff to stuff that we know how to handle even though we don't know what it is, things get a lot easier. And so the development practices that we have there are some of going to be valid in any languages and some are only available to us in like Erlang and Elix here and I want to talk about them. So the easiest way to do it is to increase the knowledge that you have in your team. This is actually a real complaint I have. Why the hell are all the Elixir questions on Stack Overflow also tagged Erlang. But yeah, you have to increase the knowledge and the first way to do that is hire more senior developers, get good. It's not a really constructive ones. Senior developers are a real solution for that just because they're going to take that little orange circle and make it a lot bigger and it's going to be great. But the problem you have is that if you have your senior developers they work on a feature for a year and a half and then they get the hell out you still have the code to maintain and nobody knows how the hell it works. And so you have to use or make efficient usage of your senior folks and that usually will be done through fostering a good culture of mentoring, education, and communication within your teams. It's not a sign of a healthy teams is if all the inputs is senior people and all the outputs is other senior people. You have a healthy team is what you have as an input is junior people and your output is senior people getting poached left and right. Like it's a happy thing if all you get stolen from you or senior people and you don't hire them in the first place. You're able to produce as many as you need. And so what's going to be interesting about that is that you're going to need fewer of these senior people to do the stuff in your team and they're going to be able to help you increase the knowledge of everyone you have in there. But a more important thing is about who you pick to be on your team in the first place. And this is where diversity, I think, is interesting. So the thing that happens is that if all you have with you or other people who have the exact same background as you went to the same schools, lived in the same neighborhood all time, you have the same hobbies and passions, the things you know circles for the entire team you have is probably the same for everyone on the team. And so everyone's most likely to have the same blind spots. If you have people that come from different technical backgrounds, different disciplines, different economic or social areas, you're going to have a much greater coverage of all the areas you have in your team. And what's going to happen for that, if I want to put it into a principle otherwise, is that if you hire a team full of people who really really enjoy bitcoin, you're pretty sure you're going to get a blockchain in your system, no matter what the system is. And it's the same thing for distributed systems. Distributed systems engineers love doing that stuff. It's just like I need a single machine, it's like what if we put Paxos on there? I have a raft implementation in the lake series, it's going to be great. And you put all that stuff in there and you get that kind of problem going. If you have a more diverse team and you know you've done that if you had to do internationalization and you have only English dude on your teams who have never spoken another language, you're going to hurt your foot on the corner of some door every five minutes. If you have people who have that different perspective, they prevent all the bugs just by being there and calling out the stuff that they know because they have a different background than you do. And so that's one of the best ways to increase knowledge in your system. And what's interesting is that the increase the knowledge in all the areas not necessarily related to code, which means it fixes future bugs and features you haven't even developed yet. It's an investment in the future you have on your team. If you do that, it's extremely worthwhile. And yeah, I mean it's one thing. And if you're really building products, we have to be aware of that as well just from the point of view that mostly everyone who works in tech is pretty educated, pretty literate. And so it's easy to forget, for example, that 15 to 20 percent of people in North America are functionally illiterate and not able to extract meaningful information from a text. And so if you have a product that you build that is just like read the manual and 20 percent of the people in your target audience don't know how to read a manual, you're kind of screwed. You're going to have a very bad support, well I will bear a very bad time in your support team. So that's increasing knowledge overall. The other one that I like is exploratory testing and I freaking love this painting. Yeah, it worked on my machine. Back up your email, let's go into production. So exploratory testing is the practice where you take an experience manual tester, if you still have that, you sit them in front of the system and you let them go hog wild. It's like just do stuff and note it down. Tell me what it can do, what it cannot do and really it's going to, these people are going to find a crap load of bugs. It's going to be super great. The problem is that this is a kind of expensive way of doing it, it's time consuming, it's hard to do regressions and we have kind of mechanical ways of doing that. And one of them is really to simply use fuzzing. If you've ever used fuzzing, it's interesting. If you, the most known tool for that is American Fuzzy Lop. It expects you to compile a program, put some kind of annotations in there, the guides that about where it's going and it's just going to generate garbage to throw at your program. And so it might be something like maybe to create a crash it's going to require 5,000 W's that are capital followed by binary garbage and after a couple of days it might figure that out and tell you how it crashes your program. And so fuzzing is really, really great to throw all the garbage you can and figure out can it still run through the garbage or is it dying in there. One thing that we have in the Erlang and Elixir communities that is kind of related to that is property-based testing. Where instead of just throwing garbage at a program to see if it crashes, we throw garbage at a program and check if it still does the right thing. We've got a few examples of that and frankly we've got great tools for it with a quick check which is probably the best thing you can have in the system for that. You've got proper, that's pretty interesting, trick that is kind of dead and you've got stream data but stream data doesn't necessarily do the fancy as still but if you can use property-based testing it's absolutely great for that. So the big interesting stuff in property-based testing is in stateful testing and it's going to be basically equivalent to having that kind of exploratory testing done by someone but you're automating it in code. So instead of just generating data and saying like I have an image hosting service, can I generate user names that are really funky and break the database, what you're going to generate with these stateful tests is going to be instead of a data structure you're going to generate a sequence of operations so it might be something like log my user and log my user out, upload an image, view an image and delete an image and you might have properties or rules that are going to be like a user cannot upload duplicate images, a user can only see the images that they have themselves uploaded and you need to be logged into uploaded image and you're going to run sequences of operations at random that are going to be something like log in, log out, upload the image, get the image, upload another image, delete the image, upload the image and maybe after 24 or 40 of these operations it's going to find a bug and then you're going to have that complex sequence of operation from A to G that's going to be really complex like how the hell is this a bug and what property-based testing does that it takes that sequence of operations and removes all those that it can whilst you're reproducing the bug and it's possible that what you're left with at the end is going to be a sequence of login, upload an image, delete the image and reupload it again and oh surprise you're not allowed to reupload an image that you deleted because the checking is done is wrong for the duplicates then you might not have thought of doing that but the computer is able to do it for you and it doesn't make us smarter it just makes us better at understanding the programs and all these practices what they do is they really increase our areas of knowledge and the blue arrows in there and make it easier so the approaches that you have that are a bit more social for the team help you with all the areas overall and the mechanical testing and whatnot really helps you within the code and that turns out to be critical because if you don't have code you don't have bugs which is why writing code sucks but yeah one thing that you hope happens though is that as your team keeps gaining more and more knowledge and the little brown orange circle grows you hope that the purple one stays fixed what you don't want to happen is that everyone is like we're so much smarter than we were before we've got a great team and then you make even more assumptions than before and then you get screwed ideally you kind of learn from all the mistakes you made not to trust yourself and to yeah verify your assumptions before you do and hopefully the things you think you know are actually shrinking proportionately to what you know the other really interesting and easy way to do things is just to decrease these stuff that you know nothing about and the easiest way to do that is to just write less code don't write the code in the first place you're not going to have the problems if you can you've got as well the formal proof assistance all the formal methods of writing software that make sure that nothing can happen that you haven't planned for and prove that it wouldn't happen and those are great but for what is about Erlang and Elixir what we really have that's going to be interesting is going to be use dialyzer is going to prevent all kinds of weird states that you had not planned could be there is going to prevent you from getting in there use a linters code for matters we've had talks about that already they make stuff clear it's harder to have bugs and things that are clear and easy to understand if you do have formal methods and the time and budget and knowledge do it do use it it's working by all means the other trick is let it crash and let it crash I think everyone here is kind of aware of that but it's kind of really really critical if you do the approach of what I like a whole lot I like to hate a whole lot in goal language is just you check the errors by hand and you forget one of them you enter really quickly into that kind of garbage area of your code where the input that you had and you kind of massaged into working no longer is what the user expected nor the system expected and then you get free range to the system to have emergent behaviors that causes surprising bugs that nobody thought could be there if you fail early and you fail often you prevent all these unknown states and things go much much better the other thing that's super interesting in there is to have an observable system and observability has been thrown around quite a few times these last few years in the operation circles and the reason I'm saying that is that the worst bugs are in things you think you know and in the things that you don't know which means that in practice you are not going to have logs or metrics about these areas if you know they were risky you would probably have explore them and prevent them the hard box from being there so whatever issue you have that's really really tricky you're not going to have metrics or logs and if you're working with airports or cities like we do it might be two years before they've deployed an exversion of your software and so you don't have anything to debug it and so it's really critical when we work on the airline virtual machine to be aware of all the nice features we have for that that will include tracing they will include all the metrics the system debug information the memory layout that we can explore all of that stuff is gold because when you have one of these tricky bugs instead of spending a cycle of three weeks to a month of deploys to add metrics and then come over logs like an idiot doing all of that you just log on to the machine you check the thing you trace it and in 15 minutes you have an answer so you really want to learn all about that stuff because that's what makes the difference between losing a month of work and just doing it directly I care enough about that that I've ridden airline in anger for that reason and then the other one I just like this painting a whole lot with the little caption it's not really related to anything but I like the early look they give that's how I feel whenever I go in lunch area and you sit right next to the marketing folks it's like what are you not doing here so yeah the trick is making things irrelevant which is the feeling you have when you sit right next to the marketing people and usually you do that through architecture architecture is one of the best examples for that if you have redundancy you have more than one server it doesn't matter why the other dies you don't have to know if it's about hardware if it's about a really critical bug or something like that it doesn't matter you took an external means to say all these categories of errors that result in a crash I handle them without knowing what they are so fail fail fast and fail often is great for that and this is going to be valid in all languages what's really interesting is that in Erlang and elixir specifically with OTP we have access to these architectural patterns that most languages only have when you're dealing with hardware machines or virtual machines and not within the language itself and so who here is really familiar with supervisors and how they work all right there's still a few hands who are not raised either you are hung over from yesterday or you don't know about it so I'm going to go over stuff there's going to be supervisors and supervisors just free means really to get through them there's a one for one supervisors as a basic one where if a process dies it's the only one that dies it gets restarted and off you go it's great there's a rest for one supervisor and the rest for one supervisor is pretty fantastic nobody uses it but it denotes a linear dependency between components and that means that if process C on the right depends on the one and B and B depends on A you put that supervisor in there and if one of them dies all the other that comes after are going to be restarted and it's a great great pattern for all these linear dependencies and then you get one for all and one for all is going to be when all your processes kind of interact with each other and if one of them dies it's really tricky and it's a garbage task to try to repair the state to know like there's going to be a new one coming you have to reset cancel what you are doing go back to announce it it's like forget about that just kill everything and bring it back up it's much simpler to reason about that's when you use one for all and then there's the way to deal with loss permanent supervisor restarts all the time transient ones on the restart on an abnormal exists so if something failed accidentally that's when you restart otherwise it's like the task is done it's great and temporary supervision initially most people are just like why do I need that I just need not to put it in a supervisor but there's a great reason to do it and I get down in there and this is the part where it's really becoming a bit like a whiteboard sessions and this is how I designed personally all the Erlang systems I work with and I think more people should do it I think I heard at some point Joe Armstrong say that he never has more than two layer of supervision in a system and like this is crazy I never have only two layers of supervision and so I'm trying to sell you with this technique right there so let's assume I have this system it's a very fancy one it has tiny local storage in the cylinder right there there's an IP device it could be a camera it could be an escalator it could be an HTTP server that you just route traffic to and then on the left you have like the inputs you've got DBs and a Qs and that's where you put orders that go into that system and you've got the root supervisor at the top and how do you build a system before having ridden any code well I'm going to need something to talk with the databases and the Qs and the storage and all of that and so I'm going to call that the report supervisor or something or to put it on the left that's the first thing I want to boot in my system supervisors always start from the left most child depth first synchronously and that is great so I will need to talk to the SQL database probably postgres I'm going to start a supervisor for a work pool in there and once the work pool has started all the workers are there they're connected once the connection is established I'm ready to go connect to the queue maybe it's RevitinQ maybe it's Kafka I'm going to do the same thing now I know that my system whatever I boot after that is already ready to talk to these databases they're going to be available so the next thing I'm going to want to have is metrics because I want to know what's going to happen and the rest of my system is going to need to be able to publish metrics so I start a worker for that then I'm going to want a local configuration of a cache if I'm deploying that a customer side the databases are down I still want the system to work so I put that in a Nets table all a big copy of that and I dump it to disk through a worker or something and now if the databases go down I should be able to still have a system that works and now I can start the things that pull or that communicate with the other devices I'm setting up my configuration I know that when I start this process everything else is already set ready to go and so I have the database I have the configuration I can start my worker that talks to the IP device and I can decide to make my own one there and each one can be structured in a complex manner it could be a little supervisor that itself has a state machine that maintains the state of what goes on within a connection and the connection is another process that link to it and so if the connection breaks I can still maintain my state or something like that and that's how it goes and so you have yeah a good system architecture but honestly this is boring garbage I don't care how it works I care how it works when everything is on fire that's where the real challenge is making a system that works when everything is going great is easy it's not that interesting what I want to know is that yeah when stuff starts to fail and we've made mistakes and the mistakes might be in code might be in the environment might be somewhere how do we deal with that those are the true and expected errors and the luck we have is that if we assume that we make mistakes all the time it's much easier to plan for them so really the big practice I do is I do my own chaos monkey on the whiteboard I kill the worker on the left say if that worker dies should the other workers right next to it also die and the answer is no so I'm going to put a 141 supervisor I know I'm going to want the same thing in Kafka then I'm going to have the question if the SQL supervisor dies maybe all of them are dead because the database is down do I want Kafka to die at the same time probably not there's no reason why it should die with a database so I'm going to as well put a 141 supervision in there and I can go work on the bottom right it's like what if the connection is broken to the device what do I do do I want to drop the state in the state machine if I'm having an HTTP proxy then yes maybe there's no way to save it if what I'm having is a video feed and I'm just like broken because of a bad connection or something I might want to keep it and so I'm going to use a REST-4-1 pattern and I drop the link in between the two of them because if the connection breaks I will know about it anyway and it's all handled and then you ask the same questions what if one of the connection one of the pollers itself is broken do I want the others to die and the answer is no so it's probably a simple 141 supervisors because I could be having thousands of them and so you repeat that process over and over again eventually you get into a tricky area it's going to be what happens if I kill the entire subtree for databases what do I do in there maybe the connection is broken on the entire left side those are on two different networks that one is entirely unavailable everything exploded is it reasonable for me to want to keep interacting with my devices maybe I have a local configuration what do I do with that one what if that configuration worker dies should the other polar dies how do I deal with my root supervisor on there it's entirely at the top usually it's a 141 by default so how do I patch this system and really you don't need to patch it in code you patch the supervision tree and here's what you do you had more layers of supervision and so maybe the reporting side and the metrics are kind of tied together I don't care what they are I put them into a 141 supervision they keep the same semantics but what happens by doing that is that if the entire reporting side dies the other side still lives because they're living in two different subtrees of the supervision structure and so I have defined through the encoding of my code structure in there that no matter what happens I should be able to be working with my polars as long as I have my local configuration it doesn't matter if the databases are not there that's where at least the intent is on these and so you can start the exercise again it's like what happens if my IP device that I'm connecting to is going haywire it's garbage it's not detecting lightning properly and it's crashing all the time what do I do with that and so if it can die repeatedly multiple times a second the risk I run if I have a standard supervision structure is that it's going to hit the maximum repeating rates and kill all the supervisors until the entire system goes down and this is really a bad thing I don't want a single worker to take down the entire system that's the entire reason I'm using Erlang or Elixir is for that not to happen so that's where you use really the temporary supervisors the little supervisor that is right here with the thing that one I give it a repeated rate where it's going to be something like you can fail 10 times a second it doesn't matter if you break that it shuts down but that one the supervisor for these pollers right here that one is temporary and if that process dies it vanishes it's better to lose one device than to lose the entire thing and so what you do is that you add a brain to the supervisor you're going to see people advocating all the time for smarter supervisors with back-offs in there and all that kind of stuff I don't want any of that garbage I want my supervisor to be as dumb as whatever you have in your house if it's just a door knob that you turn you can trust a door knob it's so simple you should be able to understand it I want my supervisor to be so simple that I don't make mistakes about what I think it should do I know what it does and what I do is I have that little worker on the side that just acts as a brain from time to time it's going to scan the supervisor and they're like do you have all the children you need compare that with the configuration file and if someone is missing it's like oh that one probably died violently I can restart it in one minute and then you try it if the device recovers the system recovers you implement your backup to adding a brain on the side of the system you don't make a smart system smart systems make stupid mistakes yeah and so what I have is that just going through that exercise I have this new corrected supervision structures that now should be encoding all the things I need in my system to be fault tolerant and I have written zero code and when I work with the teams at Genitech when we have new Erlang developers that have never written Erlang before we do that before they even touch the code for their systems we just do it on a whiteboard and they go like I have absolutely 100% an idea of how I implement the system the workers are simple to do it's going to be a connection handler it doesn't matter where you put it but now you know how the system boot you know how the system fails and you know how it should be recovering but of course that's the nice picture in the sales pitch in the talk it's not always going to work you need to take measures to support your supervision tree and do the things that it needs to do the basic principle there is to handle certainty and the way to really do that one is I've written a blog post in the past called it's about the guarantees and what you want to think about is that if I have an error that is expected for example my database connection handler do I expect the connection to ever break if the database is on the local hosts then no it should be there and so if the database goes away it's fine for the worker to die but if I'm talking to a database that is in the cloud then I should really expect the connection to break from time to time and so if that is an expected condition that happens all the time the worker for the database shouldn't die when this happens it should be able to relay the information to the caller and the caller then can decide how to handle that if you choose a fail-fast manner of dealing with all the errors even the well-known ones you take away the choice of the caller of the one that depends on you from making a good decision for error handling so you want to handle the unexpected by crashing but what you know will happen return it as a value this is the okay and error tuple and all of that stuff that's where you do it and so what that means is that I don't make a local decision I expected the connection to break but really the smart decision of what to do about that broken connection goes to the caller and so you absolve yourself of the responsibility you bring that up to the upper level and then they decide if they want to crash or they want to handle it and the worker is fine and if you do that you bring more support to your supervision tree the other pattern is going to be dead letter queues and I love this painting this is I think a bunch of Mongol armies responding to the Ottoman Empire but just insults and yeah, dear server, choke on that and so dead letter queues especially if you have clients to which you cannot give information if you're using a queue like Rabidim queue or Kafka and you're just consuming stuff and you're expected to shut up and do it there's going to be times where someone sends you wrong data either it's a bug in the client's software it might be an operator with fat fingers that just broke something or it might be that another part of the system is on a newer version of the software than what you are on right now and this is one of the realities of distributed systems is that you cannot know what the future version is going to be but you have to expect that it's going to run right next to you at some point and so if you get that it's possible you get messages you don't know how to handle and the way to do that usually is that you're going to try a few times if it doesn't work you let somebody try to do it and if you have a counter or something like it failed 10 times you have to give up either you throw the request on the ground or if you're using something critical like I don't know of financial transactions you need something called a dead letter queue and the dead letter queue is you saying that my system is not smart enough to do this I need to call an adult it's going to put the thing in there it's going to page someone someone's going to look at the data as a stupid system that's obvious and then they're going to handle it for you so dead letter queues is your system saying I need an adult but I don't need to stall entirely there's going to be cases where you want to stall entirely if what you're writing is a database and you need to replicate all the operation in order by all means don't drop a random one in the middle you're just corrupting your one task you had one thing to do don't break it but if you're really working with a kind of system with all kinds of uncertainty then usually a knotted trail of the things you had to drop is worthwhile then the last one is slowing down slowing down is something you want to do to exponential backups I have mentioned that thundering heard a bit earlier this is what you're planned for in there when the system starts going wrong you start to try again really rapidly at first maybe every millisecond and then you start thinking longer and longer time maybe it's going to be 10 minutes, half an hour later it makes recovery slower for some clients but it makes it so much easier for everyone involved the thing to be careful about with that one is something similar to sympathetic resonance you know that thing where all the military people walk on a bridge on the same beaten and everything crashes and topples over that happens with timers if you put a timer to just back off from one second all the time and they all start at different times as you get an even load on your system you will get a little fun phenomenon by which all the timers end up being synchronized even if they were not at the beginning and so proper back off libraries usually support a jitter mode where every one of the timer has a little variation that makes it unpredictable but means that they never get all synchronized because otherwise yes you do totally get a thundering heard out of it again and then you have circuit breakers circuit breakers are really really cool in that you can have metrics that are not failure to keep your stuff from working if you're doing queries to a database or to a web service you might be going seeing like oh every time the query takes more than one second this is getting a bit tricky and if too many of them happen in a given period of time you go you know what we're probably starting to kill the service right now cool off all the operations stop calling the service we're going to try again in five seconds and see if it's better and their circuit breakers are a kind of way to do it that is pretty interesting it lets you regulate the system before it goes bad I mean it has the name of an electrical circuit breaker which is there to prevent you from burning your house down so that's probably a well-chosen name the other question is how do you know that your supervision tree is right we have a structure we think is fine we are designing systems to make sure that they kind of fit with what we have but we don't know that we're actually doing it right and so what I like for that is chaos engineering this is the chaos monkey that you run in Netflix where it just kills random servers and sees what happens and whatnot the interesting thing we have with the supervision tree is that instead of just being random servers in a cloud we have a very strict structure of what we can do and so we can have really good expectations that if I kill that worker I kill that supervisor all the workers should be dying and unless I've hit a kind of rate limit here the Kafka worker should be alive and probably I can lose some of the workers I have here but that supervisor for the polar should still be alive if I do expect the supervision tree to be different so to test that there's nothing cooler than property-based testing and so I have a cool demo I worked very hard on this slide and do I have a little demo it's going to be Erlang code so yuck but this is what a bit property-based testing looks like it's I did that in a hurry these two lines eight and nine it's just shutting up some of the logs because it's always very noisy it's just crashing stuff repeatedly starts the application runs some commands and outputs the stuff and that's what it does and this is how a state machine looks in the generation for them I have a bunch of commands I can run here I only have one to mark a process is dead and I read a little library that takes a snapshot of supervision tree kills one at randoms and then shows you what it looks like so oh that's not the right thing I forgot to increase the text size let me do that actually it's probably better if you don't have it at first it's going to be tiny tiny text but I'm going to just show you the kind of thing it does when it takes a snapshot of supervision tree this is the 20th birthday of Erlang being a global open source so I think it's apropos that I use a terrible terrible terminal windows to test that stuff yeah supervisor it doesn't have type completions great and it generates really a big snapshot on the tree with that with all kinds of annotations I don't want us to actually look at that it's pretty gnarly and terrible let me just change the size I'm eating on sunny's time it's going to be great it's going to be glad about that all right and so if I just run it as it is that way it's not going to take us it's not a very fast test because it needs to have to but it's starting to see a bunch of failures I'm killing stuff that's normal it's really noisy output I haven't figured out how to make it silent and pretty and Erlang stuff is rarely silent and pretty I guess so it's not that big of a deal so it's going to run and eventually what it's going to find in my little workers is that this call here when the DB worker dies too often this worker also suffers and explodes and so maybe it has exploded oops yeah it has exploded by now it dumps the state it's just like I expected these children to be there they are not there's a whole history that you have to come over that I have to figure out how to make readable but it works and so you could just decide to go in here and catch the thing and make it work the other approach that we can support with the tooling could be to just tell it for example and I have the thing called there yeah put a filter and say I don't want to kill the database instead of doing that what I expect to happen is that the database in case of troubles it will not die it will be disconnected and so I have written that little thing where it filters the database if it's not tagged as a database worker you're free to kill it if it's the database I want from time to time to call a mocking function that will return a function error disconnected and so this is going to simulate a more realistic failure of what I have in my system I'm going to run it into there and the system is running with a kind of small simulator that just makes it busy and it try it should die really really rapidly and there you go and if we look at some of the stack traces which I know all of you find eminently readable there's going to be something like the bad match error disconnected and I know that this is in my worker on line 20 I go into worker on line 20 and what I need to do is probably just change it to end all that kind of failure because now I know that the database might happen to fail that way from time to time so if I had what was the result it was just okay with nothing yeah so if I have okay with whatever I just go this is fine and if I have the error disconnected I'm going to go well retry later or something this is not a real value nobody cares and if we run that system again now we should have the things where it's going to take like five minutes to run the suite because at some point it has sequences like 40 operations that need to propagate this and now the only thing that shuts down is the application when it's done we can leave it running and see what it does when we're done with it but I'm doing chaos engineering on my thing I don't need to deploy it in the cloud and oh crap that's not the right slide it started from scratch let's get to the right place what are these slides I don't like PowerPoint all that much I'm using it because that's a corporate template thanks to Genitech from sending me here yeah so chaos engineering we're running it right now we don't need to deploy in production we're doing fault injection and the supervision structure lets us validate what is happening without knowing anything about my application all I need to do at some places is make some put some tags in the text that let me say well I don't want that process to die I don't expect it to die it should be right all the time or if it doesn't die too often you can make proper or quick check goes do something like once in a while you kill that process but most of the time you just simulate failures in there without killing it and so you get that system that hopefully is super reliable and what's interesting is what you can do now that you know you have that supervision tree that really represents the health of your system and what I like to do with that is take the circuit breaker pattern and apply it directly to the entire system so what you're doing for example is that all the stuff you depend on to boot if it's a database if you're writing financial transactions for example you might want to have an audit log and if something doesn't get to the audit log you're not allowed to do any kind of buying or something if you're in ad tech because there's a lot of people in ad tech you know that if you are not able to track the the buys that you do for an advertisement you don't want to do it because then you don't track the span properly so you might be going and boot your system and say well if the database is not there or my span system is not there I don't want the system to boot I need that configuration to be there in the first place if you're writing a router or a proxy you might want to have like something like the firewall rules you're trying to implement before booting if you boot a firewall without any firewall rules you're probably doing a terrible job with being a firewall so you can put that on the left side of the tree you put your app in there and it will ensure that all the required stuff is there before you start and then on the other side on the right you can do all the stuff where you communicate the health of your system if you're answering positively to a health check or you are registering yourself to a kind of service mesh or discovery process you can do it through the supervisor and when the process dies what you do is that you automatically unregister yourself and what happens with that is by through the encoding of your supervision tree you are able to really encode in your system a way to register yourself to other dependencies and prevent other dependencies from being there and as the system recovers it automatically heals its status to the rest of the system that's freaking cool and yeah you can then try to see how could you apply that pattern to other architectures it works in airline but it probably works in microservices this is a diagram of Cindy Shridoham I think copy construct on medium from previous job she was at it this was a diagram of what their microservices looked like if you flip that image 90 degrees it looks like a supervision tree in most cloud systems that you're gonna have it's just a list of notes floating in the void and they have to be there what they don't have is the kind of dependencies that supervisors that you build and here's a really really simple system that I've seen in the past is that two systems end up depending on each other in a chain where you don't know about it and if the entire cloud goes down you cannot boot it back up because over time you implemented a circular dependency a supervision tree prevents that because it doesn't have cycles for example but how cool would it be that if B if B is supposed to have data you know automatically that A is not gonna register itself to receive traffic until there's some baseline level of health you need to serve requests and this is what is really cool about applying these architectural patterns that are entirely local to your programs to a much broader extent and context and so that's a TLDR TLDL not too long didn't listen you can get it when I publish the slides image credits it's mostly paintings and yeah so that genetic we're hiring but only like if you're in Montreal Quebec City or Sherbrooke so unless you plan on immigrating I don't think that there is a fat chance but yeah we're still open to people if you're looking to relocate to Quebec the province thank you