 Alright, originally there was supposed to be a talk here about Metal Cubed project, but unfortunately the speaker couldn't make it. Then as a secondary option, we had a connection of a gig speaking, and the second half hour of that talk was going to be replaced by someone from my company. However, that person had family issues, so couldn't make it. Hence therefore, I would like to introduce our next speaker, Walter Heck. He will now present this session. Thank you, Walter, for the introduction. Before we get started, I would like you all to get up, please. Everybody stand up. Now, if you have never brought down a production infrastructure, a production piece of software, you can stay standing. Everybody else, get back on your seat. He is speaking around with the box. My talk is about designing for failure. My session is also an example of failure. You can never know why failure is happening. As I said, I am the secondary backup for this session. So, if it's a little bit rough, I was notified yesterday by myself that I was going through this. So, designing for failure. It seems like something very simple, something that you can do by default, but many times when you try to explain this, the answer is we're not designing for failure. We're designing for working software, so just make sure that works. However, we all know that the world is unfortunately not that simple, and the more complex your piece of software is, the more likely it is to fail at some point. I learned this lesson a very long time ago by someone I was doing freelance work for, Arjen Lentz from OpenQuery. If you're watching Arjen, thank you for teaching me this lesson. Even if in a time, failure is definitely going to happen. It's not an if, it's a when question. And therefore, if you're not designing for this failure to happen, you are setting yourself up for failure. If you're lucky, however, you have a couple of options to not counter failure. A, you move to a different job or a different company before that failure happens. Or B, the lifetime of your software or infrastructure or whatever the piece of engineering is that you've designed or created. Lives less long than the time for it to fail. However, as we've just seen, I think there were like maybe 10 people standing after we asked the question. So most of us have seen failure more times than we care to remember. I actually did RMBec RF slash on a production server somewhere in the beginning of my career. Not a proud moment, but it's a good learning experience. Failure happens in real life lots and lots of times. Since I heard yesterday that I was going to give this talk, I was looking at the real world around us and how many things are actually designed to fail. A very famous example is the Tacoma Narromus bridge, which was a bridge built in, I guess, the 40s. They forgot to account for a specific wind angle and speed. Right after the bridge was finished, it started wobbling and eventually broke down quite badly. On the right, we see a bridge at Milo Viaduct in the south of France, a giant spanning bridge. And as you can see, it's in a canyon and it's designed for hurricane wind speeds that they've never seen in that area. So it's much more resilient and designed for failure. But there are many examples of where we design for failure. In software and IT in general, we also have lots and lots of things that are designed to at least partially fail without immediately giving problems. Two examples, RAID technology designed for one or more disks to fail without actually giving you any data loss. And on the right-hand side, we see QoS. These days, with the growth of bandwidth and stability of networking, it's not. That's not something that many of us encounter very often anymore, but it used to be much more important. Basically, it says we have different types of traffic and there's different priorities for this traffic. So at the bottom, we see web email and file transfer traffic. And as important as that might seem that you can load your Reddit front page, it's actually much less important than the web traffic that's at the top. So there, within the network world, Quality of Service tries to make sure that at least audio can continue even when other types of traffic are no longer available. Recovery-oriented computing is quite an interesting project. It started by some very smart gentlemen over at UC Berkeley. I didn't have enough time to fully dive into it, but the website is full of interesting papers that talk about recovery-oriented computing, which means that instead of assuming that everything will always work, let's talk about how we recover from failure and how we make sure that a failure is not immediately a disaster. So, for instance, by having not one but two backup plans for this room for this half an hour, you have a degraded user experience, which is me. But that's still better than looking at an empty room with no speaker. And the recovery-oriented computing dives into the concepts that are related to this subject. Designing for failure can be done in many different ways. Unfortunately, failure is everywhere when ubiquitous, and we have to deal with it. In code, we look at things like exception handling. There are some programming languages that don't do exception handling. However, most languages allow for you to handle exceptional situations, and very often this is not done correctly, but it's something that you should definitely think about. Fault tolerance and isolation, how do we make sure that when something is not running the way we expected, the rest of our system is still functioning. Fallbacks and degraded experiences, when I just discussed degraded experiences. Autoscaling, what if we have more traffic than we originally assumed? How do we deal with that? In our cloud computing world, it's a lot easier subject than in many other environments, but autoscaling is something that can be very useful to deal with these things. Lastly, redundancy. We've all hopefully heard of the word single point of failure. Try to reduce the number of single points of failure that you have by introducing redundancy. The fun exercise that I always like to do is try to look at an architecture and pinpoint the single points of failure. There will always be single points of failure, even if the entire system is a single point of failure. The question there is how much is it worth to you to make sure that that single point of failure does not fail? Any time. Several examples. S3 has 11 nines of durability, which means that if you have 10,000 objects, a single object can be lost every 10 billion years, which is great, except that it's only 4 nines of availability, so that doesn't mean that your objects will always be available. It just means that they will be persistent and eventually will be available. Out of the SRE theory, hope is not a strategy. I think that we've all seen one or more situations. I'm in consulting, I work in consulting, infrastructure consulting, so I see a lot of places where people say, oh, we just hope that this never happens. Unfortunately, hope is not a strategy. I quite like that motto. We cannot just hope that that single thing will never happen. It will definitely happen, given infinite time. You just might get lucky. And to show that designing for failure is not something easy, a company like Google has 100,000 employees and still they're suffering failures. So if you're a small company, don't feel too bad. Failure happens and it's good to be able to deal with that. When I talked about the person who taught me this designing for failure, one of the things that he said was in that consulting company we didn't do any emergency support. The idea was that we told clients, okay, we don't do any emergency support because we build everything in a high available fashion so that even if a failure happens, we don't have to immediately wake up and get to action. We can deal with it the next day. That was specifically MySQL consulting. So in the MySQL world, it's relatively easy to make sure that when a failure happens, it's not immediately a disaster. One of the things that I never liked is people asking me, what kind of SLA do you guarantee me? I don't know. Something less than zero downtime. It is almost impossible to reach zero downtime and if you do reach it, you are incredibly lucky. SLAs are important in the traditional world but they mean very little to engineers. It just simply means that we know how badly we're going to get scolded at when we break an SLA but simply put, zero downtime is not reasonable and therefore it's something that you can strive for and if you're lucky, you can maybe reach it over a short period of time but in the longer run, it's nearly impossible to reach zero downtime. In the infrastructure domain, put it a little bit closer to where we are to the topics that we're talking about today. We have lots of different examples of failure and how to deal with that and make sure that we can actually continue and not have a disaster when a failure happens. Full tolerance, you can easily deploy a load balancer in front of your servers so it should help with improving forgetting words. It should help with improving uptime and making sure that the availability is good. High availability, a load balancer in front of a single server can or cannot do good things. It still doesn't allow for that single server to fail however it does allow for a degraded experience. The load balancer will simply return a 503 or something similar to indicate that there are load back ends that are available to serve traffic but that's still better than a service that's trying to ping another service and gets a zero answer. Resilience, how can we make sure that we adapt to a situation based on load as I said before? This is in the cloud world very easy with auto scaling policies but even when I say very easy, very often it's not actually that easy because it implies a whole bunch of design constraints on the software that you are trying to auto scale. Not the least of which are readiness and lifeness probes sometimes referred to as health checks. When you have a server that is up it does not necessarily mean that it is working. Those are two entirely different things. My laptop is up, it's just not working. For instance, I hope you've never seen it but I've seen it more times than you care to remember. A server that is technically still running except some stupid log file decided to run away actually some stupid administrator decided to not configure the thing properly which made a log file fill up the disk. The server is still running which is not serving any traffic. A simple health check within an application that's running on that server can very easily determine, hey, is this server ready to serve traffic? Is it still able to have incoming requests? Those are actually two different things especially when you're talking about auto scaling when a server comes up initially the fact that the OS is up and network traffic is there does not necessarily mean that you have a working server and it's able to serve traffic. So you want to make sure that you have a health check that checks if the server is ready to serve traffic. The liveness probe on the other hand is more important over time where you're wondering is this server still able to handle incoming requests? Maybe indeed the disk ran full and it's no longer able to handle incoming requests and the liveness probe can tell you, hey, this server is not healthy anymore let's replace it with another one. Some example technologies, coursing, pacemaker are all the tools that I am quite familiar with. MySQL Galera is a quite interesting tool to make sure that we have multiple MySQL servers and one can easily die without having the others affected. Cloud-based infrastructure. So for instance, in the lab, we can see that the DNS can actually fail over our traffic to the static site if the original backend is not working anymore which means that we have a degraded experience but at least we can show some kind of context to the end user. Multi-AZ region in cloud so you can design for failure. I'll talk a little bit more about it in a minute but basically it depends on how much money you want to spend for how much failure you can account. So I always tell people to not overdo this. If you're Netflix, it's really cool that you can have whole region go out and nobody has to wake up. If you're not Netflix, then probably okay with having some kind of a simpler setup because it will cost your whole lot less money. I should be preaching to the choir about these things but as I said, I work with consulting so I've seen more than one environment that makes me not super excited. To give an example of failure in a software domain, so at the top we see a snippet that connects to MySQL and if the MySQL connection fails, then we store the fact that the database is not working at the moment and if it's up, then we make sure that we store that so we can, in the snippet at the bottom, check whether or not the database is up or down. So the reason that you want to store that is that if you just try this, this might actually, sorry, fail slowly because the connection hangs, the attempt to have the connection hang. So asynchronously check if it's possible to connect to the database. If it's not possible to connect to the database, very simply, designing for failure is sometimes not more than having an if statement. Say, hey, I still give a response but I don't actually try to connect to the database because if you didn't have that, this would probably hang and cause a much worse experience than being able to immediately be responding if the database is clearly not working. Hence, therefore, you're getting an error message. In software design there's also the circuit breaker paradigm and the circuit breaker paradigm is relatively easy for those of you who know something about electronics. The circuit breaker is normally always in a closed state which means that electricity can flow freely and in this case, traffic and logic can flow freely. If something happens in the electrical engineering world, we open the circuit breaker so that the power can no longer flow. In the software world, you open the circuit breaker to make sure that nothing can continue anymore and then you can open it with, for instance, one request or one query depending on the system that you're actually doing a circuit breaker design for and if that works, then you close the circuit breaker and the application can continue to work again. CICD, quite an interesting one. I've seen too many CICD pipelines where a CI is easy with a bunch of tests and everything's fine. The CD part of CICD can fail in 723 spectacular ways and that's rounded down by a lot. I've seen more pipelines than I care to remember that break is spectacular ways because of whatever reason the situation that you think you're going to deploy to is not actually that way at all. It's good to design for that also in your CICD pipelines, specifically if you want to be able to recover from failure from one pipeline failing. It has a tendency to leave things hanging in the middle which if you don't have one-minute way to recover from such a phase you're still looking at manual work which is not necessarily something you want to do. On the right side, we're talking a little bit about chaos engineering so if you've never done this before try to break not your production environment that's going to get you a lot less angry faces than trying to break your production environment. Try to see how your environment deals with failure. Turn something off and see what happens. Once you're more confident in breaking your environment you can try to do this on a production environment and see if that survives. There is a well-known suite of chaos engineering tools by boys from Netflix also that deals with trying to break environments in production. Make sure you do this during work hours because during work hours on a Tuesday at 11 o'clock in the morning you're around. You are ready, you're prepared for the failure, you are able to deal with it which is a lot better than on Christmas Eve at 3 a.m. when you are hopefully wanting to sleep. The volume of an error budget is something that can be quite useful. Instead of always implying very strict requirements on when you can deploy and how you can deploy and things you can change in your environment sometimes it can be good to have an error budget which basically says as long as the SLA stays above 99% we are free to do whatever we want to the production environment because it means that clearly we have a handle on properly dealing with the production structure. The moment we fall below 99% a bunch of additional checks get put in place. All of a sudden when you get below 99% and these levels are obviously up to you and the constraints are also up to you. You can say okay when we get below 99% uptime we will no longer automatically deploy and all the ploys need to be approved by the person next. In that way during normal operation you won't put too many restrictions on your team. However, if things don't go well for a while then you fall back to a more cautious way of working. If you're looking to start a little bit with designing for failures some of the things that you can do is look back at the last X times your environment failed let's say 10 and ask yourself could this failure have been better experienced by your end user? So instead of the user getting completely not working system the fact that because a search engine failed does that mean that we could have just designed a search on the website? If the caching layer fails does it mean that user is okay with a website that loads a bunch slower instead of getting error message because the caching service is not available? That can give you a good idea of where do we need to look at starting to design for failure and then another thing you can do is look at your biggest risks what are the things that will give you the biggest disruption and how can you design the end user experience because that's what you really do when you're designing for failure how can you make sure the end user experience is as good as possible? Let's say a database is not working that would be a problem depending on whatever your application is doing but maybe you can still serve some kind of page that doesn't require a database maybe only rights to your database are failing so maybe you can still have your application do read only If you want to do this you can deal with increasing levels of detail so start with small things and iterate from there it's basically a never-ending exercise so don't expect that you can do a project and make sure that your failures are better dealt with designing for failure should be in your workflow in every thought that you put into an architecture and from there you should iterate over oh, I'm over time Walter you have minus three minutes thank you for the notification I'll stick this slide also, thank you very much for your attention