 And this is architectural patterns of resilient distributed systems. I do have a pug, and I really like him, so you may see some photos there of dogs that are similar to mine. So who am I? Oh, that's one that I found in the street. So I live in San Francisco, so once I found a lady with a very small one, and I went crazy for it. So my name is Ines. I'm a distributed systems engineer at this company called Fastly. I also, I go by at random mood on the Twitter. And I also help run the San Francisco chapter of papers we love. You, in Spain, there's a Madrid chapter, there's chapters on every city. And the organization aims to bring academic research closer to practitioners, so I would really encourage you to join your own chapter. And let's get started. I mentioned that I work for a company called Fastly. I don't know if you know what Fastly is, what a content delivery network, and it works this way. So you basically, you come, you bring your content to us, we spread it all over the globe, and then it becomes closer to your users, so the web experience is much better, but we're like so much more than that. Imagine a globally distributed cache layer that you can also program, and then you've got real-time login, and temperature, and it's very nice. We're very like, I help with making sure that your bits make it there. So give us a try, nice. So this is what we're gonna talk about today. So we're gonna start a little bit about motivation, what led me to start exploring this question. A little bit of what happens, what can we find in research, and also how industry maps these concepts that we find in research, and a little bit more conclusions, and then in the conclusions section, I'll get to ponder about the world, and maybe you agree with me, and maybe you don't. But okay, let's define resilience, or what do I mean by this? So resilience is the ability of a system to adapt to keep working when changes occur. So whenever situations that are planned, or actually most of the time, it's like things that are unplanned, how well our system adapts to those situations, and how much progress can we make when things are not the way that we planned them. So let's define a little bit more, what I mean. I'm gonna toss into the term resilience, or into the bucket, resilient, a lot of other things as well. I'm going to talk about fault tolerance, the ability of a system, the scalability of a system, failure isolation, and then also complexity management. So that's what I mean when I talk about resilience right now. And why do I care about it? I care about it because it's a thing that matters, right? We can have a very nice UI, very nice product, but if our product is not up, and if our product doesn't really like our system, doesn't keep making progress, it doesn't matter. It's like this is the thing that is really important. So it's all right. So this came to me because a few years ago, then I became a distributed systems engineer, and I had been exposed to research, but when the perspective changed, all of the research became new to me, and all of these questions became things that I started to ponder on a day-to-day basis. And the thing that I want to know is like, okay, how do I construct more resilient systems now? Sometimes when you go to a new job, you have pre-existing applications, and you can see some of the patterns and some of the problems that these applications may have. And sometimes you start a new system, and you're making all of these decisions yourself, and you know that some poor soul and some poor, poor person is going to have to live with all of the consequences of everything you selected or everything you picked today. And that is an interesting thing, right? You could be making somebody's life a living hell two years down the road if your system is successful. So I wanted to make sure that my applications or my systems actually did not cause a person to swear my past self and just like, I would like to be able to see this person on the street and say hello. So how do we construct more resilient systems? Let's think about it from the terms of literature. So normally when I confront it with a problem, I just try to go and see what I can read about it and how other people have solved it. And then just like I try to contrast it with what is happening right now and then try to just mesh them together. So in literature, we're going to cover three models that really shaped my thinking when it comes to what I think about when we talk about resilience. So the first model that we're going to do is an oldie but a classic. This is the Harvest and Yield model. I don't know how many of you are familiar with this paper. Can I see some hands? Oh, one? Okay, cool, cool. This is great because I thought I was going to be here and telling you something that you already knew so this is fantastic. So this paper came about in 1999 and it formalized a lot of the concepts that we use right now to build our applications. So some of the things you're going to be like, duh, but they were discussed and then the patterns that we use now were first formalized in this paper. So this came out of Berkeley. So let's see, it has two concepts that are used to just describe the behavior of a system. You have yield and yield is the fraction of the successfully answered queries. Also, I tend to speak very fast and I don't know if people are following along. So I was supposed to get a visual cue. Somebody here, tell me if I'm speaking fast. Okay, all right. All right, this is good. Do I have to catch? Do I have to be like, meh? Or no, is everybody here? Like can we follow to why I care about this? Is that clear? Okay, so we know why I give a shit about this. So we know that there's gonna be like models in literature and then this one, a literature by means research and then the first one is going to be harvest and yield and it addresses a specific thing about system design. So we have two concepts that are important. The first one is called yield which is kind of a funny word, but yeah, it means how much information I get back from my system and you can think about it as uptime, but it's not uptime. It's much more focused on like your experience as a user of this application rather than how available is your application. Say for example, during Christmas, you're trying to do shopping because Americans do a lot of their shopping for Christmas. That's like it defines who they are as people. How well you shop for Christmas. So imagine that you have a system like the Shopify system that is unavailable for two minutes during Christmas season and you can have the same system that is unavailable for two minutes in the middle of a weekend at 3 a.m. in the morning. Like this is why it's not uptime is that there are different impact to your customer use base. So everybody understands yield now, right? All right, let's see what harvest is. Harvest is a different perspective on the same problem. So harvest is the fraction of the completed result. And then we have this example to kind of illustrate a point. Say you know that I like bugs and say that I wanted to look for cute baby animals. And then in this particular application, we have a short database and then in some side, in server A, I have things that are cute or things that have been tagged cute. In server B, I think things that are like babies and in server C, I have animals. So if I wanted to get the cute babies and the animals, then I would send a query to the three of things and then I would get information from the three of them. So say for example that the thing where babies are held, the server that babies, not the babies, the babies are not down, but the server that has the babies is down. I will still get something pretty good, right? I will get cute animals and maybe I'm gonna get some animals that happen to be babies, but I still can do some progress and my system can still be responsive even though I don't have the entirety of my dataset. So in this particular case, it makes sense. Like we have 66% of harvest, but I still respond and I still make progress. So we understand yield, we understand harvest now. And the thing that this paper gave us is like two particular models to start thinking about availability and one is probabilistic availability and they use two things. It's like how do we degrade or how we respond to failures. And this is where like the concept of graceful degradation came about. So this paper is really nice. I think you should all read it. But it's like we keep making progress and we choose either things that are probabilistic availability or the next pattern. And then in this case, they have a few mechanisms that allow you to use the first approach and the first one could be randomness. Like in my example before, I have information that makes it randomly to different servers and that is good because you reduce the chance of a particular piece of content being down. And then the second one is replication. You can make copies of it and then if one copy is down, you can still return the other copy. So this is why the author is called probabilistic availability. And then yes, and then sometimes you can even degrade the results based on the client capability. For example, if you have a bad internet connections and you're trying to watch a video, the video may come with less quality and this is like less harvest, it's like less information, it's less rich, but you can still watch it. So all right, so this is approach number one. How are we doing with the speed? Okay? Yeah, good. All right, so normally what I'm gonna is gonna happen, I'm gonna start running out of time and then I'm gonna pick it up at the end, so we're gonna go through this together. We're trying to do it very well now. So all right, the second approach is the composition and orthogonality. And this means it's just like that you don't have to solve all of the problems in the same application, you can break it apart and then your subsystems can deal with the problems in a much more isolated way or you can actually have things that are completely different. And the example that they present in this paper is for example dealing with like certs or like security. Like sometimes maybe you can use a library that does encryption for you and you don't have to build your application with encryption on it. This is what they call orthogonality. So that is nice. So we have something that allows us first to just have availability that is probabilistic and the other one is just the things that we can break them apart and have components that handle one specific thing. But I think that the true contribution or the thing that I like the most about this paper and it's strange because as I keep going along, I keep going back to the paper and I keep picking up more things that I missed the first time is that the main takeaway for me is that if your system favors one versus the other, it's a favor of it, it's a factor of its design. Ah, disappeared. So it's an outcome of its design. So like if you don't choose one way or another, it's very hard to put it later. So that is I think the contribution of this paper. All right, the second model comes from a completely different area and this is what I'm gonna call the Cook and Rasmussen model and Richard Cook is a physician and he talks a lot about system safety and I got a chance to meet him this year and at a conference and I was very weird and I was like, oh, I've seen all of your talks on your paper and he was very worded out, but I really like him. So this is also a very good, he also wrote a paper called How Complex Systems Fail that is a very, very nice one. But this is his model. So you have this universe and then in the middle of it, you have what is called the operating point and the operating point is always in motion. In one side you have economic failure boundary, the things that are economically feasible are not for you to do. In the other side, you have the unacceptable workload boundary. If something is very tedious to do, then you're likely just going to violate that boundary or things that are just the accident whenever things go really bad. So the operating point is in motion and in most and sometimes you would have different pressures that get placed on the operating point. So in here we have pressure towards efficiency, we wanna just save more money and then it moves there closer to an accident or sometimes we wanna do less work and we're just gonna cut a corner and then sometimes then when those pressures apply then the operating point goes over a boundary and then we have an incident. And if you have an incident it's never really fun and you decide to get your shit together and be like, we need to make sure that we do our things correctly and then you have some pressure to add more safety and then the operating point comes back into the center. So this happens a lot and what ends up creating is a boundary that is called a marginal boundary and this defines how close you are to having an error. So this sounds very nice in principle but we know that we like sometimes to cut corners, we know that sometimes we just don't do things like or we don't test things as thoroughly as we should and then there's something that happens when you're close to this boundary and it's called flirting with the margin. So say that we have this thing and then we have an estimation or an intuition of the things that we need to do not to have another problem. So we're like from there to what is it for immune? Right hand side of the boundary, of the marginal boundary and then like, well today we didn't do as much as we thought we were going to and then we're like, oh, we didn't have an incident and then maybe we're like freak out about it and it's like tomorrow I'm going to actually enable my tests and then now I have tests but the last time the test didn't pass I deleted it and we're fine so might as well just do this now and I'm getting really comfortable with this area so what ends up happening is that I end up redefining how close I am to the error boundary without even knowing. So this is normally what happens when we start doing this. We have this illusion that it's fine but we never really know how close we are to the accident boundary and we keep pushing this limit. So the insights from this model where like this thing of how close we are to an incident or how close we are to an error was very important to me because the paper defines that resilience is a factor of like how you operated a system. It's a factor of how you explain the system to people that are new, how you respond to when an incident happened, how what happens whenever you know that you have to change parts of your application and also like how you learn from the mistakes that you have, how you keep redefining how close you want to be to this error boundary and also that the safety is not about necessarily about what can happen but it's also what you do whenever bad things happen but this is from an operational perspective. The one before was from a design perspective and Cook reminds us that how we operate our applications are very important for us to build resilience. Okay, so he also tells us how we can engineer things with system resilience. We wanna like build support for continuous maintenance. Maintenance is something that is gonna happen the moment that we have a system out and the moment that a system goes into production if it's doing anything worth doing, it will become a critical system in your organization. It will become something that we need to have like some safety in order to just like in order to keep making progress. So those are the things and then I don't know how am I doing in time, 42. Okay, and then also the main important thing is that he says that we should like we should give control of our applications or our systems to the operators and the operators are gonna do crazy shenanigans with your applications and then you can't prevent that. So you might as well just expose the knobs and expose the dial and let the operators do whatever they're going to do and then you react and then you put things in place as problems go along. So this is how Cook and Rasmussen tell us that we should think about resilience in our applications and also they say that we should think about configurations and interfaces because people want to like be able to interact with our configuration and then it's much better to explain in the paper. So like I encourage you to read it as well. Okay, so design, how we operate and then this is a third model that also inform my thinking and in San Francisco there's this person called Paul Borel and he's the best synthesizer of papers. So this model comes with all of this papers and then there's a bunch of literature that I have assembled for you that if you want to like go through Paul's work you can follow along. But I'm going to call this the Borel model because that's his last name. So Paul tells us that we can split, we can think about in a system complexity in a sense of like in a continuum where we have a rank of things on the probability of failure and in the literature like it's a little bit more explained but because I don't really have that much time he splits this work into three main areas and you have things that we consider traditional engineering, things that we consider reactive things like that they're based on operations and also there's this area that is called Unk Unk and it stands for unknown unknown and what he says is that this thing keeps going on forever and at some point everything that is unknown if you fold it into the first two categories has the same like surface as the first two and cascading of catastrophic failure comes in this Unk Unk section because if we would have known that they were going to happen we would have already like used the strategies before to catch it but anything that has to do with the unknowns are things that we couldn't possibly predict and this is where the tricky parts of the system this is why building distributed systems is also very hard so what Paul also tells us is that okay they're different failure areas and we need different strategies and different like different approaches to deal with them I don't know how many of you are familiar with the work that Kyle Kingsbury has done to prove the correctness of a database but we have an area then for Paul anything that Kyle has done to prove like how distributed systems work deals with things that attack the first two areas and then we have another researcher and his name is Peter Alderal that wrote a very important paper two years ago called Molly where he attempts to reason from a correct system design to see where it could have gone wrong so say that I did a computation that gave me a good state he tries to chop the computation and analyze the execution path to see if the problem arrives to the same like if the program arrives to the same good spot it's gonna crazy that he does this but he reasons from the outcome up to see what are the things that could be going wrong so he's saying that all right we have people that are already attacking these two things these two areas and collectively we can do something that is more robust than is more resilient so that is interesting about the Borel model where we now have an awareness that first we can have design and we can have everything the traditional engineering gives us and then we can have things that we can do in operation to reduce the likelihood for us to have a problem and then we have things that we can't do anything about and at that point we're screwed okay but there's some things that you could do right like in classical engineer we have things like coding standards we have programming patterns we have like testing of the full system and if you know what I mean by the full test system is that you have to test in a distributed system the client the server code and also the provisioning code because most people don't necessarily like test the provisioning code if your system takes a configuration the thing that configures it should be tested as well as what your system does whenever the configuration is wrong because that is going to happen and also that are right and we think about these systems as they like they converge to some good state with classical engineering that's what engineer classical engineer helps us do in reactive operations we have okay like inventory of things that could be like hazardous to your application you need to know them at redundancies like you can use feature flag to deploy code to production you can do dark deploys where you push it out and then it's just like it's out in the world but no traffic has been turned on and you can have also like run books and documentation are very important and you can just like have canaries and these are the things that you can do within the reactive operations to make an application much more resilient and in Ununc you can do like formal methods fault injection and things that allow you to do some sort of system verification of stuff that you couldn't have predicted also the goal between these three areas is to reduce like domain independence and then to reduce the amount the amount of like the possibility for an error but you do different strategies so this is what Paul tells us when you think about building resilience using a single discipline is insufficient so if you think about it from one single perspective is insufficient and we need different strategies all at the same time which kind of sucks but it does follow our intuition but now we mean so we have to attack three different areas and then one of them can screw us at any time it's like we just don't know what it's going to be and it's not really great all right so I bumped you out enough I've told you that there's different approaches so let's see how what happens in industry with them obviously whenever we talk about the things and how we're done like at scale you always have trouble like you have like three papers from like few companies and Google is like yes I'm gonna be a cliche here but this paper is really really good and I really like this paper because they describe how they constructed their locking service and then it's like a seminal paper and it's really really fantastic and here are the key insights that I took from this paper like they pondered doing this thing that managed distributed locking at a company scale and they pondered two things whether they were going to be in a library for developers who use or for engineers who use or they're gonna do a centralized service and they chose to do a centralized service and provide the client libraries for people to use so they would have the control that is necessary to just to do something that allows you to have resiliency also they did limited the scope of the problem by just only offering storage of small data file with restricted operations and then they said on the paper something that is very, is a little controversial but and it hits close to home so I have very like a lot of feels whenever somebody tells me this but it's like okay in the paper they say that engineers don't plan for availability, consensus, primary elections, failures, their own bugs, operability, the future and they also don't understand distributed systems and this shit is happening at Google so like okay the monstrosity that is Google in terms of like systems applications and share knowledge and we get this insight from them so it's just like again it's like it's pretty depressing but then another thing that they tell us to it's like okay we have a centralized service or they have a centralized service to manage distributed locking and it's hard to construct so these problems are inherently difficult and they're difficult to reason with and they're difficult to put out and you can just like dedicate effort into architecting well if they're very well, if they're very well scoped and then they said that by having a service that everybody uses allow them to pull resources and make sure that that service was very, very resilient and they just like threw everything at the service to make it fall tolerant so that's pretty nice right like if we're in a company of that size some of these problems may be solved for us and also they said that the cook model says that people can do whatever the hell they want with our applications they're like eh no you're gonna do these two things and no more than these two things and these are the primitives that we give you so they say that if they restrict the user behavior they can increase some of the resilience because they can just narrow down the API for this particular service all right and then also the consumers that although they did that they said that they found corner cases that were completely like completely unpredictable to them so again we have this problem of onk, onk popping up again all right so now Netflix because everybody has to talk about Netflix when we talk about resilience right are there any Netflix employees here awesome good all right so tons of things from Netflix they started chaos engineer and there's a lot of patterns are very good I'm not gonna comment on them yet there's a lot of links you can use all of these things that they say in here they're still applied they're still true they're very good but the thing that I find much more interesting these days is like since they started with this chaos monkey chaos engineering they're now even going beyond they're the things that originally they told us to do like these things into something that is like a little cookie and a little bit like more science fictiony where they are like all right systems are complex we can't reason about them we might as well develop an intuition and they have this thing that taps into all of the monitoring system and in real time shows how traffic flows and the entirety like that start of microservices how it looks from like abstracted away I don't have a demo of like how the Amazon regions would fail to each other and then they have the little dots are like packets and then they're like their request and have different colors when they're healthy when they're not healthy and that was pretty amazing so imagine if we can have something like that for all of our applications and we just like like see them we're better at see them and like second like into it in things that we are about like analyzing metrics, graphics and all of the things that we need to in order for a system to work correctly so that's pretty neat so since we cover Netflix very very quickly I'm gonna share a little bit of the things that we do at FASI as well so we have this system that is called Powderhorn and these two people like I work with them Tyler and Bruce gave a talk in Barcelona a few years ago describing the story and the evolution of how we created our instant purging thing so first one's B1, B2 and then in B3 we got it right after reading some literature and we got it right by using a gossip protocol and then it's called by model Montecas and then this talk is very nice because it describes the entire complexity and our problem domain is very interesting because we deal with everything that is awful on the internet so everything that is awful in the internet we have to respond and then this particular thing has to be very efficient, very fast and it has to go everywhere in the globe and then invalidate your cache very very quickly and it does so in this talk you can learn a little bit more how we do it and they use it by just using a gossip protocol so on the Nexus area which is like we have ways to programmatically like interact with the internet we have this single fail D so you have the internet and this is horribleness of everything and then traffic comes to a pop and you have this protocol that routes traffic to different servers the problem is is whenever one of the servers go back the protocol rearranges the table and things shift so if you have content in one of them like your content gets moved and you need to be able to do this so the way that we solve this problem is by just hacking the protocol and using just MAC addresses and fake MAC addresses to be able to like predictably route traffic around this is also in a talk by one of my colleagues called Joao so if you're interested in how we do that this is a trick that we also use to just like all right the current infrastructure doesn't do it let's trick the current infrastructure to do the thing that we want and abuse this open configuration things and actually get more sane behavior and more resilient behavior from how and route traffic reliably within our own pop and then like I'm gonna talk about then a little bit of my team so for the last nine months I think it's nine months right it's September so yes nine months it is September yes it is September so for the last nine months I've been like trying to start something new and then there's different challenges this thing about trying to make something resilient it's very important to me because now I'm gonna be responsible for a system that I have to support as well and other people are going to be dependent on and we're in beta and it's great and then we discover something and now I'm here and everybody's working on a problem that I just can't help so it's great I'm gonna tell you then what my system does so you have a user that just come to our CDN and then they have my service and we go to the origin and they fetch the images and then we resize them fast and then return them back to and then we resize them and then we put them on our CDN and then it goes back to whatever device or whatever like customer or whatever client you have for that particular content piece so again, because I was doing fast I want a picture of my dog my picture of my dog lives over there and I want to resize this big I have to figure out with our team that we're on we have to figure out how to do that more efficiently and then just like and very, very fast so in a way, the advantages that we have is that this system is stateless I don't have to keep the copy of the original I only have to transform it very quickly which should be like easy because like when you don't have state to deal with you have a little bit more allowances and your system can be more resilient and I don't have to keep anything no coordination, no database I don't have to use things that are like harvest and yield in the sense of like how much data I give out but I still have to be available and then data has to still come out the thing that's most difficult for us is how we thought about the request cycle through all the dependencies that we have and the dependencies are the hardest thing and they're the hardest thing because this application has to be it's everything like customer setup if you messed up where your images live I don't know you like for example screwed up your S3 like bucket permissions we already cannot access it and then that disrupts our system if the caching layer is having a problem we need to be able to know that the caching layer is having issues and then just not send traffic back to the caching layer the interaction between every point is very complicated and it's very complex and this is a thing that actually like has cost us the biggest time like our design the same time has gone up front to figuring out this thing and we even the libraries that we use can have problems so at this point we're thinking about resiliency on every interface and every area that our application covers and it's hard so I was thinking all right another thing that has been related to how we deal with resiliency is like how you define your error types and what you decide to expose versus not because sometimes you think that sometimes you could be just hiding away information and your system is less resilient than it should be because you're just simply not thinking about like exposing this problem that you're having and handling those carefully like is very important so in this case failure detection and system operability are ongoing concerns for us we have many servers, we have many many locations and if one of our servers goes down again as I mentioned the interaction between the edge and us has to be like synchronized like if we're in a state where we can't take traffic we have to signal the edge and who's talking to us to actually not send us traffic okay so this is getting boring again in a sense it's like it's complicated and it's hard and another thing that is very helpful when you're iterating this fast is to have intermediate versions for things that you save so for example we versioned the way that we represent things internally and the thing that you wanna know about that if you're ever doing that it's very good to have it from the get go as part of your design but you should like know that you're gonna you should have to support for mix mode and this means that if I have version two on this server version one is could be on the other server on both of them need to be able to communicate and that is missed very very frequently when you forget that you need to support mix mode happens like it happens more frequent than not it's like it's a huge it's a huge threat to your availability and your resiliency all right you want versioning for everything all right so we covered a little bit of literature we covered a little bit of like how companies think about it let's summarize it because I oh the person that was asleep woke up so it's great so let's then just like summary like summarize a little bit of this okay so in our application or anything else things that have to deal with redundancies are very important so you add resilience by just putting more of it and then again we saw this in the things before but you have like redundancy of like resources execution even for example like whenever we're talking about our purging mechanism this thing's gossiped so it's more than one message that goes out and this is redundancy of messages and this help you like stand up like or be able to like withstand circumstances or things that you didn't plan so all right all of those failed resiliency and also like something that is very important too is like capacity planning is still something that is worthy of an exercise and now I'm trying to even just do recent capacity planning to the point in like it's correlated even to the power that we have available in a rack how many operations can I do for the amount of power that this server that I have installed in a pop consumes and that is that is also hard and also that when you're trying to make optimizations like in the cook model say you're trying to save some money or you're trying to like that particular set of optimizations may remove some of your redundancies and then make your system may make your system less stable all right so we want to have more things and we want to be able to like to just test them and then just have them out so redundancies are key as we saw in the cook model operations are very important and when you don't know like how close you are to the error boundary it means that we're always guessing and we don't really know how close we are and if your operations are complex if for example your deploy process takes like 20 steps and they are like done by hand you're gonna get it wrong and this is gonna cost you an outage this is going to cost you a problem and the thing that is interesting too is like you said when you were like implementing your system or probably it's also something that you design so it's important and also like okay things are complex we are building distributed systems they are like they're very hard they can fail they can have all sorts of crazy shenanigans but sometimes complexity is good and then when you're dealing with complexity sometimes it's very very hard to make something very simple to operate and then in that sense that complexity is good because you have invested a lot of time and invested a lot of effort into make something simple to interact with so if it increases safety then it's good and also like you should know that resiliency is sometimes will come at the cost of other goals like if you wanna make a system that is resilient it's not gonna be cheap you're gonna have to have more of it and it may be just time consuming because it takes time to make this thing so it's like it will come at a cost of something else all right so again so we wanna like leverage engineering best practices we saw like a pattern emerging that the things that we know that we should be doing are actually helpful and they're helpful in different planes right so we wanna do them testing and resiliency are correlated you still have to do it I'm sorry that there's no magic bullet there versioning is important provides you an upgrade path is very very good okay so the upgrades and the evolvability of a system is so tricky again mix mode is important we recently forgot to account for this so but we fixed it but having been burned by this one now I'm like I see it everywhere and also I feel it's an interesting thing because we should be able to like the way that we prototype system I think is fundamentally broken like we start creating these things and there's just too much too many areas of attack and if you push something out and then like follow these patterns it's like you have to still revisit every single aspect and constantly check them and see how well you're doing and that takes effort and takes a lot of discipline and we don't seem to have like ways to construct applications better and then that's my rant on yeah okay I guess what I'm trying to say it's like yeah it's laborious it's worth doing and it doesn't seem like for distributed systems there's that much mechanization or things that get done for you so that's a little sad so let's wrap it up a little bit like let's start just bringing it home because probably like I'm running out of time oh cool all right so again this is my TLDR we've seen different things we've seen patterns in academia that highlighted the importance of code design we've seen patterns in academia or even system safety that tells us how we operate and how we interact with our applications is very very important and what we expose and what people are gonna do with our systems are things that we cannot expect and then we see in the world model that although like we may be very good at those things there's still gonna be things that we miss and there's still going to be challenges that we could have never predicted and there's nothing that past us could have done to like to help us in the moment that we're in and there's some things but this is where like I think maybe some higher level this is maybe where I think the future is gonna help us the most how we simulate, how we model the systems before we even construct them and I think that that's interesting that that's where we're gonna see the biggest returns maybe we have a language oh there was a paper where like Microsoft described this language where as you made a change it went to the cloud and run formal methods on everything you have to let you know if you have violated an assumption and that's amazing so like if we can do that as part of our like development environment or how we develop that would be great so like since we're talking about the future that would be the future that I wanna live in but all right let's summarize it then we have here the things that we could do in design and in design we could do all right we have to figure out are we favoring like harvest are we favoring yield we need to make this decision explicit we can use orthogonality as in like we can just like have different areas with different responsibilities or the composition are like for the wind things that help you do we have enough redundancies in place this happens in design are we resilient to our dependencies that's much harder to do than it sounds and also like theories matters because sometimes this problem that you're running into has already been studied and a solution has been already proposed decades before you run into this problem and it's nice to be able to borrow from that and then just start from a point where it's well known that it works operability are you providing enough control to your operators or your users and a big question that will let you know if your system is resilient or not is would you like to be on call for it so that would be really good once you're on call everything that is shitty in your system will wake you up at the craziest times of the day so that motivation is real so I think people should be on call for their own applications so especially if you're responsible for it you should rank your services in terms of like what can be dropped when you're talking about harvest and yield if you have things that are not important and you can just drop them and shed them when you are in situations that are problematic then your system can continue to make progress as well and monitoring and alerting are very important they should be in place they should be part of your MVP and when things that have to do with the unknown unknowns the existence of like stress actually like the unknown unknown like stress is how important the first two areas are and how when you cut corners on the first two you're moving closer to the error boundary and on top of that you have this extra area that is as big as the first two that is coming for you and is going to ruin your life and ruin your weekend how we run everything we can and if everything else fails you we don't do this in San Francisco yet we don't have enough intern but you can start like doing human sacrifices that works I know maybe when you don't call that is a version of a human sacrifice so you solve this unck-unck problem by putting the burden in a poor person in a poor human being more you should do these things tests, core reviews are good these stress claim behavior is good even if they are internal because sometimes you can do those your own application from your own company it's easier that somebody else can version is good remember mix mode checksums are very good and then you have things like error handle secret breakers it's like again if you can just shut down traffic to certain areas so you can still continue to make progress back pressure is important leases are important timeouts are important just make sure that those things happen while you're in design or at least you have a node to just go back and think about them and the probability automation anything that you felt to automate is going to be something that you run into when you are drunk at 3 a.m. and you get a page it's very important to do that and then release stability like it's often tied to system stability if it takes you forever to release your application or do a deploy then it's very common it's often you're gonna make it like you're gonna get wrong at some point playbooks are good you should link your alerts to things that are actionable like if you see something pop up it should have a link to how you monitor, how you debug it or how you troubleshoot it and also like how you configure your system has to be consolidated especially if you have many of them if in one system you do it with like for example a data bug in Chef on the other one you have a config file when you're moving from one to the other and it should be much more straightforward and should be able to have some common patterns for how you configure your system across your organization and also the important thing is that you should keep in mind that your operators determine also how resilient your system is how the people that are using it like the people that are like helping you just like run it like they are like also helping you determine how resilient your application is I could be that have the same the first like the most amazing system and if it's a pity to ask to monitor like nobody's going to go and read through my logs and see what happens it will be very hard sometimes alright so this was me last year it's like I was like we can't recover from lack of design design is all of the things that matter I'm not minded either harvest yield or not being aware of everything means that we're going to sign up for a redesign the moment we finish coding and I was very idealistic and I feel that I was very arrogant and stupid it's like now that I have something new this thing about redesign has already happened in different areas and we got it wrong so I was like okay fuck like this thing is like it's hard like having a good design is hard the unknowns are like super hard to predict like the dependencies even are super hard to like to to just like to reason through so I felt like my original apprehension about everything should happen in design is incorrect like I think that maybe like like redesigns are a part of having a system that evolves maybe like I think about it now in terms of like alright I touch this is my assumption still valid a few months on the road is my assumption still valid and then like maybe now like I think that the when I look at a problem or when I look at something else I just go through the checklist of everything that I have am I doing all of the things and what did I miss and maybe I shouldn't really like try to fight so much the redesign and also like think about it more in terms of the adaptability and I think that that is kind of nice and I think that is a better way or is a way that I like to live more with the terms of like there is a way that makes me happier when I think about resilience because we know that we're gonna have to change we know that a lot of things have to be reconstructed when your assumptions change or the world decides to throw you something that is completely bananas and before I was like everything should happen in design and then now it's like okay this is an evolution to this and then this is a reason also like why sometimes when you read literature or things from big companies they have decades of effort and labor trying to figure out we can start from there but in our applications we're still gonna have to like have this type of evolution this type of thinking and maturity. So that's all I got I don't have any time for questions but all of my papers all of my research are all in a repo so you can open an issue if you have a question I'm gonna be around and I hope I didn't talk too fast I look like I have so sorry so thank you