 Thanks. My name is Iribanesh and I work as a tech lead in kivi.com and I would like to tell you a few words about this topic. The problem is that over years the systems are getting more and more complex. We develop more complicated systems. We have like higher requirements for the systems to have like more complicated features, more of features and especially in past years we are trying to usually break up like big monolithic applications into smaller chunks. So the apps like have a lot of more dependencies and are like getting more complex in general. And the problem is that usually like anything can break down. Any of your dependencies like a network can go down, a data center and whatever. Like anybody can do some humanistic and the system just can crash down. Like for example in past two months you may have noticed that one credit card scheme had like big outage in Europe and a lot of people couldn't pay with credit cards. And two weeks later it happened to a different credit card company. And a few weeks after that there was like huge storm in Germany that caused like outage of one of big data centers. And this company has like really big budgets and they are doing the best to stay like available but anything can happen and they can just break down. So goal of this talk is like to present few techniques that can help preventing of failing your app in case line of the dependency will fail. We will see some examples in Python and the other goal is to have actually visibility over your app to know what exactly is happening inside. Not just to deploy some app somewhere and not to know what is happening there. For example in my previous company we have been using one of the data center. They were like more expensive but they promised like 100% reliability. And so we paid something extra to have something reliable. But what happened once suddenly all of our apps running there just went down. So we call them and asking when they gonna fix it. They were asking fix what? We explain the servers. They asked which ones. We told them like all of them. They didn't believe us. So we advised go to your web. Not even this is running and they realized that they have a problem. And what happened? They found out that some maintainer, some plumber actually broke a pipe in the building and the whole data center was flooded. So they were like repairing the infrastructure over six days loading the backups and they didn't even know like that something actually happened. They claimed to have like 100% reliability that they have backups. But the backups were in the same building. So basically you can't like trust anybody and you should know what is happening with your infrastructure. So we'll go through some techniques that you can apply in Python to make your app more stable. We'll go through some monitoring and other thing. And later some tips what you can like aim on when you are deploying in the infrastructure out testing and stuff. So when you start actually developing the app you should think of what about what are you actually using in your app. Like if you have some third party APIs, databases, radices, what servers you use, where they are and other. And you should realize the importance of your resources like this database. Is it okay if it goes down for one minute? What happens if it goes down for one hour? Do they have any SLAs? Like can you trust them? Do you have backups in more different locations? How do you use these resources? Like how often? Like when you are starting like building for example some Django app it has only one database. But over time like over years there will be more and more people working on that. And suddenly you will have like dozens of dependencies, a lot of databases, third party APIs and it can easily turn into a mess. Like I don't think that they originally designed it like this but somehow it can naturally happen that you are like with big spaghetti infrastructure. In one of the previous companies again there was like some time period where DevOps usually once a week found a server and nobody knew what is like what is running there and what is that we even had it. So when you know that what are you actually using? Next thing is like to think about how you use it. Let's say we have some service and we have some dependency which can be some rest API. Now we see like healthy communication. We have a request on time. We send a request, receive a response immediately and everything is smooth. But the problems becomes when the dependency starting time outing. You have like some delay when you get the response and your request are starting piling up and being handled in parallel. So what are the consequences? Like your users of your service will have like a slower response. Another thing is that your app will start like eating resources. For example, if you have some Python app which is not like fully acing, you may have depleted your application workers and that will be like busy waiting and looking at other resources and it may end up that new request that your app will start receiving, you will just throw them away because you won't have anything to handle them. So what can you do? Check out like any communication with other systems and decide like how long is it worth waiting for the response. Yeah. Let's say we have for example some issue and in the corner you have information about the weather in Edinburgh which is like nice to know but it's not worth for a user to wait one minute to load to see like this information. Let's say that the weather API like time outing so we will just cut request for example after one second or three seconds and just don't show the user the information. Most of the like Python libraries usually have parameter time out which before we can set the number of seconds you want to actually wait for the response and otherwise cut the request. So when you apply it the communication can look like this. Even though the dependency is starting spiling up the request, you are returning the response on time but you will lose the information you don't know their response. A few things to be aware of like if you cut a request that actually manipulates with some data for example storing in database and you cut it off it may be like executed anyway. So you may store something in database or may play with data and you don't know if you did or not. And second thing like if you have some Python component for example that handles the work with libraries, you need to take care of if you set time out not just to query the result from a database or also for initialization of communication because some of the libraries allows you to set the only time out for the query itself. So even though that you set up the time outs, you might end up like breaking your app because you didn't or the library doesn't allow to set it properly. This is like really basic technique that should be in my opinion applied to any communication and one more like advanced technique is applying circuit breakers. The philosophy behind it is like when you have some dependency that starts failing, why should you communicate with it? For example, you have this eShop with weather information. If several users will end up error loading the weather, you will just stop displaying it at all and users won't get the information at all. So if you know the dependency fails, you won't even wait for any time outs or any seconds. You will just immediately return instant errors. So you won't pile any request or anything. And also it's better for third party system recovery. Let's say that the other system is failing because it's overloaded. So if you actually stop calling it, you may give it time to auto recover. In Python, there are some libraries usually implemented as a decorator. You will just set up the decorator. You will specify how many errors in a row you will allow for this resource and some interval, how long you will wait until you try to communicate with it again. You will apply the decorator for a function. In case the function will raise errors like five times in a row, you will stop using the function. The decorator will stop calling the function for a 60 seconds in this example. How it works inside? The breaker has three states. One is closed, which means the communication is okay, everything is smooth, and you call the dependency normally as issued. Second state is open, which means something is wrong, and you stop calling the dependency at all. And half open is state. When you are open, and you are trying to query, just try to send a few queries, a few requests, and you will see if it's healthy or not. And based on that, you will decide if you wait more or close the breaker. So communication can look like this. You have first request that gives you like 100 or 200 or some success response. So breaker is closed. Second request is still okay. So you are keeping closed. Third request, you will get some error or time out or just some error response. But you will check that it's only first error. So you will keep the breaker closed. Then you will receive another request, another error, and you will see it's already like an error. So I will open the breaker. Then there will be like this communication window that you will stop calling the dependency. And after that, you will try to, you will get to half open state and try to send there some request. If you are okay, you will close it and continue or add some time window again. So the communication looks like this. Once you start receiving errors or timeouts, you will just stop communicating with it. So in general, like you will return instant errors and you won't deplete any resources and you will give it time to recover for the system. And yeah, two Python implementations. Yeah. So far, we have been using breakers and timeouts just to kill some dependency that is not working. But in case you have something really critical to perform some action, for example, if you are like handling the payment or something, in case like the other system, for example, payment gateway will start failing, you may really want the response so you will duplicate the request again till you get the response. So from this perspective, handling your request will take longer because you are internally calling the service more times, but you will get your response. Again, implementations, it can be made as a decorator. You will just set up like how many times you want to repeat the request. It shouldn't be like infinite time or whatever. You can specify how long you will wait between the requests. Probably you can set up also some jitter or something applied to function. It functions as an exception. Function will be called again and again how you configure it. But there are also problems with it. First thing is if you actually calling some API like more times, you might end up like changing data more times. You can store like some objective database more times if it's actually internally executed. And also a bigger problem is when you are style piling repeaters, either in your application internally on several logical levels or through some other dependency, you may end up like smashing some resource. Here we have one request to our service, but we will make three requests to some dependency. Some dependency will make again three requests to other dependency. In general, from one request, you may end up with nine requests. So this is really dangerous in case the final dependency will start failing because, for example, it's overloaded. And if this mechanism, you can totally smash it. How to prevent it? For example, your API or the dependencies could return specific error codes, which can tell you, okay, I'm overloaded. Don't repeat the request at all. It won't help. Or okay, I have some different kind of error. You can repeat the request and we will see how it goes. You can also set up some budgets. Let's say that you will allow, to some dependency, repeat calls like 10 times in a minute. So if you repeat it 10 times in first 10 seconds, you will just stop repeating it and give it more time to recover. Or you can set up some hidden potency mechanism. For example, you will get to your request header information, which will tell the dependency that it's duplicated request and the app can handle it properly. It can help, for example, when you are manipulating the data, you will see that you are sending the same request that already, for example, have been handled somehow. So we will just return some error that is already resolved. Yeah. And quick thing also, if you don't have to do anything like synchronously, for example, again, eShop, you are receiving orders. You want to see some confirmation email or something. You can just put the request to some queue, schedule some tasks, which will be handled async and the system will go like faster because it doesn't have to perform the action synchronously. Also, what is important on the other hand, not to over complicate the system, like to apply several of these mechanisms on several layers, you may end up, some layer will swallow, like, part of the trace back or some exception may not bubble up for some reason. So you may have hidden errors or partial information or you can receive error more times if you don't handle it properly. Okay. And one more technique. In our company, we have developed some system of diagnostics or repairs, how we call it. And it's in case that you have some communication that you actually want to do async synchronously, information you need, but in case it happens, it's like okayish to do it async synchronously after a few minutes. So we have some system that paradoxically checks the system for inconsistencies and based on that, it will automatically fix them. For example, you are calling some dependencies synchronously. It will start time outing, so you can have some side job that will actually pull the system for the results of those actions and handle the information properly in the system. It also helped us a few times with some buggy releases. We have released some bug, but the system, like, discovered some inconsistencies, automatically fixed it. So we actually saw the errors, we could handle it, and there was like no damage done because the system auto healed. So when we know what dependencies we have and how to handle them, it's also really important to monitor it. I was working once in a company and we have developed some eShop and once somebody contacted us that we are selling something really cheap. So we were like from the start happy that, yeah, we are above other concurrence, but then we dig out that we're already selling it too cheap, and we found out that there was some service calling the currency rates from a bank, but it got broken in some time, and the currency rates were not updated for a few weeks, and there was in one country like bad political situation, and the currency rates changed a little bit. So we were selling it like cheaper, like underpriced in some market. So it was a little bit painful at the time. So it's important to know what is actually happening in your system. Let's say you have a system that has like a dozen of dependencies, something crashes down, and you have to go to the system and see what is happening, what exactly went down. So you can monitor like all of these resources, major responses times, the errors, it returns connection counts through to both writes. If so SQL queries, you can go deeper, debug like slow queries in Postgre. If you don't know the explain command, you should definitely check it, which can give you like details about execution plans of your SQL queries. You may find out that there is like something really inefficient, and under a little bit heavy load, it can actually like smash your database, deplete CPU or whatever. You can install in your Python app some APA, APM, that will like for a really small work give you a really nice overview of what is happening in your app. You have like dishboards or some other thing by default about accounts, your API is doing overalls database or whatever, and it can give you like great insight into your application, and you will usually be surprised what is happening inside, and how can you like debug it. Next level of monitoring is define some ping and point. Also really simple thing. You'll actually use some third party service that will check if your application is alive, so it's good that it's not like part of your infrastructure. It's something that could do the guys from the first story. And when you are doing the ping and point, you should also define what does it even mean that your application is healthy. So the ping and point could query database, release or check some other dependencies to tell that your application is actually alive, not just returning some dummy responses. Next level of monitoring is let's say that your application is okay, but you can go a little bit further. For example, you may monitor the functionality, not just that your API call is healthy, but something that was supposed to be happened, actually happened. For example, if you process order, you will send some statistics. Okay, we have somebody purchased a TV in Germany and send a statistics somewhere. It will not tell you the reason in case you are stopping televisions like what exactly happened, but you will have some impulse that will give you, that will be given to check for some issues. Let's say your application is like 100% okay, but some firewall or something can like block the traffic or you can see that some data center just went down and you were cut off the traffic or something. Oh, yeah. What happened, for example, in our company, we were selling the flight tickets and one evening we had this monitoring and we were like we are selling maybe too much of the tickets. So we were investigating it for hours, we couldn't fight anything and we were selling more and more and we couldn't see the reason what actually happened. One bank in, I think it was in Indonesia. They were charging, actually the bank was charging customer only 1% of the price. So we had information, correct information about pricing, but somewhere the bank had some buggy release and actually starting charging customer less. So everyone started buying this product and we couldn't find like the reason in time but if we didn't have this monitoring, like the damage could be like much bigger because the banks eventually charged the customer and we handled the refund process. Also, when you have a set of these metrics, we currently in company have like hundreds of such metrics. So we have a team of analytics that are actually building some apps on top of it for detecting some anomalies which can give you like some impulse or inside that something is actually getting wrong. And if anything from these helps, there is usually something that will report you the error. So when you have every dependency checked, when you send information about it like if you have monitoring setup, the next step is actually set up proper alerting. It's nice to know that somewhere you have information that Redis is time outing but you should be alerted to actually be able to take an action. So if you have any monitors, you can set up proper alerting, you can set up responsible people for the arts, for each of those like check the appropriate channel for the art. If it's something like really important, set up a page or some phone call. Also escalation policies in case like somebody is not able to respond to this art. Also, do really basic stuff. You know, roll bar or sentry or some error reporting tool. Just basically wrap your Python application and report the errors to some system and based on that, you can be alerted again. So when you have monitoring and proper alerting, the next step is to check out if you have proper logging. Let's say you are alerted but you should be all the time able to get details what is actually happening in your app. If you are handling communication with the resources above that we already went through, you should think, okay, what happens if communication with this API falls? Will I have some information to get the details like instantly or not? So everything should be like looked properly in case it makes sense. In one of the companies we've been building, for example, CRM and one of the colleagues was working on the core and there was like a really core database table and he made, he was using primary key as count of the entries plus one. So he didn't use some sequence or auto increment and it was like working well until somebody removed the user from the database and cascaded like deleted a bunch of records and there was like a huge mess in a database. So the first thing when we noticed what we wanted to do was to restore a backup of database. We couldn't do that because our server provider had some outage and they didn't store our backups for a week or two. So we spent several days of parsing syslogs and reconstructing the database from it, which was like a really healthy thing to do. We didn't have like that good logging to do it like a better way, but at least we had something because otherwise we would be really screwed up. Yeah, so proper logging will make your debugging like much easier. You may consider storing important info in database to get like better statistics over it, like get more into detail or also consider adding request ID to our logs so you can bulk together logs from more different layers of the logging. So next step, if you have everything handled properly, if you're a Python app, if you're infrastructure, is that you need to put it somewhere. So it's usually about the price or laziness and the features provided. You can go with your own servers. You can use some hosting or cloud services and recently becoming more popular some serverless solutions. For example, on AV as Lambda, you can deploy like a whole world app just writing these two lines and defining API endpoint in Sangui and everything is handled for you, auto scaling and servers and basically everything you need, just two lines of code, which is like pretty convenient for some smaller services. Okay, also consider auto scaling. What happens that the load of your application like grows a little bit, like your CPU or something is getting depleted. Usually in cloud services, you can set up auto scaling policy to boot up more servers or containers. Also consider like if one of the servers goes down, what will happen? You can auto restore them, boot up some other servers, for example, in different data center. A few things that you can do for helping reliability of your app, you can do also on web server. For example, set also repeater on engine X level. So if your request will fail, it will be automatically repeated and you don't have to touch anything in your app. Set up catching to help with load. You can proceed with web application firewall that will give you more features that can make your app a little bit more stable and you don't have to care about something on app level. Also, when you're developing, you may consider like separated environments for development, stage production, testing. Also you have more layers of testing. You can put to your continuous integration, smoke test, integration test, unit test. You can try to send some part of the traffic to some kind of instance that will be basically a release candidate. So in case you are like not 100% sure that the new release will be okay, you can test it partially in some part of the traffic. You can apply the performance test. Also when it comes to monitoring, so far we have been talking just about application itself, but you need to also monitor the infrastructure itself. Yeah. What can you also do with monitoring? For example, set up monitors on top of engine X. You can be alerted when you receive like a bunch of non-100, 200 requests in a row. You have logs on a web server, so you can define, again, a request ID populated to your app to join all the logs together on a more different levels of logging. On runtime, what companies, bigger companies do actually simulates the outage. You can turn off some data center or simulate that some dependency is down and see how actually your app is behaving. If you do anything for preventing and your app will actually have some outage, it's usually the good practice to write postmortem after everything is resolved to actually get some look about what actually happened. So you can learn from your files. You will think deeply about what was actually happened. You can apply the fixes for the current outage to other parts of the system because it can be vulnerable also to some issue. The key is to keep track and get into a real cause of the problems. So we went through some techniques, how to prevent failing that can be applied on different levels of your application. Mainly, you can do in Python code like timeouts, breakers, repeaters. We went through monitoring. You should know what you are actually using, monitoring servers, application from outside, from different level of your application. You can use APMs and some custom functionality monitoring can play with it more deeply. We have a lot of different ways to test that your application is actually stable. You should think of architecture. There is a lot of you can do about reliability or more different levels of your stack. Also, which is really important, the proper logging. Again, you can apply it on more different layers and join them together to have really good overview of what is happening inside not just to randomly attach your application when there is some outage. And based on that, also proper alerting, think of all the use cases for the alerts you set up. If you know what you actually do with alert, you will receive in advance how to reach somebody who can handle it. And I think it went a little bit faster than I expected. So if you have some questions, please come to the microphone. By the way, one also important thing, our company is hosting a party tomorrow in a pub nearby. So if you want to get free drinks or hang with people, you can stop by a KVK booth and get some details of what it is. Just a second. Hi, thanks for the talk. How do you know that your monitoring system works? You monitor it, I suppose. Usually, we have more different levels of monitoring. So when we were thinking about the pinging of your application, and I think that all of the things that we have been doing in our application, and I think that all of them would have to go down at the same time, which is like the probability is really slow because it's usually some, for example, some our self-hosting services are more different third parties with different data centers or whatever. It's also about the consequences of not working with your app. So it's like more different levels. You mentioned the serverless. Do you have any kind of, for example AWS, do you have any kind of metric on how many servers would be needed to get the same thing running in comparison to on serverless? Sorry, I don't have. It was just recently starting playing with it. So I don't have any deep insight to help you with this request. But maybe we can, if you want, we can talk after the talk, and I can ask some colleagues that have deep experiences with it. Cool, thanks. Before doing Python, I have worked 28 years in a bank in IT. When you have a real incident, I see there the combination channels, when you have a real incident, you really want to have a human in charge. So I would suggest that as long as you have not talked to the human, you're not sure that the alert has been taken into account, especially when you're in escalation. So, well, my experience is that if there's no phone call, you're not sure that somebody else... Yeah, like, usually it probably depends how important the alert is and what is the response time you need. If it's something like ASAP, like, definitely go with phone call, anything. Sometimes, okay, just to alert on some secondary system, and it's okay if it's handled, like, in 30 minutes because it's not becoming critical yet. But I agree, like, definitely if it's critical, just go with phone and alert anybody you can and do the best to reach the appropriate people. Hi. I've got a question about mapping out dependencies because documentation is all great, but do you have any automated tools that will allow you to, like, make a graph of dependencies with some calculations and so on? Yeah, it's a little bit tricky. Currently, I think some of our teams, like, working with cloud formation and stuff like this, like, to get all of this together to one place, but I think that there is some solution being developed that could be, like, potentially open source, but it's definitely not in the stage that we could provide anything. And I don't know about anything, like, that could, like, help with this. It's a little bit tricky. Often, if you have, like, services on more different hostings. Thank you very much for everyone. Okay, thanks.