 Okay, so hello everybody welcome to my talk on network retries So it's great that this room is called testing and automation because it's kind of for both So it's talk exactly useful for both of testing and automation. So a few words about me So I'm currently a software engineer at red hat and I also a member of a weird community infra That's our project and we do continuous integrations and release automation and we are responsible for infrastructure and As probably a lot of people we use python Primary and we have a lot of automation in python. I also love da-wops So I'm a strong advocate of da-wops. We have doing stuff and yeah, so if you don't know what is over It's free virtualization management platform. There is a booth in building K. You can come And about this talk so it follows really real story that I did so we probably mentioned We run a lot of automation and see I sometimes you have several thousand of job runs per day and Sometimes something fails due to a lot of reasons In our automation we use a lot of network services we download packages from shared party repositories from our repositories and We use some rest API's along the way during build and testing So a lot of network services and when you run a thousand of jobs sometimes you get failures and Usually there's sporadic. So the next run is okay So you don't really look into them if there's just one failure or because you try to look for reproducible stuff But it's still not good because some of developers are frustrated and we are frustrated because due to these random failures and I Just started this work on how we can retry it automatically. So if you can rerun ourselves and it works why not let computer do that for you and I Needed a way to somehow test it because if I just ask people about retry they say I retry that CJ just Try accept and some sleep and Some counter as that's it. Okay. Yeah, and the funny stuff if you test it on your laptop would probably work You can see that okay. It does retry. So is it enough that the questions that I have and I was looking for a ways to really test How is it effective? So it's not just works But does it really solve the problem that we have does it really effectively retry that and that's when I come to this Linux traffic control there was a talk actually one before this About that they use in the same Linux kernel traffic management to simulate poor networks and This all is should be reproducible. I it was my primary goal to make it reproducible so you can find all the code and I Used in my github page. There is also a large version of presentation with more text because otherwise It will be hard to present. So for this I cut everything possible. So let's start with some reasoning if Failures we get okay. So when you come here by plane You probably didn't meet overbooking but sometimes when you have a ticket for a plane You may come to boarding can they say, oh, sorry, we don't have any space in the plane left You say, oh, how can it be? I have a ticket, but that's Mathematical concept call it statistical multiplexing. So they essentially sell more space in a plane then there are Just in hope that some people will miss it So maybe two and so they can sell more and this is statistical thing So they try to just predict that on this plane the typical people are late But sometimes it fails and you just denied boarding due to no space and it's pretty valid. So Actually this statistical multiplexing stuff is used in networks, too For example, if you had like circuit switching, it's like a telephone, right? You're calling somebody you have a line and it's all yours But it's not the case in TCP networks You use packet switching and you see you really share the bandwidth with somebody else and it for some reason He wants all the bandwidth then you are not lucky and you may get failure is just due to this and it's it's not really a problem It's just built in it's a future and So the thing is that really packet switching and networking system are just made to fail sometimes If demand is too strong and it's not real failures You can't blame network team for that if it's not happen often. It's just future and The sink is that is just mathematically if you try something a lot of times and there is very rare chance of it failing But if you try it a lot more times, then you will see it. It's just normal So I must warn that you see the spherical call here and since we do the simulations We have to use spherical call. It's from a jock of physicists Learning the cow by assuming the cow is a sphere So we made some assumptions in this talk and here for example, I assume it importance, but it's not that important So if it's this chain and if like if you have automation you probably have a lot of dependent parts and each of them depends on the previous Job completed successfully and even inside jobs you may call one service and another and you really need all of them All of those calls be lucky and as you see here So if you really have a kind of chain of dependent events and there is a probability of which of them failing F and then You for example have 100 steps and probability F is small. So here's 001 is very small probabilities But if you try it more times and the final success of everything this and if even for example This is one your job and you have like hungry jobs Then you will see here like if you do 10,000 rounds of this job that consists of 100 parts It's very probable that it will fail at least once. So that's basically Given this overbooking you will certainly see some failures in if you have a lot of runs It's just a fact and that's what we consistently see in our systems It's not all of them, but sometimes one a few in the days they do fail and this is actually not a bug It's it's us intended. So in order to try it on my laptop I set up this test environment. It was running on this old laptop. It uses containers so basically what I did is that I created a container with Engine X is just a sttp server put a file there and was trying my Python code and Downloading it then I used this Linux kernel feature to simulate poor networks and then I was trying different solutions see how badly they fail and Was trying to make something reliable out of it and it's good that you really contest it on laptop You don't need infrastructure for that. So This is basically how I did it. I just created a JSON file just random seven kilobytes file And then just created engine X from it So then essentially our JSON file will be on this URL on the local box And then we can try our Python code and just download it so but The tricky challenge here is that sometimes you configure this network simulator and the network is that bad enough that it just basically doesn't work and You somehow need to know if the network was working only it's your code not working a network was okay so and There is a protocol called UDP if somebody knows it's just basically doesn't have any checks if it arrived or not it just throws packets continuously and In order to check that network is working. I used this UDP just net cut it basically Like cat it just throws So give random over the network and then if you see UDP packets arriving then you will see that the network is working And if your request didn't finish then it's it's kind of your code fault not the network But if UDP is not coming, I think Network is that bad that probably nothing can be done if somebody knows what to do I will be happy to listen and I also collected network traffic on this Simulated setup I use Dump cap with wire shark utility. You just dump packets and since I somehow needed to analyze it I just export it into CSV files and then once you have CSV files You can use some Graphene software use octave just because I used to it What I think is that here we basically record our TCP traffic because we are using HTTP and HTTPS TCP and we also recording our UDP traffic and then We have the CSV files and we can load it into some system. So Okay, ask a lot of people for request is nice library. It's very easy to work with and we use Python requests and our Infrastructure a lot and our automation and our tests and I started with this very basic naïve code it's just Get request using requests It doesn't have any retry so anything. It just gets this URL and here it will Throw any exceptions if it happens and we get this on from it. So it's converts from By stream to just on that's what I did. So I run it on my Simulations using this sampler pie. It's available on github But basically what it does you specify it amount of thread pools that it uses like ten sense reads And then how much time they need to repeat and then it basically Calls your function in this thread pool amount of time you specify. So this way I was simulating a lot of requests coming and then I just did it also output CSV. It's not your fancy. It's just a flag zero It was Okay, and one it was there was some error and this is the time in seconds how much request took just to understand it How effective with it? Yeah, you can check with Scott. It's not that important and then I run it and you see this CSV files I managed to generate this nice graphs so Yeah, maybe it's not real, but you don't need the numbers. So Just in gravy we have our DP traffic I scaled it down because UDP as always very effective for just throwing packets So if you don't scale it will blow up all the screen. And this is our request our TCP requests That we made so it was pretty fast This is a histogram amount of request occupying different times. So you see that it's here it's zero zero two seconds and Like 20 requests more than 20 requests. We finish it like in Zero zero two seconds. So and this is actually pie chart. We kept everything working so all hundred percent of requests were completed successfully and I believe this is how much of the network software get tested, right? So everything is working. That's very fast Now we need somehow to make real network that it's not really working good all the time and Just for you to give some heads up on how that's working. So when you use Python, this is a TCP layer model It's not that important, but all our Python libraries we use request It's actually in fact uses EuralLib3 inside and then EuralLib3 calls sockets or sockets somehow on TCP level But everything here is a dead data layer and yeah Basically, if you don't know it's the network model is just encapsulated But important that everything that's down there for us from our Python libraries We look like a some date data is coming or there is just no data coming or we get some kind of exception So really there are some mechanism here to make it somehow reliable But for us it's all the same and important stuff that I didn't really hack any libraries I just try to use what is available right now. It's important assumption for this talk and So Linux network emulator is a part of Linux kernel So in Linux kernel it sits below the IP stack. There is a concept quite Quiet and discipline. So it's what your ISP uses to Limit amount of bandwidth that you have for example and in this discipline There is a project call it Netend. It's network emulator and it's used exactly for simulating poor networks So it sits here on the IP between the network device important stuff that it's applied only to outgoing packets So if you want both ways then you need to set it up On source and on destination too, but for our talk we are not really working on IP stacks So it's fine. It has a lot of compabilities. It can simulate network delays It can just simulate network loss and can corrupt packages and can create duplicate packages And can reorder packages are not arriving in proper order and it can also can rate limit it so you can simulate like gps network just by limiting bandwidth and Network emulator does all that for our purpose is it's enough to get loss because as I say it for Python It will be a data coming know that they come in all just exception. So we free to choose it so we choose just to drop some packages and This is yeah, there's a probably hardest part of it's not that hard So the problem is that nobody real really knows how to simulate the real networks It's just they are really too complex to simulate them. So we really need some spherical cow some Models that will be good enough for us to simulate those networks it's we don't pretend that it's very real, but if useful for us enough then why not and Network emulators supports different models. So explicitly configure it So what I choose it's simple model called Gilbert a lot model. So it's just a state machine It has good state and bad state and then you specify probabilities of Transitioning from one state to another and then you specify also Success and failure of probabilities in each state and then when you configure a network emulator It will be just following this model with your probabilities. So Also, if you assume for example in good state everything is working correct and at bad state nothing is working Then basically you made a simplification and you'll need just Two numbers just how this which probability we go from good to bad and from bad to good That's what's simple enough. It's called simple Gilbert. It's a simplification of this model and it's enough for our purposes So and this is why you need a bit understand that mark of state machine because When you use Linux network emulator, you just have to specify here that you use Gilbert a lot model There is no just magical switch make me poor network because there is no poor network. They're all different And so here you specify that you use this Gilbert a lot model and then you basically specify the Probability going to bad state a probability going to good state and you just have to specify model parameters there is no way to not do it and It was actually very tricky to find some good numbers. You just play with it and one important stuff so on research papers probability are usually given in decimals and Here's actually it's in person. So I just specify decimals It was not clearly documented and then it wasn't behaving like that and that's because I had to specify person There are also show command you can see and it will give you your parameters So we specify just PNR and since we assumed that in good state. It's 100 success and it was a zero then It substituted us while it's themselves and then if you already added this q-discipline Then if you want to change it, you need to use change command So if download my slides, there are references to the documentation of everything that you see here So you can just look up, but yeah, it's just basically we specify that with 50 percent probability we go from good state to bad and 20% probability, but so just so it stays in bad state a bit longer. So I run it and It's actually was running over the night So and all the requests this Downgraph is just the part here. It's first 60 seconds. So we had 45 requests when it's okay Then there were no request at all. It's just stuck for all night. I just terminated it when I woke up and I should see UDP traffic was coming. So that's why I think the network was working The funny stuff that we just forgot the timeout. It's not by default by default There is no timeout in that request and as you see without poor network, it's locked But now we're just running over the night. It's just because we forgot this timeout And this is from documentation and say there is no explicit timeout by default. So and Again, so we can just get some timeout, but I just checked. So it's 95% tiller So almost all requests finish it in 60 seconds. So why not just take this 60 second other timeout and Here we go. So now it's this is how I should write on it before if I read the documentations I had to specify timeout here and if we do that Then at least it finishes I think finish it in a couple of minutes and now we do have failure. So here is At 26 requests fail it all other were okay. And this is the histogram. You see that must of the Request it's like 40 requests took less than 20 seconds and this is how timeout It's 60 seconds or those requests were boarded by timeout. You see it's like 20 something requests and it's 60 seconds Good cool. We see the timeout is working because that's exactly what I expect 60 Okay, so now does it at all make sense if we retry it's really doing any better for us. It's Pretty trivial, right? So here some requests. It's like 20 something requests. They were lucky and they finished so if we just retry we try luck more and We have the same like probability of being successful So then more we try then we may get just more lucky next time and that's why it just makes sense to retry and Just some note that in TCP. There is a retransmission mechanism, right? So it's retransmitted packets, but it Doesn't work when the TCP connection is not established at all because there is no yet TCP connection So like if there is DNS failure or you just can't connect then you are not using it Also, there are some HTTP specific if failure is you can handle it from code, but obviously TCP want to retry it for them So but another stuff that we need to consider is it safe. So if we just retry everything Can we mess up with something and it stands out that in case of HTTP There are requests that are safe because they just don't change anything like get we just get a page So nothing changed and there are also Important request. It's a bit harder, but yeah, you can check it here So it just changed something but only once if you call this several times it Won't change it anymore. So she a was one then we called says a several times and it's two and two so and In other cases when nothing happened. So for example, my request didn't reach the remote server and Kind of nothing happened and The libraries that we use requested uses our ellipse 3 and it has built-in support for our sttp retries. So it just Knows all this and it's available for you inside. So it's pretty much Requests in relapse 3 libraries that support retrying python. So every shenaza does is not but as you find out you can use Or ellipse 3 retry even in your code if you are not necessary you shenor ellipse 3. So here how you can retry this Request library so it's accept to our sttp adapter parameter calls max retries. So we set it 3 and And it will retry three times. So okay, so it's probably good enough, right? So let's try and 14 requests still failed if you just implemented 3 retries and It was really looked some strange for me because if you check on this Instagram you will see that so we enable 3 retries and timeout is 60 seconds. So we would expect that at least 14 requests were retried here. So they should go three times three times 60. It's 180 But we don't really see anything here on this histogram. You see it just it's just they all failed somewhere here But three attempts would get us here. So something didn't really work, right? And I enabled the protocol debugging just to check what it does This is how you can do it. It just enable python logger for all the libraries you use in your stack and I also disabled network So this traffic control you can just specify fails 100% both cases and you will have no network So it's kind of configurable network switch. It just did it. So I will get instant failure and That's what I got. So I really didn't see any retries there. It just doesn't retry it was funny and One important thing I noted it says convert it retries value 3 and it converted to some retry object So it was something going on So I checked it more and it seems that in relapse 3 there is nice concept of obstructing your retry functionality in two classes, so they have this retry class and Apparently it has more configuration than it converts to because when it converts is just set this total value But there are separate value for connect errors that happens when you connect and for either or that happens when you download and What happened is just it didn't set it So it wasn't really it retry for example connection related errors, and that's why it didn't work. So As I said, so relapse 3 it Works for sttp compliant services if somebody violates sttp standards like they change something and get requests You're not lucky. You can't use it. But if it's compliant sttp service you can use it and it will be safe. So Because there are non-compliant stuff it is a by default but for connect request just didn't get to remote servers So nothing happened for it If it's safe method and it retries safe method by default something may break But if you it's your implementation, you know it and yeah, it's also because retries on the cases when Too many requests failed like sure is not available. So the cases where the request didn't reach the remote server So this is how you can use it You create this object from relapse 3 set your retries for connect read errors And then you just specify it instead of your number and then instead of converted value We will use the object. So let's see If it works, yeah, so it's actually when I tried it it didn't work But you now see that it does the retries. So it was one retransfer it doesn't I skip just for the space and I was kidding this time because I it's really happened to me. I just forgot to enable the network, but That's the tricky part about simulation because with those model and with others It's really hard to simulate network. That is just for example down for five minutes and after five minutes is up But in real life this can happen and it's good time to speak about the back off so it's also really three support it but What is back off? It's just before trying again. We just wait for some time. It makes sense, right? It's like a sleep and for example We had we imagine we don't have network here. The white is no network here We have some network. So if you have no sleep, obviously all our requests are happened here And there is no value, right? We can have some constant like sleep for one minute each time, right? It will be good because eventually it will catch it But it's not that effective because if for example network will start working here then we will wait a long way and The best thing to use is exponential back off Just next attempt to wait is just exponentially longer and longer and longer and Or all three support that so this is from its cause. It's how you can specify exponential Bob back off It's from from the library, but for you it will be just a value. So it's good So you can specify back off factor. It's basically It will generate zero one two and then it multiplies it by your factor. That's how you can get there Longer longer delay. So, okay. So we specified here just a parameter of this retry object And then we try one more time Finding stuff that it's still didn't work. So it absolutely is the same as before It's like it was 14 requests failed and now we specify all these parameters and we still get the same And it's same as before we don't really have anything really tried three times with maximum time out That's why we need to test right so it was I was also was pretty surprised I checked the exceptions and I found that there are two kinds of exceptions one is somehow handled to just a warning It says, okay, I'm retrying and it's fire, but second is just a connection error Wow, it should retry on connection errors, but we don't how can that happen and from here it's appear They have a generator. So they essentially stream that out of relapse three They have generators and if failure happens already in this part of court. It's not handled by relapse three library So it's out of that library. That's why it fails no matter what you said here And if you are not hacking libraries, there is no way actually to have a ride it but Retries easy right try accept. So that's what we did. So we did try accept but It's important that we use actually this retry objects from or a lip three to retry it So it's really better because they cared for this all HTTP stuff You saw it's complex formulas and the potency of the stuff if you use it Then we get all of it for free and what it does is just it's inside the library So when we do request it checks it if everything okay, you have request if not they really throw max retryer or exception and Then there is a two ways one is in increment and it will it give you a new retry object So it's actually mutable so you can just use it here and then there is a sleep It's where the back of happiness is where it's sleeping and this is actually pretty much how this is library like request Or three uses this retry object. So we can do the same but use this object from retry and if you do it like this And we have one small problem because if we try happens inside the library here Then it will retry one more time if it happens here then We retry it and it appears that in maximum we now can retry max retries Multiplied max retry. So this is because some retries handed by library some are not but it's not yet in request But newer or ellipse 3 I found that it's actually output the retries object to use it along with this response so you can just take it from there and Using your loop and your try block. So then you won't have this problem But unfortunately this request is still not possible, but when they upgrade to new embedded library, it will be better and And That's actually Another feature of for ellipse 3 that you can also specify retries just in request. We don't need all those HTTP adapters It's not yet in request, but I think it probably will be possible to patch it Okay, so now if we do all these We have pretty good picture, right? It's just one request failed. So and Yeah, it's almost all requests. It's it's here. It's 100. So it's maybe 15 seconds a lot of requests finishing in 50 seconds so because we wanted waiting for the full time out, but I'm maybe I'm idealist, but I wanted all request be successful probably this is good enough, but Yeah, if you just our target and make efficient retries, then we probably want to check what happened with that one request and again, I went to my lock and This is how it failed. It's pretty rare case But I see it on our production systems. Sometimes, for example, we were getting like half of our PM file and Actually, we fail we're failing only when we try to install that our PM file because nothing before that the text is that we fail it to download it and Here a lot of people ask about in TCP. We have this retransmissions But as you see, it doesn't work really. So you might have broken file derivative and Let's see what happens here. So we have Error so we try to call land on self content inside this library and it's just none So there is no land for none and somehow we just didn't detect it that we got no data back from our GDP and I actually was flying here I checked this and It has in HTTP. You have the content language, right? You specify how much is your content, but in our lab 3 and requested this check is disabled by default Is it just not checked maybe because there are some wrong implementations a report correct size It's probably safe to disable But if we disable checking of the length we may get like this because the connection just closed it Because network was poor and we thought that we basically got all the data and Then if you're not checking links, we will fail only when we try to use this data. That's actually what happens there and There are also cases when you just don't specify links that are streams and we really have those systems They just give us RPMs, but they don't specify their links Just we when it's done then it's done and then even if we check lunch We can check it for that cases and our automation will fail because we don't know if it's correct RPM And I come up with it's a bit ugly from constant point because here we basically can Check everything needed inside our code if like we have tried to have the block then if needed We can even verify RPM itself. Was it correctly delivered or not? and This is what I did here because I didn't want it to change the libraries but For this particular case if you enable verification of length it will work and if server of course specifies the length It's not always the case But if it doesn't then we just Can try to have JSON here and if it fails it will fail this type or so it's demo only Probably you shouldn't handle type or or in your code because it may be valid program program Mistake, but if you just wrap it then it may throw some invalid JSON And then if you handle it here, it will basically retry it alone with others and if you use a retry Object then it Build general implementation, but with your verification plugged in and that's what I did for this case And after that Everything was okay. Actually, I managed to get 100 requests passing with with all this and Believe me, it was really poor network. I wouldn't want my network to be like that Because yeah, you see there are a lot of gaps. It's just fluketation, but it's magic retry they Can make you pretty successful only that poor networks and good stuff is that there are tools like built-in Linux This in all modern Linux distros has it so you can test it on your laptop So you don't need complex infrastructure for that. It's good And so if you make some conclusions, so I would say then we can really emulate networks good enough for Testing of network applications. It's just possible. It's not simulating real networks But there are enough parameters so you can achieve your result and Clearly you saw that it was very successful and very fast in the beginning on local host network So obviously testing it work application with local host network is not enough and I would also argue that even local network may not be enough sometimes if you Application works with internet. It's certainly poorer than your local network. So even if you use local network then You may still need to do something to test it better and another thing it's retry Implementation is not that easy actually because you see there is some Knowledge that you need to have and you need probably to read HTTP FC Oh, you just use the existing solution, right? So you see this for ellipse re-implementation is generally enough You can just use this retire object in your code. It's not bound to really three at all You just pass it three quests types and it out puts you if you retry or not and Another thing you see that RCTP they know for example this method you can retry this method You cannot retry and I know some people develop network protocols or network system and think about it if you have it So can you define some of your requests the way that? Your library can retry them without user even knowing if you check like Amazon SDK they actually have built-in retries inside because they know how the system working and they just give users this for free and That's what you should do for your systems if you develop this and Still provide some ability to customize it you you see how we use this Relipse three functionality to retry on completely non valid just on so we can check just on and then retry with it So it's flexibility that some of your users my my need and It's really as you see it's all possible. So I did it on this laptop. So you also can do it So If you have any questions, we will have time for this now, but All this is available on github as I said there is sample code that I was running and Some octa fails to generate the graphs and the full version of presentation with more text You can use it to set up and as I said we use docker container So it's really very easy to try on your laptop and I encourage you to do try it because I Wanted to be reproducible for you. So you can benefit from it. And there is my mail and my Twitter So feel free to ask there, too And so we now have time for questions. Ah it's actually Have some Yeah, I need to repeat so it was a questions. How fast does it change states? So actually, I don't know it's as I understand it's limited by actually your computer clock So it has some clocking inside and it's basically just each tick it checks the probabilities and then transition or not, but I Would say it's very fast even on this laptop So if you specify it's like yeah network is working or not working multiple times like Microsecond Yeah, so the question was if it works different on different Computers like with faster clocks. Yeah, I think I think that's the case And that's why there's no just values magic values that you can enable and it says it's bad network So that's why you have to somehow tune this model and try play with probabilities But yeah changing it is just one comment So for yourself, you can just try to adjust them and see what works for you because that's probably I Spend a lot of time with finding a good value as for my code Yeah, any more questions Experimentation Yeah, so this yeah, so the question was that is it focusing on Python and did they checked other stuff No, actually, I wasn't testing other libraries So I was testing the Python because that's what we use and yeah, the request library is very popular So I would expect a lot of people use it But it's can't be similar way to test it any another because your network emulator is in Linux And yeah, also probably we'll think about submitting some patches So because there are certainly some stuff that can be improved for example If you can specify it right here request in request it would be very useful for me at least Okay, any more questions Okay Oh Oh Oh