 How is the food? OK, today, I'm just a speaker. I will talk about secret baker retry. So I think it's a very common topic as many of us are running a lot of servers, especially like microservice. So just have a quick question. How many of you here use microservice in your system? OK, so in my company, microservice is a very few. It has a few components. I mean, at the beginning, maybe just one or two components. But I mean, as the service grows, we add more and more. Actually, now it's already very complex. So resilience has become a very important part to ensure that all the microservice can talk to each other to maintain the stability, especially the availability, when the SRA is educated for the quality of our service. So by the definition, resilience is actually a concept for our service to be able to quickly recover from failure or fault. It actually has a lot of topics like, for example, how do you maintain an identity complications between different servers? How do you define a basic image? You don't want to let clients to call servers too much. Today, I would like to talk about secret breakers on the retrial. So secret breaker, yeah, let's take a very simple example. We have two services. One is service A, and another one is service B. So as requests from users coming, we have a lot of requests. I mean, a service A is just one instance. We process a request on a call service B. On a service B, actually, it's wrapped and encapsulated in a load balancer. And that's a multi-instance in a service B waiting for processing the request. What if there's something wrong in the service B, for example, database load, network blip connection between the A and B, any unforeseen circumstances? We actually, we can call it a system fault. It's actually very hard to prevent fault. But in terms of microservice, we can create our design to prevent failures. So that's whereby a secret breaker is actually a perfect example of design to prevent. I mean, if you are in a service A, you need to think about if you will call service B a lot of times, and we need to define a mechanism to prevent the failures from service B. We affect service A directly if we affect our user. So on the more traffics, the system failures actually will be a disaster for our users, especially like. For example, if your users, they don't see anything from a service A, and they just keep trying the request. For example, you will take a booking, and you will open up, and you won't see anything. They just keep clicking the booking button, for example. So in this distributed system work, it is even harder because you have a lot of instances. You don't know how many instances in the service B is suffering. It sometimes is just like a big pick, unlike the event. So those failures actually from a service can be casted out to other upstreams. So we want to control the latency under the failure rate. And actually, we want to service A and service B to handle those errors in a quickly way, in a graceful way, the most possible. So that is how the circuit breaker can stop the failures and actually give a better user experience. So I mean, let's go to the detail on how circuit breaker is defined by using state. So you can imagine that circuit breaker is actually something in your home to connect different up lines. The default of the circuit breaker is always closed. So when the state is closed, all the requests will go through, and actually, we call service B. Once there's a problem, we will go to open state. So open state is designed for the system A to be felt fast. We define a threshold. For example, actually, there's few criteria for the threshold. By the number of concurrency requests, by the number of percentage of failure requests, or by the timeout, what have you called, another this more than 10 seconds. You can market that later on, failures. So why is open state? Because you can move to the half open state by a slip and hammer does. This one is like a periodic checkings by the circuit breakers. So in the half open state, the request is success. And actually, we can move it back to closed state. And actually, if we let our service A quickly recover the original state, it can, when the call service B, immediately. So in the example, you know, I believe it's not just my company or other companies they also use, I think a library from Netflix. So it's interesting that it means the original library is in Java. So yeah, Java is still not really a bad language. I mean, the concept is actually, I feel like the concept is inherent from the track memory management system in Java. So you can imagine in Java, you have a lot of tracks running and you are locating memory for each track. So here, I mean, there's a high trace, a goal library, also do the same thing. Each A, it actually defines each connection between the base A to the base B by using a circuit breaker name. A type of that, we can define trees. I think the timeout values, I mean, there was parameters for the timeout, the total currency connections, and also the percentage failure errors. So each IPN and write, we have a separate track pool and a cache to cast the state of the circuit name. So in the high level design, high tricks, actually offer fallback solutions. So for example, you will call the base B style. Then what else we can do? For example, you can have a table to cache or those previous response. And actually, this fallback will allow you to access. So actually, in terms of users, you don't see the error apparently. So that is the fallback solutions. So this default result, we create a better experience from the customer. So how do you know, I mean, how do you monitor this circuit breaker? High tricks, like we actually offer you some few, if you go to the GitHub website, can you see? Oh, OK. So it actually offers you a few plug-ins. You can connect with your instrument's library, for example, Datadop Graphics. Yeah, there's a few plug-ins here. And you can connect with your favorite instrument's service. Now, you can monitor your connection from different systems. Yeah, so let's see a demo. OK, so today we can zoom. OK, so I am refined a best-sever circuit from one service. You see, right? So yeah, I just named it as a producer, yeah. So I passed the timeout, it's 500 seconds. Yeah, that's the most important parameters. As parameters, you can try out and see what is the best values for your system. You run this server, and this server actually will call another service. And actually, it will trigger the call charge producer API. As you can see from here, the output of the call charge function is actually a low-routine. It's actually passing to the output channel. If there's some errors in the chat errors, we actually pass the error channel. So I mean, which error is come first? And it's a good break to immediately change the state. So I mean, the call charge producer API function is very simple. I mean, I'm designed for the demo. So I put an OS environment. Do you think that is a failure or not? OK. So let's try to run it. Actually, I will run the server. OK, so the server is running. So now I'm trying to run a current command. I think you need to expand the font. Oh, OK. Many, many times. My server in the one on the tag. And in another process, I will try to call it 10 times. OK, so actually, if I see, look at the server. Yeah, everything is success. So it means that's a perfect scenario whereby there's no errors at all. So if there's a problem in the server, you don't go on the Windows Routing System. You will respect the environment very well. Yeah, so OK, let's try to run it. OK, so we can see that. First, can you see it? Yeah, so 1, 2, 3. So first three times, you get a 503 errors. And subsequently, I mean, the circuit opens. So it's actually very easy to understand. Because we define our parameters. But it's actually, we come up after like 500 seconds. And also the number request for loop is 3. So I mean, it's a way we accumulate enough three requests to calculate the percentage. And once it's like rich, I think it sends questions. So the easiest denial of service attack to your circuit breaker will take insistence, which would be to cause a lot of errors. And then they just shut down. I think in that case, if you want to protect your service, then I think you have to use some kind of rate limit. No, I mean, if you use the circuit breaker, how do you make sure that some even guys out there have a circuit breaker? So just throw a little stone. And then he puts out a big stone for me. I don't use that to deal with it. I don't think what you do is use a circuit breaker for unexpected errors, like 500 level errors. Yeah, so it's like that based on the server errors. So there's something that is in the ground to look at unrecognizable panics as compared to not sound, right? Yes, so I think, oh, so I think another point is that it's very hard to protect against DDoS anyway. You need rate limiting or some sort of signal. So I think if you look at the diagram, I see at the beginning like this one is like, you are in service A and you design this C circuit breaker to cost this B. So I mean, if you design something to cost amount, I mean, you are in total control of your resources. So if you are in service B, maybe you think different. Because the circuit breaker is actually designed, I mean, it's actually placed at the integration point using two circuits. But actually, you can define, I mean, based on the status codes to define the errors. I mean, I'm sure there are a lot of interesting government installations out there. No, that's why you're not connected to it anymore. Oh, yeah. They're not supposed to be until somebody is very fed up with the deployment process. They're running the wire out of one or any one other. OK, so that is for the circuit breaker. Oh, so this is how you design your system. For example, you have multi-service. So actually, you put the treatment, a contrast, amount, mass, components, components, requests, and the error percentage to be like, I mean, the first service, I mean, always like greater than the last, I mean, the last one. So I mean, actually, to ensure that you have enough time to follow the first service, to get the full success response from the last one. Yes, so, yes. So how do you define the threshold, the threshold delay? Yeah, you should use some config management system on if there's some problem, I mean, or you want to fine tune your parameters. I think you should better use, like, you should store your configs in somewhere and actually have an ability to refresh it. Question about max concurrent requests. How does that work auto-scaling? Is that like max concurrent requests per node, or? This, max concurrent request is actually per the time it does. It's not per node. OK, yeah. So if you scale up, you have to scale up your max concurrent requests. Yeah, that's right. So it's not direct relationship, it's more of a, this is the pool of the concurrent requests that the high-trick library will look at it and calculate the error threshold they encounter. Yes, I think they have our default window in the last slide. The pool is actually not necessarily a threat pool, but it's just more like, it's like a window where the high-trick is looking at it, like for all the requests coming in concurrently on processing, what is the error now? And then make a decision, like, whether I should break the circuit or not. OK, all right. But yes, if you have more traffic, your incoming traffic then will want the pool to be larger, because if not, then you are saturating your pool there for the windows. OK. OK, so yeah, let's move on to the next part. So if you see the separate biker design, I mean, maybe you can ask yourself, I mean, what if, like in service b, I mean, for example, they are doing a scaled-down event. They have 10 instances and now, I mean, they just scale it to scaled-down five. And imagine, like, I mean, maybe a few of the requests actually come from your service. It's actually hit the note-existing instance. And actually, you get some, like, 500 errors. And that's actually, because of you, I mean, it falls a lot for security breakers. And I mean, how do you, like, I mean, how do you evaluate on the request? Actually, errors. So we try to make them to keep your request persistent. I mean, to make it, if you, I mean, it's kind of like tolerance to make sure that you have, like, a buffer or a back-off time limit to control your percentage of error. So actually, we can define the retry program. It's like, you have and a client. And actually, you have only one resource. And all these clients actually try to grab this resource. I mean, at the same time. So I mean, all those, each client only will access one resource. And actually, they have a key link. You cannot get the resource instantly. So it needs to be retry, and that needs to, like, I mean, how do you design your system to be optimistic on currency control? So first thing is that, yeah, let's go back to the more just now. So I think a lot, maybe a lot of you will try to write this back in the code. I mean, you have API functions. You just call it all. Maybe that's to retry three times. Then if that's error, then that's equal to retry. Until hopefully after three or more number of times, then hopefully, I mean, the request will be success. So what's the drawback of this code? Anyone know? Lots of requests, conflicting. I think it's the conflicting request will be like, I damn potency other requests. More like saturating your height, like timing conflicts in your router. Yeah, I think maybe it's one thing. But I think it's time. No, actually, it's very fast. Why that point of power? And then go back all the way. Yes, it's no back off. So no back off means that you are trying to eat service bee, downstream service. So I mean, service bee actually will be suffering. Because if you keep retrying, actually, you like service bee life. That doesn't have time to recover. So that is the reason we need to add, I mean, supply off time to get the buffer for the retry. So yeah, this one is like, since better, you stop a bit. Then actually, you try to continue it. Yeah. But I mean, this is also, I mean, maybe if you think a bit, oh, maybe your service actually, your DB has a problem. And if you try every one seconds, maybe it's not good enough. So maybe a better solution is you try, every time it's like double the timing for waiting. So this back off is for natural arrogance. Apart from that, yeah, you can add jitter. So jitter is the way you randomize your waiting time. You have, for example, you have from 0 to 100 milliseconds, you can randomize the between these two values. The reason is you have, for example, you have a service actually, you have a multi instance. And all these things actually retry the same times. So if you retry the same times, you might service bee, maybe service bee, maybe break the rate limit. So how to do that? Actually, you try to make all the other instance randomize. And actually, you try, I mean, you get a spread request in smaller regions. So yeah, the back off and jitter arithms is, actually, you can define your sleep time is, I mean, depends on the arithms. I think here, I take like one, five arithms one is, the first one is you don't do anything. Second one, you use explore nature, back off. So every time you just double the time you tried before. Full jitter, it means that you just randomize between the two values. Equal jitter, it means that you, yeah, it's kind of like, you're still sticking with the full jitter, but you just do a flip to it a bit. And the last one is the echo-related jitter. You take something, you define your own value based on the previous value, you sleep. So this one is, if you make more sense, because you're trying to combine between the randomize and the back off time. So let's see, I mean, what is the result? This is the result actually I get from an article from IWF website about back off algorithms. So in terms of clients and the works, if you try instantly, it means that you don't have any back off at all. Actually, you can finish very fast, because the work, but your system actually, you need to make a lot of calls, because if you don't have a buffer, then actually, you might be like, you can hit a lot of, I mean, this will be a big, very big number. So you can see full jitter and equal jitter, which actually has a least number of works. The top line is on the non, is it? The non is not linear, right? No, the non is not linear. You want linear? Yeah, so because if you have more clients, it doesn't mean that you haven't. Actually, I think the non is most likely to be linear, because it doesn't depend on the time constraint. This is very strange curve, because if 100 clients get expected and linear, I expected, yeah. Maybe the curve is the other way. But at 100 clients, you need almost 25,000 calls. Not true good, not true good. Yeah, so it's not about what you need to do. Yeah, I was confused as well. But yeah, it's number of calls. OK, got it. So I think it's a little bit. How bad is your CPU in the data? Yeah, your CPU is always very strong when maybe you just go with the first one. And also, I mean, you need to consider your clients. Yes, the completion time. So if you try to do a full equalizer, this side is the most, no. And this is the yellow line. So it means that you just wait for double the time. And you do a bit of a randomize between two values. So actually, you wait very long to complete, to let all the end clients finish the work. So how do you define your retry algorithm? It's actually based on your real scenarios. Whereby, I mean, you have a limitation of the completion time, and also the resources you can afford. For the number of calls. So I think my recommendation is using the little relics on this one. Yeah, the last one. Because it's based on the previous sleep time. So I think this is very easy to implement. It has, like, actually, you can. I think between the decorated and equalizer, it's actually, sometimes you should try both. Unless you see which one is fit to your scenario best. So yes, now let's do a recap. So do you want to use the bot separator and retry at the same time your company or your project? So it's only very simple. So the reason retry is it may incur more waiting times. But actually, it's not suitable for online transactions. For example, you are waiting for a request from users. You will forward it to another service. You need to return in, like, for example, 10 seconds. So it's actually very simple because it should be suitable for latency-sensitive systems. And so you should use it for your construct or workers. Because actually, in your code, you create a new world of things. That one, you can use a worker or retry. Because actually, with a new context and actually, you can have a three-dimensional latency. So define both separator and retry. It also gives you your service. I think most of the problems when you try to implement these two is need to verify whether it's actually work in the productions. For my experience, it's actually a very flexible way to connect with the different plugins. And actually, through instruments, monitoring, actually, you can evaluate your values better. And also, you can test your intuition points and actually design your fallback algorithms. I mean, in staging, actually, you can test this. So actually, you make sure that in terms of chaos, there's no failures. Or you can actually reduce the number of failures as small as possible. So the resource here, even for monitoring, Netflix also have some library for monitoring high-strengths. High-strengths, though, is very good libraries. You actually try to do everything on the original Netflix high-strengths doing. The backup is for natural algorithms. Yeah, it's actually for you to design and retry it. Yeah, so that's all. Any questions? You define your, in practice, how do you define your particular thresholds? How do you define your retry times and retry algorithms? Oh, OK. In practice, you need to design between, I mean, depends on, I think firstly, it's SLA agreements between the two service. So for example, service means that I cannot be sent more than 1,000 requests per second. So I think you have to stick to that. Maybe you need to do a rules measure of the time maybe on staging and production. The time-out connections is different. So I mean, using that phytoning? I guess my question is, in the world of auto-scaling, I know you guys at cloud, you should have a pretty big account. Why don't you just scale out? I think scale-out is scale-out actually one of the solutions to solve the problem. But I mean, for example, there's a fault in the system. And actually, scale-out cannot help on those cases. I mean, those are very rare cases. No, no, I agree. So remember, Google load balancers crash. That's a time zone problem, right? So you lost the whole chunk of availability there. So what happens if your services are all in behind-the-load balancers that just died before your cloud provider? What do you do for the breaker? Yeah, so that is how you mean using a security breaker to stop the failure of tax carding. From any underlying impedance service. I mean, if you are at the upstream, you should always think about security breaker. If you are at the downstream, you always think about you need to rate limit control. The service don't cost you too much. And can that also, as a part of resilience, for example, like identities, how are you? Actually, the other pattern I used is rate limit is buffer. Just dump it into a big buffer somewhere and slow it with it back. The situation, like in grab that trumpet, no matter how big your buffer is, it's not going to be enough. At the peak hour of grab, you've had all these people keep retrying. And if your app is doing a lot of retry, try to get that disclaimer of working grab. So what the problem with retry and implementing the exponential retry backup thing is very important because all your retry is going to generate requests. And your smartphone is going to generate so many requests. Even the service you bring it up, suddenly will be bring down by the influx of traffic. So with the circuit breaker, you can help to reserve the critical function, like get away from something, not get hammered by all of this. An example where I wish I had had a circuit breaker was for a kind of window working in a different company a long time ago. The big head contract part, of course, that if you had a 401, don't use the same password. Don't automatically retry with the same app. I mean, this is very simple stuff. They didn't do that. They implemented a retry for everything. 404, 401, 403, everything. Something went wrong on a deployment in discussions. We decided it was time to invalidate the password. But like, no, totally, we'll stop retrying stuff now. Totally, they're living up to the expectations. Invalidate the passwords, cascading failure, every single cell phone, which has this software called a large number for this particular client, very large number, on the hour, every hour, starting to retry every five minutes. Now, of course, they all have seen the same network thing. So we just got these ridiculous spikes coming in, which started to break that we would have saved so many, so many hours. Yeah. So I think that's all of the end of my business. Thank you.