 I'm Mohammed Reza, a third year PhD student, and I'm working in autonomous distributed system labs in Umiya University in Sweden, and in these days I'm mostly working on autonomic microservice management, and especially in traffic management policies. So, let's begin. As you know, the performance of microservices that we are working on these days depends on various factors such as complexity of each service that we have and the incoming workload and many other things that we are familiar with. Therefore, the management in such dynamic environment could be really complex. But the good news is that service measures are there to help us to handle this complexity in the fly. But let's look at an example. One of the amazing capabilities of service measures is traffic management policies, which enables us to have circuit breaker, reply patterns, and without the need to implement anything in our development process. So, let's see them in practice. I want to ask you how many of you are using service mesh in production? Nice. Great. And how many of you are using traffic management policies such as circuit breaking and reply mechanism in production? There's a difference that people are not really working with these things, and now we are going to discuss that. And yes, that was it. Okay, let me share my shocking experience that I've had during my research. There's a possibility to ruin the user experience by misconfiguring these two patterns, I mean the reply setting and circuit breaker pattern. And the worst case is that they can even cause some outages in your services. But they can also help us to maintain our SLAs. I would say that these patterns are to some extent like a double H sword. So, we have to be careful about them. But let's explore them in practice. I've done some experiments with these patterns, and I'm going to share them right now. For the experiments, I use the online boutique, which is a simple e-commerce app that you might be also familiar with. It's open source, implemented by Google. And I use the cart endpoint. And while it invokes a chain of three microservices, as you can see here, front end recommendation and product catalog. And in total, we need for a single and complete response, there's a need to have six requests during all of these three services. So, I deployed this online boutique to Covenants Cholesterol with its own installed. And I've manually identified three key parameters to highlight the worst case scenarios for both circuit breaker and reply setting. For circuit breaker, I use the HTTP to max request, which is the maximum number of requests to a backend service, or based on my experience and my understanding, it's like a queue size. But for simplicity, let's call that circuit breaker configuration for this talk. And for reply setting, I use attempts, which is the number of replies to be allowed for a given request. And also per timeout, which is the timeout per attempt for a given request, including the initial call. And any replies that we have. But based on my understanding, it's like an interval. So, we can call them interval, attempts, and circuit breaker configuration. Okay, now we reach to the main experiments that I've made for my talk. Consider that we've done everything. And with the Floyder app, and we are going to study where should we enforce our circuit breaker. In the first scenario, I enforce the circuit breaker in the first year in the front end service. And with 20 as the size of the queue, and then I generated 1.2 times of the maximum load that service can handle with a response time less than 100 milliseconds. But let's see what will happen to the services. In the first year, as you can see here, in the first year, there's about 200 requests per second, successful requests. And also there are about 60 or 70 circuit breaker requests. And the same throughput is also in the second year. But we had quadruple throughput for third year, because it depends on the architecture of our service. With keeping these values in mind, we are going to enforce on other tiers to see what will happen in practice. And also we have to see the response times. In this scenario, in this case, we are going to enforce the same circuit breaker configuration on the second tier. And you can see that there are about 260 requests per second, successful requests per second. The throughputs got higher. And also the same throughput for second tier. But there are some circuit broken requests in second tier, which are seems like fade requests from an outside client's perspective. And also the same thing happens for the third tier. And then for the next scenario, we are going to enforce the circuit breaker on the third tier. And we can see that we have about 220 requests per successful request and about 50 or 60 fade requests. And the same thing happens for circuit breaker requests, which are in the third tier. But our clients see them as fade requests. Now let's enforce this circuit breaker on all tiers. As you can see here in the first tier, we can see around 200, 210 successful requests per second. And some circuit breaker request is somehow similar to the situation that we enforce our circuit breaker in the first tier. But it was not all. Let's look at the response time. Here is the CDF chart of for all experiments that I've done. The response time for the successful request or now known as carry throughput are highlighted with a circle marker. And for all requests, including the failed ones are highlighted with triangle marker. As you can see, when we enforce the circuit breaker on first tier or on all tiers, you can see that here, the response time is below 100 millisecond mostly. And if we enforce the circuit breaker on third tier or second tier, we can see that the response time are higher. Or for second tier, it goes up to 500 milliseconds. It's huge. So that was one of the experiments. Let me check. Yes. But let's see what happens when the incoming traffic is changing, which is the actual environment that we have. But let's skip the chart presentation because it takes time and focus on what lesson we can learn from this case. For this case, the circuit breaker is enforced on the first tier with different generated traffic load, as you can see there, 80%, 100%, 120%. And what we can learn is that having higher incoming traffic can lead to more successful requests, and also circuit breaker, circuit breaker request, also having circuit breaker in the first tier could help us to have faster failure so that we could have better user experience in a way. And our system can maintain an acceptable response time. But one other case is to investigate the impact of circuit breaking values, I mean the size of the queue that we had. For this case, we enforce the circuit breaker with different values on first tier with an overloading traffic again. But what we can learn from this case is that there's a tradeoff between request rate and response time. I mean, if we have higher circuit breaker values, then the request rate increase while there will be higher response time and vice versa. Or we could say circuit breaker somehow protects latency at the expense of availability. And the final takeaway from this case is that the best value for circuit breaker for this app and this setup that we had is 20 for the queue size. Now it's time to see what happens if we have retry setting as well. In this case, the circuit breaker with different values and one retry attempt are enforced on the third tier with an overloading traffic again. And we didn't set any retry interval for this case to see the impact of this combination. Again, we can learn that the same observation as previous case that there's a tradeoff between request rate and response time when configuring the circuit breaker thing and when we configure lower circuit breaker or there are more freight requests in the downstream services, there will be more requests to be retried. But in general, the retry setting could increase the response time when tapping circuit breaker or there is any failure in between our services. And let's deep dive into the retry setting and see what happens if we change the number of attempts for our retries. But to somehow highlight the worst case scenario, we configured the really sensitive circuit breaker. We limited the size of the queue to one to have more circuit breaker requests. And what we can learn from this case is that the higher retry attempts can increase the throughput slightly but with higher response times. And now let's look at the intervals. What if we have different intervals between each retry? In this case, we can learn that high retry intervals does not really affect most response time and request rates. But having no intervals could cause a retry storm in between our services. But we have to keep in mind, we are using HTTP 2-max requests in Istio and we don't have any replication set or other things, just one single service. Okay. Let's see what happens if there are some spikes in the incoming traffic with different intervals, which is expectable in a live environment. What we can learn from this case is that a high retry interval again does not affect the response time and request rate, but too short retry intervals might cause a slightly higher number of circuit breaker requests, especially after spikes that we had. On the other hand, if the loaded spikes are larger, we could see more circuit breaker requests with higher response time or somehow we could call them short retry storms after loaded spikes. Okay. And as the final case, let's take a look at a case where we just changed the enforcement point of these two patterns that we have, repriser ting and circuit breaker. When we enforce the optimal circuit breaker on first tier or on all tiers, different enforcement point of retry setting plays no role because our clients, our traffic generator didn't support the retry setting, so it didn't retry. So it doesn't have any effect. But when we enforce the circuit breaker on second tier, we can see that there will be higher number of fade request and higher response time. If we enforce the retry setting on all tiers, we can see that the higher number of fade request. On the other hand, if we enforce the circuit breaker on third tier, with enforcing the retry setting on first tier or on all tiers, the highest number of fade request and highest response time is observed. So let's summarize what we learned and how should we configure retry setting and circuit breaker. How should we deal with such dynamic environments? So in general, we can say that finding the optimal configuration can be really tricky because our performance of our services are changing dynamically with the incoming traffic, with the CI-CD pipelines that we have. So we cannot say that this is the best configuration for all services. We have to tune every time. So maybe a controller for that. Also, based on these findings, we can conclude that if we want to have lower response time and highest possible throughput, it seems to be beneficial to enforce more sensitive circuit breaker in the earlier layers, maybe in the first layer as well, so that this could enable fast failure and better user experience. About retry. It is not really suggested to enforce high number of retry attempts, but it might cause retry storm if we have some sort of misconfigured circuit breaker or we didn't configure any retrying turbos. About retry intervals or better said, high retry intervals does not really play a huge role under throughput, but the short retry interval as we've seen could somehow increase the retry storm possibilities in those situations. One of the benefits of retry setting is that if we don't have a well tuned circuit breaker in our system, retry setting, I mean both retry interval and retry attempts could somehow increase the throughput during overload, but with a little bit higher response time. And finally, the most important point about these two settings is that they could be really good choices if we have transient failures or transient overloads. If we have for larger per-use overloads, then we have to think about other scenarios such as scaling or other things. So that was all and that was my research somehow. Thank you. Does anyone have any questions for Mohamed? We are actually ahead of schedule, so it would be really great if you have any questions. So we have one gentleman have a question. Actually, it's easier. So when you have a question, walk up to the middle, there's a microphone there. So you can ask your question there. Thank you. Hello. Have you done any research into tuning retry jitter, not just the interval? No, I didn't do that, but I'm trying to do it in my next phases. Thank you. Thank you. Okay. Any other questions? Okay. I'm not seeing any. Let me know if I miss anybody. If not, thank you again. Thank you. Mohamed. Thank you.