 Good morning, everyone. I'm Kush, a platform engineer at DevRip. And we are thrilled to be here at STO today. And today, we are not just going to talk about the challenges we had with STO, but we hope that we may experience it together. So before we start, by a show of hands, can I know how many of you had face challenges while scaling or configuring STO into your organization? Nice. And how many of you had triumphs with STO, which in the sense that it's your media performance better, or it may be your platform more reliant or more robust? Quite a few. So we hope that we may be able to address a few of the challenges which you might have faced. And we hope that the number of hands in the list today may increase, whereas you had improved your performance or reliance of your platform. So, because we are going to embark on a journey of DevRip tinkering with the STO, exploring vast capabilities, understanding the hurdles, and celebrating the victories. So today, I have with me Khushboo, who is an infrastructure and security engineer at DevRip. So over to you. Hey, everyone. OK. So at DevRip, we're creating world's most customer centering companies by bringing customers' voice to software development. You can head out to our website using the QR code here and to know more and try it out for yourself. So as any other modern sales company at DevRip, we use microservices and we use Kubernetes to deploy them. And there are just so many ways of doing this to call any one way a right way or a wrong way or the best way. So we're going to take you through a journey on how we began and what our network infrastructure looked like and then how it looks today after we have adapted STO. So this is a very high-level architecture how things looked pre-STO. So we used external load balancers for handling our external traffic as well as internal traffic as well. And within the cluster, we leveraged Kubernetes native objects, ingress, service pods, everything to manage our traffic. And we had one ingress per namespace, which was essentially for one microservice. So this kept one service owner to be in total control of their namespace and ingresses and everything. You might notice that how communication between different services is also happening externally through the load balancer and not talking to service directly. So there were a couple of reasons for doing so. One such reason was service discovery. So service discovery is native to Kubernetes. However, after a certain scale, it's not as reliable. So we chose to use the ALB. And the other reason is lack of load balancing options you get just with service. You can just do a round robin, but that's not enough. And so we chose to use an external ALB. The third reason was to have a little bit more insights to how our network is behaving and what kind of traffic is flowing through it, which was absent without an ingress on an ALB. So now this is how it looks today. So a very first observation one might have is that instead of having ingresses per namespace, we just have one, which is taking care of all external traffic that's flowing in the cluster. That ingress is the Istio ingress gateway that we have. It sits in the Istio system namespace separately and handles all the traffic. For service specific traffic, however, we use combination of virtual services and destination rules to do traffic manipulations and traffic management targeted for those namespaces. So we're gonna talk more about this as we go. So first thing first, why did we even do that? Why did we even choose to go this route and why chose a service mesh? So service mesh enables you to have a uniform way to connect, secure, and manage your microservices. It enhances communications, gives you more control on the cluster. And you don't have to take my word for it. You can read NIST guidelines. They say that for running any Kubernetes cluster, using service mesh is the way to run them efficiently and securely. So service mesh, aka Istio, for the context of the stock, we're gonna talk about certain capabilities. So observability, Istio generates detailed telemetry by default. Gives you more insights into how your service are communicating to each other and then enables you to debug issues faster. Traffic management. So like I said, it's very limited when you use only native Kubernetes objects about what you can do with traffic management. You have to rely on external load balancers and gateways for that. With Istio, you get a wide range of all traffic management options. You can choose from load balancing options. You can do things like circuit breaking, rate limiting, tune your timeouts, and also do things like deployment techniques, blue, green, canary, and whatnot. So that gives an edge, right? Security, very important. So with Istio, it has automatic MTLS automation, as we know, it lets you control access and enforce policies based on service identities, thus making your intra-cluster communication secure as well. Service discovery, we mentioned it briefly. So service mesh is native to Kubernetes, yes, but service, sorry, service discovery, but service mesh can just add to it and do a whole load more and help you run these very complex intra-cluster communications more predictably. So with all these advanced features, you naturally, your services becomes more resilient to failures and everybody wins. So we will cover how we achieved all of this at DevREP, how we leverage Istio to do all of this and talk about challenges we face, give you solutions, the best possible of the solutions that we followed to fix them. So I guess we can all agree that it's not realistic enough to cover each and every Istio challenge configuration and scaling stuff in just 25 minutes. So what we have done over here is we have distributed our slides presentation into high-level network extraction which Istio provides you with the CRDs and feature capabilities. So before we start diving into whatever network capabilities Istio provides, let's just look into how we handled Istio installation and Istio upgrades. So to start our journey with Istio, we created an enough same chart which was combination of community charts and custom ingress controllers and Istio gateway resources which provided us a sense of flexibility and helped us to cater our use cases. Coming on to how we handle the upgrades, so when it comes to upgrading Istio, we adopted Kendry upgrade approach. This method not only allowed us to take care of breaking changes but also allowed us to gradually roll out the, gradually roll out few services to a new control plane and making sure that there is no incompatible or breaking changes in the configuration which we currently use. However, our journey hasn't been without challenges because the heavy reliance of Kendry upgrades onto Istio CTL and Kendry presented few hurdles in our GitOps workflow. We are heavy users of Argo CD and Argo workflow and our complete CI CD is driven via GitOps. The imperative nature of Istio CTL command was somewhat at odds with this approach. So whenever we had to perform the upgrades, we had to disable the application from Argo. We had to do manual magic by upgrading and then re-enable the application. Apart from the GitOps, we also had to deal with stale versions of the control plane being referred in the Envoy proxies and they were not getting insane unless and until there is a rollout happening for that particular service. To again overcome this challenge, we had streamlined the approach and we had an automated script which used to take care of data plane restarts. So that script used to monitor if there's any control plane upgrade and it would trigger a data plane rollout whenever deemed necessary. So this actually helped us to automate our process to a certain scale and made it more efficient. So now let's take a look into Gateway and Ingresses and let's take a look at what were the major triumphs which we were able to make with the Gateway and Ingress feature capability of Istio. So first time which we made was that migration to Edge Gateway. So our migration to Edge Gateway via the sidecar of the Envoy proxy for each service was a significant shift in our architecture. This not only eliminated a single point of failure which was a old central Gateway but it also enabled us to leverage Edge features on Envoy like GRPC load balancing, circuit breakers, dynamic service discovery, state rollouts, traffic splitting. And these features not only scaled and improve the performance of our platform but also it increased the resilience and robustness of the platform. Second, let's take a look at how we achieved JSON to GRPC transcoding and how we made it magical. So an essential part of our architecture is the communication between internal services which is primarily handled via GRPC. So to facilitate this, we needed effective JSON to GRPC transcoding and before introduction to Istio, our old central Gateway was taking care of this and service owners had to maintain API schema and API mappings at both places because they need to make sure that the client is able to handle JSON to GRPC. So Envoy need to be supposed to transcoding using Envoy filters and protobuf stubs and these, how we did this was and how you can do is this, you can mount the stubs into your sidecar Envoy pod which can enable seamlessly converting the JSON request from a client to GRPC server. Now there were a lot of manual process involved in this and we took help of CI and our Bazel build system to orchestrate this and make this process magical for developers so that they don't even know that how the things were happening. So how we did this was that Bazel was responsible for generating the protobuf stubs and then CI was responsible for mounting those stubs into our main pod directory and then we used SEO's sidecar annotations to actually tell that you can mount this protodirectory into the Envoy sidecar and we had a general Envoy filter templated into our Enchar which would take care or which would find the protobuf stubs into the protodirectory. Now let's take a look at the major challenges which we faced while scaling with Gateway and Ingresses. So first major challenge we faced was that we were facing a lot of 5-axis errors and out of the 5-axis major we were 502 and 503s which was upstream connection reset. After debugging and digging down more into this we realized that it was always happening during rescaling of this to Ingresses Gateway and we came down that it was due to the Envoy which has the termination rate duration default to five seconds. So to mitigate this we set our termination rate duration to 300 seconds which allowed enough time for all the requests to terminate gracefully and not getting any connection or not getting any request terminated abruptly which was in the queue. Although this is a deprecated approach now because as of STO version 1.13 there's a new Envoy feature capability which is terminated when active connection is zero and we are yet to migrate to that which we've seen can be more reliable. Second was that we used to experience occasional 504 errors from our ALB and after digging more into this we found out that this was only happening for APIs which were actually taking more than the ALB timeout. So we had few APIs which used to take five to 10 minutes during some export during some imports and more. And the solution for this was that STO ingress should be sending keep alive props to NLB so that it can make sure that the connection is not yet terminated. But after digging more into this we didn't find any native approach in STO which is why we had to rely on our Envoy filter which would take care of sending this keep alive props to the ingress gateway. Right. Let's talk about how we made traffic management more controllable and granular using STO. So STO's offerings let you do department techniques like Canary Blue Green where you can have more informed upgrades. You are sure that your upgrades are not faulty and that they are successfully able to serve the traffic. So that is done, all that magic is done by using STO virtual services as well as destination rules. And another very interesting and useful feature that we leverage at DevRev is using Envoy filters. So Envoy filters are very powerful capability that is offered that we can leverage using STO itself. And so we at DevRev use combinations of Wassum as well as lower Envoy filters. And these filters run on different parameters including but not limited to GRPC context, HTTP headers and parts and also JWT metadata. So these came very handy for specific use cases for some services where they wanted to manipulate some traffic. So one such use case for example was during end to end testing. We wanted some of our end to end testing traffic only to land at some dummy service for the sake of the test and not actually call our main service. And these Envoy filters just work like a charm for such use cases. So as always there are challenges when you try to do good things. So I'm gonna talk about a few challenges we had when we implemented these traffic management features. So a very interesting thing happened when we started using Istio was our website connection started closing abnormally. We received a lot of 1,006 status code whereas the expected status code is 1,000 when your connection closes just normally. And also interestingly it was only happening with Chrome client and not all of the browsers. So we turned to community and dig deeper and figured out that this happens because of this very specific setting called delayed close timeout which is set to one second by default and which is what causes these close abnormal connections. So to fix it, we had to use an Envoy filter and set this setting to zero and then have graceful terminations. However, now with the latest releases there's a new way to deal with this problem. You don't have to use the Envoy filter. You can use Istio setting itself, a pilot flag. And I mean that's very way easier to configure this. So other issues we had was with the blue, green and canary. So during a particular testing with the blue, green deployment for one service we were trying to mirror traffic to see how our new deployments are working where we learned that the mirroring is actually not happening as expected. So we started debugging and we learned that our internal service to service communications are actually not honoring our virtual services. So it's bypassing virtual services. And the reason behind that was a virtual service was only tagged to our own gateway, the devref gateway that we deployed for handling external traffic but not the default Istio gateway called mesh gateway. So for this to work, for your service to service comms to honor your virtual service configurations you have to tag the default gateway called mesh to your virtual service. And another thing happened during canary upgrades was we noticed that our Istio depods started consuming a lot of memory and CPU on our nodes and it was not normal or not what we had seen in the past. So it turns out that was also an issue with this flag called distribution tracking which is set to true by default and which was the culprit for this high resource consumption. So setting it to false handles that problem you can head to these issues to know more about what this flag is used for. Let's talk about how we made our clusters and intra-clastic communications more secure with Istio. So Kush mentioned how we moved from a dedicated one gateway to decentralized edge gateways per service and offloaded a lot of operations to those onways instead of our gateway service having to do it. And one such operation was GWT validation. So with on-way proxies you just need to create a request authentication resource and hook your issuer with that. And then all GWT validation tasks can be taken care by the on-way itself. You don't need a dedicated service to do it for you. And then that MTLS. So like I said Istio offers MTLS by default. So whenever you install Istio it's always enabled but it's set to permissive mode which essentially lets your on-way accept MTLS or non-MTLS traffic which works which does not interrupt anything much but when you move to strict mode which makes sure that you only accept your on-ways only accept MTLS traffic and not anything else one need to be very careful about making sure that you create service entries for any external endpoints. So if your services are expected to talk to anything outside of your cluster which Istio does not recognize or know about make sure you create those service entries at the host and the port numbers which you want your services to talk to. You can also restrict that to namespaces. If you want only certain namespaces to be able to talk to these external endpoints and not others you can do that as well with Istio. So that's just authenticated inter-service communications. Now you can also do authorization. You can allow or disallow certain services to be able to talk to each other. A very strict way of doing it is just have a deny all traffic policy and authorization policy which says you can only talk to services within your namespace and rest everything you're not supposed to do. And over that then you can create explicit allow lists and allow services which are by business logic expected to talk to each other. So whatever on-way filters and whatever templates and manifest and CI magic Basel system we were talking about we will be sharing a GitHub repo where you can find all the samples and all the manifest. So now let's take a look at how Istio helped us with application monitoring. So we can divide the possibility part for application into two major milestones. First is the traffic monitoring. Second is monitoring the application itself. So Istio provides comprehensive traffic monitoring capabilities which can help you capture and bound and outbound traffic giving you a complete picture of how your services are interacting and how data is flowing throughout the system. And if any issue arises you can rely on tools like Kiali and Omlis understand the flow of requests and understand complex service interactions. So we did the river heavily using Kafka consumers and asynchronous workers. And it's very heavy and it's very costly to actually implement tracing for Kafka consumers and asynchronous workers. So to actually visualize the network traffic for all of them we relied on Kiali because Kiali or Istio does not need any instrumentation on your application level. And if your service is onboarded to the mesh it can provide you with default metrics like what's your error rate, what's your latency and from where traffic is flowing. So this actually held us to debug a lot of network issues and a lot of flows issue with our Kafka consumers. And we were able to know that where anomalous traffic is coming from and why is this service is degraded or why this latent, this endpoint is being throttled. Next thing is Istio can also provide you powerful application monitoring and debugging features. One of the major one is tracing an APM which can allow you to track individual requests as they flow through various services. So with sidecar context propagation feature you can actually take the full part of requests even in a complex microservice architecture. So how we did this at Debra was that we utilized share piece of code for GRPC and HTTP server initialization which take care of context propagation from the sidecar trace IDs ensuring contest consistent and accurate tracing throughout the system. Now the last we enabled the config flag which is proxy merge which actually helped us reduce the load on Prometheus systems. So if you don't have proxy merge enabled then you have to configure two scraping targets for Prometheus. One for your sidecar container, second for your main application container. With this proxy merge the sidecar will take care of scraping the main application matrix and merging with the sidecar matrix itself. So you would have just a single target to scrape. One of the key aspect of maintaining healthy STO is monitoring the STO control in itself. And the STOD exposes a lot of matrix which you can monitor to make sure that the STO is behaving properly. We are just going to walk you through some of the key matrix which actually helped us debug few other production incidents. So first by keeping the track of sidecar injection failures we can actually check and identify any issue which can impact service to service communication or which may cause an incident in your system. Another important matrix which we tried was success or failures of configuration pushes from STO to Envoy. Which actually helped us to ensure that each and every data plane proxy is running the correct configuration as in sync with the latest control brain version. And at the last we monitored the STOD resources. So as Khujbu mentioned during our Kenri rollouts we saw a lot of increase in memory and CPU consumption which actually can tell you that if there is any memory or CPU leak or if there is any configuration change which you have made which you shouldn't or if there's something wrong with your STOD. At the last STO comes with a lot of default alert rules which are actually used by STO performance suite. These rules can provide a good starting point for monitoring the performance and health of your service mesh. However we think that every organization environment is unique and you might need to customize those rules. So the good thing is that the way those performance alert rules for alert manager implemented are customizable according to whatever you need or however you want to. And at the last it's important to remember that effective monitoring is just all about having right data at right time. And by monitoring the control plane and customizing the alert rules you can actually make sure that you have a healthy service mesh and everything is working well. Now let's take a brief look around what were the challenges which we faced while implementing monitoring and observability with STO. So as we scaled our services we encountered a challenge where enormous volume of matrix showed the Prometheus setup. This was due to the high cardinality matrix which increased the load on our monitoring system and that's when we realized our single Prometheus setup was not enough to work well. To address this we did several optimization one of which was to disable host side of fallback which helped us reduce the number of unique matrix reduces elevating load on Prometheus. But again this worked to a certain extent which we decided to use a federated Prometheus setup and this federation allows us to split our matrix load into multiple Prometheus as well as we also drop short-lived matrix from our federated Prometheus global Prometheus. And this again reduced the load but again it was not efficient enough because we needed long-term retention for our matrix too. So to again at the last we had to rely on Thanos for having horizontally scaled Prometheus and then using S3 for long-term retention which actually helped us to overcome the load and to also have matrix stored for compliance purposes. All right so we try to cover as much as we could you know during this short window of 20, 25 minutes. So a little bit about what's next for us at DEVREV with our journey with STO. So we at DEVREV are all expanding to different regions now in terms of hosting the platform so that we can cater to our customer's needs of data residency and availability. So for that we evaluated multi-cluster STO deployments and different deployment models that it has to offer single versus multiple control planes, single versus different networks and then you know how you if you need or do not need a cross network gateway to manage traffic between these clusters and also internal versus external CA authorities to manage MTLS in those clusters. So as the next step, we are gonna share our findings and our learnings through that journey as well. So you can scan the QR code below to access the GitHub repository which has manifests of all the implementations we spoke about during this talk and we'll also be sharing these multi-cluster findings through the same repository. Also like Cush shared initially that we face issues with our upgrades and that they're not as smooth as we would want it to be. So we're writing our own custom operator to take care of those problems, have automatic cannery, make sure that the data plane is upgraded to the same version as control plane and also be more GitOps friendly. So I hope that thank you all for being here. I hope we prepared you a little bit better in your journey with STO and also give you a heads up about a few bumps that you might run into. So thank you everybody for being here. Please we request you to give feedback. You're scanning the QR code here about the talk, what you liked, what you did not so we can make ourselves better and I don't know if we still have time but if we do we can take a few questions.