 Hello everybody, we are from AppDynamics. So my name is Sergey Sergeyev, I am a software architect. Hi, I'm Ayush, I'm a principal engineer at Cisco Systems. And we are presenting testing in prod with Canary. So AppDynamics is a full stack observatory platform which collects hundreds of billions of mail data points each day to provide insights into your application network infrastructure and security. It also provides insights into a higher level context like business and users. Our customers rely on the stability, availability, and accuracy of a platform to make infrastructure and business decisions. Thus we take a great care in testing the platform and its services. At its core, testing is about reducing the uncertainty by checking for known failures, past failures, and predictable failures. In other words, running a piece of deterministic code in a particular environment and expect it to fail or succeed in a repeatable way. In the standard test pyramid where we start with more isolated to highly integrated test. So we start with unit tests followed by integration testing, then at high level, E2E testing. Then all of us basically top it off with some manual test. But the place that we do our test is a pre-production environment, which in most cases drift very heavily from our production counterparts. Once you deploy your code, you are not just testing your code. You are testing your code along with all the uncertainties in the environment. The uncertainty is like users, the code, the environment that you're running on, the infrastructure that you're running on, and even the point of time that you're deploying can make huge difference in how the application might behave. I'm still waiting for somebody to meet somebody who would say that they love their staging environments. Normally staging environments are mostly broken in consistent environment, which lacks any real user traffic or load. But at the same time, those are very expensive to maintain. The focus of testing has always been to prevent bugs from even reaching production, but that's rarely the case. Instead, our goal can be to prevent bugs from being released in production and not deployed in production. We can decouple release and deployment. And there are two general ways of doing it. There is the feature flag way and using one of the deployment strategies. We basically reserve feature flags for large, basically large features which spans multiple teams. And for our day-to-day regular deployment, we use Canary. Canary is a deployment strategy to release application in an incremental fashion to a subset of users and move on further when acceptance test passes. For us, a normal Canary based on request doesn't work because of the nature of our workload. So we use tenant-based Canary. In this case, simply we target specific tenants as Canary. For the Canary. To make tenant-based Canary work in our environment, we use Envoy filters and a bunch of internal libraries. So any request which we interact with our sync APIs or async APIs goes through this internal libraries. So in our environment, when a request hits our ingress gateway, the Envoy filter in the ingress gateway basically passes the claims in the job token and it creates a Canary ID. And that Canary ID is basically passed through all the subsequent requests to our downstream services. And we use basically HTTP headers in our sync APIs and Kafka headers in our async APIs. And this is the role of our internal library to pass this header forward. Looking at a sync API, how does it work in our system? When we receive a job token that may get converted into a Canary ID, so the Envoy filter at the service level can decide based on the Canary ID and the Canary ID and the downstream destination rules if it has to pass that request to a primary service or the Canary service. And this happens for all the sync APIs. Async APIs is bit involved where one way of solving this is by running two Brokers. One is the primary Kafka broker and another is a Canary Kafka broker. A similar flow happens at the ingress layer where this Canary ID is created. Now we have an Envoy filter at the service which is creating the Kafka log. So here if you can see the service A is creating the Kafka log. So the Envoy filter at service A will replicate the Kafka message to both the primary service and the Canary service. And the primary service will basically discard the Kafka log message and not even process it. And it will only be processed once by the Canary service which is talking to the Canary Kafka broker. And any other service in the environment which is consuming the same topic will work the same way as it was working earlier. Now I would like to request Saiget to basically continue. Yeah, and I came basically to ruin that Canary test in parts. And to talk about limitations. So what are the limitations to do Canary-based testing? So what if you want to test infrastructure components change or upgrade? For example, Istio itself, how you test? Istio components upgrades and et cetera rules or configuration changes. Some application and data store upgrades cannot be tested with Canary, they cannot be reverted cleanly. So you need something else to test it. Testing disaster recovery for the whole system is also challenging. You can't just restore the Canary and say this is done. So Canary usually access just some component of the system. It's hard to set bigger transactions for Canaries. And this is when seller architecture and GitOps may come into play and help. So we designed our system around server architecture to address co-ability challenges. So a server is a collection of components grouped from design and implementation and deployment. It's basically independently deployable, manageable and observable. And self-deployment is repeatable. So it's, it may help with a lot of testing use cases. So in close-up itself is a bunch of applications like over a hundred applications contributed by dozens of teams. So our system is quite big and complex. And each team can provision a subset of components needed for their testing. So, or even the whole system using our GitOps framework. So teams developing infrastructure services can use the same framework to provision infrastructure components as well. And basically we use the same framework to provision test deployments or production deployments including when you develop a service mesh components or something else infrastructural. And we can provide again a subset of components of the whole system. And we have two different flavors, ephemeral which is the minimum components deployed to a system for ephemeral testing for integration testing purposes. Or we can provision a cloud flavor of the system with all the cloud resources from external sources. But we can customize it to be small or large whatever the team needs. So this way we can minimize the cost for that testing infrastructure and test what's not testable with canary testing. So in a high level our testing infrastructure and chain promotion looks like this. So we have basically development area where developers can do. So if you develop for example some Istio components or rules you can provision your own deployment play with it, do some development testing. Then we have CI pipeline which basically validates all the components, applications and only after that they get to that shared environment with canary support. Teams can override canary deployments but in general it goes through this CI pipeline all the way to production. And if you're interested in GitOps in service mesh join us. So the company is actively hiring follow us on LinkedIn if you want to get updates. Yeah, this is it. We'll be happy to answer your questions. If you have during the break I will be in there and I will be in there. Thank you so much.