 Hey everyone, I'm Anirudh Ramanathan. I'm the CTO and co-founder of Signal Dot and I'm going to talk about preview environments. Before we get into that, a little bit about my background. So prior to this I spent a lot of time working on Kubernetes and ecosystem projects. So I have been one of the maintainers of the core workload controllers like Stateful center deployment and so on. Then I also spent time on the Apache Spark project and I'm a maintainer of the Kubernetes cluster backend in the Apache Spark project. In the past few years I've been focused on developer productivity and how we can make microservice development very effective, especially at scale and that's why this is something I wanted to talk about. So getting into it, first we should talk about why preview environments require us to talk about this. The crux of it is preview environments are important today when you have many microservices. Like when you get beyond a particular scale of microservices when you're talking tens or hundreds, it becomes very difficult to get really good feedback for the specific piece that has changed. So for example, as a developer I might be working on some small part of the entire stack and to get feedback from the overall application, that is something that a preview environment can help with. So it has these effects of aiding the review process, the helping scale developer velocity and also giving people access to this high quality feedback earlier means there's less rework down the line. So overall, it just helps people ship code faster. It's not new, preview environments have been around for a bit and there were several ways of doing it. So the traditional ways have been okay, let's just take the whole application, maybe or a subset of it in Docker compose, put that into a namespace or put that into OVM or just create a Kubernetes namespace create a copy of every single microservice in that namespace, right? So these are approaches that have existed. They have their flaws and I mean they work well at smaller scale, but when it comes to really like tens or hundreds of microservices, they start to become a little bit more cumbersome. So here I'm going to be talking about how you can use a service mesh. So something like Istio or Ligurty to really create very resource-efficient and highly scalable preview environments. And as an added benefit, it also unlocks this new model of collaboration where it's possible to test things together across microservices. And this is all before merging code. So that's the key thing about these preview environments. The feedback is available as you develop or as you create a pull request and and so on. It's at that phase. So just to give you a high-level intuition of what we are talking about when we say request routing based preview environment, essentially it is taking a test microservice, which is something that's being changed. It's one or more microservices that someone's been working on and someone's opened up a PR. Taking that and kind of stitching that together with the rest of the dependencies. So the key thing here is the dependencies which are shared across all the other environments as well. And we'll talk about how that sharing is safe and how there's isolation and all of that. But this baseline environment, like we call it, is usually updated continuously from like using a CSED process, whatever is being pushed to mean or the master branch is continuously getting updated and each of these microservices is sort of live. And that's what helps us ensure that the testing is always valid. So when you're testing in this example, test service one against the rest of the stack, you know that you're testing against the latest dependencies and it will continuously evolve. And this is essentially also what gives us the resource consumption being particularly low here because we're not deploying everything, we're just deploying a subset, just what has changed. And then we'll use the service mesh to stitch them together so that you get a view over this whole stack and you can still test your microservice along with all of the dependencies. So this sort of is a more, if we get into the physical layer of how it works, this should give you an intuition of that. So this is a general microservices environment, it's a request flow being shown where the arrows represent communication between these microservices. And this is what we would call the baseline environment. And this is where we will be creating these preview environments on top of this particular thing. So this is what a preview environment for Service A would look like. So we're assuming Service A is some backend service, it exposes an API, and then someone's developing it. So they create this new version of Service A containing some changes in their pull request. And that's what this represents, the Service A with the Git Shah. And then you can see the baseline is still running the master version. And what essentially we're doing is certain requests, which is the ones which are in red, will go through the path that is in the dotted red arrow. And then the requests with the bold black arrow are continuing to take the path that they always did. So that's how you achieve isolation in this model, not all requests are treated the same. Certain requests will flow through the new service and help you exercise it. Certain other requests will continue to stay the course that they were taking before. So that way, you can extend this model, you can have as many of these test sandboxes as you want, where you could be sending certain crafted traffic. So if you use a preview environment where you have set a particular tag on the request being made, then you're going to get this new behavior where instead of Service B from the baseline, you're getting Service B, which has been, which is the test version, or sandboxed, as we call it. And of course, you can additionally have state, some sort of database that you can add into this as well. So you can have some isolation of specific data sources if necessary. So we'll talk about all this and how such a system can be built. So there's several pieces to this. I'll sort of break it down as, okay, first thing is what do you need as a prerequisite to set up this kind of system. And then we'll get into how you set it up. Like concretely, if you're in your current workflow with CICD and the Service Mesh, how does this actually come to life? So first, let's talk about the prerequisite, which is how are these requests knowing their identity so that they're being treated differently at different hops, right? Like that's essentially what the intuition is the Service Mesh is helping you make a localized routing decision at a particular layer that is enabling it to treat a particular request differently from another. So for this to work, you need some notion of some common context that is being passed along the entire request. So for this, the most effective way typically that this can be done is open telemetry. So you might have heard of open telemetry in the context of tracing. One of the really cool properties of the open telemetry libraries is that they also have this notion of generalized baggage propagation. So if there's any header you want to pass along from service to service such that the value is retained at every hop, there is the baggage header that you can use. So in this case, we're going to use this, we're assuming all of these services have been instrumented with open telemetry, which enables them to just pass along baggage as they receive it. So if we set this specially crafted baggage that says, hey, my routing key, which is this parameter we have just defined here is service APR 15, then it will flow through this entire network of microservices and then take the path in red. Whereas if we don't specify a baggage, it's going to take the path through the latest version of service A. Okay. So once this prerequisite is sorted, now let's get into how this will work practically. So when a pull request is opened, there is this staging environment already that we have readied up where we will be deploying these test workloads. So pull request is opened, there's a CI CD process that kicks off, it builds a Docker image, it pushes it somewhere. And then we take that and then we deploy that as a Kubernetes workload into the cluster. So this is not affecting the baseline. It's a new separate deployment or our good rollout that will be in the cluster that is running our test version. So typically, it's easy to actually deploy this right alongside the baseline version because it can you reuse the same config map secrets, etc. But essentially it's completely isolated. It's a test version of the workload. And you do want to tie that its life cycle to the pull request itself. So now we have taken care of deploying the actual workload, we will also be creating a service for it. And that is essential because you need that for the request routing later for telling the mesh to say, hey, treat traffic that is intended here differently. The next part of this is actually achieving the request routing. And this is where the service mesh comes in. So when you instruct a service mesh to say, hey, if you see this particular header, the one that we talked about earlier, like routing key service APR 15. And if you are service A, then instead of sending traffic along the previous path, send it to this new new microservice instead, the test version. So that's essentially what we are doing here. So this is really familiar to people who are sort of used to doing Canadian rollouts, etc. Except this is none of this is in production, necessarily this is being used much earlier in the life cycle. So this is more concretely, taking the case of Istio, what would a routing rule look like. And you can see this is an Istio virtual service. And we're specifying matches. So in this HTTP is a bit misleading, it works for HTTP and GRPC. And it basically matches this header called baggage, and then finds the routing key. And then if it finds that we are intending this for service APR 15, then it's going to override the regular route that it would take and instead send it to our test version that we deployed in the previous step. So this is essentially how we are achieving that routing. So first we deployed the workload itself, the test workload and the corresponding service. And then now we are creating this network routing rule in Istio to actually configure the traffic. Now we took care of the stateless stuff. This is fairly straightforward so far. Now we should talk about how we can isolate a database. So first off, not every preview flow actually would need isolating a database. There may be cases where you don't necessarily need to isolate the data because you are able to achieve isolation in a different way, like at the application layer. Maybe it's developers and people who are using the preview environments will use their own test entities, so on and so forth. And in a lot of those cases, it's often beneficial to have high-quality shared data. So it is a trade-off as to when you want to actually isolate the data sources. One case in which it is always necessary is schema changes. So the reason I say that is you have a schema change to a particular database that's going to break the baseline. And that is somebody else's dependency that's going to break a whole lot of other people. So that's the case where DDL change, you definitely want to isolate it. In that case, you can deploy an fMRAIL database just alongside the workload, the test workload that we deployed, and then make sure that these two are linked together. So essentially, if you are testing a schema change, you can deploy a new version of the database, and it need not always be physical isolation. It need not be a whole new instance of a database cluster. It could be a schema within a database. It could be tables within a database and so on. So the lowest level of isolation that you need, but you would need to deploy this alongside the test workload whenever you desire that additional degree of isolation. Then we get into an interesting case of message queues. So message queues are really common, and then you can see the producer-consumer relationship here. With message queues, open telemetry and baggage propagation actually gives you a very clean solution where messages being sent themselves can contain this kind of metadata or header, which for example would be routing key. So let's say every message contains what routing key the request was that added it to the queue. And from the consumer end, each consumer can be aware of the routing key that it should consume. So its consumers essentially would, for example, a test consumer would only consume what is relevant to it. So any of the messages that are annotated with its own header, and then the baseline consumer is going to ignore them. So this is essentially how we made the message queue multi-tenant at the message level itself by using open telemetry. So bringing all of this together, you get these preview environments that you can get very high quality feedback by deploying very little resources and configuring the service mesh to ensure that you're able to realize all of these different preview flows and get the desired degree of isolation and so on. So what are some benefits? Why even do any of this? It's super scalable. So you saw that we were deploying the minimal amount of resources, so which is just what changed. And we should contrast that with trying to set up like full copies of the infrastructure which is way more expensive and difficult to manage as well. Then you have DevX, like time to set up each of these environments is really, really small. So because you're deploying so less automatically these environments, preview environments associated with the PR will come up quickly. They can also be torn down quickly. You don't necessarily have to think about what about the infrastructure cost of all of this because they tend to be very, very minimal compared to any of the traditional approaches. One big advantage is the fidelity here, which is because you're always testing against this baseline that is constantly updated with real dependencies, you have high confidence and feedback. You're not having the problem necessarily of there are these dependencies that I'm running which are outdated or stale. So that helps ensure that your testing is always valid. Like the previews that are being seen are valid. They're reflecting the most correct current state of the overall application. And then something that I did not go into detail on is this notion of collaboration. So I want to expand on this a little bit because this is being done using request routing. And we were essentially setting a particular header on a certain request and it would follow a different path compared to others. You have the ability to now combine preview environments. So you could have a preview environment for PRN in repository A, PR15 and repository B. They both have their own routing context associated with them. If you can combine them together, you essentially have realized a routing configuration where you can have the test version of service A and test version of service B come together. So this enables developers to test their changes together. Like if a feature is spanning multiple microservices, you can have the ability to preview those changes end to end across those multiple microservices before you merge code. So that's something that is new, that is enabled in this kind of approach to preview environments. That's all I had. Thank you so much for listening. If this is interesting and if you're building something like this, I would love to know more. Please do reach out. Here's my contact details. Thank you so much.