 Thank you everyone. My name is Matt Grossman and I'm here from Lyft to talk about how we've used Envoy to rethink microservice development. So I actually, I don't work with JP over on Envoy Mobile. I don't work with Matt on other parts of Envoy. I'm actually part of the developer experience org at Lyft. Our goal, we're not networking, we're not observability. Our goal is to make it so our teams, our product developers are as fast and safe as possible. When they have a change that they have made and they want to feel confident in pushing it out to production. So I'm going to take us through the very systems that we've had at Lyft, different developer environments that we've had, the problems that we've run into them, and how we've pivoted to using a system that uses Envoy to hopefully make that system faster and better for developers. So if we rewind by like two years, this is what our change pipeline used to look like when a developer had some modification that they wanted to get out to prod. In development, each user would have their own, what we'd call a one box, which was basically just a very large EC2 instance, and this would run the entirety of the Lyft system. They would change whatever amount that they wanted and their services is completely sandboxed in. When they were done and they felt confident in their change, they would create a pull request, get that plus one by a teammate, and then take it out to our staging and production Kubernetes clusters. So notably, the development and staging and production run on very different stacks here. Development, our one box was totally separate versus our Kubernetes staging and production were quite similar, other than just like a couple changes, of course, on what the data flowed like, etc. So just to zoom in a little bit more about what one box was, like I said, really large EC2s. When a user wanted to make change, they would just make the modifications locally on their own machine, then they would sync from their local to the remote, and we would kind of do some hot reloading. And the big thing was that because they were so large, they could run every single service at Lyft. That worked for a really long time, especially as we moved from a monolith to microservices. At first, we didn't have that many, but eventually we ran into problems. One of the problems with this type of development where each user got their own EC2 was it would scale by end by services. So of course, as Lyft grew, we got a lot more engineers, which means we needed a lot more one boxes to give out to everyone. Also, as Lyft grew, the amount of services that we needed, and we're running and we're creating, were much higher too. So the one boxes that we needed needed to be larger. So it was just not a great scaling complexity. Another large problem that we ran into with one box was, like I said, it was very different than our Kubernetes staging and production environments. So we had these issues with environment drift, where something would work great in development, but then it would get deployed out to staging or production, and we would run into some sort of strange issues due to the system that was built on top of. Sometimes the other problem would happen too, where someone would put out a change, not realize until later that it was broken in development, and then this caused issues with unclear ownership. So our team, the developer experience team, owned one box. So when something would go wrong in development, but it was working in other environments, they would kind of point fingers at us, but we're not experts on every single person's service. So we would kind of have to rely on the service owners. And they, of course, wouldn't really feel responsibility themselves because they weren't sure what they did wrong and seemed working to them. So a combination of these changes or these problems is what led to this unfortunately normal workflow that a lot of developers would end up doing. First, they would provision one of these one boxes. They would run a command by SSHing into it. They would spin up a whole bunch of different containers, and oftentimes that could take over an hour. Sometimes some of those containers start to fail, so they would end up having to reprovision it. And then eventually they would just give up and deploy and test and staging, where they're a lot more confident that things worked as expected. This, of course, was a problem. This is not a great developer environment. And because of that, we decided we need to kind of go back to the drawing board and figure out a better arrangement. So that's what led us to the solution where we are sharing a developer environment. So in that last step that I said, what's so wrong with deploying to staging? If we ended up doing that, staging is not so bad. It's a real environment. It's very similar to production. Staging has a bunch of great advantages, actually. It's like I said, very production-like at Lyft. We had this whole system where we had have all these simulated rides that were sending ingress traffic. So we kind of, when you deploy out to staging, you would get some traffic that exercises your new code path, so you kind of gain some confidence in it. And then also teams really cared about maintaining the SLOs and making sure that if their service was broken staging, they went and actually fixed it. They would get paged about it and, you know, if they would peeve a bunch of people. So, of course, they're going back and fixing it. So those are advantages, but of course, staging also had some problems if you wanted to use it as a developer environment. One of them was, like I said, a broken service in staging might mean many transitive dependencies are also broken. You deploy something bad. You know, many calls upstream from them, but they are suddenly paging themselves and like, oh, it's because someone on this other team in a far-flung org deployed. And that's not great. Also, because there was this one core version that was running on staging for a given service, if teammates wanted to collaborate and work on the same service, but maybe they had their own branches that they wanted to test, it wasn't super straightforward. Perhaps they would have to kind of take turns and say, hey, I'm testing this in staging. Can you wait an hour? I got to wrap this up, make sure it works as expected. And so, of course, that slows down and velocity. And then lastly, deploying something to staging was kind of more complicated than testing something in development. So, at least at Lyft, deploying to staging was really just the final step before you deployed a production. So you'd have to create a pull request, get a plus one from a teammate saying that the code works as expected, run through all the CI checks, et cetera, and then start to deploy a pipeline. So all of those problems with staging kind of led us to realize that we can't just use as-is. We needed to make some modifications to our existing shared environment. And to do that, we created a system called Staging Overrides that gives us a lot of the advantages of a shared environment for development, but it fixes a lot of the problems that we had with isolation. So that was our overall goal, and this is kind of like a very high level picture of what Staging Overrides looks like. First, users make requests to the core staging environment, but they have their own instance, what we call an offloaded pod, that contains just their changes. So when they are done with their changes, they don't have to fully merge it. They just need to get it up and make a container image build from it. They can deploy it to the staging mesh, and from there, it kind of lives in staging just like anything else. But notably, it does not receive traffic from any other people who are testing and staging, so this fixes a lot of those problems of teammates stepping on each other's toes. The next part of Staging Overrides is this override metadata. Now, of course, we don't want random strangers and staging designers' product or testing to hit your potentially buggy code, but we do want you to hit it so you can make sure you're confident. So to do that, we had to embed metadata within the request that kind of informs us how to do this type of routing and make you get offloaded to your, or get sent to your offloaded pod, in this case the fraud service. And that's kind of like the last part that brings us all together is what we call routing override, and sometimes called just overriding override-based development, where once we know that this request has the metadata that says we want to do a test version, we need to actually act on that and route kind of dynamically and appropriately to the pod that you specified. So those are kind of like the main components that make up the system that you call Staging Overrides. First is these unregistered offloaded pods. The way we do that is we use an XDS control plane that we have, and we have some like very simple logic that does some endpoint discovery service exclusion. The next part is that propagated override metadata. For that, we use Envoy's distributed tracing technology that I've built in as we just saw from University of Illinois. They have a concept in there called baggage that will dig into a little bit further that lets us propagate this data. And then lastly are the actual overrides that act upon this data, which are a custom filter that we wrote in Envoy, as well as a cluster called the original destination cluster or original DST. So in the next couple of minutes, I'll just kind of talk about all these different pieces and how they work and how we kind of all combine them together to do this. So first is offloaded pods. I'd say this is probably the more straightforward part. To do this, we created a GitHub bot for all of our users who have ended up putting up pull requests. They could just type slash offload, and once their container image had finished building, they would start a new deployment in our Kubernetes staging cluster. And this deployment that had the pod inside would be labeled specifically with this offloaded deploy equals true label. So that's really all we had to do on the deploy side to modify it and kind of wire that up with some amount of bots. But then in our control plane, we're processing these case events and sending out EDS to all of our sidecar mesh and telling them like, hey, here's a routable instance of the pay service. We wanted to make sure we didn't do that if we had just an offloaded deploy. So in this very, very simple pseudocode, you can just imagine that if we saw this pod as an event that popped up in our mesh, we would just kind of skip over it. So this, on its own, simply deploying a pod into staging, which is the same environment that others are comfortable testing in, actually provided some benefits. Some people just wanted to SSH into a pod, run a script, and kind of have confidence that what they were doing was working. We didn't need any ingress traffic, but of course the power comes when you actually can choose to route to that pod. And to do that, we needed to use some of this override metadata. So here's kind of just an example of what this metadata would look like. You can imagine that in the metadata that's within the request, we say what the upstream cluster, which lifts roughly the service that we're trying to override, as well as the IP address of that cluster. So here it's pay, and the IP is 1.2.3.4. So let's dig a little bit more into that override metadata. It doesn't have to be very complicated. We, technically we have some protobuf, but really just a JSON blob that looks like this. If you have some set of overrides that you want applied throughout the duration of the request, you just say, hey, this service in particular, rather than going to the blessed version that's running on staging, go to this IP colon port instead. So this is what we wanted the data to look like, because we knew this was everything we needed to do our routing. The question was, where did we actually put this data and make sure that it was always available? So to do that, we leveraged our existing setup outlift for distributed tracing. Distributed tracing, as we just learned about in our chat, makes it really easy to basically collect a whole bunch of data about how our requests flowed end to end. And at lift, we have a pretty good setup with header propagation. It's not perfect, but collectively we use this trace data for normal tracing requests. Our mesh kind of looks something like that whiteboard. And we use it for observability sake, which is traditionally what it's been used for. But we also use it, like the fact that we got all of this header already propagating was really useful for us, because within this Open Tracing data, and we're using Open Tracing, Open Slummetry is the new standard, but for the most part, they kind of work the same for what we're doing. We were able to embed a special field within the header called baggage, which a lot of the time the metadata from your spans within a given trace are kind of local only to that span, and then they don't also get propagated to any further upstreams in the call graph. But baggage is special in that it is information you embed once and it gets propagated across process lines. So for us, that's super useful because all we need to do is really early in the request line, take this metadata and encapsulate it and chuck it into our baggage field such that it can be propagated throughout the whole request. So now that we have the metadata that's been propagated, we have our offloaded pods. We need to build the networking magic that will actually act upon that metadata and send us to the pod we're interested in. So very high-level overview of Envoy at Lyft, pretty standard stuff. Each service is deployed alongside a sidecar Envoy. It's a container within a pod for each service, and this container handles all ingress to the service as well as egress, which historically provides all the benefits you would expect out of Envoy. It's super simple for applications who don't need to think much about networking. But for our specific use case, it's great that we knew every service was deployed alongside these sidecars because it really gave us a hook that we can implement and kind of intercept all of the routing that normally happens. So to do that, we used Envoy's HTTP filter subsystem. So we modified and created our own filter that then we injected into every single one of the sidecars that is deployed at Lyft. And basically what we need this HTTP filter to do is mutate how we end up doing that routing and how we choose what our upstream endpoint is just before we were about to go to whatever was normally on staging. So at a really high level, the filter logic is we extract the information outside of the baggage, which is generally just service IP pairs, like I showed before. We see if any of the overrides within that baggage match where we are about to route to. By the time this HTTP filter is run, we already know where Envoy intends on going. It's already chosen that based on host header, based on paths, et cetera. And if there's a match between what is in the overrides and where we want to and where we plan on going, we should basically kind of like block that part of the routing and go to that IP address instead. So here is a very, very high level C++ overview of what the filter looks like. I'm not going to walk you through it entirely today. If you are interested, you can go download the slides later and check out all the code. But I'm going to walk you just through the real key components that kind of make it tick. The first part is what we've been talking about, which was because Envoy has first-class support for tracing, all we have to do is when we define our HTTP filter, we say we get all these callbacks that gives us access to things like the current Envoy span. And then from there, all we have to do is get baggage and pass it the key that you want. So you can imagine how we posted our override metadata within the baggage overrides key. We just call this get baggage, bring it out, and then parse it into whatever data structure that you want to represent it in. So we have a protobuf here that we parse it. This next part's a little hairy, but based on the way the HTTP filters work with routing within Envoy, we can't just straight up mutate what upstream cluster we send it to. In this case, you could imagine if we were sending it to the pay cluster originally, we want to go to a different cluster instead. So what we end up having to do is take that existing cluster and wrap around it in this thing called the delegating route that we kind of had to subclass. And when we do that, we can provide the cluster that we want to go to instead. So the question is, okay, we don't want to go to the normal pay cluster because the normal pay cluster has all of the endpoints that weren't skipped over. It has the true endpoints that are really deployed in staging. It's the ones that we want that are safe and blessed and work well. So what we want to do is send it to this other cluster called an original destination cluster. So the original destination cluster, sometimes called the original DST, has a special header, a configuration called use-htp-header. And when that is enabled, if a request gets sent to this cluster with this header, x-envoy-original-dst-host, which you see at the bottom, with the IP address colon port, Envoy will take the IP address and port of that header and route to that instead of wherever it's going instead of that. So coming back to the filter, you can imagine what we had to do was when we overwrote the cluster that we were doing our routing to in the filter, we send it to the original destination cluster instead. And then we also add in a new HTTP header. And that HTTP header is the x-envoy-original-dst-host. And we also just add in that IP address that we kind of extracted out of the baggage earlier. And that's it. Not too hard. Of course, a lot of error handling and other things like that that we didn't end up doing. As a real high-level overview, our steps there were take out the overrides from the baggage, from the tracing header, take that route, wrap it in this delegating route, and make it so that for the most part, all attributes about this route are exactly the same as the one Envoy has already chosen. But accept the cluster name. For the cluster name, we want to send it to the original destination cluster. Then we want to set the route to this new route we just created and finally add in the IP address from the metadata we got from within the override baggage. And with that, I would call that our staging overrides v1. That was a system that took a year and a half or something to pivot from this one-box system. And we did this a while ago. And we were really happy where this got us. You can see here, you know, developers can keep sending requests to staging just like they normally would developers or product managers, data science, anyone who's just trying to test something, see how something looks. All the while, they are kind of blissfully unaware that there's all of this testing happening around them because it's not affecting their request. It's just affecting the requests of people who are explicitly opting into these new routing rules. So now I'd like to talk a little bit more about some of the stuff that we have done on top of it as where we kind of want to go next. Now that we've built this base system. So one thing I kind of hand-waved over at the beginning is how do we actually get that baggage onto the request in the first place? And the kind of we call that baggage attachment tooling. To do that, we ideally don't want our developers thinking about how do I manually craft this IP address JSON blob and tuck it onto every single request. We kind of want that to be abstracted away from them. So there's a couple different ways that we thought about this. But ideally, the way we want our customers, which are other developers at Lyft, to think about it is I have a pull request. I've deployed this pull request into staging. Given that this was the pull request with the changes, give me the baggage and attach this baggage that routes me to it. I don't really care about any of the other details besides that. So we've actually worked on a couple different ways to baggage attachment. These first two are kind of examples of if we add another filter up front at the edge, only living on the edge, not necessarily in the mesh. We could basically have it decode some information either from a header or from the host name. That kind of gives us all of the characteristics we need to know about to determine what baggage we should attach. But there's this one other option that we actually originally started with and are still using today, which is we send it through a proxy. And this is great because if we intercept the request that the user send, we have all sorts of control to add in whatever data we want before it makes its way upstream. So the proxy we have is our own in homegrown proxy. We'd love to open source it someday and maybe we'll get to that. Originally, we created this progress, this scriptable ingress proxy for mobile engineers to mock out backends. So rather than pointing their mobile apps at staging, they would point it at this proxy and then they could write all sorts of scripts against it that let them, you know, if the back end hadn't completed it yet, return back some dummy JSON data or something like that. And that kind of allowed the product engineers who are both mobile and back end to work concurrently after they kind of agreed upon a contract of what the response body would look like. So this is what we ended up using is we just tell our developers to point their browsers at their own proxy. And that is what they do today right now. They point their apps and say, let me talk to this scriptable proxy. This proxy has all the information it needs to know once I've configured it and it will attach the baggage before it ends up sending it off to the rest of staging. So because we already had this proxy, which was very fortunate for us, we kind of took what we already had and took the easiest path to getting this working for our users. So we just kind of defined like a helper type script snippet that says I want all of my requests that flow through the proxy to have these envoy overrides. Here's my project. Here's my branch. And our proxy would be responsible for figuring out what was in the metadata. And you can see here it's talking to the Kate's API server and that gets back all the information it needs to fill in that metadata. So given that we've built all of that so far, we still think there's a lot more we could do with this general overrides-driven development. So like we already have this proxy that knows how to interact with baggage. What if we chuck that proxy into the middle of the call graph as opposed to one of these offloaded pods? Now we have this wildly configurable and scriptable solution that lets us kind of play with and fiddle with requests before they actually go to their upstreams. It could work, you know, test what something would look like assuming an upstream was airing out. You can do all sorts of wild scripting like this and it really gives you mocking at any hop in the mesh as opposed to just at the ingress edge. Another thing we've been exploring is what if we redirect the traffic to laptops? Rather than just going to offloaded pods that have been deployed to staging, if we embed something in the metadata that lets you redirect to your own laptop, you could do this much faster PR iteration cycle. You don't have to push up a pull request. You don't have to wait for container image build or wait for a deploy. You can just hit it straight to your own computer. And another extension that we've been playing with and we have some of this and we just keep adding more is what else could we include in this baggage? For example, what if we include something that said like log level debug and that will very temporarily, only for the duration of your request, increase the verbosity of all the logs. So, you know, something that might not be showing up, you just need to add a little bit more metadata and this is sufficient. So, you know, we have a lot of flags that are only valid for your request. So, I think that these things about what we could do with staging overrides, I think kind of are the full vision here. We want to give complete control over request flow to our developers. This will make them feel like they have all the knobs and levers they have so they can reproduce all their arrangements. Make sure that they feel confident that they've tested everything comprehensively, all the different arrangements before they end up sending it to real staging and of course when we do this we want to make sure it's fully isolated. We want to make sure that developers aren't stepping on each other's toes when this happens and our team, the developer experience is responsible for giving really good tooling that lets developers think about this seamlessly and just think about these are the changes I want to test and here here's all you need to do that. So, just wrapping up in conclusion here. Overall, these are really good results compared to our last iteration of our developer environment with Onebox. Provisioning a Onebox would take an hour or sometimes more or it just would never work versus getting a pull request built and pushed up to staging takes 10 minutes if that. Also, there's a lot more parity now between our staging and production environments than our development was with our other environments and this is great for us because the infra is more similar so changes we make in production often end up benefiting staging as well so we just got a lot more for free and more functional parity so that way developers who push something see it work in staging overrides and then push it to production are a lot more confident that it will work as they expect it to. And then lastly, we think this is just a great new framework that has kind of changed the way our team is working. We're thinking about what else can we let our developers do to make sure that they feel confident in their changes before it goes to production and we're kind of just getting started on all the different things we can enable. As far as some challenges context propagation, as we talked about in the last chat, is challenging. We need to make sure that header now it contains the trace ID but also contains these routing rules so if you have a service that you're trying to override real deep into a call graph and somewhere along the way that header gets dropped then it won't end up being routed to the override because it will not have that metadata so thankfully at Lyft we have a whole bunch of libraries that we maintain for a common go python type script that we can kind of inject our own little code into there and make sure that it does the propagation. There's another general problem of data isolation. We are sharing the same staging environment so you can imagine if something is really broken in staging and starts paging and sending a lot of stats that might actually page the on call so we've had to do work there to sort of ensure it gets sent to a different stats namespace so it doesn't trigger all these alarms that have already been set up. This has been challenging because it's a generally paradigm. We kind of had to do a whole year long flow of reteaching people how this is supposed to work because it's very different from the one box. Everyone gets their own sandbox type development and lastly if we could redo this we think there's a lot better ways even within Envoy versus original DST that we could do but there's also other really cool tech now like telepresence is another CNCF tool that does this very similarly we would probably reassess of course we had the Envoy expertise for filters in-house and we're glad to work on that but if you're interested in exploring something similar be sure to check out some of those options and that's it thank you so much yeah the question is do we feel like we're abusing telemetry data to influence how our routing works so first of all we only do this in our staging environment we know other people who have had success on doing this in production so I think that originally that is what our trace header was used for was telemetry and in fact when we saw that we're like oh can we really trust it because our distributed tracing had some sort of flakiness to it but it's actually forced us to reinvest it and it doesn't really to the way we see it is yes that metadata was used for telemetry but we think it can be used for more and it's worth investing in because it can be used for more so that was why it was there originally it was great for us because it was already partially implemented it was already mostly propagating but finding another use case where it actually could like influence product flows was enough for us to be like okay let's really go in here let's instrument where are the services that are dropping traces how do we go fix them how do we page them how do we make sure that they don't introduce new changes so by adding this additional use case for this trace header it's kind of made us go all in on propagation such that we trust it more it's still not perfect it still has issues thank you the question is does anyone want one box back which is our big hunk in EC2 instance so separately alongside with this where we got rid of one box this was kind of our solution for end to end testing to make sure that you felt like something worked involving multiple services kind of integration testing but one box also handled the use case of local development so real quick just my own service let me run unit tests let me really quickly iterate on it so we do when we deprecated one box we also invested heavily in getting a really good local development experience as well and by doing that we use like a bunch of some other tools like tilt is one of them for like basically running Docker compose locally on your own machine you're only limited to really running your service and you've kind of cracked down on you can't have this crazy services dependency on your own laptop so I think a combination of those two which is the really good fast local arrangement and then like a much more comprehensive and trustworthy end to end experience and staging kind of covers all the gaps there's still people who do want something like that and there's ways you can kind of do it but for the most part it seems we've found the org on board definitely a lot of like evangelism internally though that we had to do that was certainly one of the challenges here which is just this new paradigm we had to clawed away from some people effectively because they were much more confident in it or maybe they weren't set up super well in staging like most services are yeah do we have a way to test our infra in staging so by infra do you mean like change our envoy filter or like observability or what parts of infra do you mean yeah so I would say this is yeah like a first class support for product teams who are making changes to their services and I would say making changes to infra services or like for example that service that responds back with the baggage after you give it a pull request that we don't have a great way of testing right now and we kind of do our own bespoke things but you know we're like the infra team so we kind of can figure out how to we hack it onto our own and kind of do our own bespoke thing it's not great but there are like we've been chatting a lot with the networking team about how we can make it so they can like test a new Shaw of like our envoy binary on like just one pot or something so like we've like been looking into little integrations of the infra teams here and there they're all just kind of you know way less the same in the way that all these product teams are way more like similar like stateless services so it was way more easy to uniformly introduce this for them but then our deploys team we kind of have like a one-off thing that sort of works with this system and requires a little bit of cube cuddle port forward a couple other hacks like that but it can you know should we choose to invest it we have gone at working with some other infra components thank you thank you