 Thanks everyone for coming. I'm going to let you know that I'm getting over a super-packed cold. So if my voice does weird things or if I start hacking uncontrollably, that's why. So what we're talking about today is actually how we built and deployed Envoy at Lyft. And this is going to be a talk where I'm not trying to sell you something, we're actually going to talk about the practicalities of building something and deploying something very complicated into a very large existing infrastructure. I think it's a useful case study because we come to conferences like these and we learn about all the really amazing technology that people are building. But very often we don't talk about it in a construct that lets people understand you're coming from some existing technology and you want to adopt this new technology. That's a very hard problem. It's not as easy as saying we're going to run Envoy or run Istio or run Kubernetes. That's just not the way that it works. So first let's just do a brief refresher on Envoy as well as Service Mesh. There's I think 15 talks at this conference on Service Mesh, so I'm going to do this super briefly. I think everyone here probably has a good idea of what that is. The day behind Envoy is that the network should be transparent to applications. And when network and application problems do occur, they should be easy to determine what the source of the problem is. And what we find these days when people try to deploy microservice architectures is that the network is not transparent. It's in fact a huge thorn in people's sides. And it is very, very difficult to understand where problems are occurring. Is that problem in your application? Is it in the virtual infrastructure that you're using? Is it in some container runtime? It's almost impossible to know. And what I think a lot of companies actually find when they try to move from a monolith to a microservice architecture is they want the agility gains of actually moving towards that architecture. But then when they go to actually accomplish that architecture, when they go to actually accomplish that architecture, it becomes very problematic because people don't trust those services. And in fact, at Lyft, what we saw is that we had our monolith, and we'll actually all, you know what? It's fine. What we saw is at Lyft, you know, when we started to roll out our microservice architecture, we found that developers, they didn't trust the network. Like, they were seeing long tail problems, and they didn't understand how to debug them. They saw long P99 latency, and they would say, you know, we can't do this networking. We can't do this RPC, so we're just going to go ahead and we're going to write those features back in that monolith. So we see that problem where we want to move towards this architecture, but we don't have the tools to allow people to write business logic and not focus on all of the faults that are actually happening. So from a service mesh perspective, this is a very simple diagram of what the service mesh does. And the idea here is that we have a couple of different services. We have a sidecar proxy that is co-located next to every service. The service sends all of its requests or traffic to its local proxy. That proxy does all of the things that we all know and love around service discovery, low balancing, rate limiting, observability. And it's going to pick a target, it's going to send that request to the other side. The other side will deliver it to the service, and then it'll all come back. And the idea here is that each individual service only knows about its local proxy. It doesn't have to actually understand anything about the underlying network topology. That also allows it to be the case that your service can be written in any language. It doesn't matter what language it is, Haskell, Go, C++, Java, it doesn't matter. That service can talk to its proxy on local host. It can have something, some magic happen, timeouts, whatever. And then the response comes back and it comes back. And it really does make that network transparent in a way where people can focus on business logic and they don't have to worry about what the network topology actually looks like. So from an Envoy perspective, and I'm going to go through this super fast. Envoy is a proxy, it's a self-contained proxy, it is not a library. So it's an out of process architecture. So when I showed you that last slide, Envoy sits under the service and it basically receives all of those calls and those calls go out and then they actually come back. Envoy is written in C++11 so it's very fast as well, it's very productive. Fundamentally, Envoy is an L3, L4 filter architecture. So what that means is that bytes come in and bytes go out. So we can support multiple protocols. We can do REST, we can do Redis, we can do MongoDB, we can do MySQL. This is going to be problematic. And because obviously we do lots and lots of H1 and H2 and REST and all of these things, we do a lot of filtering up at L7. So we can operate on message boundaries, so headers, body, as well as trailers. Envoy is an H2 proxy first. So we can proxy from H1 to H2 and H2 back to H1. We spend a lot of time doing service discovery as well as active and passive health checking. So active health checking is the idea that I'm going to periodically send a ping or a health check request to some other service. Passive health checking is outlier detection. So that's the idea that I can monitor traffic that's happening within my data plane and I can go back, sorry folks. And I can go back and I can see if there's consecutive 500s or if a particular host has some type of success rate that's out of some variants. From an advanced load balancing perspective, we can do things like timeouts and retries and circuit breaking and shadowing and all types of fun stuff. And we'll talk about it a lot, but when I give talks about service mesh and Envoy, the most important thing is really observability. It's all about observability. It's about understanding what's going on in your system and if there are problems in your system being able to pinpoint where that problem actually happened. And that comes down to stats, logging, as well as tracing. The other thing that I would mention is that what you'll hear is that people talk a lot at conferences like these about service mesh. But all services, all internet services, typically have some type of edge proxy, right? And what you see today mostly is that people deploy different proxies for their internal proxies as well as for their edge. So people might use Nginx for their edge and they might use HAProxy or something else for their internal proxies. And from an operational perspective, 99% or more of what these proxies do are the same, right? They all do service discovery, they all do timeouts, they all do rate limiting, they all do circuit breaking. And from an ops perspective, running the same software on the edge and doing TLS termination and running the same proxy on your internal service to service network has huge operational gains. It's kind of like a core tenet of systems engineering. Less software you can run, less software you have to understand. It's vastly simpler. So this is Lyfts about five years ago. And this is almost a standard lamp stack, right? Like this is the way that every startup starts. We have clients, we go to the internet, we have the AWS ELB. Obviously now AWS has their advanced load balancer, but most cloud providers still have a fairly basic load balancing system. Lyft was originally built on a PHP Apache monolith and our single data store was MongoDB. And this is supposedly simple, there's no services. But what I'd like to point out really briefly here is that even in this basic architecture that we have here, there still is networking, right? Like you still have the load balancer, you still have the database, you still have your monolithic service. And even with this basic architecture, trying to understand what's actually going on here from, did the client have a problem on the internet? Was the Amazon load balancer crapping out? Did PHP have a bug? Did MongoDB do something bad? It's actually really hard to know. So even in this very simple state at Lyft, we were already having a lot of operational problems trying to understand how is this thing working, how is it actually happening? And we're going to go into why we built Envoy and how we deployed it. But I actually want to show this because it's instructive again to see that you actually don't need hundreds of services to make something like Envoy useful, it can be useful with only very few services. So fast forward to today, this is Lyft's architecture today, roughly. We still have clients, we still have internet, we have a set of TCP load balancers, so very basic edge load balancers. And in there, we come to our front edge Envoy's. So we run a fleet of edge ingress proxies, which do L7 parsing and TLS termination and things like that. We still have our monolith five years later and probably for many more years. We are decomposing that monolith and we have lots and lots of services written in Go and written in Python. And we run Envoy, we have 100% coverage at Lyft. So Envoy runs co-located next to every service at Lyft and has hundreds of services. And then to our back end, we have MongoDB still, we have DynamoDB, we're actually starting to use Spanner from Google. And we obviously do a whole lot of stats and tracing and a whole bunch of other stuff. And again, the important takeaway here is that we run Envoy on each and every node. So every hop that we have from ingress to egress, we can have a point to point link from stats and tracing and logging that we can look at to actually understand what is going on. Okay, so that gives you kind of the overview of what happened. So like I was saying in the intro, you don't go from what we had before to what we actually have now. You don't do that overnight. You can't just automatically roll out Envoy on hundreds of services. It doesn't work that way. So what we're going to do is I'm actually going to talk you through kind of how and why we develop Envoy. And we're going to talk about the incremental steps that we took to actually roll it out and talk about the things that our developers at Lyft found most useful. Because as you're starting to think about rolling out some of these proxies, a lot of it is actually selling it to your internal teams. And a lot of times they want to know, what are we going to get out of this? Like is this actually useful? And the teaser here is that it wasn't easy. It was a lot of work. Okay, so what we did is we started with the Edge Proxy. And I actually talked about this briefly. And almost every internet application has some type of Edge Proxy. You have clients on phones. You have to get back to your back-end services. And from a fundamental perspective, if your app is not healthy at the Edge, it doesn't really matter what you do internally. So having that observability at that Edge location is super important. And we could have a whole long conversation about how it's important to extend that observability all the way out to the client, but that's not this talk. But from the Envoy perspective, it's critical that we get good stats out to the Edge. And even to this day, if you look at the state-of-the-art cloud solutions, state-of-the-art being Google's load balancers and Amazon's elastic load balancer, the AOB product, they're terrible. I mean, like, they're still really terrible. They're black boxes where you don't actually understand what's going on. The observability coming out of them is still fairly bad. I think Amazon and their AOB product only got percentile latency metrics in the last six or nine months or something like that. It's quite bad. And, you know, of course, you're at the whim of what they're doing from a modern TLS perspective. They just tend to move a lot more slowly. So from the perspective of dashboards and tracing and logging, it was very important for us at Lyft to prove out this new proxy and actually put it on the Edge and replace the ELB and show that we can get very good stats. And what we found almost immediately, here's an example of one of the dashboards that we actually provide people at Lyft. And what you're seeing here is you're seeing on a per-host basis, here are our front edge nodes. And you're seeing stats like the number of connections per second, the number of requests per second. But then on the bottom row, you're seeing latency on a per-backend service basis, or you're seeing the 500s on a per-backend service basis. And this is a tiny portion of the metrics that we actually show people, but the idea here is that you go from a situation in which you're trying to go into S3 and pull down logs and kind of understand what's going on and the dashboards and CloudWatch are super bad to a situation where we have excellent visibility and we can much quicker pinpoint what are problematic services or what are causing problems for our customers. So again, it's all about observability, observability. So now we're on our web scale slide and if this was a 45-minute talk, we would stop here and we would watch the MongoDB's web scale video, but it's not 45 minutes. So I would encourage all of you if you have not seen the video and the little asterisk thing down there to go watch it right now because it is the funniest video that you have ever seen. So Lyft probably at this point is one of the biggest MongoDB installations in the world, I'm guessing. Like we have a lot of MongoDB and MongoDB has been a thorn in our side for a long time. And one of the things that you'll find with MongoDB even in the most recent releases is MongoDB has a very poor connection handling model and by that it uses a thread per connection in high-performance networking that is a big no-no. And what that means is that MongoDB will DOS itself very easily. If MongoDB does something bad, it'll start kind of refusing connections then clients will reconnect and then you'll have like a thousand threads and then the thing will just shut down and you're basically in this depth spiral. So one of the things that we realized that we could do pretty quickly with this proxy so we're already running it on the edge, now we have our PHP monolith. We realized that we could use Envoy actually on the PHP monolith to collapse connections from PHP to Mongo. So basically monitor and rate limit all of the connections that are coming into MongoDB from the client side of things. And hey, while we're here, we can actually parse the BSON and we can suck out amazing stats about latency and tables and query patterns and multi-gets and scatter gets and those types of things. And we can do this for all services so now we're also doing at this point, we're doing Mongo on Python. So instead of rewriting all of this rate limiting and all of this great functionality that we built in every Mongo driver, we write it once in Envoy. We point all of the applications at Envoy. Envoy talks to Mongo and lo and behold, we have no more Mongo outages due to Mongo Dev Spirals. So our current stack still looks exactly like this. So we now talk to Mongo from PHP, Python, and go all of that traffic runs through Envoy and via the Envoy filter chain that I talked about before, we can chain different functionality. We can build in stats, we can build in rate limiting, we can build in all of this cool functionality and we get that for free. The applications don't have to do anything. We just roll it out and it just works. So once we did that, so this is, you know, we're now going back a couple years and we have Envoy running at the edge. So once we have Envoy running on our monolith, we're still using internal load balancers from Amazon to actually do service discovery. So obviously when the edge has to talk to the backend service, we have to find those backend services. So services boots, they come up. They basically register into the load balancer. And then, you know, the load balancers is going to come and actually pick that back. And what you find is that, again, these load balancers, these cloud products, they're, you know, they're getting better, but they're still, they're black boxes. And that extra hop from an observant, from a kind of an operations perspective, when you have failure, now you have this giant thing in there that you don't know what the problem is and you have to figure out, well, it could have been a bad ELB instance. It could have been the EC2 network. It could have been my application. I have no idea. So you're putting this big thing that's in the traffic path, a single point of failure between any two endpoints. And that just makes understanding what's happening in that network topology so much more complicated. So what we found here is, but even with this, even with the ELB in the middle, because we're running Envoy on one side and we're running Envoy on the other side, we still get really amazing stats. We still get all of the egress stats from the edge nodes. We still get all of the egress stats on the PHP nodes. We can do things like buffer so that we can deal better with PHP, Apache traffic handling. So we're still getting lots of benefits. But the next stage here is that because of what I was saying before, we would like to get rid of those internal load balancers. Those internal load balancers, they make debugging harder. They're a single point of failure. You just do not want them. They don't serve any good purpose. So we would like to do direct connect. And at this point, obviously, if we're going to do direct connect, we need service discovery. And I'll take a little detour here. Is if you look around the industry, the way that most people at most kind of internet companies have done service discovery historically is they've done them using systems like ZooKeeper or Console or something along those lines. And these are great systems, they're fine. But what you'll find is that these are all fully consistent systems. They all use leader election. They all require care and maintenance, and they all go down a lot. And you'll find that most big companies have teams that are dedicated to keeping ZooKeeper and similar systems running, especially at scale. So when we started with Envoy, we made the kind of explicit decision that service discovery for networking, it's not a consistent problem. It's an eventually consistent problem. Systems scale up and scale down and things boot and die, right? It doesn't have to be consistent, it just has to converge. So we made the very explicit decision that we would go through and we would build an eventually consistent system. And that system, which we still use today, is about 200 lines of Python and a cron job. And the cron job runs on boot, basically. It runs every minute. We have a service that writes into DynamoDB and then we basically have a sweeper on that table, where if a service doesn't check in in X minutes, say 10 minutes, we just delete it. The system is dead simple. We literally have not edited the code in over a year, and we have had zero outages. And because of all the caching, it's eventually consistent. But because we cache in the service, because we cache in Envoy, and because of this eventually consistent process of the cron jobs right into the system, we have run tests where we have deleted the entire DynamoTable from under the service. Everything keeps going. We obviously can't scale up, but everything is fine. We make the DynamoTable again. Everything starts checking in again, and we're totally fine. So I would encourage everyone to think very, very, very carefully about whether you really need a fully consistent system when eventual consistency would probably be fine. So let's talk a little bit about how developers at Lyft actually use Envoy. So at this point, we're basically building the mesh. We're not fully rolled out, but we're building this mesh. We're actually getting a lot of value from it. And I actually mentioned this this morning during a panel, but you'll hear a lot of service mesh talks, and people like me will get up here and say that the service mesh is magic. It's totally amazing, and it is really awesome. But there's a little thing that we don't tell you, and that thing is that in order to get full value, especially around tracing, you have to propagate context, and that is required. There's no way around that. You must have application code that actually runs because you have some header or some piece of data that has to get propagated from your ingress point to your egress point so that you can join those traces up. So what we did at Lyft is we wrote a very thin client in Python. We also have one now in Ingo. This is, you know, we call it Envoy Client. It's very simple. What you're seeing here is an example of someone calling a service called Switchboard. The idea here is that they don't need to understand how Envoy works. They want to make a request to a service. They use a client. They say what the service name is. They may set some options that you don't see here around timeouts or retries or various other things. And then this client is about 100 lines of code, and it calls down. It knows what port Envoy is actually running on. It actually goes through, and it makes the call, and it basically gets the response, and it unpacks it. And most importantly, it propagates IDs. So if we know, for example, in Python, that we have, you know, an X request ID or an XOT span context header that's come in and it's on the greenlit local thing or whatever it is in Python, you know, we'll take that state and we'll propagate it to the other side. And that makes it really easy for developers in Go. We do the same thing with Go context. But the idea here, again, is that from a dev perspective, doing a network call is as easy as import library. Who do I want to call? Make a call. Get answer. Okay. So at this point, we have now rolled out Envoy. You know, we actually took the approach of go big and don't do the tail. And I find that from a roll out perspective is actually better because if you can run this thing and show value on your biggest services, there may be a 90% long tail of small services, but if the big service people are happy, it's a lot easier to get these small service people happy. And what we found is that people were quite happy. Like we are suddenly saying, oh, like we're not having all these problems because we do more intelligent load balancing or hey, when a problem does happen, we can actually debug it. Isn't that awesome? So now what we did is we entered the slog period. And during this period, we actually had a spreadsheet and the spreadsheet was updated on a daily basis. We had a program that would scrape out all the services and figure out are they using Envoy or are they not using Envoy? We would build a Google spreadsheet every day and we would go through and we would burn that spreadsheet down. And that was a combination of just, you know, carrot and saying like, could you please do this? It'll work better, right? Like wouldn't these be great? We went in and did some of the work ourselves. To be perfectly honest with you, I was expecting to have this take much more of a stick approach. Like I thought we were going to have to go get in there and kind of get management to go make this happen. To be honest, it was a lot of carrot. Like people loved what they were getting. Like they loved the features. They loved the observability. They loved the fact that they just no longer had to worry about how certain portions of the system actually worked. So in hindsight, it was a long slog. But I think what you'll find is that most developers, most people that write business logic that actually run on a service mesh, they don't want to go back. And in fact it's been really interesting because Lyft when I joined was about 80 developers. It was quite small. We're over 500 now. And when I joined, there was people that knew the old way and there was people that knew the new way. But now people come in and they only know the new way. And I think for them it's kind of interesting to see because they just expect it to just work, right? They don't have to worry about all the problems that we actually used to have. So let's briefly talk about kind of how we did config management. And this again, I just wanted to explain, you know, that we come to the conferences and again, people talk about all this awesome stuff that we have. Things happen in a much less awesome way. Like not so awesome. So when we did Envoy, you know, what we did originally is we literally would hand write all the configs and we would check them in. And we would bundle them with the binary and we would deploy it out and that was how we did it. And that allowed us to move very fast because we could make breaking config changes. But it was very tedious because we were literally handwriting configs for each and every service. So that was obviously not going to scale. So what we did then is what everyone does, they bring in some templating code and we brought in Jinja. And what we actually developed is a system which we called config gen, which we still use to this day, certain parts of. And what that is, is you basically realize that a lot of the Envoy configurations, they become the same. Like there's a lot of commonalities in how they're doing it. It's mostly the same with a few variables that end up being different. So we ended up building a whole system that would effectively take a set of templates, a set of services and a set of variables and use Jinja, iterate over them and build an entire set of configurations. And it's at this point that we really realize with Envoy that in any complex Envoy deployment the configuration is going to be handwritten, or sorry, it's going to be machine generated somehow. There is just no way around that. You can do it using templating, you can use it with something like Istio Pilots, but something is going to do it. No one is going to sit around and hand make configurations for hundreds of different services. It's just not realistic. So at that point what we did is we moved to a hybrid system and the way that the hybrid system actually works is we did a combination of build time template generation and then what we would do is we'd actually ship the templates to each box and the Jinja files and we would actually run the template generation code on the host. So what we do at Lyft is every service at Lyft has a manifest and the manifest says things like these are the tests that run and for networking, these are the services we talk to, these are the circuit breaking settings, blah, blah, blah, blah. So what we did is we would take in this manifest data on the node, take it in with the Jinja templates and generate the final Envoy configuration on the node. So now we have a mix of central template generation with deploy but then the manifest information gets joined on the host during salt run because as you can see from here we're not yet a Kubernetes native shop. We're still using VMs and salt and all types of legacy infrastructure. So one of the goals of Envoy has always been to be kind of what I call a universal data plane. The idea is that we have a set of APIs, we can build a control plane and the control plane can speak those APIs and then Envoy will enact those configurations and it'll do it dynamically. And what that will allow you to do theoretically is have a single bootstrap configuration and a management server which provides the Envoy's all of the configuration. So you're kind of now seeing this transition period. We've gone from hand configurations to machine generated Jinja but now we're moving towards very complex management servers and we see that with Istio Pilot and there's actually people that are writing all types of different solutions. And in the V1 API we have a couple of different APIs. We have Service Discovery Service, Cluster Discovery Service, Route Discovery Service, Listener Discovery Service. The idea here is that ultimately we allow every aspect of Envoy that would be reasonably reconfigured to be configured by a remote management server. You're not required to use these so a goal of Envoy is that you take the complexity that you need. If you want handwritten configs, use handwritten configs. If you want SDS only, use SDS only. If you want the whole enchilada, use it. And at Lyft currently we use SDS, RDS, and CDS. We'll be using LDS soon. And this is roughly what it looks like today. And what you're going to see in all of the other talks that you'll attend here is that this looks a lot like Istio Pilot which is not very surprising because these things have evolved in parallel. And the way that this works is that in Envoy today at Lyft we have our legacy 100 line Python service discovery thing. We have a new service called Envoy Manager which is like Pilot. And we are now taking in all this information from our legacy infrastructure, from our cron jobs, from our manifests, from our merging them together and setting that configuration to Envoy's. So over time the config that we ship to every host is getting smaller and smaller until it becomes a unified bootstrap config that runs on every host, never changes, and everything else comes from those management servers. So that's that progression. But the idea is to understand that you can take still any of this complexity from end to end and you can build very cool systems on that. And we're actually now beginning our Kubernetes migration. So this will get a lot more complicated as we have our legacy infrastructure and then bring in Kubernetes and juggle all of it. So that will be very exciting. So thank you very much. Thank you for dealing with my voice. It's very shaky. You can talk to me on Twitter. I think I'm almost out of time, might have time for one question or so or two questions. My plug, Lyft is hiring. We have lots of good jobs. So if you want to talk to me, please talk to me. And we love building the overall Envoy community. So please join us on Envoyproxy.io. So thank you and I'm happy to take questions. Could you speak up? Yeah. Yeah, so the question is do we do functions or Lambda? Currently no. I think that Lambda is or just generally functions and Envoy is going to be a very interesting thing over the next, you know, probably two to three years. Mainly because functions at Lambda, they require networking. And it's actually networking is way more complicated in that environment because things come up and come down very quickly. So the answer is no, not right now, but I think that's going to be a big area of investment. So the question is at Lyft, what does the job stack look like? There's actually, if you look at the monorama talks, there's a couple of talks, two talks from Lyft that go into that entire thing. But I'll briefly say that we use a company called Wavefront for stats. We use a company called LightStep for tracing. We run our own logging system, which is Elasticsearch and Kibana. One more question? Yeah. So the question is Unvoy has a system called Runtime, which is like a decider or a feature flag system. It's pretty common that many companies have these things. We just have a Git repo that basically every time it gets checked into, it gets sprayed out to all hosts. Unvoy does a file system watch. One more question? Yeah. If you wanted to run NY at the service layer and officially provide that to the media. So we do run Unvoy at the service layer. We don't run it next to Mongo. We run it on the service side. Yeah, so yes. Yep. Exactly. All right, last one. Yeah. So the question is will Lyft move more towards Istio? The answer is absolutely yes. We obviously don't want to move more longer than there's Istio. And that's kind of what I was talking about before is that it's one thing if you're like doing this from scratch, but we have a giant system to migrate like hundreds of services, millions of requests per second. It's not something that we can just basically move. So the goal for us is going to be to move closer to pilot. I don't know at what time we would use Mixer. Components are actually separate. So one can use Pilot, but not Mixer. One can use Kubernetes Ingress, but not something else. So yes, the goal is to move towards it. Absolutely. Sorry, I have to go, but I will stand outside. So thank you.