 highly informative so far. I work at Lyft and now I know even more about Envoy, like scary amounts at this point. So this talk was, it used to, it was a joint talk with me and Christopher Rocher, another Lyft libraries engineer, but so I get to play the part of two Chris's, but I thought in listening to all the talks today, this is kind of a choose your own adventure. You can direct it with questions if I hit on something that you're interested in, but this is more of an inside-out view of someone who we don't use Istio. We don't even have Kubernetes. We have our own orchestration system, but we've been able to do some really nice things with Envoy, which I think were observed by the people who have now adopted it into those, into this project. So who am I? Lee Dorncourt libraries used to be a tweet person once upon a time. So what is core library? So we're responsible for all of the HTTP and gRPC frameworks. That means that if you want to build a gRPC, I send layer over the standard library for gRPC that gives you all of your fancy stuff that you need for a production app. You get logging, stats, tracing, all of the stuff that's there. The gRPC library is perfectly fine, but you kind of need a little more information when you're building production apps. So we also do all of the tooling for Go. So if we need to make something easier for, like Flynn was saying, to make Jane's life easier, that falls on us. We go and observe patterns around the organization, see things that are tripping people up. This doesn't make sense from gRPC. This would be easier. I need a generator. We will build that and we do that for all of these things. So we're also responsible for rolling out gRPC. It was our decision or all to say that Go was ready for the organization at large, and that gRPC was that we were confident that you could build an app with our frameworks and ship it out and that it was production ready and should be able to absorb all of Lyft's traffic. So what we're going to talk about in this, we'll get to Envoy. But this is really, this talk is about, we had to bring in an RPC layer into an ecosystem where we already had Python Flask services and all of these other things. So I think it's important to cover that decision when talking about making these kinds of organizational changes. We work with a lot of legacy systems. I'm not of the opinion of just cutting people off. I think there's graceful ways to do things that can cause the least amount of pain. So that comes with disrupting workflows that's going to happen. But how can we do that nicely where people don't hate your team and everything that you ship? Then I'm going to show you what we've built and how all of this relates to Envoy. I'll always do that. Whenever I talk about RPC, I like to use this quote and unfortunately it's entirely too long for me to just read it out to you. But essentially what Waldo is saying is that every 10 years a lot of people will get together and decide that like the technology that we have now, it's the appropriate time to rewrite the same thing that we did 10 years ago, only this time with the new shiny. And trigger warning before I do this, but I hate to do this to everyone. It's funny. I've been through a few of these and some of them still exist. They're not all one-to-one comparisons, but you can see that like wire format RPCs have existed for a while here. Everybody should be familiar with some of these and it's a funny arc that it follows because it goes from like the rigidity of like Corba or something like that and then like loose JSON APIs and like now people are coming back around to RPC. So I'm not here to wax philosophical too much about that, but it is an interesting arc to go and look at it. Christopher Michael John has a really, really good blog post on that. I can share the link if you're interested. So lift, a little history. So like any good story, it starts with a giant PHP monolith. And there's active decomp efforts going on. Not a lot of new code goes in, but if you have a monolith, you still have to take care of it. You have to be thoughtful about it. No one wants to really continue working on it, but that's the thing that built your company. And the more often than not, in a lot of companies, you go to war with the system that you have. And you try to make intelligent informed decisions in ways to like carve this thing out and then move forward. But while you're moving forward, like you don't want to be gated by this monolith that no one wants to work on. So that wasn't challenging enough. So we created hundreds of Python microservices. And then those are all flash HTTP rest. And to keep it interesting, we wrote all of our core services and some compositional services in go and DRPC. So what is a core service that your organizational primitives, for us, it would be users, rides, drivers, things of that nature. And, you know, put your own widget in there. And then so yeah, they have zero service dependencies. That doesn't mean that they don't have upstream dependencies like databases and caches. It just means that those interface behind this is none of your business. This is a primitive. However, whatever magic we use to construct it, that is a user or a ride or whatever. And they do not talk to each other. So also, obviously, have to be a form of our services because they sit in the golden path. So why would we pick GRPC? So first, before we do talk about GRPC, we have to talk about rest. And when you look at the landscape of a lot of rest services, there's nothing inherently wrong speaking JSON over HTTP. And, you know, I've got a small JSON payload, you understand that payload, fine, we're all having a great day. Until we're not. So most people talk about restful services. But in all actuality, this is a better term. They are more often rest ish. And so here's, here's that I'm going to leave up here. And I know someone's already kind of like tingling angry at this, but this is a perfectly decent to some degree rest API. But let's paint this shed. Should have been a put. Let's make it a resource. What are you calling it? The users need to be there so that it identifies the resource. That's, that one looks fine. We've corrected the HTTP verb. But no, not done yet. We needed to put the version number in there. We moved the ID out of the JSON payload. And now it's in the URL. And, you know, fielding should be really happy. But that's a really, really boring argument to have. Especially when you have 20 teams, 100 teams, and they're all coming up with this. Now, there are from that grab bag of trigger words that I put up earlier, like there are ways to specify but that also comes with code generation, lending. And it comes with a lot of arguments about which one's right. And I actually don't care which one's right. Like as long as it's a uniform and it works and then I'm fine. And so that's why IDLs are pretty great. And I liked that Harvey said a lot of the JSON configuration that you're seeing an envoy is moving to proto. And I'm not sure if it's all like where it came from, but this is where we are at lift, moving everything to proto where we can. And I'm going to show you how all of that weaves together with envoy as well. So why are IDLs nice? So we get a single source of truth. We define those primitives that I just talked about in a pretty cleanly easy, you know, it's a flexible enough language to express most things. I think I'll show that in a second. Code generation is really nice. You can generate APIs, client servers, you can do your doc, observability stuff. Everything is just generated for you. If you want to generate a server with a certain decided upon rest flavor or restish flavor, then you can do that. You can do it for the entire company and they would just work so they can speak HTTP back and forth. And you don't even have to use DRPC or proto box. You can just have it just be uniform. And the plugability of the proto 3 IDL has been pretty great. We've been able to do some really nice stuff. So simple service definition. All of the URL fanciness that we had in our rest call is now defined into the package. And so we have our V1 right there. It defines the update. We know exactly what we're sending. There's cleanly defined IO. So that's nice. What about existing services? We have hundreds of them, but now we're going to try and mash, you try and bind them up into using the IDLs and have like clearly defined service boundary. So what you can do nicely provided by Google are these annotations where you can cleanly bind. We have, and this binds up to our legacy services. And so we can tell it, I think I may have left one thing out just for brevity, but you just tell it which object to bind to. And as long as those methods are implemented on that Python object, then all of this will just work last, wire itself together. And well, not all on its own. We did things too. But this also here, we've moved like no one's sending Postparams or JSON in this world. So this is one step that we have asked our service engineers to do, which is to move the types on the wire, even for HTTP one. So it simplifies all the API IO. So it structs them, structs out. We get safety in places where we didn't have safety before. Like there is, there is MyPy and some types and stuff for Python, but like most of our code base is still Python too. So getting this in for Python and PHP was huge. The transfer cost, that's, you know, it depends on the type of JSON, but in most cases, Protos are going to be the bigger win. That improves your latency. Here's a graph. The best one I find is not that great, but the blue is the good one. And that's what, that proves my point. Moving on. I couldn't get a better graph. It happened a long time ago. Types on the wire, the errors that they eliminate are favorite types to get rid of. I just found this one in some logs. And this is the least interesting thing for me to get paged about. Like, I mean, everyone knows exactly what these are, and like this is the Python version, but the same thing would happen in PHP. And you can even get them in Go if you don't, you know, you get a null pointer exception. But generally, you will see in your dynamic flavor of languages. So back to why GRPC. So if we have this cycle of every 10 years, everything getting reinvented over and over again, and we know that this will just, I mean, it will live out its lifetime, and then we'll pick what the new tiny is later on. One good thing about GRPC is that this is the transport. And I'm pretty sure HTTPS2 will be around for a little while, so we can at least trust that. The layer on top of it, there's all of the framing and the special stuff that GRPC can do for you. But essentially, it's sitting on top of a robust standard that I'm pretty sure we'll be seeing a little bit of. So if you're not familiar, HTTPS2, the nice things, full duplex stream, binary transport, just nice push, which is really fancy and really fun for mobile clients. We've been able to do some tricks there. Header compression. So that's going to reduce the size of like large requests or response where you're using tons of headers. And so what's not been so great about this, because I mean, it sounds pretty fun. We've got types and languages where we didn't have types before. People don't have to argue or bike shed over APIs. Bringing this stuff into an organization can be highly traumatic. I can't just like rip away things from people, rip away the rituals of how they develop. If they're doing a reload cycle in a browser or something like that, you can't just tell them like that doesn't exist anymore. This is the first thing you will hear. Over and over and over again. And this is a totally valid question. Because I mean, like I think of myself as you know, like I think people here are pretty sophisticated engineers, not debuggers, S-trace, TCP dump, but like you can tear a print line from my cold dead fingers. It's like this is a real thing. Like in now, you've removed this confidence marker for someone and how their date at the day works. And that's not okay. So how do we make this better? So incremental adoption. We don't have to yank everything out at one time. Like we can allow teams to opt into the new shiny things, you know, try and come up with really good carrots, make things familiar. So that problem of curl, like why don't we introduce some kind of generated CLI, which is what we ended up doing. But now people who are used to constructing JSON with a simple CLI, they like typing JSON on the command line for some reason, but it still works. And you send it right to your service and it will speak right back to you and it'll give you a JSON representation. And so that that little dopamine hit is triggered and people are fine. You just calmly introduce these things. So make the tooling very welcoming and standardize all your framework patterns. If there's something from Flask that people really like, do it and go or whatever your language. And then always work with the roll forward. So we can get people over to like, you can entice you over with protobus and you get types, but you don't have to do anything else. Just switch to that and then slowly move people into all of the new pipeline of tools that we're delivering. So how can we make protocol and infra changes that are flexible like this? How can you have upstream services that are changing independently of the downstream Python? Further complication, Python GRPC will not play well with G event. Hard stop. Incompatible event loops. There's a ticket and I think people are working on this now, but I haven't heard anything. So that's a big problem for us because we have G event everywhere. So how do we get GRPC? We can't speak the protocol. What do we do? And so that's how I tie all this together in a minute. But Envoy let's us do this. Everybody knows Envoy now, so this should kind of make sense. So this is Envoy in a world that doesn't have Kubernetes or anything of that nature. We have our own home rolled orchestration and it works for us. It's been around forever and Envoy has kind of grown up in here. So now you get to see the environment in which Envoy was born and what it fixed. A lot of these things weren't even set up so cleanly. You had everything talking directly to the database. This is a growing company where you're replacing things in the golden path and you have to be very careful when you're doing that because you will lose a lot of money other ones. So everything goes to a front proxy. We have Go services. Everybody gets their sidecar. The service mesh takes care of all of that routing. And if we haven't covered this today, like Discovery was covered, but we also front DynamoDB and I think recently read us as well. And so that's really interesting, especially for Mongo where we wanted to get a lot of insight into what was actually happening and we could not get that out of the database itself. So sitting in front of it gave us like there's a quote and I would just butcher it but that Matt says, but it's about removing the network from the equation. You never have an engineer in this case say it's a networking problem if Envoy's sitting in the middle because Envoy will tell you for sure if it's a networking problem from the staff or the process is dead or something like that. So what's a sidecar? I just really wanted to use this picture. That's a sidecar. The dog is your application. I don't know the metaphor for this but it's a sidecar. So where I was talking about earlier, the incompatible event loop. So how do we bring this stuff along? Well, with the GRPC upgrade bridge filter which didn't fit in the little bar that I wanted there, the Python HTTP can just post HTTP one all day and speak through the bridge which does its magic, buffers everything over into a proper HTTP2 and GRPC request and then return it back and the Python client was none the wiser that any of this happened. Like the trailers get buffered in. There's a little bit of the payload that goes into the headers but these are really simple HTTP clients. They're super dumb and because Envoy can do all kinds of really neat things with what are they called, retry policies and things like that. We don't even have to roll retries into the client. Like I don't want to go and write a whole bunch of logic. Envoy is more than capable of doing all of this for me. One note is that right at this moment Envoy doesn't support retrying on GRPC code. They're packed into a header in a different way. They're not just part of the packet in the regular HTTP one or part of the full response but that is coming. That's awesome work being done on that the other day. So what if we can't even speak protoblog? So an interesting addition that just landed not too long ago and I don't know if this is included in Istio and those builds yet but I did see that it was out there is there's a JSON filter now. So now you can think about you don't even have to think about upgrading much of anything at all in the Python world or the PHP world. You can just speak JSON of a certain shape and then it'll do the exact same thing as the GRPC bridge. So this is this stuff is really fun. We can now evolve independently of legacy architecture. We can continue teams can keep going down go GRPC route or whatever we world we envision in this case it would still have to be GRPC but we want to rewrite everything in Java. All of this gets to be done transparently. You can do really interesting things with all of the envoy like traffic splitting and stuff like that. This is a world that where your infrastructure is highly, highly flexible. So this is normally where the other Chris comes and does his part so and I'm somebody else. So how can we take all of this stuff that we're doing with IDL? We've defined our APIs now. We define our protobuf struct but that's just by default. So what else can we do with this? So what if IDL started defining things in the data store? And so we have a database driver that needs to go to any of these. These are both four that we have right now. BoltDB was like a hack week project that somebody wanted to do. But we have it now. We have an adapter for it. I don't even really know what BoltDB is. So then we build on top of this decorator based middleware that's just like your, you know, your change middleware pattern and expression engine that works in Marshall to an abstraction that works and that you can then take out and turn into DynamoDB expression or Spanner, SQL, S things and wrap all of that into a driver agnostic client and then have the IDL based models that define life cycle hooks so you can think of like afterput, aftersave, beforesave, things like that and then wrap everything in a because we're in go generated type safe repository. So now we have definition IDL all the way down through here. We've got a type safe database access for any number of drivers that we want to write an adapter for. And so a model looks like this. You just add your own annotation and you can look at this and tell exactly what it's doing. Like we name the primary. We tell it that this one's going to be in Mongo. And so the interesting bit is that this is all defined and expressible within the proto 3 syntax. We do use proto 2 in some cases for things like trinary states where nil is considered meaningful. So you can have nil true or false and people do that when you have Mongo as a database. So here's your model that comes out that's generated. There's a lot of other helper things on here. But we can just call to model on on the PB and to proto on a model. So if we want to transmit the model across the wire at any point, we just call to proto and send across as on a through a gRPC. Here's the repos just to get your head around it. I don't know how much time I have, but hopefully I'm kind of on time. So all your common builder pattern. And so for each type that you define you can't you get one of these generated for you. So we haven't found maybe only a couple cases where we've had to special case any of the builders for any operation that people wanted to do. We've kind of limited and gated people who come to our tooling and say like, there's some things in Mongo we're not going to let you do. Like just because you were allowed to do it before doesn't mean it's going to go that we'll do it. We have made exceptions, but there is some things that people do that like I feel like if we make nice tooling that this can be a gating function to like things up like maybe that migration isn't so bad. So that leads us to the platonic ideal of what we were shooting for. Like these are a lot of assumptions that you go into when you when you like really think about databases a lot and how you would like your data to be sanitized, but the real world is not at all like that. No matter how much you like the code you wrote, this reality will hit you right in the face. So true, false and mill. This is something that we had to deal with. This is a real thing people have done. And you can ask me later I'll explain it. Like that one still gives the other Chris nightmares. And then a lot of things. So like so one of the themes that I thought up for this talk was that you have to find the right injection point for the software that you want to write. Envoy is a really good example of that. Like infrastructure is inflexible. The network, we don't understand the network between a service and a database call. So where you insert the software that you write, if you put a lot of thought into it, you can find these nice little places where things work out really well, even in bad conditions like when you're dealing with PHP doctrine models in the legacy PHP monolith. So how do you change fields off of moving train? So what we did was move the source of truth. So you don't I don't really know doctrine, but this stuff kind of makes sense. Basically, we ripped the guts out of it in how it defines its field, used a generator and generated a schema for it and inserted the types into the doctrine PHP model. And so now this pattern might look familiar. We just use this everywhere. Like if we can now none of the code paths around unless you were messing with code inside doctrine, the library, which no one is except for us, your code paths don't change. So all of the save update operations and everything were are now expressible as the protova to go over to gRPC. And so we were able to stop the monolith from writing directly to the database just by doing this. So now without anyone being any of the wiser. Now this is something that you obviously carefully roll out. You use a switch for that you do 1% test. You got to make sure that the data integrity is there. So it's much more complicated than just the slide can communicate. But these kinds of ideas around like being flexible around kind of pretty much inflexible legacy software is really powerful. And a lot of people end up in these situations. We're not always like building our service oriented architecture dreams from the beginning. Like we got to bring everybody along and Envoy is a big key part of that. So we do our best to reduce trauma with transitional tooling. And so here's the Python clients here. They look exactly like the official gRPC client. So one day when gRPC works inside gEvent or we're able to gut it, our Envoy client can be swapped out because Python doesn't care about clients. And you just tell it that it's what it is. It'll be fine. So and more importantly, our users won't know any difference. One day they're using a gRPC client. The other they're using the Envoy client. The next day they're using real gRPC. This is like this isn't something that someone should have to have affect their day. If we can do huge changes like this without having I mean you should tell the team this is happening. But you can tell them it's not going to cost you anything. We just need to monitor this as it goes out. But people really like it when you do something really huge for them and it's dialed on and tested. Like maybe it doesn't work the first time. But like the ergonomics to cause trauma to a team to uproot all of their user clients is pretty huge. And so we try to think of that as we develop software and release things. Sometimes you do have to just rip the Band-Aid off. Other times, you know, you want to be in gender trust by giving people really ergonomic easy to use things. But sometimes you do bump people out. So all in all you just have to listen to your customer. And that means embedding yourself in those teams and seeing what their day-to-day is and go and sit with a service developer and build out the service with them, you know, at an end point. So like we've done a good bit of that. And these are the things that they tell us. They're not wrong. Maybe in Java they're super easy and go they're not so bad. But here's the official Python messages. This is not the full thing. There's like 300 or 400 lines more of this. I don't even know what this does. Neither does IntelliJ or PyCharm. It has no idea. It's because the class is built up dynamically from some hash map of the file descriptor. And so we did this. Wrap it simple proxy, the plug-in for Protoc. And then you get this. And people like it a whole lot more. And so if you do this extra little bit of work and understand from a Python developer the point of view who sits in PyCharm all day, just this simple, all this is just delegate. I mean, we just forward everything off. It costs nothing to do. It just depends on how many properties are in that protobuf. But now Jane's life is easier again. I'm going to just keep stealing Flynn's. And so you can, yeah, just keep trying to make her happy. But these are all the things that we're, other things that we've been able to do is just to make things easier on people's lives as we have all of this code generation at hand. You know, just generate the code, but like think about like, oh, well, we can generate like really simple to instantiate protobufs or, you know, the clients could be better or we're toying with the idea of generating fixtures right now that represent certain scenarios that happen so that people stop writing integration tests. So that's a lot of code, Jen. So of course we wrote a framework for it. So it's the, how am I doing on time? 10 minutes? Okay. So, yeah. So we have this framework. We hope to open source it. It's helped us a lot. Writing protoc plug-ins is the simple ones are pretty simple, but doing complicated things. Sometimes the AST is a little, little wonky for doing really complicated stuff. But it simplifies all of our code gen and we can test the code gen, which has always been something that I wanted to do. Like, I want to represent the thing going on to disk as a model and then assert it and then test like a drive run that it was actually written to that spot with the correct permission. And we can do that. So this further like, nobody really wants to dive into this, but that's how you do it. It gives you, you just use a visitor pattern. You can walk everything and no matter what you feed into it, all of your protos will get fed and then you can generate all kinds of interesting stuff. So how far can you actually take this? Service generation. So we have a serve framework that I talked about earlier, which gives us a GRPC and HTTP servers, just some friendly APIs on top of it so that people aren't, you know, wiring up everything together. And we've got a debugging API that's available for both of them for doing like P prof and things of that nature. So what we can do at that point is now that I know about your service definition, I can build a service config. I can configure your OD, which is the ODM that we wrote. And I can just build, I can compose smaller and smaller configs for you that for each service of HTTP or GRPC flavor. And then on that we just gave people handlers that are very familiar from BLAST. And so we implement the base GRPC interface ourselves and then forward everything to handlers which people immediately got. And all of your testing is generated for you at 100% coverage, which I like doing because like anything that you do after that just takes down from it. So we can do linting, static analysis so that we don't have to police all the IDL things going in to make sure people use all the right conventions. The mocks and test fixtures. I think this is going to be really nice because we generate all of the clients for people. And now we can have scenarios beside it that service owners can define. And then their definitions will be running against the protos and we can at least acceptance test without anyone writing any code whatsoever. Just fill out the fixture and go. GRPC on mobile, this would be really interesting. One day we'll get to it. Like I think this would be, I don't, if you've done this before, please talk to me. Like I'm really interested in hearing anyone's stories because I haven't gotten to try it yet. So the incremental march continues. Having ideals that are really good but the awareness of the realities of your organization are the key thing that you have to pay attention to when designing something like an envoy or just something as simple as a CLI tool to interact with GRPC. So thank you. And that's Chris Rocher who wrote half of these slides. You went first. Just to follow the example of the bike shed around the REST API because like if you try it, I don't know, it's a deep joke, but like the more you iterate on this URL, then you end up with a B1 in there. And so I tried to think of everything so that I didn't get caught. I gave the talk and somewhere you didn't put the B1. The code is instrumented since we generate all of this. The serve service is has a, for GRPC we use a unary interceptor which is basically a fancy word for middleware. And that's where all of our logic is. So we have very predictable ways. If you name a service, widget service, you get predictable stat names for all of your endpoints that are going to come out. And then that gets sent to stats D but not through this traditional library. We do some bundling of our own to try and like cut down on all the requests. We have another stats relay thing that's there as a sidecar as well. So no, it's been stats D from the beginning as far as I know. And the envoy uses it. Plugability like was mentioned, like that would be huge. But right now everything stats D. So I think you, that we have the exact name solution for proto. We're going to bolt in a validation plugin with annotation. So we'll do an extension for it that gives you, I'm assuming that your validations are like between this range. Like you can assert that it's a UN 64 all day. But is it within a range, then you have to write your own validator. And I think with the protarchin star, this would be pretty easy for us to do. We just need to come up with a common set of like validations. I think validation libraries can go overboard. In some cases, but in some, I totally agree. Like there's some that are just like a range. That sounds great. Now this is the spanner work that we're doing. We're going to add validations to OD. So it will make it a combination. It won't go into OD because OD is its own extension to into the proto language. But in OD will be open source one day, but we don't want to tie the two together. And it seems like this is something that could be useful for a lot of people. I've seen people try protarch plugins. So we'll see if we're good at it. We did not. That would be, I don't know if you've, we've talked with the proto team a couple times about this. And Chris Roche has been in those meetings. He would be able to tell you, he has gone deep. And the solution we came out with was like, we just support proto twos for these crazy cases. And it's just for stuff coming out of PHP. That was just like manga will let you shove anything you want into a field. Period. And so like, you can't, we can tell it that it's a type, but it might not be. And then to make it the whole, the whole goal was to have something that is expressible as a wire format. Like we could, we could have like bowed out and just said, nope, it's just JSON or BSON and shoved it as a bytes field. But we feel like if we're going to do this work that we were going to bring types into this world. There were many times, many nights where we almost gave up and just said we're shipping BSON across wire, but didn't. You've had your hand up for a while. We have, well, on boy, yeah, of course, all of everything that I showed, except for our, the PHP stuff, that we plan on open sourcing all the, that we need to do a pass on OD in order to like, I mean, because there's still some stuff that's a little bit tied to, we've tried to do our best to stay uncoupled from list properties. But, you know, like with any project, like I have to ship this now. So I got to go clean some bash scripts up and things like that and search to get history. But it will be serve, we're looking to open source and we open sourced rate limit, which is a component of on boy or used with on boy if you want rate limiting. So I think it's just like lift us go rate limit. We're doing an update to it that is probably going to require serve. And so that will mean that serve will come out with that. And so it'll be built on the same frameworks as what we're using. The generation stuff, like any of that can be open sourced. So the protoc gen star, I think is pretty much ready. The weird thing about that project is that even if we explain it as best as possible, it's not going to make sense to anyone unless you've written a protoc plugin, like it's so weird, like in so specific, but it does make things really easy. So that the Python proxy objects were definitely open sourcing that we find that to be like really useful for people to like just generate these things right alongside the protobufs that are on friendly. Yeah. Thanks everybody.