 The observant among you will notice that I do not look much like Maté. Maté could not make it, so I am covering this talk for him. I'm Flynn, I'm a tech evangelist. Maté is a linker demaintainer. I am a tech evangelist, so that means I work in marketing, which means that if you have complex questions about this presentation that Maté did a lot of heavy lifting for, I will absolutely be able to make up answers and they might be correct or they might not, but let me know, you know, sing out if you have questions, we'll figure out what's up. The short summary of this talk about switching to Gamma without ruining your reputation is that as of LinkerD 2.12, we started switching away from some of our custom LinkerD-specific CRDs to things from Gateway API and this turns out to be hard. I will leave the question for whether we succeeded in not ruining our reputation for later and we're mostly gonna be talking at this point about kind of the nuts and bolts of how we went about it. Again, if you're familiar with LinkerD, you will know that we just shipped 2.14. Most of what we're talking about in this talk happened in 2.13 because that's when most of the heavy lifting for the shift happened. Between 2.13 and 2.14, it was much more incremental stuff as opposed to let's just rip out the underpinnings and put it all back together while we're still supporting everybody. So, setting the stage, raise your hand if you don't know what LinkerD is. You are lying, Rob. Okay, all right. LinkerD is a service mesh. If you imagine your typical Kubernetes cluster where you've got microservices all talking to each other, then the point of a service mesh is to go in kind of underneath your application and provide you security and reliability and observability uniformly without having to change your application. LinkerD does this by sticking sidecars next to each of your application pods. The sidecars then ruthlessly take over all of your network communication and that allows them to force it to be MTLS and to mediate a bunch of things and do retries and cool stuff like that and measure a bunch of things and provide the information to you. So you have uniformer observability across the whole call graph. It's pretty cool. Most of the time I do this, the control plane is an afterthought and I say, oh yeah, it's the thing that manages all the proxies. We're actually gonna be talking about some of the details of the control plane in this talk because that's where a lot of the magic happens when you're gonna switch and do things the gamma way as opposed to the LinkerD specific way. I should probably also have pointed out that the gamma initiative is the initiative that's trying to figure out how to take gateway API concepts which were originally designed for north-south traffic and make them applicable to service meshes east-west traffic. It's a acronym for something that I don't remember so I just remember it as gamma. I'm looking at Mike and Rob because they might remember this. Gateway API for mesh management and administration, gamma. There we go. Props to Mike Morris. So LinkerD actually gives you a lot of really critical functionality with no configuration at all. So for example, MTLS, request ways to load balancing, observability, you do not have to configure these things. You can configure things on the observability front. We'll talk about that. You can talk about how load balancing should happen. You don't have to. Then there's other stuff like almost all of the reliability and often things like that which you must configure because even with chat GPT we can't guess how that works in your app. We could use chat GPT and guess wrongly but that doesn't sound like any fun so instead we make you do it. But we at the same time try to limit how many CRDs specific to LinkerD we have to introduce because everyone we introduce we then have to go out and teach people about. And also everyone we introduce limits the transferability of knowledge across the whole ecosystem as a whole. This is one of the big wins of the ecosystem is that once you learn how a deployment works you don't have to relearn how a deployment works on some other Kubernetes platform, I hope. If you have to learn LinkerD specific stuff then if you want to go and use Istio not that you would want to do that but if you did want to do that you would have to relearn that stuff and to the extent that we can do it with standardized things your knowledge is more transferable and that is a win. So in the beginning there were service profiles. These actually showed up in LinkerD 2.1 all the way back in 2018. They are a very LinkerD specific thing that lets us talk about primarily metrics. So this service profile up there is saying anytime you are trying to talk to webapp.books.service.cluster.local that's a Kubernetes FQDN remember that concept. If it matches slash books and it's a post then we will record metrics for it under a bucket named slash books and if it is a get to slash books slash some number we will turn any of those into a single bucket called slash books slash curly brace ID curly brace that's not some funky templating thing those are literal curly braces because when you're looking at that in the LinkerD dashboards you don't really care about books slash five, books slash four knowing that it's a get to get a specific book is probably more useful to you. This is most of what actually that's all of what service profiles could do in the beginning. Then we extended them a bit to support timeouts and retries because it makes sense to talk about timeouts and retries on a per route basis not retrying the posts to slash books is probably a terrible idea retrying to get is probably not a terrible idea for example and once we did that we further expanded service profiles to talk about more configuration for retries for example again all things that make sense per route they cannot actually do routing stuff and this was a big limitation of service profiles so this will not work and they also couldn't do things like rate limiting and circuit breaking and all that kind of stuff. Fundamentally rate limiting and circuit breaking in our heads are routing and that will also become important later. I do have a bit of a confession to make at one point we wedged a secret API into service profiles so we could do traffic splitting for SMI we are not gonna talk about this anymore for the rest of this talk because it was never seriously considered for a thing to build on for Gamma because it was I think it's probably too much to call it a brutal hack but it was not something that was designed as a thing we could build on for the future so there was no serious consideration of building on that. Okay Lincordy 2.12 we bring in the HTTP route from Gateway API. Originally in 2.12 the only thing we allowed you to do with an HTTP route was talk about auth. So this is a piece of configuration where if you're trying to go to the book server which is a Lincordy specific CRD server but if you're trying to go to it and you're trying to do a post to books or you're trying to do a get to book slash number then this is a piece where you could then attach authorization policy to this HTTP route and that gave us the ability to do per route authorization which we could not previously do in Lincordy and it was very cool. Fundamentally this was a bet that we were placing on the idea that the industry standard way of doing this A was gonna be a good thing and B was gonna be Gateway API. The juries still kind out on whether both of those things are true but it seems to be very much trending in the direction of Gateway API being the standard which is also pretty cool, not least because it means that this bet probably paid off or probably is paying off and this is kind of a big deal Lincordy's implementation is very different from the other service meshes. The way we tend to approach things is pretty different so we have a fairly strong bit of culture within Lincordy towards just controlling everything ourselves. This was a bit of a departure from that. The astute reader will note that there are no backend drafts in the HTTP route I just showed you in 2.12 you still could not use HTTP routes for routing despite what the name says. We were using route as a noun and not a verb and this is kind of relevant. This is also one of two reasons why we brought an HTTP route in a different API group than Gateway API uses. First because we were sharply limiting what it could actually do and second because conformance was not a thing in this world. Okay, that brings us up to the end of 2.12 and the start of 2.13 where we started going, all right, we have service profiles which were designed for metrics and had some other things mashed into them. We have HTTP routes which are allegedly designed for routing and we have people, I guess I shouldn't even say allegedly, they were designed for routing. And we have people who are asking for routing features in Lincordy that we don't have right now. So great, now what? Anytime you're finding yourself asking that question, the first, the two questions you need to be asking after that are okay, what does our user want and what does our user need? These are not often the same thing. Sometimes they're at least related which is good. In our case when we talked to our users ultimately what they needed and what they wanted thankfully were full fledged traffic management all throughout the call graph. So they wanted things like fancy routing and canary deployments and the usual progressive stuff but instead of wanting this right at the edge of the call graph where an English controller can do it, they wanted it everywhere. Great, Lincordy could not actually do most of these things in 2.12. And we also were constrained by the fact that we couldn't toss out the features that service profiles let you do and that left us with two really obvious ways to approach this. Fundamentally we could extend service profiles or we could build on top of HTTP route. I will skip to the punch line here which is that ultimately we decided that there was more green in the chart for HTTP route so we decided to go that way. There are a couple of big things that I want to point out here because they've changed a bit. When we went to HTTP routes they didn't have retries or timeouts. They still can't do retries but they can do timeouts which is kind of cool and we support that. There's a bit in there about sharing policy and fundamentally what we're talking about there from Lincordy's point of view is if you look at the mechanics of service profiles if you have two routes that should have exactly the same policy you must create two service profiles which is kind of a pain in the butt. And there are a couple of things about HTTP routes where we had a couple of different ways that we might be able to do better than that and I'm conditionalizing that statement a lot because one of the things we've learned going through this is that some of the stuff around that with HTTP route is still kind of up in the air and we're not really sure what the right direction is. The other one I'll point out again is we considered it a plus that service profile was completely under our control and less of the plus that HTTP route was controlled by Gateway API but ultimately decided that it was worth it. So yeah, we decided to go with the HTTP route doubling down on the whole industry standard thing. If you are Lincordy, this also means investing in the standard as opposed to merely implementing the standard which is ultimately the reason why I'm now one of the gamma co-leads. Although as I pointed out at the Gateway API office hours yesterday I'm still not entirely sure how that happened since I work in marketing but yeah. All right, are there any questions so far about any of this? The question was do we, you know why go with our own API group for HTTP route and let's come back to that one because I talk about that a little bit later and please ask that again if you don't think I answered the question. All right, anybody else? No, okay. We're gonna dive into some of the technical background for implementation crap I'm gonna talk about here. I should also point out remember Mate was supposed to do this talk. He did like the lion's share and heavy lifting on these slides but let's just go and take it as read that any mistakes in this with respect to Lincordy's implementation will be on me and not on him. All right, so nomenclature. We have the control plane for Lincordy, we have a couple of workloads. Each workload has a proxy. I'm gonna emphasize that because in the early days of gamma this turned out to be a very surprising thing for people who are familiar with some of the other meshes which struck me as a little surreal because I'm pretty sure some of the other meshes do it the same way but yeah, we have two proxies involved and when workload one wants to talk to workload two it's actually workload one's proxy calling workload twos proxy. In a fit of originality we call those the outbound proxy and the inbound proxy. This is kind of important because the outbound proxy is where outbound policy happens which is primarily routing. Routing decisions have to be made before you make the connection. You can't make them after the connection is already made it just does not work. Inbound policy happens at the inbound proxy and this is primarily concerned with authorization. Checking off at the point of the connection going out does not work because you can then do all sorts of evil things if you're not checking it on the way in. The other thing I'll point out is there are several different controllers within Lincordy. We're only gonna talk about two of them the destination controller which is largely concerned with where a given connection should go and the policy controller which is largely concerned with policy and since we are mostly talking about gateway API used for routing things we're primarily gonna be concerned with the outbound proxy talking to these two controllers. This will become important in a bit. You'll also hear me talk about the load balancer. I am not talking about the thing that is a type of Kubernetes service here. I'm talking about a chunk of Lincordy that is responsible for making outbound connections. Ultimately, this is the piece of Lincordy that implements all the routing decisions which is slightly different than saying it's making all the routing decisions but it is the one that has to implement any decisions that get made. Oh yeah and it, the load balancer has to keep track of actually a lot of different things but for our purposes today really the end points are kind of the only relevant ones today. All right so if you're actually gonna be making a connection before Lincordy 2.12 and I'm going to go very quickly through this because if I go slowly through it will be here all day. The workload initiates a connection. The outbound proxy snares that connection then calls the service profile API in the destination controller which maps the destination IP address into a fully qualified domain name like workload.namespace.service.cluster.local hands that, sorry grabs all the service profile related info associated with that name and then hands that back to the outbound proxy. The outbound proxy then takes all of that information immediately turns around, passes the FQDN to the end points API in the destination controller which does not hand back a set of end points it hands back a stream of end point updates to the outbound proxy and the outbound proxy uses that stream of updates to feed its load balancer internally the load balancer pays a lot of attention to that stream to figure out which end points are part of the set which are not part of the set which it can use, et cetera, et cetera. I am glossing over a ton of detail in that particular bit I am hopefully only glossing over irrelevant detail. I guess we'll find out. One big example of something I'm glossing over because it's very important in practice but not so much for Gamma is for example if you are talking to an IP that is an end point IP rather than a cluster IP then there is no FQDN that we can associate with it but rather than just handing back empty stuff we actually hand back the default service profile for you. Likewise the load balancer skips an enormous amount of logic if it knows there's only one end point in the set things like that. You will also notice that I didn't say anything about the policy controller and that's because it didn't exist before 2.12 so nobody was talking to it, it wasn't there. In 2.12 everything is the same except we have to have information about HTTP routes but in 2.12 although we added the policy controller here the only policies it could deal with were authorization and that is inbound policy. So everything that I told you about pre 2.12 is the same because the outbound side of the proxy didn't talk to the policy controller at all. That brings us up to, oh no sorry that brings us up to a slide I forgot sorry about that. Interesting point about the policy controller is that it was the first controller we wrote in Rust by we I mean a bunch of people who were not me but also despite the fact that it's not written in Go it has to do all of the usual controller-y sorts of things that a Kubernetes controller has to do it has to go through and keep in touch with the API server and watch for updates and do all this indexing stuff and keep track of all the HTTP routes and all the usual crap. So it is a full fledged Kubernetes controller it just happens to be written in Rust because we think Rust is cool. Yeah, all right that brings us up to the point where we can start talking about gamma actually again. Good place to ask for questions, any questions? No, all right the first step for 2.13 pretty much could boil down to just add back in address what could be easier answer lots of things. One of the things that we tend to do working with Linkardy is ask questions about the user experience first and use those to drive the technical decisions rather than the other way around. If you ask questions about the technical stuff first and use it to drive the user experience you end up with user experiences designed for engineers which suck, so we don't do that. The UX questions kinda started with things like one of the big ones was great we're gonna try to do routing in HTTP routes what happens if a service profile on an HTTP route end up conflicting because we could already do HTTP routes for auth and then we had service profiles for auth and now we're gonna add routing so they're gonna become a bigger deal and ultimately the answer was let the service profile win and I apologize for not finding a gif of that scene from Star Wars with let the wiki win. Mostly this is the principle of least surprise. If you're already used to dealing with service profiles and some bozo sticks in an HTTP route without you knowing about it you would like for your world not to come collapsing down around your ears. So this is mostly a least surprise thing there were some technical things in that it lets us follow a pretty well trodden path of what happens if there is no service profile or if there is one that gets deleted and things like that we already had a bunch of that default code in there. Let's see. Do we think that decision has worked out? I think so. There were some rough evages in the beginning mostly dealing around places where the proxy could fail to notice that there was actually a service profile present or fail to notice that there was actually an HTTP route present and get stuck looking for the wrong one and ignore the one that was actually there. I think those have all been dealt with so far so or at this point so ultimately I think this worked out pretty well. One of the things that I like about it is that it is at least a deterministic thing. You can know how it's gonna behave and we can explain to you how it's gonna behave and these things are important. It was much easier to decide that than it was to implement it. Another big question was whether we were gonna be okay with feature gaps in the sense of are we okay with putting things into HTTP route that we can't put into service profiles and are we okay with allowing service profiles to win? Okay with allowing them to be used at all even given that there are things that service profiles can do that HTTP routes cannot do and the answer to this one was a very very reluctant yes. Mostly the way we ended up resolving that one was deciding okay we will prioritize putting the new stuff into HTTP routes. We will think very very carefully about backporting anything because we really really do not wanna do that but we will make it a deliberate decision to backfill everything from service profiles into HTTP routes and we're going to come back to that because there are some interesting ramifications of that decision. Do we think that decision has worked out? That is a lovely question we'll come back to that later. Conformance, yeah conformance, conformance is lovely. It was not possible for linkerd2.13 to be conformant with gateway API full stop could not be done. There are two reasons for this. The biggest one is that in the time frame that linkerd2.13 was being built gateway API did not have a concept of conformance profiles yet and so if you wanted to be conformant with gateway API you had to have an ingress controller which we do not. So couldn't be conformant done. The other reason in there is that and this gets to your question there were also things in the 2.13 space where HTTP routes still could not do things in linkerd that they were supposed to do for gateway API and so that was the other reason we pulled them into their own API group because there's no point in using the official API group if we can't be conformant. The moment we start using the official API group we will instantly have people coming up and bugging us about why are you not conformant and more importantly we will have users coming back and filing bugs saying dude you're using official gateway API things but X, Y and Z don't work and putting them in our own API group permitted us to sidestep those whole discussion things. Do we think these two have worked out? Mostly. Only mostly. We're gonna come back to this too a little bit. So the summary here. We were gonna let service profile and HTTP route coexist with service profile winning. We were only gonna put new things in HTTP route and we would need to backfill all the stuff from service profiles into HTTP route. I should point out that effort of backfilling everything is very much still ongoing. So, linker to users, if you have strong opinions about which things need to be backfilled first, let us know. Oh yeah, we didn't worry about conformance. All right, technical stuff. After we had all the answers to the UX stuff we could ask questions about the technical stuff. The first one was are we gonna do this under the hood as a shiny new API or are we gonna try to wedge all this crap into the service profile APIs? And the answer was let's do a new one. The biggest reason here was that if you look at service profiles and you look at HTTP routes, you will find some really fascinating places where trying to serialize them both to the same wire format is really, really hard. In particular, the way service profile approaches route matching and the way HTTP route approaches route matching are very different. If you look at the details of the spec and so trying to translate those semantics would suck, so we didn't do it. If we're doing a new API, do we put it in the policy controller or do we put it in the destination controller? We put it in the policy controller because Rust is cool. No, not entirely because Rust is cool. Mostly it was because if the policy controller is already talking to the API server to wrangle all this HTTP route stuff, there's absolutely no point in duplicating all of that API server traffic by making another controller do it too. We should just take advantage of having the information in the policy controller already and then do it that way. We also decided that we were not gonna bother trying to do any other gateway API types for 2.13. GRPC route is kind of obvious, but it's also not as well-defined yet as HTTP route and even if it were and we wanted to do it, we didn't have any bandwidths to do it, so whatever. That decision was mostly made for us. Okay, whoops, that's the tech Q and A summary, screwed up that slide. New API in the policy controller only support HTTP route and the way this ended up looking under the hood was that the workload initiates a connection, outbound proxy snares it, which it calls the service profile API and the endpoints API exactly the same way as it did before so it ultimately gets that stream of endpoint updates, but then it also has to go through and call the policy API, which has to do the same mapping, that's an annoying place where we end up duplicating work and then it returns a stream of updates with routing configuration. The outbound proxy then has to feed both of those update streams into the load balancer and the load balancer has to figure out how to honor both of them. That is the hairiest part in this whole thing, really. There's not really a good way to go through and merge these things, it's much more a question of the load balancer has to be able to figure out, oh look, I'm getting service profile updates so I should ignore HTTP route updates or I'm not getting service profile updates so I should honor HTTP route updates. Even after you do that it's still kind of obnoxious because you have to make sure that the load balancer is doing the right thing with the very different route matching rules and semantics, so this was a mess. This was the lion's share of the horror of what we went through here. We think we got it right at this point and I mean at this point we are conformant with Gateway API's mesh profile so clearly we got it right, right? Another thing that was kind of a mess in here is that in 2.12 what we did was we basically used a webhook to do validation of service profiles and HTTP routes and if they didn't validate with the webhook then they would get kicked out and then basically the policy controller never saw them and the destination controller never saw them. That does not line up very well with what Gateway API likes to do which is to use status to indicate things like this instead. Rob is smirking because Rob and I have had a number of discussions about how terrible webhooks are and yeah we didn't really want to extend the webhook anyway so let's be fair about that. So one of the big changes is also that the policy controller now when it sees an HTTP route it gets validated at the policy controller. If the policy controller doesn't like it it updates the status and then the part of it that's actually doing the policy API checks the status and says nope I'm gonna ignore that one it doesn't have a good status. So same result extremely different implementation. It works pretty well which is kind of nice. It did require us to do some first ever things like this is the first time in Linkardee that we've had to do lead selection leader election in a controller and we had to do it in Rust. I'm pretty sure that made it into Kuber and so that people who are trying to do their own controllers in Rust get to benefit from our blood, sweat and tears or probably more accurately from Eliza Weisman's blood, sweat and tears. So current status, did we ruin our reputation? We don't think so. I'm not completely sure yet but we don't think so. We have people who are using HTTP routes they seem to like it, that's good. This kind of change tends to be tricky to manage and there are 100% things that we did that we look back on now and go well that was a bad idea or that went a lot rougher than we thought it was. One of the big ones there is that business with the proxy going oh I'm getting HTTP route updates or I'm getting service profile updates. There's not a lot of feedback from the proxy about what it's watching right now and I can tell you from personal experience that this has resulted in some much trickier debugging sessions than I would have liked. So there are definitely things we need to keep going. Things that went well, starting with UX considerations absolutely critical. You really, really need to do this. I've said this multiple times at this conference. The end user, which usually for me is an application developer but for LinkedIn is not always an application developer. Considering things from the point of view of the end user is absolutely crucial. If you don't do that, you will get crap that they hate and then they will yell at you and you won't enjoy it and if you do it from the start, you have a fighting chance of coming up with something they're actually gonna like. Also having our own API group gave us an interesting vehicle for being able to try things out kind of in our own little sandbox as opposed to having to worry too much about conformance with the rest of the world and that permitted us to do some things. That was a positive and a negative actually. I guess we'll start kind of at the bottom of this one. The negative side of having both API groups is first off it's confusing for people and that turns out to be a pretty big deal. The other one is it's obviously more effort on the development team and if we had started at a place where mesh conformance was possible, I think we probably would not have gone that route. I think we probably would have just started with the core API type and rolled with it rather than doing our own first if we had not had to sidestep that conformance issue. Not having a good way to compose service profile on HTTP route is definitely causing pain for some of our users and that's a huge, huge priority for us to sort out. And yeah, there's still functional gaps between service profiles and HTTP routes. Lastly, what worries us about all of this? So every project using Gateway API in anger is finding that they're developing in advance of the specification. We are no exception to that and we're still trying to figure out how to do that. A good example there is retries where HTTP route-based retries, we do that differently than every other Gateway API implementation that we know of. We use budgeted retries rather than counted retries. So instead of saying you may retry up to three times, for us, you can retry as long as the total fraction of retries is not more than 20% of the traffic going to this back end. And Envoy recently added that, maybe somebody will pick it up, but until then we are the only implementation doing that and that is probably gonna make it a little challenging to get Gateway API to accept it. And then we get into fun questions like, oh, well, should we use policy attachment for retries? Should we use an extension filter? Should we do something else like cram it into the policy.linkerd.io API group directly and then use that as a vehicle for going back to Gateway API and saying, hey, this works awesome, you should do it this way? I don't know. I really don't, it's an interesting question. And also, as I said earlier, Linkerd has a strong tradition of directly controlling our own fate and so there's also some of that going into all this as well. Lots of ongoing discussions going on here which the three Gateway API people in the room can confirm there are lots of them and I'm probably being really obnoxious to them. But overall, we do believe it's possible to do this without completely ruining your reputation. We think that's a good thing. The user-centric design process is key. Thinking very carefully through how things are gonna look down the road, very carefully about how this will affect your users is critical, we see people miss that a lot. And lastly, the Gateway API folks actually are very welcoming to talk to. They have a lot of useful opinions. You should talk to them if you are thinking of doing something like this because you will probably be able to avoid a lot of terrible, terrible things going wrong. And with that, thank you very much. I think we have one minute for questions but I will of course be here or back at the Linkerd booth in the project for a billion or you can also find me as Flynn on basically all of the CNCF slacks. So, any questions? No? Am I gonna escape without somebody asking me something? Go for it. Because Rust is awesome. Ooh, yeah. Slightly less facetiously. The data plane is already written in Rust. And to the extent that we get to move away from having all of our maintainers have to know both go and Rust at an expert level, that's good for us. To the extent that Rust can become a viable alternative to go in the Kubernetes ecosystem, that's good for everybody. So, yeah. The fact that almost the whole ecosystem right now is written in one language with one runtime scares the bejesus out of me, right? All you need is one really nasty go runtime bug and the whole thing comes crashing down around our ears for a little while. I would much rather have a more diverse ecosystem than that. I should point out I have no knowledge of any bugs lurking in Go to make the whole thing come crashing down around our ears, all right? So, FBI, when you watch this talk, it's cool. Anything else? All right, thanks much. I appreciate it.