 Why don't each of you tell us a little bit about what you do with the edge and service mesh just so we can get a little bit of background for those people who aren't familiar with the three famous faces we've got on the screen. And Alyssa, why don't you start because your name starts with A. Okay, so hi, my name is Alyssa. I've been at Google for about 13 years working on the GFE, which is Google's frontline proxy. It takes in all of Google's web prefect, everything from Gmail to ads to web search. So it's pretty critical bit of Google's infrastructure. So I worked on that for a decade. I launched HTTP2 at Google. I launched HTTP3 at Google. And then Google got interested in this proxy called Envoy. And I was kind of like, why are, we spent a decade working on the GFE, why Envoy? And started playing around with it and got really interested. It's just a beautiful architecture. It's super pluggable. And so I spent the last three years working on Envoy, both trying to get it up to kind of the Google level cloud reliability standards and also adding features and becoming a senior maintainer along the way. Fabulous, thanks. So Lynn, alphabetically, we'll roll around to you and we'll let Matt blush with pride about Envoy as the last speaker. So Lynn, why don't you tell us a little bit about yourself and what you do? Okay, great. Yeah, so I worked on Istio for almost, gosh, a little bit over three years now. So that's my primary interaction with the edge. So you guys all know Istio has the Ingress Gateway and the Egress Gateway for the part of the edge. So my primary role on Istio is contributing to upstream. So I'm a maintainer on the project. I contribute primarily to like user experience and also environments. I also serve on the technical oversight committee of Istio. So it's been an interesting ride. Fabulous. And so Matt, alphabetically last in a set of three, but why don't you introduce yourself and of course probably everyone knows you, but you should still say something for those few people who are watching who don't actually know. Of course, thanks for having me. My name is Matt, I'm a software engineer at Lyft where I've been for five and a half years. These days I spend about half of my time leaving our infrastructure team at Lyft. So things like networking and Kubernetes and deploys and all of those types of things. And then I spend about 50 to 60% of my time working on Envoy in the open source domain. Prior to Lyft, I was at Twitter where I built Twitter's edge proxy. So that was actually my first main experience with the edge. And having that experience back at Twitter was what led me to build Envoy. And of course Envoy is used in a whole different number of domains from a service mesh systems to API gateways. So my experience spans all those domains. So do you three have some real world stories of putting these into practice to talk about how some of these things went, they went well or some of the places where maybe there were some things that you had to think some more about as part of that? Yeah, I mean, I would say that any migration comes with migration pain. My new hire project at Google was launching GFE2, which was taking the proxy handling all of our traffic and hotswapping it to any proxy, which of course, a thousand small differences add up to a bunch of interesting debugging examples. And Envoy is no exception. Like one prime example that we had early on was that because Envoy was written for HTTP2, it lower cases all header keys, which is HTTP2 standard. But we're dealing with legacy traffic, often expects HG11 style capitalization. You have a content length header, it starts with capital C and it has a capital L. And theoretically, if you're following the spec, which no one does, you should be able to handle either casing, but no one does. So when you have say 20,000 cloud customers, it's guaranteed that not all of them do it right. And so for example, there were a lot of internal debates as to how that Google should handle that in Envoy, which were preempted by someone else running into it first and writing the camel case HTTP11 option. So you can kind of work around that particular issue. But for any migration, there's going to be dozens of little impenetrable mismatches. Oh, you reordered the headers differently or I expected a space here between commas, which we hit internally. And it just takes time to debug. And we've done a lot of work in the last year for Envoy, making it easier and easier to debug what is actually flowing through the system, what your data actually looks like when things are rejected, why they are, to try to at least ease the pain about migration. Yeah, that's the really key point is that all of these systems theoretically exist on standards, but as Alyssa said, it's pretty fuzzy, right? And I think everyone runs into these issues and stepping back to Envoy itself and in terms of one of the original value propositions, at least compared to its open source competitors, is Envoy has really focused on the observability side of things. So trying to better understand what's going on through metrics, logging, tracing, things like that. And I think just given that you're going to hit not only all of these migration problems, but even apart from migration, just trying to roll out a system, just a service mesh or any other component where you're dealing with a polyglot code base with 10 different languages and frameworks, it gets super, super messy. So I think that that is really the underlying crux of it is understanding that things are going to break. And actually, as Alyssa was saying, focusing on the tooling and the observability and the diagnostics to allow you to understand it is probably the most important part. So my guess is that the audience who's listening to the panel is all of a sudden excited, is like, give us some specific examples from the three of you's work so far of where this debugging sort of thing has worked or maybe not worked as well. I'll start by just saying that I'm going to be honest. It's very hard for me to remember the one exciting thing. And by that, I mean, I've just, I've dealt with so many different problems and different bugs that they just kind of blur together. That's just the way that it is. And I would answer a slightly different question, which is going to be an unpopular opinion probably, but these systems, as we're talking about right now, they're so complicated. Like anytime that you have a service oriented or microservice architecture, it ends up becoming very tricky, very quickly because of different languages, because of bugs, just because of faults, right? I mean, it's just so many things can go wrong. That my advice to people typically is don't do any of this unless you actually have to, because it's very, very painful, right? So I think that to the extent that folks can look at the problems that they're trying to solve in terms of if they can limit the number of application languages, do that, right? If you can limit the number of services, do that. If you can use a cloud provider for Ingress, do that, right? It's like, just don't implement any of this. I think at a certain organizational size, that becomes unrealistic and you end up with services, you end up with an API gateway and you end up with all of these things. But the problems in my experience, and I'd be curious to hear from others, they're just endless, it's like they just never end. I mean, there's just a never ending array of bugs. And it'd be hard for me to come up with anyone in particular. I mean, I would say, go for it, Lynn. Go, go, go ahead. The kind of two classes of bugs are gonna be, you know, control plane or data plane, right? And we've done a lot of work to make them easier to diagnose, right? So a lot of the time for control plane bugs, you have, especially for, as Lynn said, the configuration language is not the easiest. It at least explains what you're doing wrong. Hey, you have field A, that means you need field B, or this is out of bounds range, you're not setting a reasonable value, you know, maybe look at the docs. For data plane, again, I've personally done a bunch of instrumentation over the last year. So this morning I got into a bug report saying, hey, you know, connect, is it working when I try to do domain ratchet? You know, if I say match all wildcard domains, it passes through. And if I say match localhost, it bounces. And I was like, oh no, I didn't test this. Let me go debug what's going on. And so I wrote an integration test and ran it. And the debug detail said, route not found. So that's great. I know that the route match failed. And then the log said, you know, that host colon 80 didn't match. And I'm like, oh yeah, I wrote, I copied what they did. I did localhost or host and the matcher should be colon 80, right? It took me five minutes to debug it, basically between logs and response code details, right? And so as, you know, you ask kind of what are the common gotchas and what we've done because the envoy, the actual core envoy maintainer community is quite small. There's a group of devs who aren't maintainers, you know, who are awesome and really good at answering questions. But when we see questions like this happening over and over, why is envoy sending this response? You know, why is mic and thing not working? We'll put up an example so people can copy paste at it or, you know, well, we wrote a fact, you know, why is envoy sending this response? Here's how you turn on these logs. Here's how you get the details here. Here's how you turn up trace logging. That tend to at least give the common debug tools and hopefully, you know, reduce confusion and or outage time. Yeah, I want to echo some of what's just been said. I mean, it's definitely hard. I remember even earlier this year, just the Israel project itself, right? We actually went back from microservices to monolithic. And because of some of the challenges are hard, right? Some of the, we had like four, five, six control plane components and we consolidated into one single component because we find out we're not enjoying the benefit of microservices. We don't have multiple languages. And we had a lot of communication issue among our own microservices, which could be solved if we just communicate through local hosts and we would deliver and release all these components always together. So we actually went back to say, you know, we don't need microservices. So we don't need to consume our own service mesh. And we were actually a lot happier now just having one single control plane component. Like Matt said, I mean, if you don't actually need it, the best might be just avoided. In fact, from our project perspective, we were amazed on how many issues we have and how many issues we discover with Envoy also. It's just like, you know, we think, you know, the project like Israel is almost four years. And if you count the pre incubation, it's almost five years. And Envoy also being around for a long time. But like Matt said, you know, it's just like, people have different scenarios and someday, you know, people going to discover something new and there's also regression and all that. So it's super complicated. And the one thing I would highlight is someone from August trader, Karstony wrote a really good blog. You know, when you have service A, talk to service B, direct without cycle, you're worried about one connection pool, right? The moment you inject the cycle, you're worried about like three connection pools. So you actually increase your chance of things might going wrong. And the most common problem in service mesh is the five or three arrow. So any of the time out configuration, if they conflict, you know, all these issues from your connection pool, you may just get a generic five or three and it could take a long time for you to troubleshooting and realize what might be wrong. Yeah, that's a great point. And actually, if you were to push me to one of the most common issues that comes up over and over and over again is that in these architectures, you wind up with this chain of proxies, right? Just like Lynn was saying, I mean, you can have an edge proxy, a second layer edge proxy, a service, a sidecar, some intermediary. And often it's the timeout config that ends up getting really messy and confuses people a lot. And particularly with protocols like HTTP11, which don't deal with those timeout races very well, that was solved in HTTP2, but we have a constant stream of issues that come into Envoy and to other projects about why is it disconnecting and you have to explain that well, you have to actually configure the timeouts correctly across all the different layers. It's just, it's very complicated. Avoid. Instead of avoid, because we're speaking to a knowledgeable audience out there, I wanna poke at this a little bit more and maybe learn a few lessons from the three of you about how to find those timeout problems. The details, I mean, I added this into Envoy, but I added it because it's one of our most powerful debug tools for the GFE. So every single time Envoy sends a local response, it tags a specific detail that is unique across the code base. So you can literally say, detail a maps to this line of code in this file, right? So you can see exactly which timeout caused. And if you chain those in headers, so in your response headers, you always say the details of this response was forwarded by the back end and the details of this one was whatever, then either by looking at access logs or response headers, if they make it to you, you can see exactly which proxy in the chain failed and exactly what line of code it failed at. Then you still have to debug, wait, why did this one timeout and these two not timeout? But at least you've root caused, you know, where it happened, not why it happened. And I wanna add, you know, thanks to Envoy and service mesh project, these problem are actually surface a lot easier than used to, right? Because now you can observe what's going on with your microservices is a lot easier. So you can actually see, maybe you have this problem at tonight, at just 0.5% where before, if it's 0.5%, you may not even know and you may just leave with it. It's just because of the service mesh and the platform like Envoy, all these issues are surface to the user right away. So they actually have the chance to debugging and troubleshooting like Alisa said, you know, having like the debugging tool to figure out what might go wrong so that you can tweak that timeout and maybe tweak our retry and then to further configure your system to be more tolerant. What I would add too, which is super interesting is that when we raise a bunch of these issues to the user, what I found personally, and there's a fair amount of irony here, is that issues that a user may never have noticed previously, like very low rate issues, they now notice and they think it's a severe problem, right? And then they'll spend a bunch of time debugging and are asking questions and that's fine. I mean, there's nothing wrong with having all of this, all of the stuff raised and trying to figure out what's going on. But the thing that I would add is with, I would say with great observability comes great power and great responsibility. And what I've seen sometimes is that people can get hyper-focused on very small issues and not look at the big picture in terms of the overall system reliability. So I think as operators of the system, this is where what I was saying before, I don't mean to keep saying don't use this technology, I'm just saying that it is a powerful system as you add on all of these layers, it becomes more and more complicated. And I just think it's important to realize that there's a lot of intuition here. Like there's a reason that Alyssa has been working on this for like 15 years, right? It's because there's a lot of experience that you gain from the way the patterns that things break and then being able to look at the big picture and try to understand is that really a big issue or not? So I would just add that it gets really complicated, not even just the system, but just trying to take in all of the inputs and understand should you even action on them, right? It's like, what should you alarm on? Should you care about this? And that's where there's no, in my experience, there's no substitute for operational experience here and getting a team together who can operate these things, whether it be something like Istio or some other service master and API gateway and just getting a feel for how things fail and whether it matters and that just takes time. Yeah, I actually wanna add onto that because that is a huge problem and that's not one that any proxy can solve. As Matt says, you often have these really low rate of errors, right? And low rate of error, that's not a big problem unless it's consistent 100% error for some user on some platform in some country, right? And it gets really hard to figure out is this a low rate of error, but a high rate for some class of users or some class of networks. And that is an art where our SRE are amazing at that are reliability engineers and the operations people just have developed that experience over time. I've picked up some over debugging some number of Google outages over some number of decades, but yeah, there's a science to the basics and then the rest of it is all art and intuition that you develop over time. Other sort of key measures or thought processes might you imagine or not imagine that you've seen for success or not success on going to say a service mesh sort of situation. What other things might people who are watching this panel use as criteria to decide, oh, this is a good way for us to go or this is maybe not a good way for us to go. We should have fewer services or a monolith or something. Yeah, I would say, size is certainly an important characteristic. The second thing I will look at is what programming languages are they using? They really have multiple programming languages. Oh, are they just primarily using one single programming languages like we have in Istio with our control plan? We have six, seven components but all using the same languages. So it didn't make sense for us. The third thing I would look at is, are they actually operating the services by different people? Are they actually release the services on different schedule because that helps to make the decision to, because the purpose of microservices go faster, right? Release on different schedule but if you're not really doing that then you're not really enjoying the benefit of microservices. So that's definitely a key criteria to also look at. And then the fourth thing I think important is, are they partner with somebody? Oh, are they actually consume open source directly? How much experience the organization have to consume open source technologies? If they're just consume envoy and Istio and other open technologies out there because it does require a lot of experience in adopting open source technologies. So hopefully they have some experience in the past that would help them easy the transition to adapt to these technologies. I guess one thing I would add is not just considering team size but also for larger companies considering where you can join forces. Lynn mentioned the control plane. Well, if you're one team of five people it doesn't make sense to run your own microservices and run your own control plane. That would take all of your people. But if you have a bunch of different small teams across a company, right? Where you could join forces and have one operations team that kind of handled that and had an escalation path and built those muscles and a bunch of small teams leveraging it that can be a really powerful model. We tend to have a very strong infrastructure teams at Google and I've been surprised kind of repeatedly working on the envoy side how many companies kind of are more siloed, right? And have different teams kind of all ruling their own. And again, it's harder to work together, right? To be able to coordinate, you know what role it's going to make sense and what the security release processes make sense for each team. But if you can do it you can avoid a lot of repeated work and you can really build up those strong teams with deep expertise. It can be really helpful. Obviously not relevant to the small startups that don't have teams to actually do that but for medium and large companies for sure it's something to think about. Yeah, those are both the super answers. The only other thing that I would add is that when it comes to the API gateway side of things North-South or the service mesh East-West I would also ask people to really look at what problems are they actually trying to solve? And by that, I mean is your system actually real time? It's like, do you have to have low latency? It's like, do you have to have synchronous API calls? And sometimes stepping back and trying to understand could you develop your system on top of PubSub? You know, could it be lossy, right? It's like there can be other architectural patterns which you're just much simpler than trying to get a giant microservice call graph to actually work. It's a very complicated technical thing that is not easy to do. So my advice from people even apart from the, you know stay on your monolith, don't adopt microservices if you can, you know is to look at the actual application use case and say, do I need real time communication? Does it have to be synchronous? Are there simpler ways that I can actually do this that tend to be easier to debug a bit more reliable? So that would be my other main high level piece of advice. Fabulous. Well, I think that that bit of advice that the three of you have just offered is great for the audience who's been listening to us to go away with about the size and the expertise and looking at the architecture issues. I'll give you all a final chance to say anything that you wanted to say other than thanks and you know, so long and thanks for all the fish but and then we'll call it good and answer questions from the audience during the actual panel broadcast. So is there anything that you'd like to like to sum up with there, Alyssa? I guess I would say that if you have done the analysis Matt suggested and decide that microservices is the way to go or envoy for your ingress traffic or you decide to adopt Istio is a recommendation to just really try to get involved in the community. It's open sources is an amazing thing because you can be like, hey, I don't have to write my own proxy I'm just gonna take this one and you can just take it and use it and that's great and many, many people like go with that model and it works fine for them. I've seen a lot of people kind of come in late in the game and be like, wait, I need help figuring this out or how to add this thing or I wanna do this flourish. And if you get involved in the communities early both Istio and Envoy have really strong developer communities we've got a great maintainer team there's active Slack channels and if you get involved the more you get involved in the community the faster you get your questions answered the easier it is to get your features landed the easier it is to basically get everything working the way you wanna work. So it's this kind of bootstrapping thing where like you put in that initial effort and you get way more out. So it's an option not everyone has the time to do it not everyone has the cycles to do it especially if you're early on in a startup but once you get to the point where you have some breathing room it really is valuable to be proactive and just reach out and be like hey, let me help out with a little bit of tech debt and then I get my brownie points and now people answer my questions faster because it's all volunteer work, right? And the more you volunteer to help the more people are gonna volunteer to help you and we always love help too. How about you Lynn? I would say definitely spend time to understand whether you need service mesh I think it's really worthwhile your time to spend that time upfront to do thorough evaluations and then once you decide which project you are gonna land I definitely agree with Alisa spend that time to work with community and don't be frustrated because we are all volunteers so sometimes it might take a little bit of patience take a little bit of pain you may need to attend the community meeting or a work group meeting so don't be shy we would definitely welcome contributions. Yeah, I don't have a ton to add other than those two statements which are great. I think the only thing that I would add is that I've talked about it during this panel is that I think as an industry we're not that great of evaluating what I call total cost of ownership and that's looking at how much would it cost if I'm gonna pay a vendor or how much it's gonna cost if I use this open source and I have to actually maintain it myself or if I'm gonna make my own control plan or all of these things. So I just really encourage people to when you're trying to solve your problems to look at the entire menu of options from using open source directly to paying a vendor to hiring people internally and try to be realistic about what the costs actually are and that's where I think things will wind up in a much better place for most organizations if they spend more time doing that because then they'll have less surprises later. Fabulous. Well, I'm gonna summarize your three final points with I'm gonna say that Matt said look for simpler options which I like as a general life philosophy and definitely when you're writing software. Lynn said carefully consider your requirements which again I think is a fabulous thing when you're doing software and Alyssa said make friends early which I think we all learned when we were in kindergarten and with that I thank all of you for being part of this panel. I think it's gonna be great when we finally do broadcast it at KubeCon and so thanks so much for being part of it.