 Good morning service mesh con. Welcome to service mesh the new single point of failure in this talk We're going to be discussing some of the trade-offs that service meshes and made on implementation and how that impacts you as users I'm here representing linker D. Mitchell be representing Istio and Sabine will be chatting with us about console Sabine, why don't you tell us a little bit about yourself? Hi, I'm Sabine say it I'm joining you here today from my cloth is I Work at Hashi corp as an engineering manager on the console service mesh. I Have worked at Hashi corp since the beginning of this year Before that I have been working in the infrastructure industry For about five or six years Yeah, and I'm super happy to join you all today and have this conversation with Thomas and Mitch Thanks. That's fantastic Mitch Can you tell us a little bit about yourself? Hi, my name is Mitch Croners, and I'm a software engineer with Google I work in Seattle, Washington, and in the last two years. I've been working on the Istio project in particular I work in user experience Where I make it my goal to really deeply understand the way that our users interact with Istio To understand what its strengths are as well as its weaknesses And then to take those learnings and turn them into meaningful features that users like you can take advantage of in Istio Cube con is one of my favorite ways to meet and interact with users to hear about the creative ways that you're using our software Ways that maybe we would have never anticipated as well as to understand how we can better serve you in the future So I look forward to chatting with all of you to hearing your questions at the end of the session and to interacting with you in the lobbies, thanks and Hello everyone, my name is Thomas Rampelberg. I'm a software engineer here at Boyan the creators of Linkerdee Linkerdee is a super fast Super lightweight service mesh that's really focused on user experience. Here's a little bit of a agenda for what we're going to be covering today I'd like to call it turtles all the way down First off, we're gonna be talking a little bit about complexity or the power of saying no What it takes to go and build your service mesh out in a way that walks the fine line between features and Something that isn't particularly usable for folks Then we will be chatting a little bit about how to operate your service mesh and what that takes and finally My perhaps most passionate subject what to do when the service mesh breaks or how to figure out how to debug and Manage your service mesh in production All right Complexity I've never heard of you Complexity is a big part of service meshes. Obviously, it's something we want to fight against Since we're talking about a new single point of failure a complex single point of failure is 10 times worse than a Normal single point of failure, though obviously we'd prefer not to have a single point of failure to begin with To get me to get us started on this one, uh, mitch. I know this one is Subject you're particularly passionate about Tell us a little bit about how istio has been managing the complexity Is there anything you've done recently that you feel has made a really big big win there? Yeah, you know as I think about service mesh in general what it does really well across the board I think is it takes complexity out of your application layer Which is great. Your developers don't need to think about all sorts of complex network topologies and problems Uh the downside though is that we tend to sort of have the consolidation of all of that complexity Into one layer that can look very difficult to manage and fragile at times And so one thing that we learned in istio is that even though we're developing a product for microservices And we've all got a pretty good background in running microservices It was just too much complexity to have your microservices management platform run microservices itself Asking our users to run microservices on our behalf where we're not actually the ones operating these things We're not even necessarily talking to the people who are operating these things Was just too much complexity and so a year ago We moved to a monolithic model for our service mesh and we've seen substantially reduced friction In terms of upgrade and maintenance for our users as a result of that It's been a sort of counter-intuitive maybe eyebrow raising move for some But I think it's been great for our users the other area that we're always fighting complexity is project sprawl Uh and that is we're always hearing from various users Well, we'd love to see a full fledged canary feature Or we'd love like a full platform for running software and services where we can see all of the knobs in one place in a user interface We made the very intentional decision not to be a platform Uh and it's an interesting decision because it means that istio is not all you need to run your services We are not a one-stop shop The intentional decision there is that we want to be part of an ecosystem We know that we are not going to be the best at everything So we're going to strive to be the best at a very core set of service mesh features and then allow Various other technologies to come in and fill the gaps such as canarying There's a number of technologies out there today that are doing a great job with it Built on top of service mesh. So those that's kind of how I think about complexity and how we manage it in istio I really love the microservices point I have a long argument that microservices don't actually solve a technology problem They solve a people problem And if you've got microservices solving a people problem where teams get to own their own destiny The service mesh owned by a single vendor aka The istio team doesn't really make very much sense as a microservice solution And it's just it's fascinating to see how that all fits together Yeah, that's super cool to go and Think about there the I love your point too about the Not being a platform It's too bad that kubernetes doesn't come with a service mesh out of the box and a canary deployment, but I think that Especially once we start talking cloud native There's so many folks that have so many different interesting unique use cases that the only way to have a great solution Is to make it so that the community can build badass stuff like Something on the linkered east side in particular We don't do ingress and we don't do ingress because there's a bunch of amazing ingress controllers out there I did a my kubecton talk is going to be with the ambassador folks Which is a fantastic solution for us because they're they're going to do it better than we ever will You know how that works Sabine, uh, tell me a little bit about console and how You've been managing the complexity there Yeah, so, um, one of the things kind of piggybacking off of both what you thomas and mitch had said Um, so one of the pieces that we decided to say no to or that we didn't want to build Was an apm solution. So, um, we did not want to store metrics in console Uh, but we also we still wanted our users Uh to be able to visually see, uh, their user workflows and metrics in the console ui So, um, what we have built is a scalable javascript plugin Uh, and this javascript plugin allows users to query data Um in the apms that they use and apms just fyi for those folks who may not know is an application performance management Tool, uh, and these are as an example. There's prometheus and datadog um, so right now we have a plugin for prometheus So our users can use that Um query that get data from there and then we have um In our console ui We have a topology view that basically shows the data model and then it supplements it with metrics data coming from that apm So things like your request per second latency error rates So, uh, so the dashboard For console can actually pull metrics out of datadog Correct. That's super cool. That's super cool. Yeah, so right now we have um, our first plugin is for prometheus And so we can add on other plugins for other eight years Yeah, we uh In linker d we looked at prometheus and said everyone in the world is integrating with prometheus So we're just going to stick stuff in prometheus and hope for the best Uh, mitch, this is actually an interesting conversation given istio's history Tell us a little bit about where uh istio's relationship with the prometheus and where that's gone So i'm curious where where you'd like me to go with that one Are you thinking about mixer and and what we've done with that over time? Are you thinking about our default installs with prometheus and how we've removed that from the project? I didn't know that you'd remove prometheus from the project. That's a new one for me Yeah, we have and it's because we love prometheus. Uh, so it's a little bit of a surprising twist Uh, we found that users already have prometheus installs And they don't want to fiddle with federation levers in trying to get our prometheus install to work well with theirs And they really don't want our prometheus install to overwrite theirs Um, so we found that it was best just not to install it and give them clean instructions on how to plug Their service mesh into their pre-existing prometheus install But we uh, we are about to release a stable version of linkerd that comes with our bring your own prometheus Solution for a lot of those same reasons. Um, no, I was actually thinking more about the mixer or the demixerification that has happened recently especially since we were talking about, um Um, complexity. I thought that that was an interesting piece there, but uh, generally speaking, especially as cloud-data projects I think we've all realized that metrics are better served by the community to go back to the theme. We're kind of going along there Yeah, um Mixer was an interesting architectural change because originally mixer was one of the key motivations for running Istio as microservices Uh mixer was the only layer that logically could block the data plane So it had to run extremely light extremely fast and it was because of policy decisions We also put telemetry there because of the close coupling with the data plane But that was more incidental policy decisions are really the hard part that mixer was handling Uh, fortunately over the last year the developments in web assembly have made it so that those decisions Can actually be written in arbitrary code that executes inside of the data plane So with zero latency in terms of network cost and that's what enabled us to get rid of mixer and ultimately move towards monolithic Micro or not monolithic microservices monolithic service mesh. There we go Uh, and it's super interesting because it's a monolithic service mesh, but with the wasm plugins in envoy It's almost like you've got it like the ultimate distributed system Yeah, I don't think anyone has fully realized the capabilities including the Istio project The capabilities of having arbitrary javascript script code that just runs ubiquitously in the in your proxy I am really excited to see the developments that come out of that over the next two years Especially now that web assembly has been merged back into the envoy upstream I have done horrible embarrassing things with ebpf that are basically the same Once you give someone the ability to do anything the world Their world is their oyster the hacks I have done. It's it's unfortunate. It will be a wonderful terrible world. I think Sabine I think you mentioned you had something else that you wanted to bring up around complexity too Yeah, so there's one other thing that That we decided not to go forward with and this was around Supporting multiple console installations on one kubernetes cluster. Ah, yes So some users they You know, they might want to have two environments on their k8s cluster So, you know a dev environment and a production environment possibly on the same k8s cluster And so that can result in a number of issues like security issues performance issues So just like around the security issues If you have two environments on your k8s cluster and you have two console installations as well if there's a console client that you want to Register with for example your dev environment, but you do it by accident to production That can cause huge, you know, very obvious security issue there And then along the same thing That can also cause performance issues where if your dev environment or your staging environment There's a ton of tests running and all of a sudden it's using a whole bunch of resources So then back can have an effect on your production environment So we basically wanted to make sure that users don't run into these issues And so therefore we decided to limit to One console data center one console installation on one kubernetes cluster Yeah, we uh, we started out supporting multiple uh, linkardy installs on clusters with this idea that uh, Less the like dev prod instance, so that's a really good use case but more the multi tenancy Which I think I'm going to ask mitch in a second about because I'm super excited to hear there But we ran it directly into the buzz saw of crds because you can't have crds version in any reasonable fashion Having two installs, especially of linkardy when you've got a couple crds starts to become Totally untenable because how do you do the upgrades? How do you keep from stepping on other people's feet feet? And so at least on our side of things we've really been pushing the Multi cluster as a way to do isolation and multi tenancy Instead and I will point out that I personally would never operate a dev and prod Yeah kubernetes cluster as one that that's crazy Um, so mitch, I'm super excited. Tell us a little bit about istio and multiple installs on a cluster So multiple installs on a cluster has been something we've been working towards for the last like 15 months or so It's been kind of early access since one five one six time frame But I think it's safe to say that in one eight the standard way of upgrading istio actually involves running multiple installs concurrently Interesting Because we've consolidated that single point of failure into one resource the kubernetes deployment model Updating a deployment and kind of crossing your fingers that health checks fail if something goes wrong Just wasn't enough security and safety for our users in particular When you upgrade a control plane in a service mesh oftentimes the health failure doesn't happen within the control plane itself It happens within the proxies And kubernetes doesn't provide the kind of levers and granularity that we needed to really detect failure early In moving to a control plane So with revisions what we recommend users do like if you're moving from say one seven to one eight Is you install a separate one eight control plane that initially does nothing but serve ingress traffic It's using all the same route of trust all the same mtls and everything else as the other control plane It's just got almost no proxies connected to it And then as you're comfortable You go through and cut your proxies over one namespace at a time Until you're finally able to shut down that old revision of istio So I don't know that we envision a lot of use cases where people are permanently running multiple vision revisions In one coaster although that is it's possible. We're just not certain it's useful But as far as upgrades go it's been a very beneficial tool And I think has increased the predictability of the upgrade process a good deal Yeah, the uh, it's super interesting to talk about upgrades in general are like are hard Uh, I've in the kubernetes ecosystem in particular. I'm really excited about all of the folks who are working on workflow related projects because especially upgrading a service mesh to your point You really need gates, right? It's not just upgrading the control plane It's also making sure that the proxies are at the right version and then you know, okay I can upgrade the control plane this far But now I need to make sure that the data plane gets rolled before I go do the next steps And that's and that's a tough It's a hard problem distributed systems. I hear are challenging for some reason Yeah, you know, you almost wish that you had a service mesh to run your service mesh on It's turtles all the way down All right, it's super awesome to hear about how we've all been managing complexity in our own service meshes Uh, but let's talk a little bit about a subject that's near and dear to my heart. Um Actually operating the mesh Did you say someone needs to keep the mesh up? How do you go and tackle actually managing this thing in a cloud native world Where there's all kinds of operators and users? Uh, sabine, tell us a little bit about the console side of things Yeah, so on the console side well specifically around getting started or getting the mesh up and running There's um a few things that we do to make that easier. Um, so the first is around automating the ACL setup So ACLs are basically access control limits Um, since console exists outside of kubernetes. It has its own authorization solution um and We basic so basically ACLs are to console what uh kubernetes is to our back Actually said that the other way. So um, so yeah, so basically Yeah, ACLs are to to console as kubernetes is to our back and um Instead of we had a choice here of we could have let our users Create their own ACLs and have to do that manually But we wanted to make that piece easier for our users and we basically have set up a bridge From the k8s are back to our console ACLs. Uh, and this is all done automatically so that our users can They don't have to manually set up anything um for the ACLs piece. Uh, and it's all done just automatically So the that's really cool. This is something I had I didn't know anything about so the Um, you actually have an our back bridge. So kubernetes native our back Calls get translated into console ACLs. Correct. Yep. Oh, that's cool. Uh, I I went deep on the gke our back plugin once and was I'm not going to say horrified. I'm going to say Impressed by the uh effort that the engineers went to I can only imagine how much that took but Since we're talking a lot about complexity, that's so important to have a single store Especially for something as important as our back. That's really cool that you went to the effort for that All right, I'm gonna I'm gonna do a little bit about linker d here. Um Probably one of my favorite things about linker d is uh our check command and the reason for that is that, uh Kubernetes clusters, even though we're talking cloud native and uh cattle instead of pets I'm I'm gonna say kubernetes clusters are snowflakes Especially once you start talking across organizations, uh For example, we just got done talking about the console our back to ACL mapping Who knows how that maps on a cluster where somebody's got installs or not and so, um We spend a lot of time, uh, if I see two users reporting a difficulty setting up linker d We go create a check for it. And so, um, we will go actually do validation before you install your cluster to say that You've got the right our back to be able to go and create it and then we'll do it after you install the cluster And none of the checks at the moment are, um I'm going to call them integration tests e.g. They just go make sure that the readiness is passed liveness is passed something like that Um, though we we did have a summer code student work on some, uh Conformance tests that I'm very very excited about to actually like test real workloads on linker d once you've got the install done um, but uh I did this purely out of laziness Because I wanted to be able to give docs links to folks for how to fix their problems once they run into them and having it all Automated into the install flow has been just amazing from our perspective. Um Mitch do you want to talk anything about the istio install flow and kind of some of the cool stuff you've done there? Yeah, um You know something that's interesting about install. We're in a kubernetes world. And so everything needs to be declarative Uh, but the learning right is that upgrades are not a declarative operation. No, um I don't ever want to declare The entire state of my mesh in the next minor version instead. I want to do a mutation, right? I want to say i'm on istio 1.x I want to be on istio 1.x plus one and that's the only change I want to make don't touch all the other knobs and bells and whistles um And so we we've kind of had to take our time and learn some hard lessons In terms of when we ought to be following declarative semantics and when it's better to do a more mutational semantic Um one thing that we've done over the last year to sort of assist with that It doesn't directly address the problem, but we've introduced analyzers, which it sounds like are a little bit like checks It's a suite of things. It's not Exclusively focused on upgrade or install time. It actually can run during runtime We bake them both into the control plane as well as into the cli You can run analyze and it'll list some probable problems or maybe definite problems that you have related to your configuration For me my favorite one. I always fat finger the gateway name when i'm mapping a virtual service to a gateway I mean just every time And the analyzer catches it really well And i'm excited to say that uh now there's even a way that you can have The control plane begin writing those analysis messages out into the status field of the objects So if you run a cube cuddle get on an object that has a problem with analysis, you should see it right there in your yaml That's cool. Not enough folks use kubernetes events One of my favorite Projects since we talked about canary earlier was is a flagger and stafan goes and puts Events in for all of the canary progress and you see it on the actual resources. It's so useful Analyzers are super cool We've actually had a bunch of users just take our check command and stick that into their Alerting workflow which drives me crazy because we built check as a like user interface But man, it's cool that that folks are doing it that way. Uh, That's that's really cool. Here we go So uh sabine, uh, we had been chatting a little bit beforehand and uh, one of the things you said mentioned was the Approach that consul had around federation. It's a interesting feature set that, uh, At least we don't have in the linkardy world. I'm super interested to hear more about that Yeah, for sure. Um, so So, yeah, so one of the things that consul does or what we have done is to try and ease The process of federating to data centers or however many data centers you have Um, so just for the folks who don't know federation is just data centers being able to communicate with each other Um, and and this is setting that process up or setting up Um Federated officers is hard. Uh, there's a lot of config data. That's required um, and so we have uh tried and we have simplified this process by giving our operators the ability to get a single secret uh from their primary uh data center And then they can take that secret and do a cube cut all apply in their secondary data center And then helm install and they're good to go and those uh data centers are Federated and up and running and communicating with each other Oh, that's cool. Like uh, I'm I'm going to poke fun at istio. I tried to do an early version of the istio multicluster and Uh, I'm not no. I'm just gonna say I failed man. I could not get that sucker set up You didn't fail we failed Uh, I want to say like this is a pivot that we have to make as a project We take responsibility for outcomes. We don't just deliver really cool technology that this one time We got running inside of a cluster somewhere if our users can't do it. It's not done Um, and that's something we've been focusing on actually with In relation to multicluster in particular with the one eight release We are finally and it's not out the door yet. So I'm not This could always change but we're like a week short of the finish line And we are finally calling multicluster beta Meaning we finally have the degree of support that where we can say This is going to stick around and this is the shape that it's going to be in for some time We have a bit of a feature maturity problem In that we have a bunch of developers who love shipping new features And they're so cool and shiny and they're alpha and we don't have a lot of developers historically who have focused on Doing a very hard productionization work of promoting that feature through to beta and then to generally available So I think that's going to be a new focus and theme of the istio project over the coming year Oh, that's like the it's seriously the unsung heroes too the person who ships this feature first is always the one who gets the like uh kudos and then that that poor person who goes and spends Years polishing off and making it more stable just uh We in linker d we actually have a program that we call linker d heroes Where that's kind of the opportunity? I love it because I get to call out We've got one developer who is like c i machine and he keeps c i running and again like it's not something you see in the changelog But it's the only way that we ship software. It's that's really cool To to go back to sabine's point about federation and the multicluster stuff It's tough like in linker d uh for multicluster We have you pull a secret from one cluster and push it on to another and that's cool Except that it requires coop ctl context games and this that and the other thing and like uh, I wrote the A tutorial for our multicluster and it's basically crazy bash for loop scripts and there was nothing I could do about it It was that or 20 paragraphs to explain how it all fit together. I'm uh Super excited about all the work that sig multicluster has been doing to kind of make our lives a little bit better I think that's another example of a case where it's hard to optimize for both the demo use case As well as the production like enterprise grade software use case because in an enterprise They have secret management systems that are already in place. They already have probably a managed certificate somewhere in some System whether they're doing it, you know, there's a hundred different ways to accomplish that Uh, but they the last thing they want is my bash script generating a self-signed cert and then pushing it to production for them But in a demo use case, that's absolutely what you should do It's the right thing if you want to show off a shiny multicluster demo in 15 minutes or less The way that you get it done is by not productionizing your service mesh. It's a difficult problem The uh, we've definitely run into this uh in linker d. We ship with uh, yeager install out of the box Yeager installs are Insanely complex to productionize not because yeager's a complex product inherently, but you've got storage as part of it and so our uh helm chart is uh Forcibly minimalized to the point where it feels awkward if you try and put it into production because you're you should go and use The Yeager operator if you want to get that into production for sure um Mitch is there anything you else you want to kind of say around uh Uh the kind of cluster operation side of things No, I think we've covered everything Before we get into this next section. I want to uh tell a little bit of an anecdote which may uh stick my age as being an old person But i'm gonna say it anyways, uh back in the day Whenever something goes wrong Really, there's only one reason it went wrong. It's dns. It's always dns is fault 100 of the time it's always dns is fault and uh One of the really interesting things that we've seen at least on the linker d side is that uh now that you've introduced a data plane with a service mesh It's not actually dns's fault. Okay, it is 50 of the time but the service mesh gets blamed 100 of the time and so At least For us We've really needed to design for failure and think about what that means not just for the control plane and operating the service mesh But what it means for the developers the user of the service mesh as well Mitch i'm sure that you've got some interesting insight from the istio side to share along those lines Tell us a little bit Yeah, that that makes me think of kind of two stories within the project You know historically i've only been on the istio project for two years now and before that I owned the api on a large enterprise grade network appliance Uh, and we became the catch-all for everything that went wrong with that network appliance It's like well it went wrong after I told the api to do it and therefore it's definitely the api's fault Uh, and it took a while to really embrace that role But I think what we learned is that that can be done really really well Uh, what it does is it forces you to show to demonstrate conclusively That your layer is doing or behaving exactly as expected And it raises the bar of quality to such a high degree that eventually Your users stop making that assumption and so I think it's just a matter of maturity in service mesh It's something that we're pursuing in istio. I don't think that we're there yet One of the ways that we're doing that Um is historically, you know service meshes are distributed systems But we sort of pretend they aren't like oh, I changed the config Therefore my entire service mesh now has changed config. Well, no You changed the config in istio's case that means it's in kubernetes And then it has to propagate into the control plane then the control plane has to propagate it out to every data plane Which might be in this cluster might be in another cluster a different region Um, so we're raising the visibility of propagation of changes throughout the service mesh So that our users can very easily and very visibly tell Yes, this component is behaving as expected. I can move on and start troubleshooting other sections of the application We don't want them spending a lot of brain cycles on that Yeah, we've uh, we have a little bit of a command that'll do that That'll actually reach out to the linker d proxies and dump their service discovery state for exactly that reason And you're giving me flashbacks to my, uh Istio multi cluster, uh experiment where I had to go figure out the how to dump service discovery information from envoy for sure. Um, I this it's super interesting that you went that direction because that like defensive This is Healthy is a big reason why we put check-in as a command. It's that defensive like These are all working things if you have something not working. It might actually be A bug in us because let's just be honest. We have bugs or it's something that you need to go look at in your own setup Um sabine, uh, tell us a little bit about how console kind of goes about the Debugging side of things Yeah, so, um in that from that aspect what we try and do is that uh, you know, when something is failing um, we went the route of having it be seen in our um console ui so, um automatically added health checks To each of our components and so when something fails, you'll see it as red in the ui And so that I feel that really helps. Um our users just, uh, automatically know. Okay. This piece is failing All right, I see it. Uh, it's failing. There's something that I need to do Have you hooked up, uh, like proactive alerts from that? Uh, this is a thing that I've wanted as a feature for Linkerd for ever and we just haven't had the bandwidth and the cycles to put it in yet But is it is it not just health checks, but you'll actually can get like an email then as well Right now there's not an email Uh, uh feature, but uh, I believe it is something that will probably come In our next Corrections. I'm sure like I said, it's definitely something that I wanted to do for a long time now That the debugging the mesh is is perhaps one of the subjects. I'm most passionate about In linkerd we've kind of had to throw the Kitchen sink at it for lack of a better term and kind of attack it from a multi level multi-level solution in that we've got a One command called tap that actually lets you do for lack of a better term a wire shark on your entire Cluster so you can go and tap and see the live requests and uh, that's pretty helpful But it's only really helpful to show when someone's App is misbehaving if the mesh is actually misbehaving you run into quite a bit of issues and so The next level down there is the We've got a debug sidecar and the ephemeral container features and kubernetes But I think they landed in 116 and they might be beta in 119 But no one quote me on that Allows you to go and add in a sidecar that comes with all the tools to debug it So we have a t-shark and we've got all of that like user Land tooling that you can go use to figure out the details then And it's just some old containers. I think yeah, and it's like it's the onion right like As soon as you start to get into the debugging side of things you start out at that like high level We know something's wrong There's an alert and then you kind of have to pull the like layers of the onion back until you get down to the core problem like I I remember running into an issue on gke Where they had shipped with a kernel that deadlocked when you used the so original desk socket option and i'm picking on gke it was just a you know Problem anybody could run into but man talk about peeling the layers back when you're node Deadlocks because you go and install linkerd. It was a unique experience. I tell you what um Mitch do you have anything around like kind of peeling the layers back on the istio side of things and digging in deeper on the debugging Well, I'll I'll give a guilty confession and that is that I am a huge fan of tap Uh, oh, that was that was a really cool tool to see. I love the way that you guys have pulled that off um I think one of the things that we like to see in istio. We talked about a little bit what we don't develop and what we intentionally don't We've seen a great development in the ecosystem of a couple of different tools for debugging your application Using a service mesh Now that the the network is completely managed that actually gives you some really cool superpowers If you can assume the network is working properly if you can go through that, you know Distribution status checks and analyzer checks and everything else to say my service mesh is working Well now you can actually do get into your application in really interesting ways The one that I've played with the most has been squash from the team at solo and Other than ephemeral containers which push your debug tools on to the kubernetes node This allows you to run a microservice locally on your desktop And have it connected using istio into your your service mesh So it's participating in the mesh as though it we're running in kubernetes But you can actually still have it plugged into your ide stepping through line by line with local debugging Which is just it's a superpower that I've wanted for years Oh, it's like it's the thing the ugly little secret that no one talks about when you start doing this cloud native thing Is that if you're on a cluster, how do you do remote development? And okay, the java folks have had it forever uh those of us back in the stone ages of uh go and uh languages like that are Quickly coming up to speed, but it's definitely that like I uh, I have a superpower. It's called print And that's how I do all my debugging And I know that I should come into the world of ide's and real debuggers, but I've never been able to get there Well, you should check out squash. I think it actually, uh, it can run on top of linker d as well I think I'd have to look through the docs I'll definitely have to go check it out. That's a that's a new one for me Uh, when you mentioned it the tool that came to my mind at first was uh case sniff Which is a kube ctl plugin that will go and set up tcp dumps for you, which is really cool as well Um, sabine, do you have anything else in the console land to share with us about? Keeping the mesh up and debugging what's going wrong? Yeah, so, um This is more on If there's kind of a more of a catastrophe Uh failure or like a lot happened. Yeah, of course not So on that side, um, just if there's So, you know consoles and interconnected system, uh, something can always break And that can cause different components to not be able to communicate with one another And so at that juncture we kind of had we have a choice of If that if that goes down do we want everything to stop working? And so we definitely did not want that to happen. And so we fail static and so basically that means that um At the time of failure A communication will still occur, but the configuration that That was set at the time of failure. Uh, that will continue to stay But everything all traffic will continue to be routed Um, and so this basically allows like the operator to figure out. Okay. What's going wrong? But still everything is still up and running. Um, you know in the sense that traffic is still being you know, uh communicated but um Uh, but but it hasn't all failed and uh, you know, they can still figure out what's wrong and it gives them that time That's such an important part of designing for failure, especially with a data plane You're in the way of all of the traffic and if your control plane, you know starts to make bad decisions You've got to go and do something. Uh It's a really especially interesting in my opinion once you start, uh, figuring policy into it with linker d We kind of uh fail open basically to make sure that the requests go no matter what But uh, you know, again, once you've got that policy in there How do you know what connections are valid and not and how do you start? Putting all of those pieces together. Um I I guess we've mostly talked about debugging so far. Uh, since thank you subine for getting us onto that like designing for failure Uh, mitch, do you have any like cool stories to tell us about the like trade-offs or interesting pieces that istio has put in to like protect itself from more global failures Well, I will say that we have had to rethink what what an outage is and what a failure is Because we we share the same failure model that subine just talked about Uh, which is fairly common a split between data plane and control planes Your data plane should survive when your control plane is dead And so initially that was considered to be a non outage your control plane is down But your application traffic is still flowing. So everything should be healthy Uh, but when you're really talking about a microservices world where new endpoints are being added to services all the time Where endpoints are being marked as unhealthy relatively frequently and sometimes that's done by the control plane While other times it's happening in the data plane It's sort of like the good news is you've been driving a car at 200 miles an hour and you didn't hit a brick wall It's still going but there's no steering wheel Um, and so it's sort of a it's still a very important problem for us to look at and so we've had to re-evaluate how we qualify an outage and how we interact with users and customers and say, okay Your traffic is still flowing, but we still this is still a very serious incident Oh man, that's a that's a really good way. It's that like yes, mr. Customer. I understand You're kind of down but not really down. Is this it and a serious outage or a semi serious like it's one of those where Yes or no is a much easier question, especially if you've got the checklist together. Uh, it's like, yeah, it's super cool to think about For console service mesh, we would love to continue The discussion if you all have any questions feel free to ask us on our discuss forum Or check out our docs at console.io slash docs Or we would love if you all were interested in contributing To our repos. We have the console repo. We have our console On kubernetes repo and our console helm repo Thank you very much One of the things that we hear from users very frequently who are interested in getting involved in their in the istio community Is they don't feel like they have the technical chops that are going to be necessary For all of the details and complexity in developing an istio feature Well in the usability group one of the key things that we're looking for is insight into user habits and user experience To understand what the process of upgrading istio is like for a user or troubleshooting istio is like for a user So the good news is lack of technical chops for the deepest darkest corners of istio is actually a prerequisite to contributing in this area If you're interested in getting involved in the community I highly recommend starting with the user experience working group where we would love to hear What you're using istio for and what your day-to-day life and interacting with istio is like You can find the links on the slide to join us that we also have a community meeting Where we're interested in hearing all sorts of different use cases for istio. Hope to see you there soon Oh, that's really great. This has been a fantastic conversation. Thank you Mitch and sabine. I've really enjoyed it Finally to call out linker d in our community. We'd really love everyone who's Interested to come join our community get started and check things out. I've got kit hub slack and Twitter links up there and we'd love for you to join us However, you can this is a great community. Thank you service mesh con and have a great rest of your day