 All right, let's go ahead and get started. I'd like to thank everyone who's joining us today. Welcome to today's CNCF webinars. Best practices for deploying a surface mentioned production from technology to teams. I'm a business development manager for cloud-native technologies at NetApp, the CNCF ambassador. I'll be moderating to these webinars, which will be a conversation between William Morgan, co-founder and CEO at Point, and a caveman systems engineer at Big Place, William King, CTO and founder at Subspace, and Matt Young, VP of cloud engineering and I have a quote. A few housekeeping items before we get started during the webinar. You are not able to talk as an attendee. There's a Q&A box at the bottom of your screen. Please feel free to drop your questions in there and we'll get there. We'll get to as many of those as we can at the end. Please remember this is an official webinar at the CNCF and it's such subject to the CNCF code of content. Please do not add anything to the chat or questions that would be in violation of that code. Basically, to be respectful of all your fellow participants and presenters. Please note that this recording and the slides will be available later today on the CNCF webinar page at cncf.io forward slash webinars. And with that, I'll hand it over to William. Thanks, Ariel. All right, welcome everyone. Thanks for joining us. I promise we're going to try and make this exciting and fun. And we're not going to talk about the pandemic for 60 minutes. You'll have a safe zone. All right, so the title of the webinar service mesh from technology to teams. This is me, I'm William Morgan. I'm the CEO of the company called buoyant, which does lots of service mesh things, including sponsoring and maintaining linker D. I build a project called dive delivery platform for service meshes. I have delivered delivered many service mesh talks webinars, and basically my entire life began with the service mesh and will end with the service mesh fading into obscurity. Hopefully not. So that's me. The actually interesting people here today. So what I want to do is I'm going to have each of these folks introduce themselves and the majority of this presentation is going to be a conversation with them. So, Matt, can you tell us a little bit about who you are whatever quote does and what your role there is. Certainly. Hi, everybody. My name is Matt. I run our cloud engineering team at ever quote, ever quote operates a leading online insurance marketplace in the United States that connects consumers that are seeking services of various types with insurance providers to help them protect life's most important assets, their family property and future. In short, we connect a whole lot of people that want to shop for something with a whole bunch of people that are providing services of machine learning and smart analytics combined with a fairly sizable web facing set of services to make that happen. My team partners with our engineering teams. They're my customers and we build a platform full of services and curated patterns that let our teams manage their own services and production. Thanks, Anna. Hi, everyone. My name is Anna. I'm a systems engineer or infrastructure engineer at pay base. And pay basis payment services provider, specifically for market basis, get your economies, blockchain businesses or any type of index. We are a fintech ourselves and we operate in a very regulated space, which means that for us specifically it's very important to be highly reliable, available and scalable to support our customers. And William. William King, CPO co founder of subspace, the two year old startup that we just came out of stealth like in the last couple of weeks ago. We are solving lag for multiplayer gamers globally, everything from layer one with lasers all the way up to the highest layers with distributed systems. Well, thank, thank all three of you for joining us today. We're going to post the slides on the CNCF website but I'll just point out I'm going to skip ahead to the very very end I have a couple links in here. I have a link to our esteemed panelists didn't mention, you know, some really exciting stuff so I have a link to subspace is big launch or emergence from stealth announcement Matt has an upcoming service mesh talk on service mesh con and then Anna actually delivered a talk at the last service mesh con. And with that, let's take a look at the agenda for today so I want to try and keep this pretty simple and the main thing that I want to do is there is a lot of technical content out there around the service mesh. I don't want to cover that too much, mostly when I want to focus on and the reason why I asked Anna and William and Matt to join us is kind of the organizational aspect so once you actually have a service mesh that you have deployed to some environment how do engineers interact with it, what has to change or doesn't have to change around the way that the teams are structured. And basically, how do you actually operate this thing from kind of the team and human perspective, as opposed to from the perspective of, you know, the computers and bits and bytes. So that's the focus of the webinar. I'm going to start with a very brief look at what is a service mesh just so that we're all on the same page. And then the majority of this time will be fun and exciting panel with our three guests and then we'll have some time at the end for Q&As from the audience. So as Ariel said, feel free to type in your questions as you hear interesting things and then we'll do our best to address them in the panel section of the webinar. Okay, so with that, what is the service mesh promise I'm going to keep this brief. And actually I'm going to take a different slightly different focus. So here's, here's what I'd like to frame it as this time is a service mesh is a tool. And it's a tool for giving the platform owners. You know, as opposed to the developers or the business logic implementers, so tool for giving them the observability, the reliability and security primitives, right? This is like kind of stuff that you get. Those primitives are critical for cloud native architectures, which is why we want to give them to them. The magic things is we do it with no developer involvement. Ideally, there's some asterisks in there. Ideally, what the service mesh delivered and the reason why it's so useful is not actually the features themselves, but it's the fact that it delivers those features to the platform team in a way that decouples them from the developer teams. So rather than asking the developer teams to all implement TLS in the exact same way. So in fighting with the product managers who are trying to deliver, you know, kind of business logic features, we can do that at the platform level, rather than having instrumentation be fragmented across all and telemetry fragmented across every service, but we can give you a consistent layer of telemetry at the platform level, and so on. So that is what a service mesh is. In practice, they all follow a similar pattern and I'm going to mostly talk about linkerd here because that's the one that I'm most familiar with. But the reality is, almost every service mesh follows a very similar pattern which is you have a control plane, you have a data plane, and the control plane has, you know, kind of some machinery around how the service mesh actually works, and keeping things together. And the data plane is where we do this kind of the weird and funny thing which is we install a little proxy next to every in Kubernetes terms inside every pod, and we wire the traffic through that proxy. And those little proxies which you now have hundreds or thousands or tens of thousands of right, those are kind of what we call the data plane and those are responsible for managing manipulating and measuring all the traffic that goes in between your applications. So, there's lots more information about this on linkerd.io. It's an open source open governance service mesh to CNCF projects very happy about that been in production for for probably much longer than this, including a companies like pay base and ever coat and subspace, all sorts of GitHub stars which is very important and more or less stable release cadence. Okay, very last section here so just to make this really concrete, you know what does linkerty actually do there's a set of features around observability. It's a set of features around reliability and there's a set of features around security. And as we have our conversation with our panelists, you know, a lot of these features are going to be brought to the surface. So on the observability side we have things like service level golden metrics so success rates latencies throughput service topologies on the retry side, or on the reliability side we have things like retry timeouts and load balancing multi cluster support. On the security side we have things like transparent mutual TLS and certificate management and the angle that linkerty provides in the spaces, try to be as as light. as possible. So it's easy to make things complicated. It's a lot more difficult to make them simple. So that's what we spend a lot of our time and energy. And I guess we'll find out whether we did a good job at that, or not. Okay, so now on to the fun part hopefully hopefully that all made sense. If it if it didn't in that resources slide at the very very end of the slide deck or a couple links to some docs and blog posts and things that you can read to help you inform about the service mesh as a category. Okay, so now the kind of fun part here. So this is a question, you know, that I really want to address which is how does my engineering organization successfully adopt a service mesh. And what I'm going to do is our three victims, I mean, participants, I'm going to ask them a series of questions we're going to do it panel style. And hopefully we will all learn something new because all three of these people have actually deployed a service mesh to production and have to live with the consequences of that decision every day. Okay, so this is this is the big list of questions but we're actually going to go through this one by one. Everybody feeling ready. Yes. All right. Very first question which of course I missed. How big is your engineering organization and how is it structured. Matt, why don't we start with you. Our engineering organization that ever quote is roughly around 100 people all in across disciplines. My immediate team is seven or eight. I'm overhead so I'll say seven, but we're growing the way we're structured is something that we've pivoted over last year. So in the past, the team was largely operationally based where we were, you know, sort of just doing what was needed. But over the last year we've really changed out to more of a forward looking team, you know, tasked with building out a platform that allows us to solve problems for our engineering team so that they don't have to solve them individually. In a way we're an embedded startup inside a recently public company. My customers are all of the engineering teams and my product is all the cloud things. So their service hosting environment. Great Anna, how do things look at pay base. So for us our engineering team is unusually small. We have a total of five people that includes two systems engineers so infrastructure engineers or sorry, and to three software engineers. And the way the team is split the way the work looks like is that although the systems engineer maintain the infrastructure and monitoring systems and service mesh. Our software engineers are able to to deploy new versions of an application themselves without having to make major changes to infrastructure. And everyone gets involved into everything so we can also the systems engineers can also troubleshoot the application side and the other way around. And so we have quite a flat structure in a way type of team. Great, great. And William, how the subspace look subspace has about 30 engineers from infrastructure engineers to connectivity and network engineers, all the way through to performance and software engineering. We're about 10 people on the software engineering and SRE side, and the service mesh is owned by that software engineering group. Okay, great. So we've got a nice range of sizes here we've got 530 and 100 engineers. Alright, and the next question. William, I think you got a head start on this already so why don't you. Why don't you keep going with it so you add subspace who owns the service mesh and how does the rest of the organization interact. So, SREs and the software engineers, I'll just say the lead on it, developing what our pipelines look like for the different service measures that we've got deployed. And a lot of the other software engineers interact with it by taking some of the service templates and tool kits that we do from kind of our skeleton best practices. We kind of take the approach of the service mesh and the tooling is kind of the page highway. And if the software engineers need to go off-roading, they can do everything custom. But most look at it and say the tooling you need the service mesh provides isn't worth it. So they take the templates, get the service deployed oftentimes in under an hour. Right. So if by only mean who configures my exchanges upgrades the service mesh, that would be myself and the other systems engineer in our team. However, I want to add that our setup is such that after we've deployed it to production, every time we add a new, a new service into our system. Everything is automatically configured to join the service mesh. So the actual management of it, it's very small. It's, yeah, it's very minimal. Matt, is that your experience as well? So we, from a, from a, if it breaks, who fixes it, that would be our team. If it's ownership in terms of who's been a proponent for it and who's rolled it out, that's also my team. However, I think Lisa, I never quote our applications increasingly are viewing the infrastructure that they need as inclusive to their definition of what their service is. Whether that's core infrastructure components like storage buckets and things like that. You know, now we have Terraform and whatnot descriptions alongside the service. The same is true for some of the configuration of the mesh. We're, we don't have, you know, we have roughly a quarter of our services, the most critical ones in the mesh now with, you know, adoption happening over the coming, I'd say quarter and a half. So initially, I would say it's a, it's more of a shared ownership model, however, because the way we prioritized how we're, how we're structuring this was done in close collaboration with the teams that needed it. Right. So we really leveraged them to make sure that we weren't off in space, so to speak, from a requirement standpoint, but you know, in a classical definition, we own it, I suppose. Yeah, I left that, I left that word ambiguous to see what people would say. What about for things like, you know, the retry policy for a particular service or, you know, time out configuration or things like there are these intersection points between the platform side and the developer side. There are, I could talk more to it in the, in the, in the why did we adopt the mesh and how do we, how do we roll it out but you know, we have a ever quotes about five or six years old seven depending on how you count. So there's, I don't say strata but there's a number of different epics time periods and different service architectures and the most recent few years is primarily Kubernetes hosted for new services. So, you know, there's a lot of, you know, before we had a service mesh we needed to do a timeouts and retries so we actually have some services and or libraries that are in use that do some of that. So, some of the features that a mesh provides that you mentioned for many services. It's a way for them to prune out things but we haven't done that yet. So it would be more to maybe the following question, not the other speak, but it would be a little more, a little more clearly. Lee or Anna, do you have developers who are trying to or who, who have things like retry policies or latency or timeouts or things that they care about that they didn't have to, you know, there's some kind of interaction between dev and platform. Sorry, I didn't quite catch the question. Can you repeat please. Yeah, so it's more, you know, sometimes there are things that developers, depending on the organization there are things that developers may care about that kind of fall into the service mesh realm of functionality right like I care about how retries are going to work for my, for my service or I care about how, you know, the timeouts that callers are setting when they call my service. Yeah, they do indeed care about latency and retries, but we haven't seen after we've implemented linker D we haven't seen a big change or like a latency increase that affected the performance of our system. In fact, we're able to make other changes at the same time that enabled us to to be able to offer the same kind of performance for the system. Great. Let's say on our side we're very latency aware and we measure everything in milliseconds or smaller. We actually use linker D to help a service is able to insist that its clients and consumers are not setting timeouts longer than a certain amount or other retry benefits. A consumer is able to be more aggressive and have a lower threshold, but a service is able to say what its expectations are basically using from an SRE, SLO type perspective, we use the service mesh to help standardize that. So, those particular thresholds for an individual service are they, you know, are they in the hands of the platform team or are they in the hands of the developer team. We don't really have much of that distinction here. It's kind of like a co-partnership on that, but it's at the service, like the namespace architecture level that will go through and agrees that this particular service should have these characteristics. And then both sides will implement to that. Okay, great. Okay, so I guess we covered a lot of this already, but is there a notion of like this formal platform team and if it's so, you know, what are its stated goal or how does it know whether it's being successful. I think Anna, in your case, let's start with you because I think you probably have the easiest answer to this, which is, well, I won't speak for you. Well, as I said, our team has a very flat structure, but in terms of making sure we're doing well, I guess it comes down to measuring the performance of the system, not being paged consistently when we're on call. And we haven't seen an impact ever since we've implemented LinkrD and the right version for us and, you know, after we've solved all of the initial bugs we've encountered, we haven't seen either way that performance has changed. Yeah. Okay, so that's the five, five engineer perspective, let's jump up to the 30 engineer perspective, William, back to you. Yeah, so for months we're just getting to the point where we've got SREs who are driving and doing things like LinkrD upgrades or Kubernetes node scaling out. And it's been great to be able to change the type of node that our entire cluster is using while the cluster is still in a zero downtime state. The goals from the platform team are kind of being able to know is our overall platform and the service we're providing to customers and the gamers, is it still operating in a nominal state. If it is, then okay all of these things that would require large coordination can keep continuing. If not, they're the ones who are able to at least shine a broad flashlight on where the problem might be. That's one of the things that we've really valued out of the observability is it's easy, it's within a minute or so it's easy to tell roughly where in the service tree, the problem is originating from. Okay. And then Matt, what's the, you know, at the 100 person perspective is the cloud engineering or the platform team. I guess there's a couple of different ways to answer that. In tide ever quote we actually we've, we've just finished planning for the quarter and really talking about what's what's a service versus what's a platform. It has been a topic so so if I'm to use sort of the definitions that we've adopted internally, you know, we would say as you know a service is something that delivers value to you. Like here's a thing you can call here's a service I'm running, you know, it provides this value, whereas a platform is something that you can use to generate value for yourself. I don't know if that distinction is clear. So, you know, right now I think we, we have a number of platforms teams. So, you know, within within this larger for the context of this discussion team. So, you know, we have a data engineering portion of our of our consolidated engineering team, and they run a data and analytics platform, right that people can put data into our cloud platform that we're running. You know, is comprised of, you know, some short terraform modules and Kubernetes clusters and the service mesh. So in that in that respect. Yes, we're a platform team. We're using something that our teams can just come to and use. I think we're still, you know, midway through the full rollout. So, you know, I'll caveat it with saying, you know, we still have some work to do before I would call it like a done platform, which to me means I can back away slowly from it. All of the core use cases are covered and documented with examples. You know, we're, we're still more in the like, well, here are the dozen or show services on it. And if we're going to add a new one, we'll do what they're doing, but it's not completely self serve yet. So, in that respect, it's not a done platform. I don't know if that's a useful distinction, or I could just say yes. And then William and Anna, if anything that Matt says sounds crazy, you know, feel free to jump in and just yell at him. Okay, I'm really not fragile. All right. So, so Matt, actually, why don't we, why don't we stick with you. So I think you've touched on this a little bit, but what was the original motivation for adopting a service mission and how's that hand out or have you like kind of shifted. Sure. Um, so we had ever quote we were, we had the happy misfortune of having way more load and we expected a little bit sooner than we expected. Over the last couple of years, we've seen traffic to our consumer facing services, just, you know, double triple and up. So we had a number of monoliths that were being decomposed in the process. You know, in some cases, we actually have great, you know, you know, very discreet, classically defined microservices. But in other cases, we have what's more really a distributed monolith or somewhere in between. And I don't mean that in a bad way. I just mean, we needed to scale some portions, more than others, but we still do have either temporal coupling, or in some cases, other forms of coupling still present, which, again, is not necessarily broken. So our initial motivation for bringing in a service mesh, it was still at the time, was to load balance GRPC. We had grown as an organization to the point where simple rest interfaces while expedient became a little more difficult to manage without very strict, you know, swagger definitions or open API specs, which didn't always happen. So Proto and GRPC was chosen as an RPC typed language for many of the new services. But, you know, all of the cloud providers didn't at the time have L7 load balancing. Many still don't. So, you know, we had lots of load and no way to load balance it. So that was our initial motivation. There are, however, two other real big reasons why we needed, and we do need still a service mesh. One, some of our teams are breaking into, not breaking into, we're in the health space, and we end up dealing with not just personally identifiable information, but EPHI or other data that our customers give us, that's either of a medical nature or the like, where there are compliance issues where we need to ensure that we have MTLS and encryption and transit as well as at rest for everything full stop. So having a service mesh, you know, that's one of those things that we can provide to all teams without all teams having to deal with authentication and encryption and MTLS. So that was the second big one. And then third observability, obviously, you know, when we were a 20 person company with a big shared code base. Everyone just kind of knew what was going on. But now that we have dozens of services and rising and, you know, teams that are growing, not just in number, but also across geographies, where we're now, you know, a multi region team, if you will, having a consistent observability platform is critical for meantime to issue identification resolution diagnosis and the like. Yeah, okay. So the initial impetus then was purely for GRPC load balancing, but then now the things that have been sticky, I guess, are the mutual TLS. Yeah, and there's actually a fourth one. I don't want to hog too much time here, but you know, we're rolling out continuous deployment for our services. We're using flux CD and flagger for communities hosted services, at least, and the observability and metrics that come out of that can help us form the predicates that we use for canaries. That's active work in flight for us. We've got, you know, pilots up now, and we like what we see so far. So we're doing things like this quarter, taking all of our proto that we build in CI and generating service profiles. We've moved over to linkerd. So now our observability is not just that service level, but it will be moving forward at route level, or at method invocation level. And that's a huge win because, you know, when something goes wrong or or or when we have an issue, we can very quickly see where the issue is. Yeah, yeah, yeah, that's, I love that stuff. It's really, really cool. So you're also in a regulated space. What was your motivation? Was it mutual TLS or was it something? Yeah, so the the main motivation was jrpc load balancing as well. So your application is a distributed model lit that is deployed on top of Kubernetes as microservices. So it's quite complex and it has, I think last time we counted over 50 microservices but right now realistically maybe towards 80 if not 100. And we are in a regulated space so mtls and encryption and security was really important to us, but we are able to find other ways to go around that. The main issue was scalability and being able for our services that communicate through jrpc and protobuf being able to load balance jrpc was a pain point. If we wouldn't have used the service mesh, we would have had to change the way the services communicate with each other, or even build our own service mesh but I don't think that was something that yeah, too much, too much work. Yeah, that's not fun. Yeah. Thanks and then William you know unlike Anna and Matt you get to live this carefree life of no regulations, no rules you can do whatever you want, I assume. What was the motivation for you folks, especially since you are operating such, you know in such a low earn and such a latency sensitive kind of space, why would you add a service mesh that just adds proxies and adds latency everywhere. So my co founder and I actually came out of a regular telecom space. So we brought forward a lot of those best practices and we're like if we were going to build in something at the infrastructure level. We might as well start using best practices that's a lot easier to green field, bring those in an establishment as tooling, then it is to try and backport them later. For us it was actually more to bring in determinism and more services being able to self configure. So, some of the examples, because we're doing clusters between multi cloud, both from on prem bare metal to cloud hosted versions. We were seeing strange connectivity issues between them, and having the service mesh run MTLS or run basically the connection proxies between the services in the cluster and we were able to get ways to do it between clusters. We actually brought a lot of determinism in and services were able to go through and self configure how they wanted the service mesh to react. And since we use scaffold and helm for a lot of our CIC deployment process, we were able to specify that in the actual deployment. So we could make as a discrete unit, a service mess change. For instance, we were, we had one scenario for a couple of days a TCP connection leak. Oops, but we were able to use a service mesh to vary specifically tune it from multiple perspectives to find out where the source of that came in and roll it back, and then roll forward once we had resolved it. So it's worth the, it's worth the latency hit. We see bigger latency hits not from the mesh itself, but from many other sources. And having a service mesh has bought the mesh, more of a latency budget by solving it elsewhere, then it's cost. Okay, great. Thanks for the experience as well. We haven't really noticed any issues with latency with linker D and we bought some era budget elsewhere, in particular the more nuanced way that load balancing happens that's a little more adaptive that linker D has. So we run some fairly large clusters where we opportunistically run some workloads on faster nodes when you know model training things aren't busy. And so, you know, it's not a uniform, a uniform distribution around rough load balancing for some of our CRPC services is not super optimal. This is really good to hear one of the challenges we really faced early on and talking about the concept of a service mesh to people was, you know, like it seems like a bad idea right like you're adding thousands of proxies everywhere and like you're going to incur a hit there and so you have to we had to talk about how, you know, yes, every abstraction, you know, as has a cost but you're going to have benefit at the end and like, but it's, you know, it feels like a very abstract kind of conversation. So knowing that it actually is worth it in practice. William one of the very first versions of linker D. When we installed it, we've seen major major latency added to our services, but then, but then there was a bug between the application and don't linker D but then we've sort of worked and then we solve that and after that has been fixed. We haven't seen much latency added. Great. Okay. Wow. All right, let's move on to the next one. Actually, I think there's two questions here and maybe we'll try and address them at the, at the same time because I want to make sure we have space at the end for audience questions. But so, you know, what's been the Oregon biggest organizational challenge, rolling out a service mission by organizational I mean like people, you know, I understand that deploying anything in Kubernetes is a challenge just from the, you know, kind of nature of the beast. And then what's been the most surprising benefit. So, William, why don't we start with you because you have the best name, biggest organizational challenge and most surprising benefit if any. Yeah, I would say for us the biggest organizational challenge was kind of two parts, and we solved each of them in an interesting way. So one was being able to find a shared set of configurations that works for all services. When we know that's impossible. So we found a working to find the same default, and how do we migrate off of the default for specific scenarios for as long as they have to be off default. And then where possible try and bring them back in. And so that was managing that was an organizational challenge. More of that related to working with some amazing engineers in our team, who were learning how to go from a service mesh. So folks who had never actually been in an SRE, or an operations type of role, even we had a, we have a nickname internally, you're the SRE intern for these sets of projects, where it's basically you're getting the matrix level download of how does a service mesh work, how do all the components work, how do you change and configure individual components to override the defaults. An organizational challenge that was a piece. It's paid off dividends for us because we now have more people who understand how the internals of the components work. The first part with the configurations and the defaults. We had scenarios that won't get into too much detail, where we intended a configuration on a service mesh deployment to look one way. If you want to deploy in Kubernetes, it looked different. And so when you're trying to make gentle adjustments that direction and you have a deployment system that interprets things unexpectedly. At least unfortunate downsides. Great. Great. And Matt, what about you you're steering this ginormous organizational boat. What was the biggest organizational challenge and, you know, most surprising, shocking benefit. So I think one of the, one of the challenges I think wasn't to initially adopt the mesh. I mean, we had very concrete problems of, I'll say manual load balancing happening before we had a solution to load balance GRPC. So, so, you know, I guess at a high level, the challenge has been that for teams that have an acute concrete need for which a mesh solves. What's a little bit harder in a growing company. And we are, we have an enormous opportunity. You know, and we manage the business such that when we do things that increase revenue or are working, we do more of them. And so getting the conversation around for teams that don't have an acute problem, but there's an organization wide benefit to having all of our applications in a mesh where we can, in a consistent way, have a view across services, having them take time to actually learn some of the things that need to learn or change configurations and things like that. That doesn't have an immediate value to their team. That can be just from a people or a project management perspective a little bit of a challenge. However, I think it's solvable. And when you show them some of the stuff they get like, hey, you can come to the mesh or you can implement until us yourself. Or, you know, some of the, some of the, you know, we're standardizing on an observability stack that is going to really heavily leverage consistent metrics coming from these services so that we can say, Hey, if you hop on the mesh, here's all this alerting and monitoring and anomaly detection and other things that you'll get out of the box that you would otherwise maybe have to manage yourself. So that's one challenge. Another challenge we've had is, you know, we shifted to Kubernetes a couple of years ago, and some of the difficulties there I mean, as an aside, my partner, the first time she saw the peanut butter I was eating a couple of years ago, it was like this raw peanut butter stuff that's, you know, she said, Oh, this is okay, but it doesn't taste like it's done. To me, Kubernetes doesn't feel like it's done yet. Right. It's, it's, it's useful. It's a step in the right direction. It's, it's doing a lot of positive things, but it feels like it's not arrived. There is a barrier to entry. And so in particular for us, we have both Kubernetes and non Kubernetes workloads. So I think one of the challenges has been that teams now need to kind of, in particular, when they have services both inside and outside. Kubernetes now, it's forced us to address some technical debt and learning around how do we handle East West versus North South traffic. Right. How do we, you know, what are the finer points of this. And I think a positive aspect though is that we now have had a number of discussions about how we're, how we're making some choices like freezing engine X now instead of cloud vendor specific ingresses for this kind of use cases. And an outcome has been, you know, a higher level of perhaps knowledge about the bowels of the networking that was not there before. Okay, great. Thank you. And Anna, you're the real engineer here. Right. The rest of us have devolved into management roles and are in our ivory towers, shuffling organ org charts around. So keep us keep us pure. What's been the biggest organizational challenge that pay base for rolling out LinkedIn. Well, it has been, honestly, it has been dealing with bugs. And again, I'm not talking necessarily from a, from a service mesh perspective or application perspective was to do, mainly with the complexity of our application and very specific functionality that we're using that it seems that at the time, none of the other linker the or even is still because we've tried these two before linker the users were seeing. So for me, I remember a few, quite a few weeks where I would deploy it on to our testing environments tested as much as I could and then I would message my team saying I'm ready to deploy to production. And then I'd have everyone the team saying stop, stop, stop rollback. It's not working. And again, the talk that I did with the Risha at the service mesh corn talks about those challenges. That has been the main thing. But we're able to solve them and we're able to do that through collaboration between the different teams. And just sort of to go a bit into the next questions. I wanted, if there's something I wished someone would have told me. So myself and Risha came up with this matrix of how to troubleshoot something as complex as a service mesh when your own application is very complex. And I just wish that I had access of that when I was deploying it. Yeah, that that has been the biggest challenge I would say. And in terms of a surprising benefit, being able to see on the UI to see the dependency tree between services because although we're a very, very small team, we move very fast. And sometimes it's hard for the systems engineers to keep up with how the dependency between the services has changed over even on a weekly basis. And also that helps with onboarding new engineers as well. Okay, great. And that decision matrix that you and Risha came up with that's in your talk, which is linked to so there's a link to that at the end. Okay, we're going to do one last question here from me and we're going to have to stay really focused because I want to leave a bunch of time for the audience Q&A we've got a whole bunch there. So very last question. 30 seconds or less. Maybe you've already answered this, Anna, but we can start with you. What's your best advice for other organizations who want to adopt a service mesh? Watch your service mesh talk. I would just say don't be afraid to reach out to the team who's managing your service mesh. Sorry, contributing. Maintaining, that's the word. Maintaining the service mesh. So for us, we are able to contact you guys over Slack and that was the fastest way we were able to fix everything that we're seeing on our side. And don't be intimidated because it looks, it's very complex service mesh is very complex. So just take everything incrementally and add things as you go. That's great. Okay, William, what about you? What's your best advice? I'd say, I'd say for adopting a service mesh, the incremental approach is the best way to look at it. I would take a step further and say, while you're looking to adopt an incremental approach, get something working. And then when you get something working at very small level, break it, see how it breaks, understand how to be able to triage it, roll back the break, go and add the next piece of the feature and try and take things in a functional unit. So let's say north south for an API gateway, take that as one unit that for east west between services and namespaces another unit between multi clusters as its own separate unit. You'll learn a lot about the subtleties and the insides of the abstractions by seeing how it breaks and then putting it back together. Great. All right, and Matt, I saw you nodding your head. Yeah, I think it's safe to say that if you're, if you're dabbling in these waters with service meshes, everything is shiny. So be really, really, really clear about what you actually, what problems you're trying to solve and ruthlessly prioritize. There are many features of linker D. For example, we haven't explored yet because we've really needed to focus on the ones we focused on and take an incremental approach and iterate. As an example, you know, we have at this point some namespaces where we've got everything in the namespaces meshed in our new environments, we're building out for our next next generation stuff. It'll be a default to have the service mesh enabled and the exception will be when you're not on it, but it's very easy to roll out something very broadly and then and then discover what you don't know. So, and also a plus one reach out to the upstream communities. It's one of the advantages of working in open source based CNCF stack. Great. Okay, well I'm glad to hear that the community aspect is coming out here. Okay, so we've got a couple minutes left. While we've all been talking Ariel has been slaving way behind the scenes curating all the questions that have come in. So, I have no idea what these questions are I have not looked at them yet we're going to we're going to find out together. Take a quick, quick, quick pass through there's some good questions here. Well, how about this we've got, let's see, ah, okay four questions total, and let's make it a free for all. So rather than me directing everything. Pick a question to answer in. So what's I think somebody asked why did you move to to link or D from Istio or Istio versus this two or three questions on that. For us we rolled out Istio first, we still have one workload on it because it uses header based dynamic path routing, which link or D doesn't do. We've kind of found that Istio was very broad. It seems to have a ton of features, but it's also very difficult and it has a lot of moving parts. And for us, most importantly, it's very opinionated on ingress gateway. And we wanted to have the flexibility to choose our own ingresses as we still haven't consolidated on one single API gateway type like ambassador or glue or something else. So for us link or D was a little more narrowly focused and more towards less configuration and less barrier to entry, as well as being a little bit less overhead in terms of performance. So when we moved to link or D it was primarily like I said for those three things observability mtls and load balancing, and it left open choices later that we didn't have to make up front. I would just add on what Matt said, we've had the same experience with Istio and link or D where we found that Istio was was having a lot more features, but it's also the barrier to entry is much more higher. And link or D is much more simpler, but for us specifically came down to being able to to deploy it and at the time when we did it, we didn't have the right support from Istio to be able to troubleshoot it properly. Of course that doesn't mean that things haven't changed for them now, but that's just what our experience has been. I say ours kind of echoes. So we deployed Istio initially, because it was basically 30 minutes and it's up and ready, and services are pulling into it. But we actually ran into troubles upgrading between versions and trying to resolve features link or D was always looked at as you hear the components and the Lego bricks, putting together how you need them and how you want. Istio is more of push one button and fingers crossed it works exactly as you need it. It was very opinionated as the API gateway. We actually ran into issues. And the reason we migrated or the start of the migration from Istio to link or D was when helm options and Istio CTL options, we're not being respected. And you dive into code and you realize, okay, there's a significant difference between the two. And there's no, there's no way to configure that particular construction, whereas link or D. Two hours later was up and running with the full cluster in a beta environment and we didn't really look back. That's great. I see there's a question on latency and overhead and metrics. There are. There are some metrics if you search for Kinvoke KIN VOLK and link or D maybe Kinvoke link or D Istio you'll see a performance comparison that was done in May of last year so it's almost a year old and both projects have released several versions since. But that was the most comprehensive benchmarks and that I'm aware of and all that stuff is downloadable. And so you can even reproduce those graphs yourself or not. Try it with a new version and let's see what happens. The question about the underlying proxy service does that have the greatest impact on performance and latency? Or is it the policy driven parts of the mesh that costs the greatest resource contention and latency? That's an interesting question. I think either either of those could have certainly the proxy has a huge impact on performance. You know, because that's the thing where every single call you're making between services now has to go between not just one but two proxies. A talks to be and they have a proxy kind of on both at the client side and at the service side. So if that proxy is not as fast as humanly possible as computationally possible. And then you're losing on performance on the policy side. You know, I guess it depends on how policy is done. So it's easy for Linkardee because Linkardee doesn't have policy right now. So there's no performance hit. It's on the roadmap for Linkardee and when we do, when we do that, you know, it'll be done in a way that doesn't have resource contention. There are some of the rest of these. There's three minutes. It looks like I was just reading through. I could give some quick answers, but yeah, okay. One of them was canaries and blue greens with Linkardee. That's something we're piloting now using something called flagger, which is out of we've works it integrates pretty nicely. And, you know, I hope to have some better results to talk about soon, but our pilots are looking great so far on the ingress side we're using engine X today. And the reasoning why I can't remember class question but the reasons why is we're we're operating clusters and multiple clouds. And so we want our application definitions to be as portable as possible. And so not having to tokenize things like ingress decorations for, you know, this club vendor versus that club vendors ingress is is something that drove us also engine access like 20 years old established. So for us that was a safe choice. We're also using engine X for ingress. We've always done so so it wasn't enough. It wasn't anything related to Linkardee. It was good to see that they work together though. There's a question on the currency as well. I'm sorry. I was gonna say we're on the envoy side, and that was more from a performance and rapid reconfiguration. We also got an advantage on the GR proceed arrest transcode, because we were green fielding and proto buffing to receive in the beginning, we didn't have to swagger define and swagger spec all the rest side. So a couple of services using g rpc gateway for that. So we're, you know, jrpc under the covers and then expose rest. There was a question about multi tenancy and service mesh. I can only speak to linkardee right now but you can run multiple versions of the mesh and different tenants if you wanted to. However, because CRDs, at least for the near future until version CRDs are more real. So you start with one version of the mesh across a particular cluster, but you can run multiple control planes in parallel without too much drama. There were a couple of questions about upgrades. How do you folks handle upgrading the mesh. So, when we, when we installed linkardee, the helm chart for linkardee wasn't wasn't that advanced. So we decided to just create this an in house script. The plan was always to, to get the helm chart with linkardee, but we just haven't had time to do that. But we just do it by a script. And there's, there isn't really downtime or if there is maybe a couple of minutes max. So we haven't seen much downtime. And in terms of, I think the question was specifically about platform and infrastructure. We haven't, we also deployed infrastructure with that one, and we haven't seen any issues with that. And this home charts have been updated a fair amount in, in 2.7. Yeah, we're, we're in the same place. We initially installed just manually by hand because the home charts didn't exist at all. We're using Terraform for infrastructure so the cluster itself is Terraform, but workloads themselves, we don't have in Terraform. So the approach that we found that works pretty well so far for having get ups methodologies, but with helm as well is to use flux CD. However, we're using a mixture and so we're looking at the helm operator, but we've actually flux has some capacity to template things out so the approach we're probably going to be taking for upgrading moving forward is to template out the cart and then apply that directly. I don't think we have time to cover how upgrades with link or d work anyway it's probably out of scope but we've shown that we can do it without taking real downtime. We approach it at the cluster level, we pick the cluster out of rotation for more like a Kubernetes Federation perspective and update from that. We joined start using link or date after the helm chart for solid. So that was that was one of the hurdles that we wanted to see overcome before we started trying it out. And yes, and with that, this very lively conversation comes with and thank you so much to the Williams, Maria and Matt for today's great presentation. And earlier at the slides and reporting of this will be available later today. If you have additional questions and we didn't get to all of them. Slack dot link or d dot IO, which was one of the early questions is a place where you can go and engage with the community. Thank you all so much and we hope to see you at a future CNC webinar. Thank you. Thanks. Thanks Matt. Thanks William. Thanks Anna.