 Hello, everyone. I'm Don Jaikari. In this presentation, I'm going to talk about to-dos and what not to-dos in service mesh architectures. As you may already aware, service mesh is one of the hot topics in the container networking and Kubernetes in general right now and I'm going to talk into different aspects of the service mesh, starting off with why a service mesh, how we got here and what are the reasons that we decided to put a service mesh on top of already completed networking infrastructure, and then we'll look at what is a service mesh, all the technical aspect, architectural aspects about it, briefly touch upon traffic flows of the service mesh and all those things and including the components and then we'll end off with taking a look at to-dos and what not to-dos, you know, what are the considerations around service mesh infrastructures that you can think of, that you can take into account when you are implementing, as well as if you're operationalizing service mesh infrastructures. Let's talk about how we got here for a minute. We had monolithic architectures back then and we decided to break apart that monolithic architecture and decided to introduce these microservices-based infrastructure, right? Now the issue with having these microservices is, you know, now you have many of them and to make any sense out of these microservices, you need to make sure effective communication between these many microservices, right? Now, make matters worse, organizations were building up these microservices, not in a really a cookie cutter way, in a very different way, right? To start off, they were developing these microservices using different languages, they were deploying these microservices using different build patterns, build pipelines. So all these things have made matters much more worse and this actually resulted in much more complexity, unnecessary complexity and obviously inner innovation, right? And that is not what we were wanted to get out of the microservices infrastructure and that's why we actually start break apart the monolithic in the first place, right? Looking at the use cases of the service mesh infrastructure, right? In the next slide, I'm highlighting three main use cases that always come up as the main drivers for the service mesh infrastructure. First one is being the reliability, right? Reliability across communication between these services, right? Making sure the services are connected properly and services are able to reliably discover each other properly. All these are main drivers for service mesh adoption, right? And for the long the line, the traffic handling behavior, right? For example, let's say you want to reroute the traffic, right? This intelligently route traffic is one of the primary needs in an application environment, right? For example, let's say you have a service and you are trying to bring up a new version of that service. Now your use case is where you wanted to intelligently route, like let's say 95% of the current traffic should go to the current service, the old service and 5% of the traffic should go to this new service. This kind of, you know, canary deployment, people want to make sure we are able to correctly route traffic between these services properly. So this reliable connectivity is one of the main drivers for service mesh interactions. And the second one is the security, right? You have many microservices in your infrastructure compared to the old monolithic architecture. Now obviously your attack vector in a certain way has gone up, right? You have many more endpoints, many more services that are talking with each other. And you want to make sure that this communication happened in a much more secure way, right? You want to encrypt the traffic between these services, right? Right next to the services if at all possible, right? Not after a couple hops. If you can encrypt it right next to the services. So this kind of encrypting capabilities, right? And furthermore, you want to make sure that you can actually do security policies from a much more centralized fashion, right? These kind of use cases, security use cases were not the primary driver for service mesh adoption. And last but not the least, observability. It is true that, you know, we are losing a certain amount of visibility and not necessarily losing is just getting much more complicated, right? So increased visibility is always a plus sign in this complex infrastructure, right? Now, the use case is where you want to increase visibility. At the same time, you want to, you want to able to get a lot more contextual awareness to all these traffic patterns, right? So you want to be able to gain a lot of understanding of how the traffic is working. So if there's an entity that you can provide all these tracing data and all these heuristic data, that would be great, right? So that was one of the primary drivers for the service machine infrastructure. Now, let's take a pause for a minute, right? Because if you really look at it, none of these use cases are new, right? Because, you know, we always have the need for better traffic handling. We always want security and we always need better visibility and better observability mechanisms, right? So none of these use cases are inherently new to the container infrastructure or, you know, we always have this need in previous architectures, right? So if you really look at the evolution of these, these service mesh use cases, right? It's pretty interesting. If you take a look at like pre-2015, right? Pre-COVID in this area, we were deploying these services on top of, you know, different platforms, right? Like VMs, bare metal, open stack, whatnot, right? And the services goes on top of it. And most of these use cases, right? Traffic use cases like circuit breaking, routing, service discovery, all those things, including security use cases such as authentication, observability use cases like tracing, all these things were more or less getting embedded into the services, right? In this so-called middleware layer, right? Now, fast forward into 2020, what's happening is, right? What's already starting to happen is, this middleware layer is kind of getting disappeared, right? This, this tight coupling of this middleware layer to the service is getting abandoned for a number of good reasons. Now, services are getting much more loosely coupled with each other and that service mesh layer is going to take care of that middleware layer, right? So operationally speaking, this is pretty good news, right? Because developers can go about their way of writing their service applications and adding value to the businesses, right? And operation organizations can take care of all the other use cases to make sure that operationally, all these services are getting deployed in a proper mechanism and you have much more granular control of this service infrastructure as a result, right? From the traffic handling point of view, from the security point of view, as well as observability point of view, right? So that's the primary driver for the service mesh. So that, that middleware layer that we used to see back in the day is kind of getting absorbed to the service mesh infrastructure now with the, with the, with the much more loosely coupled service infrastructure. So as a result, you know, both development and there's a clear boundary between development as well as the operations organizations of enterprises, right? So what is the current state of the service mesh, right? So it started with the linkergy and least there was not far behind. Now fast forwarded to 2020, mid 2020 now, right? So we have so many solutions to pick from, right? So this is, this is frankly, it's pretty good news, right? If you look at like the people that who are actually bringing service measures, right, they can, you can kind of, you know, quickly fill out the buckets, right? Like the big cloud providers, right? AWS app mesh, Google Anthos, right? Azure engineers have their own open service mesh, right? Past providers like VMware, Tanzu, right? All these are, you know, they are, they are already solving some of their unique use cases with the service measure, with their own service measures, right? Middleware companies like Mulesoft have their any point service mesh, right? And, and third party providers like HushCop, right? For example, HushCop has console, right? So which is pretty popular in the arena as well. So everyone is solving their own unique use cases. So which means, you know, there are plenty of use cases and problems to pick from, right? That's, this is all that it's saying from this graph right here, right? Basically, you know, it's pretty good news in the sense that, you know, there are plenty of use cases, plenty of problems. And at the same time, you know, you might actually have, might have trouble evaluating all these services, but you know, they are pretty unique to be honest, right? Like people are focusing on some, some of, some of the folks are not just focusing on the container environment, right? Because at the end of the day, traffic has to shuffle through everywhere, right? You are not going to leave behind your legacy infrastructure. So one of the prominent use cases that are coming up is, you know, how to, how to make sure that, you know, you are not leaving the VM environment behind with the service mesh architectures, right? So some of these service measures like Kuma, for example, are mostly focusing on, on bringing in like traditional VM environment into the picture as well, right? So, so everyone has their unique spin on the service mesh and the architecture and where it's going. But from the data plane point of view, I definitely see that, you know, most people are getting converge on the on my proxy layer, I'll cover a little bit more depth on that for the slides on the what, what is the service mesh talk, right? In that section, but, but in a high level, right, most of these, these, these guys are offering like unique spin on the control plane, right? Some, some folks are extending its to your control plane, but then the others have their own control plane. But for the most, for the most part, the data plane is, you know, getting, getting uniquely clubbed, uniquely clubbed together with the envoy proxy. Okay, let's move on to the what's a service mesh portion of my talk, right? So, let's start by looking at, you know, the architecture of the services, right? So in a very high level, if you look at it, service mesh is nothing but an abstracted architecture, right? There is a control plane. And there's a data plan, right? So in the control plane, right, control plan exposes set of API's. And as users, we can interact with that API's and instruct the control plan to put all these, you know, use case policies that we were talking about earlier, right? From a traffic handling point of view, you can say, this service, this service, traffic is going to direct it in this manner, right? The security policies and observational observability rules and so on and so forth. So all these are getting directed through the control plane. And then there's this data plane, the proxies live right next to your applications or the services, right? So the placement of the proxies we'll talk about in a second. But in a general speaking, it sits right next to your service and app, right? And it's the connection is getting terminated right on that proxy itself. So what this means is from the data plane point of view, all your communication is getting terminated. Outside world only knows about this particular proxy and proxy is doing proxying for that service that is sitting on top, right? So as a result, so the control plane actually instructs the proxy to enforce those policies that you define through that control plane API. And then the data plane proxy is handling the, all the data plane functionality. So for all the hype, it's for the service mesh, it's a pretty straightforward architectural pattern, right? It's nothing more than more than bunch of user space, user space proxies stuck next to the service or app, right? Whereas in stock, we'll talk about it later. But it actually referred to as a service mesh data plane. And then you have the control plane, which actually data plane intercepts this request from the control plane and it does stuff, right? And the control plane coordinates the behavior of the proxies and all that with an API. And as a user, you interact with that service mesh control plane using that API. So where should this proxy actually lives, right? So this is something we talked about briefly touch upon, right? If you really look at the placement of the proxy, right, I was showing right alongside the application, but that's not necessarily true, right? In the next slide, if you move on, you will see like three different options where we can actually put this proxy on, right? In the first option, we can embedded this proxy as a library into the application. The upper and downside is right now you have to depend on your developers to get this proxy and embedded into the applications, right? So that's not necessarily why we started out this whole microservices service mesh journey in the first place, right? Like you are coupling, you are tightly coupling more with your other teams, right? So number one, option number one is not necessarily good. In the option number two, what I'm showing is right, we have the, you have a proxy per node, per Kubernetes node, right? So this is, this is cool in a way that, you know, you don't have to deal with like multiple proxies in this case, right? You just have one guy sitting on a per node basis and all the applications, all your Kubernetes services get proxied from there and it's a much more simpler model. The drawback of this drawback is, you know, it's a single point of failure at this point. I will cover all these things. I'll touch up on this subject later as well in the dos and don't section. But, but then again, you know, it's, it's pretty obvious that it's a single point of failure. So now that means, you know, you have to deploy, you know, two, three proxies per node, that's good or not, right? So, so this is up to the debate right now as well. And then the third option is the sidecar option, right? Sidecar option in this case, what happens is your proxies getting embedded right next to the application. So if you recall from your Kubernetes 101, right? If you see a Kubernetes pod that's getting deployed, Kubernetes is pod is nothing, nothing but a bunch of containers that's sitting on a same network namespace. So what's happening is we are actually embedding one of the proxies as another container into the same Kubernetes pod. So as a result that proxy has is sitting right alongside with the same network namespace as your application service container. So as a result, it has, it will inherit all the networking parameters from the application. So, so this is what called a sidecar container. And that is that proxies getting embedded into the same pod or same network namespace as your application right alongside. Now, if you look at the common, most common deployment patterns today, the sidecar is definitely winning by a margin right now. People do like this aspect of sharing the same namespace, network namespace alongside the actual Kubernetes service itself, so that it actually sits right alongside with the application and that's pretty good from the security point of view as well because proxies right there, it's not going to cross any boundaries. So if you implement security policies, so if you implement like encryption, such as TLS, your tier, your connection is going to get right almost right down to the network namespace of that application. So, so that sidecar proxy is getting a lot more, a lot more eyeballs rolling these days and most of the implementation that actually using the sidecar proxy. So actually talking about proxies, you cannot really talk about proxies without talking about Envoy right. Envoy has pretty much become the universal data plan for all the Microsoft, Microsoft, Microsoft services, Kubernetes architectures today right. It is, it is the most prominent proxies being used pretty much, you know, by a wide margin if I have to guess right. One of the reason is being when it came out of left, you know, it was really focusing on the performance aspect of it right like this data handling data shuffling back and forth right it actually uses a high performance C++ library and that was the one of the key differentiator. But, but more importantly it actually provided a lot of observability and you know advanced load balancing functionality. But most importantly there is one reason that in that why Envoy really got popular because of the declarative style API that actually it allowed very similar to the Kubernetes style right. EngineX and HAProxy was there and the proxies were there but, but this declarative style API actually allowed Envoy to get really popular among developers because now you suddenly get this APIs that you can actually program against and using the XDS protocol that's the configuration language of Envoy right. Using this XDS protocol as users you can now really decouple that control plane from the data plane. So, so this decoupling of this whole control plane idea was pretty much, you know, very popular got really pillar with the Envoy itself. So, compared to the data plane we did have a universal choice for, for the, with the Envoy when it comes to the control plane we don't have the choice and that's exactly the point right. That's why we do have many different service mesh architectures because the control plane is different. Everyone is implementing the service mesh in a different, different ways and attacking different use cases. So, one of the example, popular examples is Istio and Istio, you know, there's the pilot component where it actually is the control plane component that, that, that responsible of pushing the, generating the configuration, accepting the user configuration, generating it and you know, streaming it down to these Envoy proxies. And this is, you know, you can see this kind of equivalent architectures around other service mesh control planes as well. Yeah, and then with that, you know, I'm not going to go that much detail into a particular control plane because everyone is different from that angle right. So, with that let's take a look at, you know, what are the do's and don'ts of the service mesh infrastructure because whether you are planning it out or whether you are already in the POC process or trying it out here and there or even you are all the way, you know, ahead of the curve and, you know, getting a service mesh deployed, there are certain considerations that you might want to take a look at and let's go ahead and take a look at those considerations. To start off, right, do you really need a service mesh? That's actually a question, very valid question that you should ask, especially if you are, if you are trying to figure it out, you know, whether to put it or not, service mesh on top of your infrastructure. Boyant, the maker of the Lincardier actually has this specific question. I have a very interesting question on their website. If you haven't taken a look at it, I urge you to go and take a look at it. There are interesting questions like, you know, how many people are you in your engineering org? How many microservices are there in your application, right? Because at the end of the day, like I was mentioning, service mesh is nothing but an another abstraction layer on top of it, right? And like any abstractions, it does have its tradeoffs, right? You know, you know, if you don't have that many microservices to take care of, it might not be worthwhile to, you know, put this another abstraction layer on top of it, right? Your general CNI networking layer that does the packet forwarding might be enough, right? Or you may want to take a look at a specific service mesh infrastructure, right? Like so there are certain things that you have to worry about, right? Another aspect of the questions that they're probing is, you know, whether do you have any, all your services written in a certain language, right? Or you have more of like a polyglot kind of environment where most of your services are written in different languages, right? If you keep inheriting different languages, different teams in your infrastructure, right? The service mesh might be a very valuable proposition for you, right? So you need to be very aware of these kind of, you know, questions before you start going and evaluating the service mesh, because again, you know, it might not be for everyone, it might, it might, it may be a fantastic solution for you, but you just need to be aware of these, these conditions. Now, moving forward, another consideration that you have to really look at is, you know, you have to really pay attention to your application architecture, right? We keep saying you are, we are assuming, we are by default, we are assuming that you are actually splitting your, you know, your monolith into microservices, but, but are they really, you know, loosely coupled, right? That's one of the, one of the primary considerations that you have to primary, you know, primary things that you have to worry about, right? If your services are not really loosely coupled with each other, it might not be a good idea to put a service mesh because, and not just from the technical point of view, as well as, you know, your organization structure point of view, right? If you have clear structure boundaries between your services to your, to your organization structures, right, your departments for say, for example, right, it may be a very good plus point to put a service mesh, right? Because everything is decoupled, everything is very loosely coupled, right? You, you, even from, even from the, from the deployment angle. So, so that's another plus point, right? So, but, but then again, you know, you, one of the key things, not just from the application architecture point of view, even from the deployment point of view, right? What I meant from the deployment is, you know, for example, let's say you have 100 microservices, right? You, but you certainly you have, you're figuring out, you know, you have dependency between, you know, service X to service Y, right? Not, not in terms of, you know, the service boundaries, but in terms of the deployment, right? The service Y is expecting service X is already there to, they are in the deployment cycle, right? The head of the pipeline, right? This kind of deployment gotchas might actually cause unnecessary problems when you try to put a service mesh on top of it, right? So, so again, bottom line is, you know, you should be free of, I wouldn't call it free of dependencies, but you should be almost free of these kind of gotcha dependencies. If you seriously considering a service mesh infrastructure, because if you don't, then you may actually run into big problems down the line. So, another popular consideration that you actually pay really, really big attention to is the resource utilization, right? And I mean, you're not going to get a service mesh without a zero cost to you, right? There is, there is pretty, pretty apparent, you know, resource constraints that you should be aware of, right? And you can kind of, you know, divide that into control plane related issues and the data plane, right? In the control plane, things like, you know, your rate of deployment and configuration changes may, you know, if you, if you're planning on doing a lot of configuration changes, you know, that is going to really tax the whole system a whole lot, right? And, you know, the scale of the system, right? The number of proxies that actually connect in a number of envoy proxies that are connected to the control plane is significantly play, play into the resource consumption picture, right? In terms of the data plane, you know, the usual suspects, right? Your protocol, your CPU cores, you know, number of threads that you have, right? The number of client connections that these proxies are making, all these things will play into picture. In terms of the resource considerations, just to throw out some numbers, right? So, Istio actually has very, very comprehensive list of scalability numbers in here. So, just to, and Istio is using envoy, so I would expect that, you know, the numbers would be, if not similar, it would be in the same ballpark for most of other service measures. But, obviously, you know, you need to kind of, you know, go and verify with the service measure of your choice. But in terms of Istio, the numbers are, you know, if you have like a thousand services in a 2000 side car with a 70k measure request, then, you know, you're talking about envoy proxy might add like around 2 to 3 millisecond of latency per hop, right? That's in the 90th percentile. And that's the number that they came up with in this particular cluster. And you can kind of take a look at all these granular numbers in the Istio performance documents. And there are like couple good Twitter threads as well. So, I'm pointing, putting a link out to Karl Stone, his Twitter thread in his, I think he's working for AutoTrader. So, he's describing an environment where, you know, 650 virtual services, 420, 480 rules with MTLS enabled and Telemetry V2 with the new Istio version, the performance numbers that he's getting. So, it's worthwhile taking a look at, but at the end of the day, you know, just be aware that, you know, you are going to pay for your service mesh because at the end of the day, you know, it is going to utilize resources from the latency point of view as well as from the CPU and memory point of view as well. So, another popular operational consideration that you take care of is upgrades, right? Again, you know, you're putting abstraction on top of it. So, better plan it ahead, right? You have to read your upgrade nodes according to your particular releases, release nodes and so on. But avoid, you know, general rule of thumb is, you know, avoid all these in place upgrades, right? There are certain settings on some of the service meshes that you can actually reduce the disruptions to the active services by using like Android deployment. For example, you can spawn your new control plane, then version two of the new version control plane, somewhere else and, you know, gradually move the services to that new control plane while keeping the existing control plane intact. So, be aware of all these upgrade mechanisms when you're doing service mesh upgrades and, you know, plan ahead. Another popular consideration is DR, right? This is very, very much, I mean, at its infancy to say the least, right? You still need to, you know, to monitor the space very closely. There are a couple ways, right? You know, you can have a single cluster in a single service mesh. You can have a multi cluster in a one mesh or, you know, you can go into multi cluster, multi mesh, like the whole Utopia mode over here, right? So, these are little bit, you know, I would not, I have not seen that many significant developments around there. But, you know, there are a lot of active conversations, but be aware of all the DR considerations as well. Now, another popular consideration point that you won't want to consider is the SMI, the service mesh interface that's coming up, right? The idea of behind this SMI is, you know, having a standardized interface for a service mesh in the Kubernetes. So, all your service mesh, like Istio, Linkadee console, all these service mesh will be a provider to this API. And, you know, this service mesh interface will kind of abstract out, like dependencies on this provider level. So, consider this as, you know, abstraction on top of an abstraction, right? So, why do we need that, right? Why do we need to put in this kind of abstraction and abstraction? Because, you know, it might be actually makes sense, you know, if you're building a tooling on top of a service mesh, SMI might be a pretty good alternative for you, right? You might want to consider this because, you know, you want to be transparent to the underlying provider service mesh. And at the same time, you know, you're tooling, you don't want to rewrite your tooling, right? So, SMI, that's the idea of SMI. It's still pretty new, but it's an evolving subject in there, right? Now, around the ongoing developments, there are a lot of things are happening around the sidecar, a lot of people are working on it. Proxy placement is almost a done deal, you know, sidecar almost have one, but, you know, there are certain pros and cons in that environment, right? You know, sidecars are a little bit harder to handle when it comes to upgrades, right? So, people are wondering about burnout process, right? Then again, you know, all these concerns that I mentioned earlier might come up, right? SillyMT is working on accelerating, you know, sidecars, you know, they have multiple presentations around that. Google has this interesting concept of traffic director, where, you know, the control plane of the service mesh is completely outside of the cluster. So, these are like a bunch of the things that use that happening. And, you know, bottom line is you have to be aware of this ongoing developments. It's a very much of a very fluid feel right now. So, I hope this presentation gave you some idea about what's happening and, you know, basics on the line, what to do and what to know to do about this service mesh infrastructure. And thank you for watching. All right. Thanks, Don. Let's just see if you have any questions come in during this time.