 Well, hello everyone, thank you for being here and sorry I'm the last person holding you from a cold beer So I'll try to do this quick and we try to make it entertaining. My name is Abdel. I'm a developer advocated Google I spent about my time doing infrastructure in data centers the last five years. I was actually in Professional services, which is the consulting parts will cloud and I did quite a lot of implementations around Google Cloud specifically Kubernetes containers and service mesh the talk with its very provocative title the idea here is to Give you my five or six listen learn in doing service mesh implementations. Hopefully some some of them will resonate with you I have to be clear that most of these things are actually Related to the service mesh tool I use which is issue which is one of the biggest and popular and the kind of first one to Come up with this idea of service mesh, but hopefully these learnings will be applicable to any other tool because Although most service mesh tools claim to be different They are pretty much implementing things the same way and architectures are very similar. I'm gonna touch at the end on the On the future. I don't know if you have heard of this, but we came up with something called ambient mesh It's a pretty new concept. So I'm gonna mention it briefly at the end But basically I'm gonna do a quick introduction. So for those who doesn't know what service mesh is Hopefully you will have a basic understanding. We're gonna talk about the lesson learned and then what the future holds There are a bunch of stuff I'm gonna skip through we used to do one of the things people use to think that they're cool And then at some point they think that they're not cool anymore. So they started wrapping up one of things with EPIs and Then this concept of API gateways at some point showed up where basically you take a monolithic app You put an API gateway around it and you call it a modern application, which is not really that's modern and then As every developer ever said let's react everything from scratch and that's how microservices started So the whole concept of microservices is take a monolithic application you split it into small chunks each chunk does one specific thing And does it well it could be a business application. It could be a feature. It could be anything They are smaller. They are very easy to release. They are loosely coupled etc. etc Everybody knows all of these things. So I'm gonna skip through these basic concepts and this clicker is not doing well so I'm just gonna use the keyboard and Then containers showed up at some point with Docker while Docker really just democratize the concept of containers containers existed Linux for a very long time, but what containers did is they made microservices easier They are very easy. They're very immutable They avoid the problem if it works in machine in my machine because a container is Inherently an application its configuration and its dependencies all bundled in the same thing They're very easy to be impact you can just cramp as many containers as you want in a single node Etc. Etc. Now when you run containers at scale you run into this day two problems running two or three or ten containers on Computer with Docker compose is easy running thousands of containers in production at scale Comes with a bunch of issues. How do you schedule them? How do you distribute them? How do you restart them monitor them? How do you persist data in case you need to write data somewhere? How do you do service discovery, which is how do you make one service find another without having to hardcode the IP address or FQD and whatever Etc. Etc. Etc. And that's what Kubernetes really was created to solve. So Kubernetes was released in 2014 It's a it's a basically effectively an abstraction layer on top of a bunch of machines or nodes or servers or laptops or Raspberry Pi doesn't matter It's meant or it's created to let you as a developer or as a user focus on the application and not on the infrastructure So you have a bunch of servers you deploy Kubernetes is an abstraction layer on top of them You talk to the Kubernetes cluster and the Kubernetes cluster does thing for you, right? I'm open source projects. Yeah, I think most people you know all of these things But I want to actually focus on this specific point here because a lot of times people kind of confuse Kubernetes That is only an abstraction layer Kubernetes is effectively two different things It is an abstraction layer with a control plane and data plane and nodes etc. Etc. Etc. And that's what people know it is but one of the most Powerful concepts of Kubernetes in my opinion is this concept of declarative controllers It's the idea that as a developer or as a user you don't tell the cluster what to do You express intent to the cluster in the form of a YAML file and Then somewhere in the cluster. There is a controller, which is a piece of code that will take that intense I want you to do this and Translate it into reality So let's say a very simple example would be you want to deploy a container in Kubernetes world We call that a pod you send your container manifest, which is the YAML file to the control plane The control plane will store that for you and then somewhere a controller is responsible for making that actual development Deployment happen so it will trigger and that's what we call the observed part So it will monitor what's the cluster tells it you should happen and what's actually happening right now So intent versus reality it will compare The cluster is telling me I should have a deployment or I should have a pod or a container There is no container right now and it will act on that so act is deploy the container and this control loop Concepts applies pretty much everywhere in Kubernetes if you have used Kubernetes before you know that basically you're right YAML file you submit it to the cluster as long as your YAML file is valid and doesn't have any errors Things magically happen behind the scene, right? So very high level architecture of Kubernetes. It's important to understand this for the remainder of the presentation You as a developer as a user you will use an EPI or a command line or a GUI or The combination of the three of them you talk to what we call a control plane control plane is either one or three nodes Or five or however you want to run Inside there are a bunch of components the main ones are the four you see on the screen The EPI server is the component to which you send your commands using the Qubectl command line or the GUI or whatever ATCD is a Distributed key value store. That's where the things you want the cluster to do will be stored So that's where YAML files are stored effectively Then the controllers is what I talked about They basically monitor ATCD through the EPI server and they act when there is something that needs to happen The scheduler does scheduling right it just finds a node to run your workload on the workload side or the worker node side Or the data plane side however you want to call it You have a bunch of components the major ones are the Q bled the Q proxy and the container runtime Very briefly the Q proxy does network programming It makes sure that the node is programmed properly such a way that containers can talk to each other That's what the Q proxy does. It's not an actual proxy by the way It doesn't sit on the data path when containers are talking to each other. They don't go through the Q proxy They just go through the node The Q bled its job is to a receive orders from the EPI server and be report the status of the node back to the EPI server Then the container runtime whether that's Docker container D Rocket whatever container runtime you have there will basically run the container for you now all of this is Actually kind of an indirect way of you doing a docker run when you do a docker run on your computer You are essentially talking to the docker demon which runs the container for you in a Kubernetes world You tell the EPI server that tells ATCD that tells the controller that tells the scheduler That tells the Q bled to run the docker run for you Right, it's very indirect way of running a container at the end of the day What happens is that some container runtime somewhere will fire up a container for you, right? And then you know a bunch of extra magic if a node goes down the controller will detect that and will just redistribute Workloads across all the other available nodes extra extra. Um, I'm gonna skip this one But believe it or not Kubernetes doesn't solve all your problems Um, and I'm gonna take one simple example so let's say you have two services two applications two containers it doesn't matter and They talk to each other over HTTP or gRPC or whatever and then at some point a security person comes to you and says like we want to add MTLS So for you doesn't understand what MTLS is MTLS stands for mutual TLS It's essentially TLS where a client verifies the identity of the server But with MTLS the server also verifies the identity of a client, right? So in normal TLS, which is what happens between you and Google.com for example You as you as the client or as the browser you are connecting to a server The server sends you a certificate and your client which is the browser verify that certificates and make sure that the server is Who they say they are in an MTLS world? You as the client in this case service A when you make that TLS call to the server You are also sending your own certificates which allows the server to verify your identity. That's essentially what MTLS is, right? So let's say you want to add MTLS Very easily if you have two containers, you can just mount a bunch of certificates into service A Mount a bunch of certificates into service B use a library mount those certificates and use them to make the call Doing this at scale Could be complicated doing it for thousands of containers could be complicated because if you have certificates You have to issue them sign them verify them rotate them etc etc So At some point some people came up with this idea saying why don't we do a control plane that does search management for us Right, so why don't we have that functionality of certificates management done by a centralized place or centralized components? And we can also make that centralized components do a bunch of things like enforce policies that control control plane or controller or Points or thingy can tell the service on both sides Only accept calls from service A if they meet certain criteria criteria, so I'm going to talk about a specific example Or why don't we make that control plane also collect a bunch of telemetries or does some routing etc etc And you can think about any sorts of policies that you will want to enforce between micro services Can actually be done with a central control plane Timeouts you can you can make the control plane Tele-service to randomly time out every now and then or to do a retry or to implement a circuit braking or anything like that So for some time these functionalities have been implemented through proxy libraries those are dependencies that you can import in your code and Then you can make them do all these functionalities in policy enforcement certificate management etc etc The problem with proxy libraries and the code dependencies is that if you work in a big organization And you are using five or six programming languages You are effectively going to be using five or six proxy libraries That does exactly the same thing, but they're written in different programming languages So you have to maintain them and you have to patch them etc etc, right and this is Core to what a service mesh is the whole point of a service mesh is to say Why don't we decouple those network functionalities from the application and have them implemented by a proxy an actual Binary which acts as a proxy the proxy will sit in between all the applications Services will send their traffic to the proxy the proxy will implement all these traffic shaping traffic management policy enforcement features And then the proxies will talk to each other over a secure medium like MTS and That's essentially what a service mesh is or at least what the first iteration of is to us is to came up with this idea Where they introduced a control plane, which is just a pod. It's just a container that runs inside Kubernetes and Then they used Envoy, which is an open source C++ based proxy that have been developed by Lyft That Envoy actually is very cool because it has two main functionalities a it can be configured using EPI So you don't need configuration files and B it can hot reload this configuration So when you are introducing a new policy or a new change, you don't need to restart the proxy, right? It can hot reload its configuration while it's working if you are rotating a certificate, for example You don't necessarily want to restart your workloads, right? So the people who wrote Istio they implemented the existing EPI surface of of Envoy Made the control plane that can understand this EPI, right? And then made the EPI accessible through the Kubernetes EPI such a way that as a user you continue using the same concepts in Kubernetes you still write a YAML file send it to your Kubernetes cluster Then your Kubernetes cluster will hand it off to the control plane of Istio Which then will translate whatever configuration you say into something that the Envoy proxy understand and I'm gonna talk about specific example later In this example, so this is as I said the first iteration of Istio service mesh and the current iteration It's still the same architecture other tools uses different architectures It's always a control plane and the data plane using using proxies linker D use a different proxy They have their own control plane Silium does this with EBPF, which is the new service measures was released They instead of using the proxy they do some stuff on the node but pretty much more service measures have as concepts control plane and data plane, right and Again specifically to Istio, it does also have something called an ingress gateway Which is a specific proxy that is responsible for getting traffic from outside the service mesh into the service mesh and implemented policies So you can do GWT token validation You can remove TLS certificates before you route the traffic to inside etc etc, and they also have an Egress gateway, which is same components as the ingress gateway, but works on the way out So if your service measures trying to call an external endpoint You can force your traffic through the egress gateway and have the egress gateway implement certain policies and collect certain telemetry And do some routing etc etc right Now let's take a very specific example This is just again Istio specific but most service mesh have something like this. This is called traffic splitting So say I have a service B version 1 I'm introducing service B version 2 and I want to do a canary deployment I want to send 95% of my traffic to version a 5% to version B I have to use in in the Istio world. There are two objects you use one of them is called the virtual service It's essentially just says virtual be existing two versions version 1 version 2 and then the destination rule Which is an object that sorry the other way around destination rule this declares the the subsets and then the virtual service says 95% of traffic goes to version 1 5% of traffic goes to version 2 right This is just one example of traffic routing that Istio can do it can do more than this and By the way to be very precise the reason why these things are possible with the service mesh and the reason why we call it a service mesh is because Every single container every single instance of your application has an invoice running alongside it Right, that's what we call that a sidecar in Kubernetes But essentially there is as many containers invoice containers as many application containers and there is a one-to-one mapping between them Right, so when you do something like a canary deployments because all the traffic goes through the sidecars We are actually implementing a real 95 5% because the sidecar on the way out Will do will count 95 requests go into version 1 and then 5 requests go into version 2 right and Then in case you update that object and you say well now I'm confident my version 2 is good and I want to do 10% of traffic You just have to update these values send it to the cluster and then on the spot on the flights The traffic will be split according to the weights you have defined You can also do fault injection. These are just another example. You can say I am introducing a fault For a five second delay is on zero point five one percent of my traffic So every one percent of my traffic I want to have a five seconds delay on this service So I can see how the client reacts when there is a delay on the new version of service, right? And they can do timeouts you can do a bunch of interesting things These two have been created specifically to run on Kubernetes most service measures actually around Kubernetes Although sometimes they claim they can do VMs They do VMs, but they mostly are designed for Kubernetes Because it leverages a lot of these Kubernetes native ways of doing things the control plane is just a pod The proxy is just a container running inside the pod where your application is right and then your configuration Is a YAML file basically similar to how you deploy things in Kubernetes and how we define services, etc, etc And So in a nutshell a service mess gives you kind of four main benefits The ability to connect things together So the ability to connect services to each other and do service discovery out of the box The ability to secure traffic by introducing MTLS The ability to control the traffic by introducing traffic routing timeouts, etc, etc And then observing you can collect a bunch of telemetries and then store them somewhere And then observe them and see how the service mesh is reacting, right? Now what are the drawbacks? What are the the four main challenges that in my opinion service measures still have today? so One of them is capacity and resources a service mesh runs proxies with your application containers and those proxies have a Footprint they consume cp on memory and you have to keep that in mind, right? There is network latency whenever there is a proxy. There is network latency. There is no way around it There are certain challenges with how you design an architecture service mesh Such a way that it's kind of scalable from day one if you want and then once you have a service mesh tool deployed It will give you a bunch of data and telemetry and logs and in order to take advantage of those those information You will need to deploy extra things and that's what I call auxiliary infrastructure You need to deploy extra piece of software extra things in order to collect telemetry and store it and visualize it, etc I'm gonna walk through each of those separately I'm realizing I'm going fast. So we should try to slow down a little bit. So Capacity and resources I'm taking this is just a snapshot from the is to benchmark on the latest version every Almost every service mesh tool that exists Publishes some sort of benchmark you can look it up linker d have one Celium has one and they all because the tool that is used to do benchmark is an open source tool It's a service mesh tool that is that's all is an open source service mesh benchmarking tool That can actually use to run against pretty much all service measures so in the case of issue specifically if you have a service mesh with 70,000 requests and The 1000 services so if you have 1000 services basically 1000 containers you will have 2000 in total because 1000 application containers and then 1000 Invoiced card. So that's 2000 containers in total And at a thousand requests per second per service each proxy consumes 0.35 vCPU and 40 megabytes of memory now At a single service scale that doesn't sound like much, but at a thousand services scale That's almost 35 vCPU and about four terabytes of memory, right? So the resource footprint that the sidecars will consume is quite Significance and you have to keep that in mind both when you are designing because when you're running the cloud the cloud is not Unlimited and you have to do capacity planning and you have to make sure you have enough capacity to run your workloads and Also in terms of how much you're gonna spend on using a service mesh, right? So there is basically a cost to benefit analysis you have to do about how much am I gonna actually spend on running this versus the benefits I'm taking from it and just a side note About 80% this is a very rough number. I don't have the exact data But about 80% of projects I worked on where service mesh was introduced those introduced to do MTLS So customers go like oh this thing. I'll do MTLS. Let's use it So they basically install a thing or a service mesh that does a lot of Features they use its own offer MTLS and they end up spending quite a significant amount of resources on a single very simple feature, right? So that's something to have you have to keep in mind then is to be which is the control plane consumes 1 vCPU and 1.5 gigabytes of memory per Instance it does auto-scale because it's stateless so you can just have multiple replicas if you have a big service match So of course the more service mesh The more the bigger service mesh is the more your control plane is gonna the bigger your control plane is gonna be Latencies Again same thing with issue just numbers from issue I bet you can find numbers for other service mesh tools Between 1.7 and 2.7 milliseconds at 90 and 95 percent tile 90 and 99 percent tile traffic so Again, probably not that significant 1.7 milliseconds, but in some applications it have been In my opinion one of the most common use cases where latency was a problem was when you are deploying a tool That's that fires up containers very quickly. So I can take very simple example Argo is a known software that has multiple that look it's a community that have multiple pieces of software They have a CICD pipeline But they also have a data data transformation or data orchestration tool It just allows you to essentially deploy a data orchestrator that triggers on events do some transformation on data And then stores the results and die So essentially it's a control plane that just fires up containers to do specific things When we when we use that with issue sidecar the 1.7 and 2.7 milliseconds became a huge problem because suddenly an operation that takes 20 milliseconds takes 22 milliseconds or a container that only takes 60 seconds to run will take two minutes to run, right? Third one is design and architecture This is just a very simple complex example of you can't one of things actually one advantage of service mesh is that you can do A global service mesh so you can have multiple clusters in multiple zones and have one logical service mesh spanning across all of them and The reason why you want to do this for example is imagine you are the FIFA World Cup And you have your website where you sell tickets you might be able to deploy all the services in all the regions So that user latency is low, but at some point you want to take down one micro service So let's say this blue micro service in Japan. We want to take it down to replace it We can still instruct the service mesh to say if the local service micro service is not available route to the nearest one The problem with these kind of designs is that this is a one or nothing choice You have to decide day one if you want to do this because if you don't and you end up designing with single vision Expanding into multi region can be a problem, right and vice versa So if you start day one saying I want to do multi region and then a suddenly you say like I don't care I just want my data to be in Europe because you know We don't like the Americans that could be a problem, right? So so so basically the point here is that your design and architect the choice can be Difficult to make in the beginning because it's hard to change things going forward Different service meshes are trying to solve this problem in different ways, but they all have drawbacks So kind of a bad it's again a balance that you have to strike and the last but not least is this auxiliary infrastructure I talked about so it's your runs on Kubernetes. That's the main story But now you need to visualize your service mesh They have this tool called keali which is like a GUI that can allow you to visualize the service mesh and see which services talk to each Other that's a workload. You have to run that and you have to spend CP on memory in it And you have to maintain it and patch it etc You have Jager for tracing again software you have to install it maintain it blah blah You have promising some graphina for monitoring and for monitoring basically Install it maintain it etc So all these like extra pieces of infrastructure that you're going to deploy in order to take advantage of what the service mesh gives you Out of the box is things that you have to pay for maintain have people take care of it patch it etc So that's something you have to keep in mind So in a nutshell These are the why nots in my opinion So don't take a simplistic approach don't say we need security so this is a this this is coming from a very Typical conversation I was having when I was a consultant with customers. Can this thing the security? Yes. Can you do mtls? Yes Is mtls a pain to manage without service mesh? Yes, so let's use it right very simplistic approach without taking in consideration all the stuff I talked about Sometimes your service mesh is not compatible with your application the case of Argo CD I talked about earlier is one simple example where By the way if you are on Kubernetes You know that this is this common problem which haven't been solved yet in a Kubernetes environment If you have a pod with multiple containers, you cannot decide in which order they start you cannot orchestrate the order yet So one of the problems with a sidecar is that when you have an envoy when you have a sidecar the sidecar has to be ready and Operational before your application can actually make network calls So if you have an application that tries to resolve DNS as part of its bootstrap in because it's connecting to remote configuration server If the invo is not ready your application will not be able to talk to the network And it might just die and crash and restart and we have seen this happening before You have to have theoretical and practical knowledge Understanding how this thing work because when it breaks is very difficult to debug it will increase your technical depth That's for sure and it will increase your operational complexity. So these are the why not now What's the future holding for us? We have announced or Google if you have announced this thing called ambient mesh Ambient mesh is essentially what's trying to do is splitting service mesh functionalities into two layers and saying if you only want MTLS We can give you MTLS via a very simple overlay, which is just a very simple proxy. I'm gonna show how it works later Then if you want to optionally enable l7 processing features like traffic management like security extra traction draw Then it's an additional feature you have to turn on and that's that's trying to solve the problem of Istio being this thing where you only only you either use all of it or nothing, right? The way it will work. This is coming from the blog post that's announced service the ambient mesh by the way Is that they are splitting the functionality into two different proxies? There is a node level proxy. So this is a node. So there is a node level proxy called the Z tunnel It's a lightweight layer for proxy and what it does is only MTLS So it's connected its intercept traffic coming out of the containers And then it encrypts them into a MTLS tunnel and then send them to the other Z tunnel on the other end, right? Then if you want to enable l7, let's say you deploy a virtual service like I showed earlier Then there will be a namespace based proxy not a sidecar proxy anymore. So there will be a significant Decrease in the footprint for all the sidecars and that's proxy will be doing l7 functionalities for the namespace So all the containers in the namespace will share one proxy instead of each of one have its own sidecar proxy right And with that that concludes well I have a blog post that I wrote and it's goes a little bit more into details with very specific examples So feel free to check it out if you have if you are if you care and With that I'm gonna I think that's it for me Yeah, thank you very much Maybe you have I hope you have questions. I've saw some sleepy faces. Yes Just yeah, just wait for the microphone Yes, so great talk. I'm wondering what would you recommend for MTLS if not service meshes? So cilium solves that with the BPF today. It's quite new. I tried it. It works very well So that could be one option You don't want to do MTLS manually Right? The TLDR you don't want to do that's money And I think I have seen that search manager is trying to solve the same problem because search manager effectively distribute certificates And they were talking I don't think I don't know if it's preview or something But they were trying to solve the same problem, which is distribute certificates essentially to workloads Yeah, I went to the cilium talk today and is it mature enough? The ambient mesh I would assume no it was released back in cubecon Valencia in May So probably still not as mature even the ambient mesh that will be released by issue is still in preview So I would probably give it a year or a year and a half before it becomes mature and stable enough to be used in production Make sense Yeah, I guess it makes sense so we're stuck with service meshes still Yes. Yes. Yes, exactly. Good point Any other questions do you have any questions online? No. All right. Well, thank you very much. I hope it was not too long