 Hello everyone and welcome to the talk titled life without sidecars. It's eBPF's promise to good to be true so today I'm going to try to Present my opinion on why the sidecar model is here to stay and why it is in fact the Future of the service mesh. So first of all, who am I? My name is Zahari Ditchiff and I work for buoyant the creators of linker D And if you want to reach out and ask me questions or connect you can use any of these channels So the agenda of this talk is quite packed. I hope we have enough time to cover everything first of all, we are gonna Focus on like what what is a service mesh and why you might need one? We're going to look at the three main pieces of functionality that a service mesh provides Then we're gonna take a look at what linker D is We're going to also explore and give a bit of an overview of eBPF and In particularly how it relates to cloud native networking and then we're going to ask ourselves the question of whether eBPF can actually replace the sidecar in a service mesh and We are also going to evaluate a couple of other models that are being used in production. So Again, like I tend to think of service meshes with respect to three main Pieces of functionality that you get when you adopt one first of all You have observability and this is something that you want in the distributed system because you really want to know What's happening in your services So that gives you service meshes gives you golden metrics across all of your services so things like HTTP and TCP level metrics request sampling and all of that on the reliability front you get Retrice circuit breaking automatic connery deployments based on you know success rates of your new rollout and all of these nice things to Keep your systems running reliably on the security front you get automatic MTLS in MTLS in most cases across mesh workloads And you get network traffic policies that are that can express very rich network rules on the layer 7 protocol stack and Linkardy is just It's a service mesh that it's very simple to use and it comes with very sensible defaults in order to get you up and running as quickly as possible and get you to production in Relatively short amount of time it uses a purpose-built Microproxy that's been specifically designed to run as a sidecar It is also a CNCF graduated project that benefits from a thriving open source community in You can convince yourself of that By joining our slack or you know checking out our open issues on the get-up page so really the high-level architecture of For the most part all service meshes, right? And Linkardy in particular Consists of two main components one is the control plane which is a set of services that runs in your cluster that's responsible for Serving TLS certificates to the proxy for services discovery Facilitating the identity aware policy and just providing all of these rules to the Proxies so they can so they can work with your services and then you have the proxy or the data plane which With the sidecar model it's injected in each one of your mesh workloads and In most of the cases this proxy is a transparent one That provides you things like metrics export so you can scrape these proxies and get the metrics into your observability pipeline there is latency aware load balancing happening at the proxy level and the proxies are where automatic MTLS is implemented and TLS termination and In Linkardy, for example, you have other nice things like on-demand diagnostic API that allows you to tap and observe the requests that are flowing through a Particular proxy so if we want to kind of express all of that in a diagrammatic fashion the life of a request through mesh workloads looks a bit like this You each pot is Injected with a proxy and an unit container So the unit container is responsible for actually setting up your IP table rules So all incoming can outgoing traffic goes through the proxy So when a request originates In your application container, it's intercepted by the proxy the proxy then would most of the time sparse the protocol provides application Like protocol where metrics it will use its TLS certificate to establish to do a handshake with the proxy on the other side and establish a TLS session that Ensures that both peers know who they are talking to and that they are talking to who they are supposed to and Then the data that's flowing through this connection is encrypted. So when You know the request arrives at the other sites that the proxy terminates the Terminates TLS there and forwards all the data to your application container so that's how That's how the site car model in a service mesh works at the moment so now let's turn our site towards ebpf and Kind of look what this promising technology is giving us and see whether it can allow us to you know improve this site car model and and just generally benefit from this new technology that's been quite hot in the cloud native networking space so ebpf has been compared to JavaScript for the kernel and It is essentially an interface to make the kernel programmable. So It is an event-driven programming model that allows you to run custom codes in kernel space and that allows you to be incredibly efficient in certain situations it allows developers to avoid costly data transfers between user space in kernel space and You know minimize context switches Cisco latency and cost and just overall allows you to eliminate a lot of the costs that you get in certain scenarios and It is also safe because the programs that could be written and run in the kernel are very We're fairly limited in nature. So each program needs to go through a quite stringent verifier in order to prove that The program is safe to be executed in the kernel and that is a good thing to have I'll bite a bit limiting. So really where could that? You know, where could that where could all of that be useful in a networking context? Well, take for example a traditional reverse proxy, right? Most of the times What happens is that a? A request rarely goes directly from the client to the server Most of the times what ends up happening is that there is a load balancer or a firewall sitting in between and the request needs to Go through that. So in a typical proxy or a load balancer What's and what ends up happening is that the request goes to your socket through the kernel level and then the data is copied to To user space where certain logic is run that decides whether this request should be Rejected or where it should be loaded to and when that happens then the request is copied back to Kernel space and it travels to its destination now that is fairly inefficient because again you end up paying Costs for for executing system calls you're paying costs for transferring data between kernel and user space and How could ebpf help with that? Well? their implementations of Proxies out there that Effectively use ebpf to do all of this decision-making In kernel space and allow you to avoid All of these calls costly transfers and like the intermediary step of like going to user space and This has been incredibly successful in certain scenarios and their you know applications that are benefiting from from this use case quite heavily, so it's it's actually great and you know if we So now we have this programming model that you know can run in the kernel and We need to ask ourselves, you know could we use that to our advantage in order to make the sidecar go away Right, I mean after all that's why we're here for right to get rid of the sidecar. Nobody wants it. So really We need to ask ourselves this question, but before that let's think about some of the limitations that are inherent to ebpf to the ebpf programming model for a good reason first of all ebpf programs are not allowed to block So you can't really like just wait on the arbitrary condition to be either true or false for an Undefined amount of time and proceed after that like there are no unbounded loops They are limited in size and the verifier needs to be able to evaluate all past-effects execution a prompt so On top of that there is very limited state management So you can't really have arbitrary blocks of data living in the kernel So with all of that being said that renders some of the things that a modern service mesh Is supposed to be quite hard to actually implement in ebpf. So things like MTLS can shakes, you know Things like retries timeouts circuit breaking these are all hard things because You need to have a lot of complicated state management that's happening and at that point in time It's fairly hard to do that in in kernel space and you know any layer 7 protocol parsing how about possible It's it like brings a question of whether that's the right place to do it And of course even if you can like there is a challenge that comes with debugging And troubleshooting applications that that are ebpf. So you don't really benefit from like the rich to chain that You get with other higher level programming languages so really now let's Take a look at three particular features that a modern service mesh provides and think about like What would it take to actually implement that in ebpf? Is it possible and even if it was does it actually make sense? So first of all, let's take a look at latency aware load balancing the way that's achieved in Lingardy is that each proxy each proxy effectively is talking to a destination service and This destination service knows about the endpoints of All of your network targets on your cluster So the proxy then maintains a set of these endpoints internally and it's only natural that you know Some endpoints will exhibit very low latency while others will exhibit higher latency and the load balancer then internally takes note of that and Runs an algorithm that results in More requests being sent to faster endpoints and vice versa and You know if you really think about it like Doing that in ebpf would be quite complicated There is quite a lot of states that's need yet you need to account for in order to that and Nowadays implementing such an algorithm purely in kernel space is you know Fairly hard to do you just don't like just doesn't fit in the programming model that ebpf is using then let's turn our sights to something that's Actually quite welcomed by a lot of the users of service mesh service meshes and that's identity-based policy and Really what that is but the way that works Most of the times is that when a proxy fires up it generates a private key and it sends a certificate signing a request To an identity service that in turn issues a certificate back to the proxy Now this certificate the proxy uses this certificate to establish TLS sessions with other proxies so That way you get Authentication and encryption on your connection so both peers know who they're talking to and this is the certificates of both peers Are tied to the identity of the workload? on top of that the proxies also communicate with a policy service that Allows them to To enforce very rich they are seven policies around What requests go where and what is allowed and that's how you get authorization so a typical identity-based authorization policy Will sort of look like this right so you can see that you can define very rich Protocol aware Policies on that level so you can do things Like define policies where you say well workloads that are from this service account Can talk only to a set of other workloads from this set of service accounts on this board on this particular HTTP road so really we're talking about layer 7 policy that needs to know about the HTTP protocol needs you know this logic needs to parse the protocol I mean even like doing the TLS handshake itself in ebpf is quite complicated because there is so much Happening under the hood. It's like version negotiation and all of that. It's just a very hard thing to do at the moment Given the limitation that ebpf has and Not only that On top of that there are other there is other functionality that is quite hard and my favorite is retries and retries are actually something that You know people oftentimes underestimate how hard it actually is to implement that So really think about what happens when a proxy or any system needs to do needs to be able to retry requests. First of all A request comes from your application container goes through the proxy and This proxy now fires the request to the server It could very well be the case that the server this request actually fails Yeah, so now you want to retract right this request is a travel with well think about what needs to happen under the hood Right like you actually have to have buffered that request Until you know whether it has succeeded or not and when you know you Can potentially retry this request so really You have some condition that being the request succeeding or not that you don't know How much time that's gonna that's gonna take to you know Have an answer to and then you also have some arbitrary Trunk of data that you need to buffer somewhere and you also Possibly don't know what the size of this data is because they are streaming requests chunking coding and whatnot so, you know you need to kind of think about all of these things and Determine whether this could actually be implemented in ebpf where you have so many Requirements that Are there to keep the kernel running right like This is simply a Programming problem that is not very well suited for the model that's ebpf follows so But don't get me wrong like we a boy and then I in general like I really like ebpf We think that it's a very promising technology that Has quite a lot of benefits to it and has its place So, you know, for example, you can do a lot of nice things that that people are doing out there very successfully So things like dynamic IP routing at layer there for their three packet filtering very fast firewalls You can use ebpf for traffic monitoring For application profiling and and you know debugging They are very interesting tools out there that do these sort of stuff But anything that's layer 7 and requires complicated state management is just not really not really fit for this particular Programming model and they're their solutions out there and service measures that are You know marketing the fact that they are using ebpf under the hood But as a matter of fact if you look closely you understand that in fact there is a proxy somewhere It might don't be a sidecar, but there is a proxy and all of these higher level All of these things that are higher on the network stack. They end up happening in the proxy so I'd say that it's It's sufficient to say that you know for now We we really needs a service mesh needs a proxy It's just a matter of where that proxy is going to live. So, you know thinking about that That prompts the question. Can we Do something better? I can this proxy live somewhere else where it takes less resources and just makes our lives a lot easier well Let's see. So first of all There is this bottle of a shared proxy for note and that's something that's again used by some solutions and Effectively what this means is that you have a single proxy per note and all of your pots that are on this particular note Are sharing this proxy? And you only like talk proxy to proxy when you're crossing node boundaries so This You know has some advantages, but you also need to be aware of it's the limitations of this model. So this is This is arguably more efficient If you use a resource hungry proxy So if you're using like a general purpose proxy such a same boy, maybe, you know, this model is better Maybe like sticking a proxy In every part of yours is not the best way forward But all of this comes with a number of disadvantages, right? So first of all You have problems around resource starvation and fairness So really if you adopt if this model is if you adopt this model You need to be aware of the fact that you might end up in a situation where on a single note There is for example a pot That uses a lot of the resources of this proxy and effectively stars the other pots out of resources So you end up with this noisy neighbor problems that You know, you can't really deterministically account for because you're at the mercy of the kubernetes scheduler then There is the problem of lacking any Good means to do resource optimization and the reason for that is that if Because if you're adopting this model, you're effectively losing the ability to optimize resource consumptions at At the pot level and this is how This is how kubernetes has been actually designed to work, right? Like So what ends up happening is that you Your proxy that's on a note will need resources According to the pots that are scheduled on this particular note So you can't really you can't really Determinously reason about putting some constraints on the resources that this proxy consumes Because again the set of workloads that are the set of pots that are running on this note Is effectively ever changing, you know, this proxy might need a certain amount of resources now But tomorrow the kubernetes scheduler might decide that, you know, well, I'm gonna put some other pots on this note and now The resource consumption characteristics of this proxy all a sudden change So you can't really reason very well about all of that On top of that, there is no real isolation of secret material. So, you know, you end up Holding all of the private keys and certificates For all of the pots that are on this note in this proxy So if there is a breach or a buck and there is like a leak of you know, and there is a leak of Private key material then effectively all of your pots that are on this note are compromised And then what scares me the most is actually the increased blast radius, right? So The problem there is that if this proxy goes down now a bunch of pots that are on this note are affected And again, you don't really have the means to reason about like what are what are these pots at any given time so This makes things like operability quite hard because it seems quite scary to upgrade such a proxy Then there is another model that I've seen being used and this is the shared proxy per service account and that Is Sort of similar you're effectively sharing a proxy per single service account and that Makes sense to me on the security front. So there is So, you know, you end up solving the problem Of security a little bit better, but again And and there is arguably a little bit more Improved resource utilization because you could argue that Pots that are part of the same service accounts are effectively are likely to Are likely to exhibit similar resource consumption characteristics But again, this presents again increased operational risk and it's Just doesn't bring a lot more advantages because it makes deployment hard you're again dealing with multi tenancy you're dealing with unpredictability of like What pots are actually using this particular proxy at any given time? as opposed to that I think that the sidecar model has a number of advantages that's You know that's that's you can benefit from first of all resource consumption scales with the application, right? So really, you know, the more resources your application consumes the more The more resources of proxy is going to be consuming and you know, you can you have a lot of more Facilities to optimize the resource consumption of the proxy and reason about like what limits and Constraints you're going to put on this particular proxy Failure is limited to a particular instance. So when something fails and the proxy goes down like all of the native kubernetes concepts kick in and all of the facilities like rescheduling eviction oams This all works in your favor to alleviate the situation and you know what actually failed And that makes maintenance and upgrades a lot easier because you effectively just end up like row out or starting your applications and you benefit from like the deployment model of Kubernetes in order to also maintain the proxy and On top of that the security boundary is very clear. Everything is contained within the pot, which is the smallest logical unit in kubernetes So all of that being said I keep hearing A number of statements around this model that I would label as popular folklore First of all, I hear a lot that sidecars waste resources. I don't think that's the case. First of all, not all sidecars are created equal You know Linker this proxy has been optimized to specifically work as a sidecar and it's incredibly it has incredibly small footprint On top of that I also hear that Sidecars introduce extra latency The fact is that the latency that's introduced by sidecars is negligible compared to the latency that Most of the applications exhibit and if that stops you from using a sidecar Maybe, you know the service The microservices model is not entirely right for you I also hear that the service mesh will soon live in the kernel because everything is moving into this layer As we saw the kernel programming model has quite a lot of limitations in that render You know implementing a lot of the things that you would use a service mesh for You know very hard if not impossible And then I also hear that multi tenants proxies are the way forward. But as we saw They are hard to operate they have a number of concerns around security and uh stability and You know, it's just uh, they don't they kind of go against the grain of What the kubernetes and container? computing model is according to my opinion, so really In conclusion, I'm going to say that I think ebpf brings a number of advantages that could be very useful in the cloud native networking space And we should push and explore this Avenue a lot more. I think it has a lot to offer I will also say that You know unlike Some companies that are starting with the multi tenant proxy as a model You know, we've already actually been there linker d1 was running a multi tenant proxy And there were a lot of users in production and we saw firsthand all the operational problems that They were experiencing and that's why I decided we need to move away from this model and we firmly believe that Sidecars are here to stay and they in fact are the right model for the service mesh So thank you for listening and now I guess we have some time for qna if we have time Right, so the question was whether I've thought about things like confidential containers because the sidecar model doesn't really you know work very well with With Kind of in trusted environments and the reason is The truth is I have actually not looked into that and That'd be an interesting thing to explore. But no, I I don't really have a lot of experience and much to say about that Hello, hello a question over here on the right So Yeah, so I wonder how many people here use istio or cillium, but it seems the What you're arguing against is that the envoy proxy is very heavy-handed the solutions have to be worked around not using sidecars um Can't other projects use the linker d proxy? um Because it's really lightweight instead of trying to work around uh sidecars as a The Sidecars are a problem. I see. Uh, well, I mean like we I'm not here to advertise liquidy or its proxy. Um This was just an objective, you know evaluation of the state of things But I would say that liquidy's proxy is lightweight and It's also open source anyone can go and see how it's been implemented the performance Optimizations we've done any they want to experiment with running it in their service mesh. Yeah, sure That's that's great. Like more adoption is always great. So Probably it's a good idea to try that If performance and resource consumption is what bothers you with the sidecar model Okay, and thanks for presentation So, uh, my question will be like that. Um, so now The linker g if we use a cni mod or init container it basically creates like a chain of IP tables rules. Yes Uh, have you observed any issues related to that? I mean IP tables? Yes as it is it's kind of On some modes of running kubernetes clusters. It can introduce some delays on updating. Yes So have you observed on some large deployments with the current situation because you produce basically six IP table rules rules per Each sidecar. So yeah, your opinion on it. So say Yeah, absolutely. Uh, so we've seen yes, we've seen problems with using IP tables, um, especially on like larger clusters and We are You know actively looking at other options to in order to alleviate these problems But at the moment we we have not really we have not really we we aren't really focused on that But yeah, I'm aware that there are problems Well, if that's all the questions, thanks everyone for attending and enjoy Amsterdam