 אוקיי, טוב, נתחיל, נשארו לי כבר 25 דקות, אז אני יש לי 25 מינטים לב, אז אני רוצה להסתכל על זה, אבל בגלל איזה קצת, אני בגלל לומיגו בואו ובאותו, אז אם אני לא מנסה את כל מיני, ואיפה שאיזה קצת, תשמעו לי לבצע מלמד, ואני אראה לסתכל על כל מיני מהרות הזה. So, hello everybody, really happy to be here. My name is Or Weinstein, I'm BPR product at Lumigo. Lumigo helps developers probably shoot their applications in production. We'll talk a little bit more about how we do that in this talk. And today I'll be talking to you about how we built our Kubernetes operator. A little bit about myself, I started my career as a developer in the Israeli military, spent some time in product management at AWS and Google Cloud, and today I'm BPR product at Lumigo. So, what will we be talking about today? We'll spend the majority of our time talking about how we built our Kubernetes operator and what it does. But, to get you all there, we have to start with some context. So, what is this ability tracing? Why is it so hard to get done right? And why did we go through the lengths of building everything that I will show you today? So, where do we start? We start at the beginning. We, at least years back, most applications were built as models, right? They have many disadvantages, which is why we switched at some point to microservices, but they have one glaring advantage. That advantage is they are much easier to troubleshoot. They run on machines or VMs, you use logs and metrics, they all come from the same place, relatively easy, nothing is easy in production, but relatively easier. Now, we rightfully switched to building microservices, which have many, many advantages, I won't go through all of them right now, but one glaring disadvantage, that disadvantage is they're much harder to troubleshoot. Now, they are much harder to troubleshoot because virtually any flow in most modern applications consists of multiple services talking to each other. Now, if you look at the graph on the screen, there is one red node colored at the bottom, that's where the error is. Now, the root cause for that error isn't, in many cases, in that same service. It could be 5, 10, 15 services upstream from that in that request flow, right? So, we have an error downstream. You have the root cause somewhere upstream, and good luck trying to find that root cause using logs and metrics the conventional way. It's much, much, much more difficult. And that is why distributed tracing came into the picture. And for those of you who are not familiar, distributed tracing is basically provided with visibility into that entire request flow end-to-end. So, a visual representation of those services, which service is calling with service, the latency, metadata, payloads, a lot of things that you can attach to these services to be able to visualize that request flow and debug it in a much more easy fashion. Now, how do you do this? You use something called, a very high level, something called a trace context, which gets created once upstream, ideally, and then gets propagated downstream to all these services to be able to correlate and create that visual picture somewhere in some back end. It can be Lumigo, it could be another system. Now, for that to be possible, each service has to have a tracer, right? So, you can use that library, some means of collecting the data, processing it, and then transferring it to some back end system, right? Now, distributed tracing disproportionately benefits from the network effects. In other words, the more you trace, the better the insights, the better the visibility, the more holistic picture you get for your application, for your Kubernetes cluster, in this case. Now, that's, it sounds easy, imagine you have a cluster, multiple namespaces, multiple objects, different crime jobs, jobs, deployments, demon sets, et cetera. Hundreds of these objects running in your cluster. You have to go one by one, you have to instrument these objects with the right tracers, right? If it's Java, you have to have the Java tracer. If it's Node, you have to have the Node tracer, et cetera. And now, you only have to do this for all your existing objects, pause, I'll just say pause for simplicity, in your cluster. You also have to build a process to make sure any new pod, any new application that spins up in your cluster gets instrumented as well, right? So if some developer somewhere in your team spun up an application, didn't trace it, didn't go through the proper channels, didn't trace it properly, good luck when that application now has a problem in production. And that's what we experienced with some of our customers that came to us. So we saw that in many cases, you think you have everything covered, but when you have a blowout in production, you actually are missing that critical piece, that critical instrumentation, that critical trace from that application where things are blowing up. And so we saw this enough times that to me, we made it our mission to basically say, we want to make it stupid simple to trace everything. We don't want it to be complex, we don't want it to be to have our customers add lines of codes, lines of code. We don't want it to be any harder than a few simple operations that get you up and running in a few minutes. So that was our mission. We basically wanted our customers to be able to say, Lumigo, trace me this thing. In Kubernetes, we elected for the namespace construct. So trace me this Kubernetes namespace and be done with it. So we did some research on how best to go about this mission. And we stumbled upon the Kubernetes operator. How many here in this room raised your hand if you've heard of what an operator is, you've dealt with it, you've played with it? Okay, most of you, great. So very briefly, an operator is just a means of extending the Kubernetes functionality. And another benefit is it's very easy to reuse and share with other developers, with other teams, companies, etc. Now, operators are used out there for many, many, many different reasons. They have many benefits. For our intents and purposes, we like the fact that it basically helps automate complex manual tasks and operations. That's what we wanted to do for our customers. So we built the Lumigo operator and in the remaining time I have with you here, next 18 minutes, according to this clock, what we'll do is we'll walk through what the operator does for our customers and then how we built it. What's the magic underneath? So at a very high level we'll dive deeper in a second. The operator does three things. The first thing is it instruments existing pods. So all the pods, objects you have in your namespace that you chose to trace, it automatically injects them with the tracers. Now, it also makes sure that any new pods that gets spun up gets injected with the right tracers as well. And lastly, it's very, very seamless and easy to clean up, meaning you decided you don't want this namespace to be traced anymore. It's very easy to uncrace everything and Lumigo takes everything away and leaves no trace that it was ever there. So how does it actually look like? What does a user have to do to get a namespace traced? Two things. The first is install the operator using Helm. So Helm repo add, Helm install very seamless, very easy just like any other Helm chart you would go about installing. So first is a Helm install. That basically takes the operator and puts it in your cluster, right? Now, the next thing the user has to do is add and you probably can't see anything here but I'll just talk about it is to add a custom object or a custom resource that we created to the namespace you want to trace. All right? Around 12 lines of code to your YAML or through kubectl apply whichever method you prefer but add the custom resource to the namespace. You also have a Kubernetes secret here basically just to keep the Lumigo credentials the token we used to authenticate to the Lumigo back end. But that's it. You're done. So you Helm installed you added a custom resource to your namespace and now your namespace is traced. Now, what does that mean? Think of this slide as your cluster. The operator, the Lumigo operator is installed in your cluster. You then have your namespace where you put the Lumigo object in and all the Kubernetes objects all the pods get automatically instrumented with the right tracer. So if the pod is running a container running Java you'll get a Java tracer. If it's no, you'll get a Node tracer, etc. Now, now it gets interesting. So how does this work? We'll spend the remaining time answering these four questions. So the first one is how do we get the tracers into the containers? That's the basic first thing we have to do, get the tracers into the containers. But that's not enough. Once we get the tracers into the container you have to load them into the process. So how do we activate them? How do we do the equal of Node.js the require and Python the import? How do we load the package into the process so it can be used? Then how do we make sure every new pod that spins up gets injected with those tracers? And lastly how do we clean up? So let's start. Here is a Cube City I'll describe of a one-liner Python container. Very, very simple. Nothing special here. I cleaned it up also for... So it fits on this slide but very basic. Now once this application gets injected by Lumigo by the Lumigo operator with our tracer this is how the Cube City I'll describe is going to look like. Now let's walk through the interesting components here. The first interesting component is we add an ephemeral volume. An empty-deer volume. Now just to remind everybody an ephemeral volume is a volume where the lifespan is like the lifespan of the pod. Unlike a persistent volume this volume dies with the pod. So we add this volume. Then we add an init container. And an init container is a container that starts and completes before the app containers even start. So it starts and completes before the app containers even start. And this init container has an image that image contains our tracers the Lumigo tracers. So the init container spins up with the tracers it copies the tracers into the ephemeral volume that's mounted to it the ephemeral volume we saw a second ago. And then it completes. Now the same ephemeral volume is also mounted to the app containers. So the app containers start with the tracers in them. Now there's an interesting question here which is how do we know which tracers to copy into the containers or into the ephemeral volume? How do we know if it's Java, Node, Python, etc.? The answer is we do not. So we copy all the tracers. Now we'll get to a point where we actually do care which what runtime is running in the container but this is not it. Here we don't care it doesn't really matter we just copy all the tracers into the ephemeral volume no real downside here. And lastly we add three environment variables. The first one is the LD preload envir with a path to the lomingo-injector file which we also copy into the ephemeral volume and we'll get to that later in this talk. We have the lomingo tracer token which is just the means of authentication to the lomingo backend and we have the lomingo endpoint which is an endpoint to a service in the operator for an OTLP collector that receives processes and exports the trace data right to the lomingo backend. So how do we get the tracers into the containers? Covered. Now we're getting gradually more interesting as we progress here how do we activate the tracers? How do we load the tracers into the process? So as we said before we don't want our users or customers to touch their code as they do this so luckily for us there is a way in every runtime every runtime is different but in every runtime there is a way to basically load packages through environment variables. So that's what we do and as an example let's look at a node example node has the node 8 and up has the node options environment variable and we can basically append to it minus r lumigo slash open telemetry minus r is short for the require statement and so adding this statement to the node options environment variable is effectively like writing the line belief so just like writing lumigo equals require path to the package so this is an example for node we do this for Python we do this for Java as well now how do we set the right environ on the right process? this time it's not a request then we actually care this time because unlike before this is visible to the user we don't want our users to start debugging which is running a python process and for them to find on it a node environment variable the node options environ and start thinking to themselves why is this here what's wrong does this have node does this have Python this is confusing so it sounds it sounds trivial but we had to find a way to solve this we can't add all the environs we have to add the environ for the specific runtime that is running inside that's pod, that container so to understand how we do this we have to dive in a little bit into how run times like the Java virtual machine see python and node actually work a quick detour but we'll come back very shortly I promise so there are two major ways to compile applications there is statically compiled applications and dynamically compiled applications statically compiled applications result in an application a binary mostly that is self contained it includes all the packages all the code all the libraries that it needs in order to run it does not rely on anything external dynamically compiled applications are different these are how most applications are compiled and they actually rely on external libraries on DLLs now why do we care why am I telling you all this because and we're getting there because there is a library that's called libc the C standard library this library was developed initially as a helper library for C applications but very quickly it's spread into virtually all operating systems and what in virtually any runtime relies on it now what does the libc do libc has basic operations so mathematical operations string operations memory management IO and it also has a very interesting function called getenv getenv is a function within libc that gets is a parameter is a variable the key for an environment variable and returns the value so effectively any application uses getenv to request the value of an n bar now you can start to see where I'm going with this so if we had a way to hook into libc and basically control the value that it gets and receives we would be able to set the right n bar for the right process now the way we do this is if you remember in that pod spec that I showed you we had the ld preload environment variable directing to the lmigo injector that is basically telling that process to go and look for and replace basically libc with that lmigo injector SO file and so what happens here is the Node.js process spins up it looks up the Node.options n bar in this case assuming this is a container running a Node application it makes a call to the lmigo injector as opposed to libc to request for a call to getenv with the Node.options key it looks up the actual value it appends to it minus r lmigo slash open telemetry and it returns that that complete value the Node.options the original Node.options value appended with minus r lmigo slash open telemetry and the end result is that it's like effectively like the user had written require lmigo slash open telemetry in the code now keep in mind we lmigo injector if it gets a call to getenv with any other n bar that we don't care about it just gets the actual value and passes it along so it doesn't do anything to it but if and when it gets Node.options, Python path, etc something we care about and need to alter in order to load our tracers then we append the value and return it to the process okay so this is how we activate the tracers now how do new pods get automatically injected how does that process look like so we use something called an admission controller mutating webhook and this is basically a construct in Kubernetes that lets you modify incoming requests to the Kubernetes API before they are acted upon so what happens is this create new pod or deployment comes into the Kubernetes API server they say wait a second I want to see if the admission controller mutating webhook wants to do something about it they refer to us we instrument the template to insert all the stuff we talked about content container, ephemeral volume all that stuff once we finish we say without the Kubernetes API server we're done and the Kubernetes API server proceeds with creating the new object now keep in mind this is a blocking request we want to make sure that we don't break anything in the cluster so we do two things first we have a continue on fail setting so if we aren't able to do our operation successfully it doesn't really affect the creation and also we have a time out of five seconds to ensure the same thing okay so this is how we get new pods automatically injected now how do we clean up after ourselves we use a kube API watch here we elected for a non-blocking operation our operator sets up a watch on the namespace where the Lumigo object is set up and basically what we do here is once the object gets deleted the Lumigo object gets deleted then we get a notification we watch for that the operator goes and on instruments all of the pods it goes and removes pod by pod all the Lumigo tracers eventually and eventually it also removes the watch so step by step basically we're moving every trace of Lumigo in that namespace where we deleted the object from okay so we're done made better time than I expected quick summary getting the tracers into the container a combination of any containers ephemeral volume how do we activate the tracers we use a trick existing in every runtime where we alter the n-var using the lg-preload to basically inject the Lumigo injector like the get-end operation and how do we get new pods to get automatically injected how do we, how do new pods get automatically injected we use the admission controller mutating webhook to do that and we use kube api watch notifications very simple to clean up after ourselves now I hope this kind of a little bit added something to each of you how you can automate and make seamless operations which otherwise would be relatively complex using a bunch of Kubernetes constructs and we're out here on the floor if you have any other questions happy to field them and enjoy the rest of the day thank you