 I hope you've been having a great conference so far. I'm going to be presenting on Jäger. I'm going to do a quick project intro and recap on what Jäger is and what tracing is. And co-maintainer Joe is going to go for a deeper dive into how one might deploy Jäger using the operator. So what is Jäger? Jäger is an open source distributed tracing platform. It's among the most popular ones. Jäger graduated from CNCF and it was started at Uber around 2015 or so. It was shortly, it was open sourced in a couple years later, I'll tell you about. Who am I? I'm Prithviraj. I'm a Jäger maintainer. I've been working with Jäger since 2015 or so and I continue to work with Jäger at Uber. Sadly, most of the work I do these days is specific to Uber so I'm unable to contribute to open source. So I'm going to start by talking about observability in distributed systems and microservices. Now let's transition to this graph. Now the graph we see here represents a subset of Uber's architecture. The dots or points are nodes and our services and the edges represent a call between a service to a different service. Now Uber's a microservice architecture is pretty large. It has about 4,000 microservices and requests routinely touch several hundreds of microservices. So when someone requests a ride or someone is looking up for a receipt or something, it might involve a large amount of services being touched or being requested data off. And this happens very frequently. So this happens billions of times a day. So in a complicated system like this, tracing really shows its value. We are transitioning into a time line view. This is a request from Hotrod which is a Jäger demo that you may download from jägertracing.io and we're going to use this to describe parts of the UI. So on the left we have the service name and then we have the operation name in gray and on the right we have this rectangle which represents the time taken for this operation. This is called a span. In tracing terms this is the smallest unit of operation that you can represent. It's a small logical unit. On the top here we have a Gantt chart. This allows users to focus and zoom into particular regions of the trace. On the top we also get this nice trace summary that tells us how many services were involved, what is adept and how many spans they are in this particular trace. Another way of visualizing this trace, or different trace to say, is by using the single trace service dependencies view. In this particular view we see that a particular service calls a bunch of other services, which call a bunch of other services, so on and so forth. Like this view allows one to orient themselves within a trace and this is really useful when the architecture is both wide and deep. In this particular case we see that the depth is 16, so the 16 services deep and the 32 services involved in total. So in this sort of situation it may be useful to find out what's going on or how these services interact with each other. One of the interesting things with the microservice architecture is the notion of things going wrong and the notion of detecting how things go wrong. For anyone who's worked with microservices you know that finding where something is going wrong is really really difficult. Like things might be broken in a particular service only for some requests or it might not even be able to detect it using traditional tools like logs and metrics. Some categories of things that are undetectable by logs and metrics are things like a service baiting on its dependent service to respond or service making a larger number of calls to dependent services than is required etc. Once something is known to be broken we may dig into it deeper using tools like logs and profiles etc to determine how best to fix it and a lot of times once we detect which service or service instance is having an issue it may not be as difficult to then fix that issue. We're going to transition into a high level of how tracing works. Now tracing depends on context propagation. Now how context propagation works is when a request comes in through an edge service. This might be an API gateway or some such. It is stamped with a unique ID a trace ID and this trace ID might contain additional fields and this is called a context. This context is then propagated to every service in the call chain by using headers or any other similar mechanism. The main thing this context allows tracing systems to do is to assign causality. So in this particular case the tracing system Yege would be able to deduce that B was called by A and by nothing else. On the right side of this we have a Gantt chart view. This is representing the same data that is on the left of the screen but I hear there's a time dimension so this latency and a start time that's added to these spans. Spans might also contain any additional information so they might contain logs, they might contain key value pairs known as tags and you're just limited by your imagination on what you can put in here. So next we move on to sampling. Yega data is very rich so as we saw in this particular slide we have for every RPC there are spans being generated and this is a large volume of data. So to control the volume of data Yega performs sampling by default. So what is sampling? Sampling is just a notion of saying only some percentage or some population of traces are going to collect data and will be stored. Now there are several different ways of doing sampling but the most common and easiest one to do is head-based sampling where every request has a sampling decision that is made and the sampling decision is propagated. So once something is decided to be not sampled no additional data would be collected. We're going to move into Yega architecture. Now on Yega architecture, how it works is that there are a bunch of client libraries. These are SDKs which are available in multiple different languages so Yega conforms to the open tracing standard and it supports instrumentation in Go, Java, Python, a bunch of other services. So essentially these client libraries generate spans and these spans go to the trace collection back end and the visualization front end retrieves these spans from the trace collection back end and visualizes them as traces. This includes the Gantt chart view, the dependency view, etc. Optionally Yega may be deployed with a data mining platform which can generate dependency graphs of services. Now what is interesting about this architecture is that it can be deployed in multiple different ways. This is with varying trade-offs and this is something Joe is going to talk about shortly. One of the things that I'd like to highlight is this left portion of this architecture. Specifically when we look at this we see this remote sampling. Remote sampling allows Yega Collector to control the sampling rate and this sampling rate is read by Yega clients and this really allows Yega Collector to control the amount of traffic that Yega clients and these applications send to it. So for instance if you have a sampling rate of 100% it means that Yega client on every service is going to send 100% of spans to Yega Collector. Once with like if you have something like zero or something like one in thousand like you the traffic that is being sent to Yega Collector is really reduced. So this remote sampling is a really important configuration that you should know that allows you to control the amount of traffic that is being handled by the remainder of your system. Now I'm going to transition to Joe who's going to talk a lot more about the Kubernetes operator and how to deploy Yega and what trade-offs they are. Thank you. Hey everyone my name is Joe and today we're going to talk about the Yega operator. Quick information about me. I'm a Yega maintainer who works at Grafana Labs primarily on our distributed tracing infrastructure and I enjoy riding my bike quite a bit. That's a picture from my local park. I ride past that fountain multiple times a week and my Twitter handle is in the bottom. You're welcome to give me follow. I mainly just tweet about distributed tracing stuff. So Yega itself a quick primer we all know what that is I hope. It's a distributed tracing backend that allows us to do this which is visualize a request as it passes through our infrastructure. As it passes through all of the many services that are used to answer a request can often be difficult to understand where it was spending time. Distributed tracing kind of gives us those answers. It shows us all of the different pieces and where the time was spent as well as of course we can attach additional metadata. The operator itself builds the cluster that we can then use to go kind of use that see that UI right the Yega UI. Record the data, store it and then visualize it. It operates Yega and Kubernetes which is an extremely useful sentence. It's primarily maintained by JP Crowling who's a Red Hat engineer and a long-term Yega maintainer and there's two important links here. If you get involved with the operator the GitHub repo itself as well as the docs which are fantastic and I spent quite a bit of time in when I was preparing this presentation. So let's talk about what we're doing today we're going to talk about you know how to use the the operator to make a cluster to configure our storage as well as some other configuration options we're going to look at. We're going to talk about how to deploy the agent and the strategies available and remote sampling which is an awesome feature of Yega as well as auto scaling more and more quite a bit more. In fact this presentation is not capable of covering all of the you know features of the operator but we'll try to touch on at least you know what it can do for you kind of at the end that we we don't cover. What we're not going to cover in this presentation is you know kind of in depth what Yega is or what tracing is or what operators are or Kubernetes is so we kind of had this expectation hopefully that you know you are you understand these pieces you know what Yega is and maybe you're considering the operator as an option for deployment. So to get started first I'd recommend going to check out the docs the link was you know in there before here's the second link and it's going to tell you to do a bunch of kubectl createfs so there's a whole lot of YAML kind of candy YAML that creates you know deployments and services service accounts roles role bindings all kinds of pieces right that you're going to need to deploy the uh you're going to need to deploy the operator itself. A second piece of early advice I'd give is to check the logs the logs were excellent if I was running into an issue if I had um didn't understand why the operator was doing something the logs are often very clear about exactly what it was choosing to do and why it was choosing to do it. So if you are having issues digging the logs very quickly I thought they were quite good. A very early configuration point that we need to talk about before we move forward is this watch namespace environment variable uh so in the bottom left you can see uh what comes in the candy YAML the by default it's going to use the Kubernetes downward API to choose the namespace that you deploy the operator to and it's only going to watch that namespace to build new clusters or to deploy the agent sidecars. Other options are empty so if you set watch namespace to empty it's going to actually watch all Kubernetes all um all of the namespaces in your cluster in fact for you know just getting started playing with the operator this might be the best one to choose because it will very quickly um because it will very quickly help you you know it'll kind of remove this configuration option it'll watch all namespaces and you can move quickly and just kind of immediately get to understanding the operator and what it does and then maybe come back and narrow this down later um and then a final option is to actually specify a very specific namespace only watch this namespace networks as well so to make a cluster we're going to use a crd um and this is a really clever frankly kubernetes feature where you can define a custom resource and then you can treat it as a first-class resource just like all of the other pieces of Kubernetes you're used to you can use kubectl to edit them or create them um you can view them in all the same ways you can query them through the API in all the same ways it's really fantastic honestly um but in this case we're going to make a object of type or kind yeager uh the operator is going to pick that up and it's going to go make the actual resources necessary to uh it's going to go make the resources necessary to create the actual cluster um so we're going to make a just very simple object like you would make a deployment and then this operator piece um is going to go make the actual pods and all of the different infrastructure necessary to have the the object we uh specified um in this case we have the strategy option it's kind of a first choice here when you're making a cluster and by default it uses all in one which is a great place to start all in one is a single binary deployment but we can also choose production or streaming so if you're getting started playing with the operator use all in one and kind of just get a feel for it deploy a test application see all the traces in your back end and then kind of move up the chain and you can try some of these other deployment options as well um all in one deploys a single binary the single binary um by default i have back end database kind of over here um but by default it actually stores all of your traces in memory so if you roll the pod your traces disappear but you can deploy an optional back end database so your application through the client is going to talk to the agent the agent pushes traces to the or the spans to the single binary which it stores then in memory for the production and streaming options um you have uh you know this stretch or you have a set of surfaces that are deployed uh by the um by the operator that you can kind of scale independently which helps you um have more control over i guess the uh more control over the over the uh over the yager cluster so here we have the client talking to the agent agent talks to collector and if we're using the streaming option we also have kafka and a yager ingester and then finally to the back end database so if you choose production this green box is not deployed if you choose uh streaming the green box is part of the deployment um kafka and the back end database are kind of our external infrastructure pieces and we'll talk about those in a second oh and then the yager query also is deployed which talks directly to the back end database so storage configuration is done through this exact same object we looked at we looked at the crd we saw we specified the type rate um which was uh uh production or streaming or all-in-one and here we're going to specify the storage so here's where we would specify we want to use elastic search kassander's also an option elastic search is the current recommended option for a yager back end but we specify elastic search and then we have to specify our um we have to specify where our elastic search cluster is basically so that makes sense um and then if we want to use kafka so we have the streaming option the the streaming strategy then we have to specify uh you know for our ingester side if you recall this this dot or if you recall the diagram it's collector goes to the kafka goes to kafka goes to an ingester goes to your back end so for the ingester we have to say you know you are a kafka consumer here's where kafka exists for a collector we have to say you're a kafka producer here is where kafka exists so um for the configuration we basically have to on both sides tell the ingester the collector where to find our kafka uh uh queue and this is kind of a neat feature so if you ever run you know one of the pieces if i ever want to know how to configure yager i often do this docker run and i specified the pod in this case the collector and i say dash help um or not the pod but the service i say help and i can immediately get like a dump of every single option that i can provide to the collector which helps me understand what the configuration options are what my kind of tunables are um all of these are available through the um through the operator by using this pattern you see below so this is really useful if you're moving from an existing deployment and you already have a lot of these configured or if you just know you want some of these for instance queue size is common to change if you want a larger queue size or you want more workers working on your queue or whatever um you're going to want to change those options and you can do that very cleanly through spec so this again is our yager crd our yager object spec collector options and now everything below options is going to be mapped um directly to a cli parameter so collector queue size will be mapped to collector dot queue size and this um this configuration will set the queue size to 100 so all of the configuration options you can find both in the docs or through running dash help are going to be um available through the operator by using this options pattern in fact if you pass in garbage options so options blurb blarg 10 then it's just going to pass that exact parameter you can see it saying unknown flag blarg dot blarg um and so this is just kind of this future proof pattern that uh i believe jp came up with that lets you specify any parameter you want so all of those options are available and open to you um to use uh on your yager cluster the agent strategy is another very important piece of choosing uh or an important choice when you're deploying your yager cluster and by default we're going to by default the operator is going to set up what's called a sidecar so by default when an apod is deployed if it's in a watch namespace um as well as um we're looking as well as we have this metadata annotation um the yager operator is going to inject a agent sidecar next to your pod so you know a pod is a collection of containers um our pod might contain or will will contain our application and it might contain a couple sidecars which maybe do things like debugging or do some kind of like metrics offloading in this case it's going to be a sidecar that is the actual yager agent um which is going to consume at the spans and then push them onto our collector um this is the default strategy and it's a very powerful strategy and it works fine up to quite large deployment sizes um the one thing you need to be aware is either your deployment needs this annotation this annotation tells um the sidecar dot yager tracing i o slash inject true tells the operator that i do want the the agent sidecar on the pods from this deployment you can also put this on the namespace and then all pods deployed to that namespace will um we'll have the sidecar injected so kind of both are an option a second more complicated option that i'd recommend only experimenting with kind of is if you need to would be a daemon set option so in the daemon set um we have one agent per node and all of the applications on that node can be configured to send spans to the node now it's kind of complicated because there's no direct way to reference the daemon set that is on the daemon set pod that's on your node so you actually have to um have the agent open node port um then you have the applications report uh use then you have the applications send their spans to the node port of the host they're on um and uh at the same port or at the yager agent port so it's complicated um i would only recommend it if you need to i'm not really sure what the threshold is hundreds thousands of pods per node maybe make this starts to make sense i can't say for sure but in this mode um we're going to deploy one agent per node all of the applications on that are going to node are going to report to it and then we are going to send those on to the collector and you can see here um in the yager uh crd in that object we keep talking about it's the same object this whole time we're just kind of configuring different elements um in this case we're going to say agent strategy is daemon set like i said by default it is that sidecar strategy um remote sampling is a another cool feature that um i have always loved in kubernetes i'm sorry in yager now wrote remote sampling allows you to remotely control the sample the sampling rates per operation of all of your applications they have to be configured to go find the remote sampling file to break it in and to configure themselves but this does give you as an operator it gives you central control over the rates at which you sample all of your different applications and there are different end points um which if an endpoint were to change and suddenly create you know 10 times the spans it normally creates it's overwhelming your back end this would be your way to slow it down basically um normally you have to kind of create this jason file and send the jason file to the uh collector um this the operator simplifies that quite a bit by just again in this yager crd in this yager object we're going to have this sampling options and we can then immediately specify our um our remote sampling strategy and then again we can have us per service we can have per operation even and that really gives us a lot of central control to um handle the tracing load that we've brought into our back end um autoscaling also important uh we have this kind of collectors and ingestors layer right so agents or either sidecar or daemon set reports to a set of collectors which then maybe go to kafka if you use streaming or maybe not um and then a set of ingestors which are also um uh this stateless piece of the yager pipeline um so these collectors ingestors can scale if your um if your load goes up considerably you might need more collectors to effectively move them through uh the queue in the collector and put them in kafka same with the ingestors if your load goes up significantly and you have your spans are backing up in kafka you can use autoscaling to bring up your ingestors by default it's going to create a horizontal pot autoscaler with a max of 100 which is quite a large installation um and you might find it uh you might find that you want to adjust that so you can say spec again in this yager object collector max replicas i can control the max again max's default is 100 or if you just want to turn this off if you don't want to bother with autoscaling you can also set autoscale to false and then set i just want 10 replicas always or maybe a set number um both options are kind of up to you as an operator kind of choose the one that makes sense to you and what kind of meets your needs i guess in your yager cluster uh this was a piece of the yager operator that thought was quite cool and didn't under didn't know all of these features exist when i first started out but there is a number of different other operator integrations so for storage we were saying you need to go make a kafka queue or elastic search backend and then you need to configure your application you need to configure your cluster to see them to go point at them um if you have the kafka or elastic search operator installed uh the yager operator will recognize that and it will do it by noticing that the crds are defined and it will make a kafka or it will make an elastic search cluster for you um so if you have the kafka and elastic search operators and the yager operator then you don't have to provision this backend um the yager operator will do it for you through the other operators um and then finally if you have the prometheus operator it will create the appropriate service monitor objects to monitor your yager cluster so these operator integrations were quite interesting um i didn't realize these exist as existed when i first started out but it would simplify quite a bit of configuration of building a backend and kind of lets you immediately very quickly spin up a set of operated um the yager your yager cluster your backend and um your service monitors very quickly um there's quite a bit more um in fact before i started this project i've never used an operator to run something in a production environment i've run a number of operators locally to learn about them to kind of play with them and um this yager operator is one of the first that i really enjoyed and felt like it was giving me value back by spending a lot of time with it even if you don't want to deploy with an operator i would recommend using at least the yager operator for a while because it will teach you how to operate yager it'll teach you all the pieces it'll when you look at all the configuration options um you will walk away with a much better understanding of yager um so other things it can do are like kassander schema creation it can work with elastic search to do handle your indexes it will run the spark dependency job um so you can get a nice dependency graph in the yager ui it'll handle version upgrades of yager it supports open shift and then it has fine grain support for all kinds of kubernetes um objects uh finally um some resources some online resources uh yager tracing i o the docs the yager docs are always great please go check them out if you're getting into yager at the cncf slack just check out the yager channel um a lot of us inhabit the yager channel and do our best to help people who are having issues and then uh on medium slash yager tracing you can find a lot of blog posts which are going to give you a lot of you know in-depth information other detailed information about yager um please reach out and connect please get involved um and hopefully we'll see you in one of these channels hopefully we'll see uh you getting involved and we can you know chat with you through some of these other uh channels and uh take care everyone and i hope you have a great conference and i will see you when i see you