 Hey, everyone. My name is Joe Elliott, and we're going to talk about getting started with Yeager today. This presentation is mainly aimed at people who are, you know, getting started maybe building their first production cluster. Maybe you've already set up a proof of concept and you're wondering kind of what it takes to get this into production. Or maybe you're just interested in tracing in general, see if it's something that's worth doing for you all, and you want to know what Yeager looks like and kind of how Yeager is built. So who am I? Let's start with that real quick. I'm a Yeager maintainer. I work at Grafana Labs primarily on our tracing infrastructure and internal projects. And when I'm not wrangling my children or sitting at the keyboard, I like to get around my city a little bit, see the city on my bike. Yeager is the most popular distributed tracing backend open source. It is a CNCF graduated project, and originally it was created at Uber, an open source by Uber. I think it began in about 2015. And before we get into the actual architecture, I want to talk a little bit about what a trace is, because, you know, kind of what a trace is and how traces are generated kind of inform how the actual backend works. So understanding what trace is helps us understand why the backend, why the Yeager ingestion pipeline is the way it is. So this is a trace from Loki, and this particular trace represents a single request. So this is one request to get 35 milliseconds to get through the Loki, or to be answered by Loki. I mean it's passed through a number of different applications and services. So you can see there's like these green bars, Loki Gateway at the top, and then query front end orange, and then Loki queers these blue ones all over the place. These spans kind of represent maybe like a single functional or logical unit inside of our application. And the application, each process, and there's lots of these of course, are going to be generating spans and have all this data like when they started and when they ended, some other metadata of course. They're going to create these spans, they're going to offload those into the Yeager kind of ingestion pipeline. So all of these different applications are going to be creating spans, pushing them into Yeager. Yeager has to then absorb all these, bring them in, put them somewhere, and prepare them for querying. And so we'll talk a little bit about how that looks and why it's built the way it is. This is the architecture. So given that in mind, we have all these applications, right? And all these applications are going to then be pushing all their spans into the agent from the client into the agent. The client itself, to start there, it's on this side, this side. The client itself is a part of your applications like a library or a framework that you bring into your application to create these spans and offload them to Yeager. There's lots of different applications or frameworks and language supported like popular ones like Go and Java, Node, Ruby. So all these different popular enterprise languages have support for Yeager. Or Yeager has support for them perhaps. And you create these spans in your applications using client which is then offloaded to the agent. Where the agent is positioned is important. We'll talk about that a little bit in a second. There's a lot of options about where you put the agent in relationship to your application. The agent then moves them over to the collector piece. The collector then has to make a choice about, or sorry, doesn't make a choice. You make a choice about how to set up your pipeline. You can either at this point point it directly at a backend database or you can use this Kafka Ingestor optional kind of piece. We'll talk about some of the choice there or why you would choose to use this. So agent or sorry, client in my application to the agent layer, to the collector, optionally Kafka and then an Ingestor and then our backend database. And then the Yeager query piece, which we won't really talk about much. The Yeager query piece just speaks directly to your backend database. So how am I going to deploy this? I'd say there's three probably popular options here. Helm. A lot of people like Helm. It's great application. It's a great way to deploy things to Kubernetes. The operator, mainly I believe maintained by JP is an excellent choice for deploying Yeager. And then there's the custom option as well, just however you all want to do it. So maybe you just have custom YAML, maybe they're just sitting and get somewhere. We use JSON.actually at Grafana to deploy Yeager. I'll say that Yeager itself is not very difficult to operate. Your operational complexity is going to come from your backend. The database you choose to put your spans in. So Yeager itself is mainly stateless. It has cues, it keeps things in memory, but it's easy to operate. It's easy to monitor and to manage. It's not very difficult. It's kind of the backend that's going to give you trouble. Or maybe not give you trouble, but is the source of complexity of operating Yeager. The operator is a cool choice because it supports some different agent modes like sidecar injection, or as daemon set sitting on your hosts, and we'll talk about why you would choose one of the other in a second. And then of course Helm, Helm's a great option. We used Helm at my previous position quite a bit, and Helm's a great way to deploy applications as well. So some of the very first things you're going to be picking when you start building your first production Yeager cluster is the backend database to use, Elasticsearch or Cassandra. These are the production supported Yeager backends. There's others, but these are the two that I would recommend. Elasticsearch is the recommended choice by the Yeager team. It has a little bit better flexibility. It has a little bit better search flexibility and capabilities than Cassandra does. And the amount of load or kind of the size of the cluster will be roughly the same between Elasticsearch and Cassandra. Cassandra is not bad. In fact, we use Scylla, which is kind of a Cassandra alternative at Grafana, and it works fine. There's like maybe slightly less flexibility in searching, but we use it and we have no problems. So if you have a ton of operational experience around Cassandra and you don't want to kind of take the risk on of building or understanding a new backend, then Cassandra is 100% worth, or 100% a good option we use at Grafana, and I'd recommend it if you have just far more experience operating it than Elasticsearch. The other option you have to pick is Kafka or not. So we talked about this optional Kafka piece. You have client, agent, collector, and then this Kafka ingester, optional part. The Kafka piece is useful in some ways. You kind of have this durable queue for a lot of people. In fact, Kafka is very easy to run, especially maybe compared to Elasticsearch or Cassandra, at least for us it is. And it provides some more flexibility kind of in your pipeline. If you're backend stutters for a little bit, Kafka very nicely and easily queues these pieces up, and then once your backend's back online, it can continue pushing spans in there. There's also some kind of neat integrations with Kafka, with Kafka streaming to do analytics, and to do some other kind of like data processing on the spans they come through. And also Kafka gives you your org or your group or your team. It gives you the flexibility maybe of writing your own kind of like applications that watch the Kafka stream, maybe generate some metrics, or provide some kind of information about your infrastructure that is difficult to find otherwise. And if you do make something cool like that, please share it. Please share it with the community. Yeager's open source. We love contributions and involvement, and if you build something very neat, we'd love to hear and see it. A final choice here is kind of the positioning of the agent, which we've kind of alluded to a few times previously in this discussion. The agent, the common wisdom here is to put the agent as close as you can to the application. So the operator does this kind of nicely for you. It'll do this sidecar injection, where you have your application, and then as a sidecar, you'll be running the agent. So it's sitting there in the exact same network name space, and every application's going to have its own agent right side by side. Excuse me. And this would be the absolute fastest way to offload spans from your application. The kind of game that's being played here is your application's generating spans extremely quickly. You're filling up a queue in the client, a configurable queue in the client. There's workers that are draining that queue as fast as possible by sending spans over UDP to the agent. So the closer your agent is to your client, the fewer drop spans you're going to get. So a sidecar for the most high volume applications, a sidecar is as good as you can get putting your application a single agent as close as possible to your application to get those spans off as quick as possible to have as few drop spans as possible. Other popular options are to run it as maybe a daemon set, or if you're not in Kubernetes, just on every host, and then have all of the different applications on that host speak to your local agent. And then at Grafana, we just run the agents as a deployment with a service in front of it, like any normal Kubernetes application. We do have drop spans at Grafana, and I sometimes wonder if we should move to sidecar or maybe this daemon set approach. We haven't yet, and we do have some drop spans, and sometimes I think it might be a good choice for us. So I will say, if you want to run it as a service, you want to run this simple option where you just have a collection of agents that all of the different applications in your cluster point at, it works, but you're going to have a few more drop spans than normal. And then for your highest volume things, use a sidecar or maybe some of these other approaches to get your agent very close. Sampling. So sampling is a very important choice, understanding sampling. Sampling allows you to only record a certain amount of your total traces into your backend. Most of us cannot afford to sample, to record 100% of every request in our infrastructure or backend. They can kind of quickly get out of control. Development teams are going to start adding spans. Hopefully your volume is going up. Somebody throws a new service in, which is going to add more spans, of course. All these different kind of changes always are growing. It's always making your span count grow, and you're going to find that it's very difficult to do 100% sampling. Or at least, maybe not difficult, but costly. Your backend Cassandra Elastor Search is going to be very large once you're hitting a large number of spans per second. 100,000 is an enormous number of spans per second. So our options for down sampling from 100% are probabilistic or rate limiting. Those are common options. Probabilistic is where you're going to say 10% of every trace initiated by the service I want to actually save. Rate limiting is going to say 5 requests per second, 5 traces per second initiated by the service I want to actually save in my backend. And so you're not sampling 100%, you're sampling this lower amount, and it allows your backend to kind of survive the storm of spans that are coming in. The next part I cannot emphasize enough. There are these different sampling options like probabilistic and rate limiting, which are fun to work with. I think for production instance shared by multiple teams, there is no substitute for remote sampling. I'm going to just tell you to set up remote sampling. Remote sampling allows you to centrally control all of the different services and their sampling rates. So you as the Yeager operator sitting here one day, and all of a sudden your spans per second go from 5,000 to 7,000, and you don't know why and your backend is struggling and you can't figure out what's going on, and you don't have control over these applications that are creating spans. Remote sampling gives you that control. So it allows you to kind of maintain this document, and you can kind of see it over here. You can maintain this document that says for each service what percentage of span or what percentage of traces do we want to actually save from that service. You can even do it per operation. So maybe you have like end points that you don't care about. You don't ever want to sample those. You can set those to 0%. End points you want every single trace to be saved for. You can set those for 100%, and of course anything in between. Remote sampling gives you this extreme flexibility of sampling that you're going to need as an operator. Some development team somewhere is going to start slamming your backend with a ton of traces or a ton of spans or both, traces and spans. And unless you can control that as the operator, you're just going to get overwhelmed and you're going to have to go like scramble around, talk to these development teams and you're kind of at their mercy. Remote sampling gives you the control you need. Absolutely 100% recommend remote sampling. Monitoring. So monitoring Yeager is actually pretty straightforward. It has really good logging, standard out logging like everything else. JSON logging, if that's your thing. It also has Prometheus metrics. So really the open metrics format. So there's lots of different applications. Of course, Prometheus itself that will scrape those, store those and allow those for query. And the metrics are very descriptive. They're very easy to read. In fact, a personal thing I do when I start any application is I curl the metrics endpoint. I just start reading it because the metrics endpoint on a well-designed application will give you good information about the internal to that application. And Yeager absolutely falls into that category. So just reading the metrics is going to tell you a lot about what's going on inside. And finally, everything is a queue. So the client in your application has a queue, the agent has a queue, the collector has a queue, Kafka is a queue if you run that. The ingester is the only piece that doesn't really have a queue. And these queues can get filled up and start dropping spans. And that is the number one question your developer is going to come to you with. Where are my spans? Like, why is my trace broken? Where did all my spans go? It's going to happen over and over again. So understanding how to diagnose this ingestion pipeline, all of these different queues is critical for like having a healthy Yeager instance that your developers rely on and trust to store their traces and trust to go find their traces when they need them. There's a medium article that I recommend called Where Did All My Spans Go? And it walks you through every single piece and all the metrics to watch and tells you exactly how to kind of track down where your drop spans are. Also, there's a couple of other links there, the monitoring and troubleshooting links on the docs are also great for figuring this out. Finally, a couple of resource links. There's the docs. The docs are really well done. Please go read those. If you're interested in Yeager, please submit PRs to improve them if you have some problems or if there's something you want to clarify. The Gitter. The Gitter channel is haunted by myself as well as some of the other maintainers and is a great place to get answers to getting started questions, a great place to just kind of move past some initial hurdles to understand what's going on or how to get your basics going. Or also just to chat generally about Yeager, of course. And then the Medium Yeager Tracing Blog has a lot of great articles from all of the maintainers that talks about all kinds of different ways to use Yeager and new and clever ways to work with it and how to diagnose issues and absolutely a great place to go if you find yourself working with Yeager a lot. Thank you all. I hope everyone has a great Kubecon and I will see you when I see you.