 Thank you everyone for coming in person and this is cool to see everyone together a lot of the talks in the Observability track have been virtual unfortunately. I hope it changes But I'm glad to get everyone together and and showing up and meeting and talking Today, I'm actually the titles a little misleading But this talk is going to be about Jaeger and we're going to dig into the Jaeger operator Which is a really cool piece of technology that's part of the project And then we're going to talk about some of the metrics work that's been going on in Jaeger The first part of the talk about the operator Joe Elliott actually put the slides together and he was unable to make it last minute He's one of the maintainers as well He's over at Grafana labs and he's the person responsible for tempo for those of you that are maybe using that He's a great engineer. Sorry. He's not with us today My name is Jonah Cowell. I'm the CTO at logs. I owe I do a lot of work on Jaeger open telemetry and open search Which is the new open source fork of elastic search And I'm happy to be here when I'm not working at the Computer or maybe playing a little video games. I spend a lot of time underwater It's been nice and COVID because you're very isolated that way So I do a lot of diving. That's my passion for the last several years and Explore a lot of cool places where I live in South Florida and all over the world so If you want to talk diving, that's definitely another thing besides observability that I like to talk about and My company is a SAS observability company. There are many out there. I can see many of you representing other ones We focus on an open source space platform and Jaeger is part of our platform So it's the same UI the same usability with a bunch of other stuff around it And Jaeger is a is a great CNCF project. It's been graduated for quite some time For those of you that are not familiar with Jaeger it's And you're maybe using open telemetry or looking at it the collector component of Jaeger Was actually forked to create the open telemetry collector And we're starting to bring some of that back into Jaeger and we're essentially going to just be consuming upstream collector and that's the plan so Jaeger is basically a storage format and a UI and a bunch of other pieces around it in order to scale it out But in general we Build things in a very different way. They're very Componentized it's designed to run on Kubernetes And this is something that uber created many years ago and open sourced donated to CNCF Yuri the gentleman who created Jaeger and uber has a great book on distributed tracing Many of you have probably read it. It's very helpful even though the examples are open tracing they're still very similar to open telemetry and Those that are not familiar I'll just give you a really quick rundown of what you're looking at here This is a trace and in the trace. There's a bunch of Different transactions that are occurring here So we can look at the different spans which are each of these little segments and how much time they're taking and Then we can look at the end-to-end trace. How much time that's taking are their errors Are there other things that are going on and you can drill down into these there are tags that are added that show You different information In the example here that my sequel call you can see the database query how long it took whether there were issues, so It's a really useful tool And it can take data in a lot of different ways and allow you to analyze The trace data it's very focused on debugging. So when you compare it to a p.m. Tools This is very much a tool that people use to debug problems or understand what's happening Not really monitoring as much and I'll talk about that on the second half of the presentation When we talk about metrics because that's really what it's going to make Yeager More of a monitoring tool and less of just a debugging tool like it is today So the operator is a really cool piece of technology That that has been evolving a lot over the last year Jp or jirasi depending on how well you know him is Is the gentleman responsible for this he was previously an engineer red hat He recently joined garfana labs And he did a lot of work on the operator does some really cool things and i'm going to show you how You can use this on your laptop to do Set up yeager and start using it And even how you can use the operator to scale out and operate kubernetes in a large distributed environment The operator is designed to do all of those things Here's a link to it To the docs which are also really good and comprehensive and explain everything in depth So a lot of work went into this And thanks to the folks at red hat who are big into the operator model That basically want everything to be deployed this way. I think it's the right type of Model to use with kubernetes. You've heard about it a lot this week. And I think it's It's really important for the ecosystem to adopt this way of Deploying and running things I'm definitely a fan So We're going to go through a bunch of different pieces. How do I basically do Deployments configuration scale out and even talk about agents and sampling And autoscaling so keep in mind that some of the things that we're going over here specific like the instrumentation libraries Open telemetry is going to be the standard yeager had its own libraries before but open telemetry is basically going to replace all of that Something that's currently being discussed A lot of people do use the yeager libraries and their applications And they're still supported We're just using hotel in the future similar to people that maybe used open tracing before Open tracing is being deprecated. There's nothing new going into it. All the work is an hotel But that doesn't mean that the hotel collector Will not accept open tracing data it does and it will continue to do so And this is the challenge with instrumentation is that as new things come Whether we decide to replace them or not we have to figure out what we do with that data in those formats. It's it's tricky So anyways to the operator Basically to Get up and running Cube cuddle create. This is essentially how you get a yeager system up and running So a couple things to keep in mind the logs are very useful when you're getting this up and running um, and also It will watch specific namespaces to deploy new New instances essentially If you leave it blank it will deploy on every namespace or you can specify certain namespaces as well So it's up to you depending on how you're deploying this if you're testing it In your test environment or on your laptop. Just leave it blank and it'll Deploy everywhere essentially So, uh crd, uh, you essentially define What you want to do with yeager and there are three strategies that are defined in the operator There's all in one production and streaming and i'm going to explain the difference between them And you specify the strategy uh in your resource definition and this tells the operator What to deploy configure setup and how to scale it So So the really basic way of deploying yeager on your desktop laptop, whatnot is basically a single binary um, so this includes, uh, the the application well So the single binary is the centerpiece the yeager single binary The other piece is around it. It needs a data store a place to store information And then data comes from your application Instrumented with the client and goes to a yeager agent The yeager agent is sometimes optional depending on how you're setting it up And then the single yeager binary includes everything that you need for the user interface For the data ingestion for the storage and then a database And i'll explain what the database is and such this is the really simple deployment the all in one version of that The second and you'll use this on your laptop if you want to test it out start playing around The second one is production Which includes everything except for the green box which says streaming and i'll explain streaming after And so what this means is now we're starting to deploy a collector And the collector allows us to scale out the way the data ingestion comes in So you can think of this as a way to Essentially support more yeager agents more infrastructure that is sending trace data And we're going to talk more about the back end database and about kafka and how those work in the production mode because there are options for those Uh yeager query is also a service that helps scale out the ui. So when you're running in production It's a microservice that's also deployed and managed The operator does work for all of this stuff basically to deploy it to scale it and to manage it It's very powerful and i'll show you some examples of how that works So today when you decide What you're going to run yeager on there are two options today in the operator One is elastic search and the other is kassandra The operator will handle deploying those and being able to scale them out. So it's actually quite comprehensive What the options are You can also point this at another cluster if let's say in your organization, you've got a big elastic search cluster You can just point it at a url instead of deploying that but the operator will deploy that as well So, uh, we do support open search, which is the apache 2 version of elastic search It's not in the operator yet. It would be trivial to add it We'll see right now red hat hasn't decided whether they want to Switch to open search just because they they do keep things apache 2 in their Code base. So we'll see what happens with that The other option is or the other piece is Is deploying kafka? kafka gives you the ability to deal with back pressure on the database So anyone that's running big elastic search clusters, you're already running kafka because there's essentially no other way to scale it well Uh, and so in the streaming strategy or configuration It will deploy kafka You can use various topics as well if you want to separate and manage the data effectively And this allows you to deal with Bursts of data coming in that maybe your back end can't deal with kafka will cue them The other thing that kafka can be used for is to do some stream Analysis and there's been some nice little experimental projects that are in one of our repos in the yager org Of people that have built kafka streams and spark streaming type of analytics on the trace data None of it's really well baked, but some of it came from uber and other companies that do that type of analysis We built something similar in kafka streams that does some of our streaming analysis We have an open sourced all of it, but we're we're going to do that soon It's definitely in the plan So kafka something that you're going to want to deal with if you're running a bigger cluster Or you have other use cases there But the operator will do that So to give you kind of an idea of how this works the nice thing about yager is that each of the components because it's all in go It's just command line arguments So the operator is basically passing in the right command line arguments for what you're trying to deploy So some examples here and we can we can specify a lot of this on the command line This basically is an example of us deploying Uh, you know, it it essentially does all the configuration and passes the command line arguments in For you by running the operator so that you don't have to deal with this all yourself Which is what happens when you run it on your own So it's definitely gives you a lot of options and makes it easily scalable in in a kubernetes environment and uh The other thing to note is that if you decide to pass in other command line parameters So in this case the example is There are there's blurg and blarg that are basically being specified in the crd here And you'll notice down at the bottom when the command line is run Unknown flag blur dot blarg it actually just passes all of those parameters in So if you feel like passing a different parameter in to one of the yager components The operator will just essentially run that in the command line So it gives you a lot of flexibility to do custom things On top of what the operator already supports So the crd just makes it really easy to manage the config and roll this out consistently Uh, it's definitely uh pretty powerful in terms of what you can do with it So the other thing to think about is the agents and in yager The agent is essentially something that the instrumentation talks to So in the open telemetry, uh world, it's It's essentially a collector In yager, we actually have two different things We have an agent and a collector in open telemetry the collector can be an agent or it can be a collector That was kind of part of the design of open telemetry was to have that same binary be flexible for both use cases So when you decide to deploy your agent as a sidecar in the operator What this means is that for every node that's running an application An agent is going to go there There's going to be centralized collector that the agents all talk to this is one option that people use Uh, and i'll show you the other options for for agent deployment that the Operator supports. So this is one way Uh, this way is kind of better because it's more segmented with each of the applications that you're deploying In uh in kubernetes The second way to do it is with a daemon set. This is another option where you have a single agent for multiple applications Now the challenge here is that you can overload the agent And you may have different teams deploying applications That would talk to a single agent and the config and and just managing that may be difficult depending on your configuration And how your teams work Um, and then those get sent onto a collector. So these are basically the options For agent strategy deployment that are in the operator The other interesting thing that yeager has which is almost an open telemetry I saw some actually commits go in this morning Is this idea of remote sampling uh in sampling is a big area of discussion in tracing and an open telemetry What remote sampling is is it's actually pretty unique It's something that they built at uber for their specific use cases around the volume of data that they were dealing with So today most of the sampling happens at the collector So the traces come in and the collector decides whether it's going to send those on to your back end The challenge is that all of that data is still going over your network And if your collector isn't really close to your infrastructure that could cause other contention or cost if you're running in the cloud With remote sampling the instrumentation library actually is able to read a file or get a signal To change its sampling strategy in real time And so because that's happening in the instrumentation You can eliminate the fact that the instrumentation is even going to send trace data or not And there's a lot of really great use cases for this. I'll give you a good hypothetical Let's say uh in general I want to get one percent of my trace data. So I'm I'm sampling But when there's a problem, I want to get 50% of my trace data so that I can debug better with remote sampling I can send a signal that says change your sampling strategy And then I can get more data temporarily. So this feature is really powerful. It's it's almost an open telemetry The challenge is that the libraries have to implement it also So we have to go in the go instrumentation and support remote sampling not just the collector So there's still work to do here So one of the and the other thing is with open telemetry is it does tail based sampling Which the jager agent only does head based sampling. So hotel is more sophisticated for sampling strategies Uh, here are some examples of uh of sampling strategies Using both using basically the probabilistic, which is the basic type of sampler And you can specify different types of sampling for each service Depending on what you want to capture and what you want to discard And you can put this all in the crd. It gets passed in To the rest of the jager deployment So it's really powerful in terms of Configuring everything together as a crd, which has a lot of benefits So the other thing that the operator supports is auto scaling and scaling the collectors and ingestors, which is really useful as well how many of the Collectors do I want to potentially scale to? and you can also Extend this on your own and maybe use a metric for auto scaling And I'll talk about the metrics that Jager supports because it exposes a lot of data and prometheus that you can scrape And then you can use that as part of your auto scaling as well So there's a lot of options here in the operator To to do the auto scaling for you Defined once again in the same crd. So it gives you one nice place to keep all of your config as code Which is really handy So I mentioned the storage You know that supported Kafka and elastic search And monitoring jager itself With prometheus is also easy. We expose a lot of metrics that you can easily scrape So a lot of folks are are doing that and it helps operationalize jager for sure So there's also a bunch of other pieces that the operator can do that I'm not going to dig into because they're even more advanced use cases But here's a few example examples of them in kassandra. It can create schemas There's a lot of index management and elastic search where jager will actually do the whole life cycle management of indices for you And then there's all kinds of other things that You know that this can do and of course open shift support That's the main reason that red hat contributed all the work into the operator Because they're big into that And there's lots of things that you can do with this operator as well. So it's It's worth checking out. It's worth using even if you're just starting to play with jager on your machine It's definitely a good way to do that and it helps You see the power of operators because I think that's the future and if you listen to the keynotes That's clearly kind of where the foundation is taking things. I think So the other pieces is the logs are very useful As I said, it exposes open metrics. You can scrape with prometheus or any other Open metrics compatible format And then and then you can also get traces off of jager, of course So Uh, I did want to spend a couple of minutes talking about metrics and this is really how we move jager from being a debugging tool To a monitoring tool Is really introducing and integrating metrics uh, so Most of you that have used an apm tool or or other products that call themselves that It's because they use traces and they also use metrics together And this allows you to do things like monitoring alerting Capacity planning other types of planning when I deploy something new How is it going to trend? What's going to happen in my environment? This is really the difference between Distributed tracing and monitoring tools like apm tools Um, so we built uh recently over the last several months A new capability we call atm or aggregated trace metrics And what this is is there's a couple of pieces to it One is an open telemetry. There's uh, uh, a span metrics processor And in this processor when traces come in the little diagram over to the right traces come in We then are able to look at those traces And derive metrics from the traces That then you you send to a any type of matrix back end So whether you're using prometheus or you're using a commercial tool Anything in open telemetry can take those metrics from that trace processor And then the traces continue on to your tracing back end So the most common use case is If I want to use all open source is i'm going to send my metrics to prometheus And i'm going to send my traces to jager as an example And uh, and so this allows us to create all kinds of metrics off of the trace data And i'll give you some examples and and show you what that what that does So in the configuration for open telemetry This is an example config We basically specify the histogram buckets that we want to create Inside of prometheus And uh, and we can define definitions. There's a bunch of examples in the code base The challenge is that it can create a lot of metrics depending on cardinality that you define Most commonly we do status code based so Okay versus errors versus You know redirects and other things like that and then you have all the histogram buckets and prometheus to do that Down at the bottom is an example of how you would use that together So using jager and prometheus together. This is basically a simple config Of the pipeline so how that's all set up in open telemetry And the code in github is down at the bottom as well. It's part of the collector So once I have this data, I can do all kinds of cool things Here's a screenshot from grafana showing the histogram buckets the performance of all my traces And uh, and we basically are calculating The request the request per second or the request time the latency the errors So it allows you to do a lot of different monitoring use cases and gives you real visibility into it We also added a view into jager to query these metrics and I'll show you that in a second And that supports anything that's prom ql compatible So in jager whatever supports prom ql You can point it there and it will query those metrics and bring them into the ui and I'll show you how that looks The community can add other things but We use m3 db in my organization and prometheus of course But it'll work fine with cortex or cortex based systems and of course Thanos timescale db whatever else is prom ql compatible If you listen to the talk yesterday about compatibility It does matter whether things support prom ql the right way So naturally it it does have to support the right type of query language for it to work But if someone wants to contribute a You know the ability for this to work with another Another metric system. That's fine. We'll definitely take any prs in the project And so we have a new this is actually a pending pr right now in jager ui As there's a new monitor tab in the ui And this allows you to visualize this data so You know latency response time error rates the trending of those And then some views this is kind of the first step. I think for jager Obviously would be nice if we Built alert manager visibility into jager and really created more monitoring capabilities Really to move jager from a just a distributed tracing system into more of a an operational monitoring tool and debugging tool So that's kind of What we put together, definitely check out the docs come visit us in slack And we're always posting blogs and if you have blogs or cool ideas that you want to post on our medium Just let us know. We're always looking for new content and ideas And I think that is about it. I'm open for questions here. I'll repeat them for those of you that are listening online And I will also bring up the slack and the online thing so that we can Make sure to include everyone. So if there's any questions, please Raise your hand And sure So the question was I talked about the scalability of jager Is there any questions about the scalability of the atm component that's inside open telemetry? So one of the challenges with with anything that's doing sampling or calculation is that you're going to get It's it's tricky because you're going to be somewhat limited based on where the traces are going So let's say you have 10 different collectors that are calculating metrics You're going to have 10 different sets of metrics in prometheus But if you roll those up, so let's say I put the collector name in the metric path I can then use it without the collector name And get the aggregate of that data So the idea and that's what's nice about prometheus is that you can aggregate data up a level from where you are And if you decide that maybe I don't want a service I mean, maybe I don't want a microservice level view and I want to move a level up and look at a Let's say a pod level view or a kubernetes You know construct you can use all different kinds of things in the metric path. So it gives you a lot of flexibility In terms of scalability, we haven't really posted numbers on it But we put a lot of metrics through it like To the point where it got prohibitively expensive and we actually tuned it down So it can definitely create a lot of metrics. I'm not really worried about that But so hopefully that answers it I'm going to take one online, which is can yeager work in a multi cluster environment? Uh, so yeah, definitely. I mean yeager is a microservice is architecture. It's scalable. There's no real issues around that If you have different storage back ends yeager doesn't work across separate clusters of elastic search or Cassandra, for example, so you should scale those out Which presents its own set of challenges because they're not the easiest things to scale out but um It does work well, uh, but it only supports basically one back end because of the way that the That the product really works today or the project really works today Um over there cool So i'm gonna try to rephrase the question and hopefully still get it. Uh Can you combine the remote sampling with other types of sampling? Is that kind of what you were getting at? An attribute sorry Yeah So you can build sampling and sorry i'm sorry for people online. I'm trying to to rephrase it, but uh You can you can combine different things in the sampling strategy so you can use tags you can use any type of part of the Transaction itself to create a sampling strategy and do sampling and filtering the challenge is with remote sampling is that We haven't implemented all the way in hotel But it will go that way because there are customers or sorry They're not customers there are users using the jager libraries that support that and they want that feature Including companies like uber that created it they they need it And they want to go to hotel so the only path they have forward is is for that to happen So it will be supported you will be able to combine them together But it might give you weird results depending on what your goals are Combining remote sampling and centralized sampling There was a a working group an hotel on sampling. It's Not that active anymore, but it's still an interesting topic for sure Let me take one for online Uh So, uh a talk from the speaker In the next session bartek asks, uh, I wonder does permissi does prometheus aggregator for trace data In hotel also populate exemplars. That's a really good question. The answer is no Uh, I don't believe that there's good exemplar support and open telemetry yet or maybe I just haven't seen it But if uh, if that's there, uh, that would definitely be an option for enhancement And exemplars for those of you that are not familiar a really powerful prometheus feature or open metrics feature. Sorry That allows you to specify an example trace or log with a metric And that allows you to move from a metric to an example of something that matches the metric So it gives you a lot of context when you're troubleshooting to use exemplars A really powerful thing for sure Um, I'll take another one for the live audience. I'm trying to go back and forth. Uh, sure Oh Sure So the question is does the operator Work with service mesh like istio and in that case Do you not want to use the agent? Uh So I've seen users Use the built in instrumentation for istio the challenges unless you instrument the code the traces are very They're not much better than logs put it that way And the issue is that the data being generated from service meshes It's not going to create an end-to-end transaction. You're just going to see what the service mesh is doing So you do need to instrument your code and if you and there's auto instrumentation If you don't want to make code changes for many languages And that's a good option But you do have to instrument your code and therefore you need a collector and an agent in the case of yeager So there's no way around it if you want meaningful traces like the example that I that I showed you And we've I've had End users at my company that have done that and they say why is the trace data so boring It's like the same as a log message and I said, well, you've got to instrument the code Or we're just going to see what the service mesh is doing So there's no shortcut unfortunately with service meshes and proxies and that kind of thing so Yeah, sure I'll take one online. I'm just kind of picking the top voted one Oh, that was the top voted one. So the next one down Uh, what's the expected memory or cpu footprint of running agents's sidecars So the cpu is going to depend on the throughput of what you're sending through the agent similar to the collector But the footprint on the memory side should be very small. It's a go binary It doesn't do much. It's actually not a super complex piece of technology at all Similarly, if you decide to use open telemetry in that way, um You can run a stripped down collector that doesn't have all the exporters and other things in there There's a custom builder for open telemetry that lets you build your own Distribution essentially it's really simple to use Uh, and then you can create a really small collector And it shouldn't be using much in terms of uh resources Beyond what you're sending through it, which is obviously the the processing power that's that's really required there um Another question live anyone And I know i'm kind of running out of time, but all right you can go again Yeah, sure So the question was that if you're running it across multiple regions and you have multiple elastic search clusters The thing I said it doesn't support we don't really query across them So you would have a single yeager ui instance for each Region and you can't really search across them So unfortunately the answer today is we don't support it. Uh, it could definitely be built But it would be pretty hard to federate those queries and make the ui work the right way Uh There are there is work going on in open search to to to do a multi cluster query It's actually was just released recently in open search And so you'll have some capabilities there, but it's uh It's not something that's supported in the yeager ui Someone could contribute it, but I would guess that's a pretty big feature to build Um And I don't really have a good recommendation around it My company we do something differently where we allow you to create Different sub accounts that that point to different clusters and indexes And we can search across those but we built a whole bunch of abstraction To make that happen. It's part of our platform Um, but it would be really hard to contribute that because it's It's uh, it changes the way that all the back end works basically Um, but if you want to contribute something like that, then you know by all means, I think it would be a cool feature Uh, there may already be an open issue on it, but Feel free to suggest it in the channel and if it's something that you're looking for So, uh, I mean, I'll keep going till someone cuts me off. I'm over though Well, I think bartek's talk is uh, is starting actually. There's a little bit of time I'll answer one more before I call it. Um So here's a good one because I like app decks, uh, can yeager. Sorry Okay, this is the last one Can yeager, uh plus atm generate app deck scores So app decks is an interesting way of trying to summarize user frustration or how well your application is performing It's not something that's supported, but it could easily be added to the to the uh, to the metrics processor that's there or you could Actually make a different A different one to do that. So it's a good idea. I think, uh, but not something that's supported today Um, and so I'm gonna wrap it up, but thanks for attending and it's good to see everyone again Uh and uh come and join us in the yeager channel Uh and appreciate everyone showing up. Thanks again