 All right, everyone. Thanks for joining us today for today's CNCF live webinar Uncovering hidden hotel traces lever In the standardized manner. I'm Libby Schultz, and I'll be moderating today's webinar I'm gonna read our code of conduct and then hand things over to Steve Waterworth from asserts and Taylor dole's all had a ecosystem at CNCF a few housekeeping items before we get started during the webinar You are not able to speak as an attendee, but there is a chat box on the right sidebar where you can Say hello. Tell us where you're watching from and leave all of your questions here We'll get to as many as we can at the end This is an official webinar of CNCF and is such a subject to the CNCF code of conduct Please do not add anything to the chat or questions that would be in violation of that code of conduct and Please be respectful of all of your fellow participants and our presenters Please also note that the recording slides will be posted later today to the CNCF online programs page at community.cncf.io Under online programs They're also available via your registration link you use to join today And we'll be available on our online programs YouTube playlist With that I'll hand things over to Steven Taylor to kick off today's presentation. Take it away Awesome. Well howdy howdy and welcome everyone Like would be said if you have any questions at all during the session Please feel free to throw this into chat and we'd love to surface those and get to as many as we possibly can I'm really excited to be joined by Steve from Asserts And we're going to be covering a lot of things as it pertains to open source and Really would just love to dive in and get started One project that I've heard quite a bit about from many folks has been open telemetry also referred to as hotel For short, so Steve, oh tell me more. I'd love to hear about The dad jokes continue Yeah, so open telemetry is Is a vendor agnostic? truly open Observability toolset Trivia fact for you. It is the second most popular Project in the CNCF as far as sort of code source code activity with commits and the like second only to Kubernetes So there's a there's a lot happening with it Where we're predominantly interested in it is from the tracing side of things as we know with that cloud native applications micro services and the like And been able to do Distributed tracing can be a very good diagnostic tool when things aren't maybe going quite as well as you'd hope that might I Find it funny that the second most popular project in the CNCF Portfolio right now is one focused on telemetry and observability And so it's great to see that it's doing so well in that space and that folks just want to know what's going on One other project that I've seen that period of quite often with kind of like chocolate and peanut butter Has been for me Is that one that you can kind of talk a little bit more about I'm not going to talk about chocolate and peanut butter I'm not sure that's a good pairing, but there we go each to their own Yes, it's a permit me as I did the flip side really That's all about time series metrics and it's probably one of the more mature Projects in CNCF if I just flick over to my crib sheet just to be absolutely certain But I think it sort of graduated in like 2018 something like that. Yes, it did it graduated in 2018 So it's a it's quite a quite a mature project and by no means stale I think it ranks as number seven as the most active Project in the in the CNCF. So, you know, it's still very much up there still very very active Let's say that's all about getting time series metrics metrics are really is at the heart of any Observability platform or solution that you're building everything's I think is really driven by metrics That's you where you start and then you hop off into into logs and tracing It's it's been interesting to see what folks are using Prometheus to measure and it seems kind of like a Dark art for some when it comes to you figuring out the right way to craft Their perfect dashboard or their single pane of glass So I know that can be a little bit difficult for some folks But I feel like once you have that set up you have a really good view into the things you actually want to see and then I know we use that at the CNCF for things like dev stats looking at those various projects and folks companies and other types of Observability as it comes to our projects who's contributing to them Where the project health might look like and some metrics there So I know that we're yes biased, but very happy to have the project because it helps provide a lot of insights on that Yeah, Prometheus is great because it's very easy to get it set up and running particularly if you're running it as a container or It's not a difficult install if you're running it natively on On the operating system So it's very easy to get Prometheus up and running in a kubernetes environment Using the Prometheus operator is an absolute no-brainer. That's like the easiest way. It's a simple helm install and you're there So yeah getting those metrics in and there's a whole bunch of exporters for For various other software components and a lot of other CNCF projects expose a Prometheus metrics endpoint by default anyway So it's very quick and easy to get Prometheus running and fill it up with a few million Metrics on absolutely everything and of course, it doesn't just have to be as you've already alluded to it doesn't just have to be Bits and bytes and Computory things it can be business type metrics as well You can you can measure the you know the average cart size if you're a If you're a e-commerce platform or the number of items in a car anything like that It's I really liked as well that within each of those components that you're able to take a look at Prometheus you can actually inspect the data in many cases and even download that in csv or other types of formats as well And continue to work with it if you might not like your view or you might be locked to a specific view within your organization as well I'm not sure if you have any Easter eggs or tips or tricks when it comes to yeah the Yeah, this is sort of starting to get to now the sort of the One of the challenges with with implementing these open source solutions It's it's Yeah, it's particularly with with Prometheus. It's very easy as I said to get a bucket of two million metrics The problem is now you need to get information out of it. You know just the Data is not information. It needs to be processed to turn it into into information And then this is yeah where the hard work really begins You you suddenly find yes Yes, I thought I was done when I had everything installed and all my scrape configs set up Right, again, I've now got to start building health rules because I want to get alerted Against things and I want to be able to see the data That's how I'm now going to start building a whole bunch of Dashboards and I want to be able to link from one dashboard to another So I've got to build all those links in to match my topology And you realize that actually getting it installed and setting up a scrape config was the first step on a very very long and never-ending journey It's it's so true. It's it's it's I like that at least gives you a framework to iterate on and You know, it's what looks good or is relevant to you right now might not always be the case So giving you that ability to change things and make it modular in that case is quite helpful And and like you said, I've seen people use it for all kinds of wild things from Tracking all of the times that, you know, you fed the dog or gone on a walk or you know Measuring personal fitness and in anything else in that category It's really fascinating to see all the different ways and what people use that so Yeah, I actually I have to use it personally as well and a little home automation Projects my wife's a king gardener So we're like monitor the temperature in the greenhouse and rainfall and all this and it's all just metric data Yeah, and we just goes into Prometheus There there was one group in in or just outside the Bay Area Napa Valley and I I had a friend send me some images, but this Winery actually uses it to measure soil With like wetness and humidity and all of these other things too and they have all of these Prometheus graphs up on Like projected onto their little like I max ask Little viewing decks that really cool really cool and wild way to see people using that Yeah And then the plus thing you know the challenges don't stop there because then We're talking about observability and there's more to observation just metrics metrics are certainly your starting point but as we said we've got with with hotel we've got the traces And logging is as old as the hills most organizations will already have some logging solution Be it open source or or proprietary There's is pretty good chance that that logging will already be there So then the challenge becomes is how do you how do you tie it all together? You want to be able to Be alerted on something when a when a when a maybe a metric isn't In the value that we we hope it should be or like increased latency or excessive resource consumption And you want to dive into that and then you want to be able to go and look at the traces for the transaction for that Or the logs from the container How do you how do you pull that to all that together? And that's where it gets really difficult doing that manual correlation particularly then when with distributed systems when a single request Will could hit you know a dozen different microservices and possibly in different kubernetes clusters maybe even in in different data centers So how do you how do you pull all that together? And that's sort of the work that the search has been doing Is to add that layer of intelligence and automation on top of these great open source tools to help you Pull it all together now yet sure you can do it manually But hey having a having something help you help you do it That makes life a lot easier and saves you yeah not an inclusive amount of time effort blood sweat and tears We we've talked a little bit about how folks are using open telemetry and prometheus But I think one thing and you've covered this a little bit, but I'd like to dive deeper into why people are using those those solutions for their problems Yeah, I think the The key area there is is you're you're avoiding that vendor lock in the with the open source tooling. It's so good now There's no requirement to pay for proprietary Agents you've been able to collect observability data and store it and that is that is commoditized You you can do that very easily and at minimal cost obviously It doesn't run in thin air. So there's a bit of compute cost somewhere But yeah, you certainly don't have to be paying licenses off to an organization in order to collect observability data and in fact in many cases the The free tools are better than the proprietary tools in that there's no limits on the custom metrics that you can have and also the maturity of the and range of Collectors that are out there again is is often surpassing what is available commercially And I think that while it might be in some cases a little bit more difficult to set up the things that you care about Initially, you're going to have that much longer term satisfaction Uh, especially around that cost as well, you know, just a little bit to go through the wall the first time that afterwards That's fairly smooth sailing keeping up with the nautical terminology. Yeah Um, when people are uh using open telemetry and promethias, what kinds of challenges have you you seen folks running to? Yeah, so as I said, once you've sort of broken free from that license cost, you've been you've embraced open source You've got all this fantastic data. We've already touched on you're turning that Data into information is a challenge. There's a there's a lot of work there and then the the correlation aspect of it I've been able to pull data from different places and have it all related to Maybe to the uh issue you're working on And then the one of the other challenge as well is also data volume Because it is so easy to go collect all this data you end up then paying a penalty on storage costs particularly around tracing Metrics are tiny. I think promethias is about one and a half bytes Something like that per the sample if you have a look at their documentation but yeah tracing is is much more the the worst offender there with You know with a span been about 2k By the time you've got all the baggage in it and then a particular transaction may be like 12 12 spans so you can soon be into sort of a few kilobytes Per trace and then you've got millions if not millions if you're a busy site of traces that's Very quickly a lot of data and yeah a lot of storage cost and processing power as well so yeah, there's there's uh There's got to be a better way of of doing it and What most people come across is like, oh, this is a lot of data We can't possibly trace everything. What we need to do is some sampling and hotel offers various sampling strategies But they're all a bit of a blunt instrument. You can say, well, I'll I'll just take 10% Which is great. But then Murphy's law Comes it comes to the fore and says, oh when there's a problem Yeah, the traces I need were the ones that weren't sampled. So I'm still blind I think when when so we've heard a lot about open telemetry as well and I remember early days of folks focusing on tracing and being told like You can set up tracing, but it's something that you have to instrument your application for Have you seen that change when it comes to open telemetry and the amount of effort needed to get started in Looking at function calls and kind of delving a little bit deeper when it comes to adopting open telemetry Yeah, open telemetry has done Done a lot of work on the on the tracing You know, it is the second most active project and there's a lot of automation in there now It's tends to be very much language specific. Some languages You know make them make themselves easier to instrument automatically and they are some more of the compiled languages like go Those types of things. They're they're a little more difficult to do but certainly something like Java which has had a standard for The Java agent since like about Java 1.5. We should remember that far back So there's a standard API for it. So it's very easy to automatically instrument your Java application say some of the others the automation is maybe Not quite as advanced, but certainly for things like go There's a whole bunch of middlewares So if you're using gorilla or gin to do your request routing there's wrappers for that So yeah, it's manual effort, but it's like changing two lines of code. It's it's not a huge effort It's not like you've got to go and hit every single request endpoint and put in a dozen lines for each one It's just one wrapper I think that's nice. It's uh, I know we've also gotten a little bit further along with things like transparent proxies for service messages and those kinds of concerns as well So great to know that it doesn't take as much time or effort to get those things instrumented And we're starting to see more capability right out of the box Yeah, that's another approach is using a service mesh You do the like it's co and link d both of those guys you can configure and they'll spit out hotel spans as they're Routing the traffic across the mesh So that's a that's another Another way of doing it. It's an added layer of complexity So like all things in engineering there's swings and roundabouts you you gain in that you haven't got to go and Re-configure each service manually, but then you're adding another layer of complexity with a service mesh But then a service mesh can do lots of funky things for you as well I I saw a question come in asking open tracing is now within open telemetry And yes, I think that was a really interesting part of the the history of the project too if you want to go into that Yeah, well, yeah, yeah now you're going back into the dim mists of time In computing speak anyway, probably like the year ago in in real time Yeah, so so open tracing was probably the first open standard for doing distributed tracing and a lot of Actually a lot of the commercial products are actually built on top of those hotel standards Open tracing standards and then yeah Open telemetry came came along and it has it has a broader remit than just distributed tracing it does it does also include metrics and logs although the The support for metrics and logs isn't as mature as it is for tracing if you actually go and look at the various project Statuses most of them are pretty much there with mainline releases General availability releases on the tracing side and you look at metrics and logs And these are still still a lot of alphas and beaters and don't don't deploy this in production type caveats on it. So, yeah, it certainly so it absorbed the The open tracing standards into open telemetry amazing It's it's wild to look back and see which projects within the CNCF have gotten merged or archived and things of that nature I remember reading about was it open census and open tracing and being really excited back Also back in the the demists of time when when I was working at Walt Disney studios and got really excited seeing those things there's like It's so many projects to kind of put together and so now seeing those culminated together as one as open telemetry I think was really helpful same thing, you know continuing to modify what's needed and Really focusing on adaptability and usability within the project is really great to see them move in that direction Yeah, I also like the concept they have with the open telemetry collector So this is sort of acts as like a Like a patch board best way to describe it, I suppose and if you your various services or your service mesh Sends the data to the collector So it has you can set up various receivers there in the collector So it'll it'll receive the metrics and spat and trace spans And then optionally you can configure processors so it can actually massage the data before passing it on So you can do and we'll get on to what certs is doing with that in a little in a little time And then it can then dispatch that data to one or more back ends So you can so if you if you've got zipkin and yeah, you go you don't have to choose you can actually have it go to to both or off to Like one of the cloud providers you can use google cloud tracing or aws x-ray you could use that as your as your trace store or of course Yeah, he does probably the most popular one in the in the open source world When when I was working on a previous role one of my colleagues was talking a little bit about annotations and making sure that Um, and they were implementing some service service mesh workloads. Well, I say that five times fast And they were talking about uh annotating that and losing that annotation about halfway through so they didn't Get to see that full traceability until they had the aha moment and said like, oh, no This is actually needs to be annotated each step of the way So as you're passing along this thread or this call that's something that you have to be mindful of and with that I'd like to transition into where it is that you work, you know talking a little bit about Asserts what you're doing with hotel and for me. Yes, but really I'd love to hear first about Um, what are you doing at asserts? What what's it's the what's the company focused on? What's your mission and vision? What are you working on? Yeah, so as I said, we're working on providing a a layer of Of intelligence and automation on top of these great open source tools The well by both the founder and and I previously spent time at at Dynamics and So we've you know, we've both got this background in in apm monitoring observability call it call it what you will as opposed, yeah, we had we sort of had this epiphany that There's all these this is all this great open source tooling out here now to collect your observability data So that's done tick, you know, don't reinvent the wheel. Why would you do that? Just just use there's the great open source stuff that's that's there and we realize you know the the problem is them It's turning all that data into useful information and doing correlations. We thought well, how you know, how can we help Help people do that So we've said we built this layer of intelligence and automation on top of these great open source tools that provide that correlation information and also Help manage the data So you don't you're not not drowning in data. You we we distill the data down and say, yeah sort of talk about Let's talk about the sort of the metric side of things first So you really only got two use cases for the metrics You're in the short term you want as much as possible for troubleshooting So in case anything goes wrong, you won't find grain metrics on absolutely everything But because it's expensive to keep that long term because the other use case you because you have is for long-term analysis and reporting And we've got CICD pipeline. We're throwing out release after release after release It'd be useful to know if you're making things better or worse You know, are these services getting faster and less error prone or are they getting slower or more error prone? So you you want to you want to have that long-term Data for that analysis So what you don't want to do of course is keep everything forever because well Prometheus doesn't really like it and it becomes very cumbersome to try and run a big a big Prometheus with storing everything forever So we've essentially automated that we take your existing Prometheus, which typically 15 days of retention If you're still if you're still troubleshooting after 15 days, you've got other problems So we take that so we essentially do queries on that data Run it through a set of rules and then Store load cardinality data long term So it would probably Knock the data volume down to about two five back to about 10% of what it was So then it's really easy or relatively easy to store that long term So then you can still do your trend analysis. Hey, you know, have we made this service better? Are there more errors less errors? Is it going faster? Is it going slower and also for customer metrics? You know, are people buying more are they buying less? Is customer engagement getting better as performance improves? I like that and and I like what you said around just storing the right data and actually being actionable on it, right? I think that when it comes to, you know, if I fill my garage with all of these things or packages Or if I just keep pushing things into there because it's important Okay, that's great. But then I've filled up my garage and I can't park my car there Or if you use the same analogy with the closet for any kind of room, but You know, if I just pulled in all of the mail that would include my junk mail too So I like that you're taking the time to focus on Making this data actionable and really being able to focus on that Yeah, the other the other thing we do to help people get started like you said earlier It's really easy to stand up for me to a set of a bunch of Collectors and exporters Scrape config and if you if you're using Prometheus operating kubernetes, it's even easier so Yeah, you've got this data But you've got no real way of understanding and visualizing it so the asserts product ships for the curated library of pre-built dashboards and health rules for all common technologies So from day one, you can be you can be effective. You know, you can actually Be productive and start using the data you're collecting without having to spend weeks or months Building dashboards and writing health rules. Of course. Yeah, it's not going to be one size fits all There's always going to be some Some uniqueness to each environment So of course you can still write your still write your own and the dashboarding we're building again Just leverage the open source. We're built. We embed Grafana In the product. So if you have some favorite Grafana dashboards, you're not saying goodbye to those You can it is Grafana. You can just import them and if you tag them correctly They will also appear in the right place contextually as well. So you don't have to go hunting for them That's that's really helpful and I think that folks would be overjoyed to hear that It's like, hey, we can save you a couple weeks or months of time even for folks that are That have implemented open telemetry and Prometheus and grapana these other tools Do you help leverage making their stacks better? I've Definitely like I I won't name who but I have heard folks say, hey, we set this up four years ago And we really haven't touched these rules since is that another kind of Problem case that you helped solve for Yeah, so Say we you know, it's a curated library. So each Each new release there may be updates to the rules as things change And you get you get different newer releases of the software components You're running as they may be behave slightly differently So we've got those rules are constantly tweaked and massaged To be the most effective like I say you always have that ability to override and tweak and tune or Disable a one of our health rules if it's nagging you if you think oh, actually I don't need this This is fine in my environment. You know, I don't care about that. You can Yeah, squelch it down and turn it off and say if you've got unique things might be in your environment There's a particularly important message queue and if the queue depth is great than five. Oh dear. We're in trouble So that's a very unique rule to you. Hey, yeah, you can just add that in there and you'll get you'll get notified about it I think that's helpful to you is is being able to focus on a great point about alert fatigue, right? You know like you're notified every time you get a sale. You're like, no, that's a good thing I can look at that in a different way. I don't need to get bugged about that at three in the morning Yeah, well the way we handle that is As you know, this is in any large system. There's always something running a little hot going a little slower so you get this constant chatter of alert notifications And the vast majority of them probably aren't actually that important There may be something that could be tuned later, but you certainly don't want to be woken up at three o'clock in the morning to say Hey, the CPU consumption on this container was a little hot for a minute. Oh, who cares? So the way we manage that is to really operationalize SLOs It's a partly hopefully everybody's read the sre handbook or at least flicked through it and So Yeah, you know what you know what the acronym stands for so search level So The the idea then is you you set up the SLOs and the things that are important like you users must be able to log in in in less than 500 milliseconds or Payments got to go through in less than 300 milliseconds, whatever so you or the error ratio one of on The integration with a shipping service or you know anything like that you can set up SLOs against it So and of course that that service there'll be a bunch of other Software components underneath that make that happen So there could be a dozen micro services underneath there and some data stores and some caches that are all making it happen Now they can all be having little issues little bits where they run a little hot or a little slow But if it doesn't impact that overall SLO, then we're not going to alert you We're still record that those things happened But you're not going to get that emergency page or a slack message or whatever at four o'clock in the morning Telling you to panic only if the if the SLO is in danger of breaching or has actually breached So we we monitor the the SLO burn down and if we see a rapid acceleration in burn rate we try and sort of Rather than wait for it to smash through and head off to the hills as it starts accelerating They're going to issue an alert and say hey this SLO is looking shaky. You might want to take a look at this I've seen folks implement some Alerting implement SLOs and have some you know key Metrics or key uptime Deliverability and reliability factors that they're trying to aim for Though they will set monitors and alerts on Objectively the wrong thing and like you said this container was running hot for two minutes But kubernetes is going to reap that and bring it back anyway. So it's not that much of an issue or It's or the autoskiller just hasn't kicked on yet to kind of adjust for this influx of traffic that we're looking at So I think that's been Those again hours of those stories many many fun Many fun moments in retrospect not in the moment at all, but I think for folks looking at Those kinds of you know implementing monitoring and alerting but making sure it's the right Kind as well. That's also really helpful and it sounds like you have sort of there's another layer then on top of that so Really one of the really clever things I don't understand how they do it. It's some very clever engineers that wrote it all but one of the very clever things we do is analyze all the The metric labels in a so prometheus metric obviously has its value But it has a whole bunch of labels describing what the metrics about So we analyze those metric labels and similarly traces have tags Which is the metadata about about the trace. It's not just the timing. So we analyze the trace tags So from that we can build up a graph database of how everything's interconnected So it's not just service to service. That's what tracing gives you but it's it's also the stack that it's running on So I like to think of it as a four dimensional graph Of your application topology So it's service to service, which is your x to y the stack Which is your z the depth and then we record it all over time. So at any single point in time We know what was talking to what and where it was running so when there's An incident so your your slo goes bad and you oh no users are taking 1.2 seconds to log in You know and we didn't want that. We definitely wanted it at 500 milliseconds or less. So what went wrong? So Without that graph you're then relying on your maybe your own knowledge to know that hey, this user service uses this database and this cache and And piecing it together that way or having to ask a colleague But hey, you know the certs has done this for you. It will when that instance generated It automatically traverses that graph database and collects everything together Onto one Dashboard so everything all the information you need to troubleshoot that incident is just right there You're not rummaging around fishing for stuff and asking colleagues and I mean everybody loves a good scavenger hunt, especially with their metrics and trying to figure things I think That's a great point when when slo's, you know Go bad or you even if you don't break an slo and you had a really impactful event and your team was still kind of scrambling to meet That slo Um, do you have any tools or or features available to help out with that root cause analysis or or anything like that? Yeah, like I said, so when that incident happens And so we say okay with that login service. It's now taking a lot longer than than our target of our of our slo So that generates that generates an incident And so you'll get notified you can use we just use standard prometheus alert manager So that can whatever hooks into that you you'll know all the usual Candidates there pay to duty and the like so you'll get you'll get notified So then you can go into the dashboard And as I say on that one dashboard it looks so the The slo was against An endpoint on the user service, but that user service has dependencies. Yeah, it obviously runs somewhere If we take kubernetes, it's the service So there's a pod and then that part will be running on a node within a within a cluster But that service may use a cache. It may use a database It may call a whole load of other things There could be like, you know easily a dozen services that doesn't microservices involved in that So, you know, which one's causing the problem or which more than one is causing the problem So you want to be able to go and investigate and check everything out? And as I said, this the graph database we've we've built understands all those relationships All those service to service and the stack So what it does it traverses that database and pulls in everything that's immediately connected Around it Puts all that onto onto a one dynamic dashboard for you. So you're not having to fish around trying to find out what's You know, what's going on where and also any of those dependent services if they've had any issues, they're highlighted as well So you can see that maybe it's not actually our user service itself It's reliant on this mongo database and this mongo database was running a little slow So then you can go and investigate. Hey, why was why was mongo a little slow because it's running out of resource or whatever So like I love it when it's just a uh, uh, capped resource kind of problem gets less fun with it It's like, oh, this yeah, the null type of define. What is what is going on? Yeah, yeah, could all yeah, that's an easy example. I'm gonna a lot more that can go wrong than that as I'm sure we all know And great to have that telemetry to dive deeper amazing For folks that uh, have any questions I'd love to urge you to throw those into chats and we can get to those I think I've got a couple more going but uh, we'd love to hear from all of you and we can get some more questions Yeah, so the worst is On that theme of ben troubleshooting. So you say you've got all your metrics and all the dashboards immediately accessible from that from that dynamically Created dashboard, but equally from there you can then jump out into logs. So you've got your existing logging solution Maybe you've got an elk stack or we'll jump out to elk and you'll arrive in elk With a with a deep link. So the the time range and the search query is already filled in for you so you're straight away looking at the appropriate container logs or subsystem logs, whatever it happens to be that linked across And then the same thing with the tracing and we do a really quite a clever thing with the tracing as I said We we have a an our own hotel collector module And we and we're using that for two purposes really So first of all, we're analyzing all the trace tags so that so we're essentially that Creates that that hotel collector that we've got Creates a bunch of prometheus metrics from all the spans that it sees and they get they get reaped in So that gives us that helps us build our graph, but also we're looking at the timings So we're building baselines multi-period baselines for each endpoint So therefore we know whether an endpoint with a particular call to an endpoint is normal or not Is it slower than normal? Or was it normal? So if you think about it ideally Most of your requests will be handled in a prompt and error-free manner, you know, they won't be interesting There'll be a perfectly normal request that came through not a problem at all And you're really generally not that interested in those It's only the slow and error ones you want to go and you want to go and delve into and go Oh, well, why did this one go wrong? And you think if you've got an slo of 99 percent That means in the worst-case scenario you're only expecting 1 of those traces to be interesting to be slow or erroneous So hell, you know, why am I trying to collect all of them or 10 percent of them? So So what our hotel collector does It then having sent all its metrics up It then calls and pulls down the baseline information So it knows for each endpoint that it sees When a span comes in if that's if that's slower than normal or not and therefore If it's if it's an interesting one if it's slow or erroneous It passes it through to the back end And if it's just a regular one, then we just drop it and because we hey We don't we don't need to fill our storage up with all these perfect traces. So that Taking the 99 slo type Argument that's going to reduce your stored traces Down to 1 percent of of trace volume Which is a good thing and if any of you using cloud storage, it will quite often sneak you into the free tier Which is even better Always always helpful and can actually make these of the free tier. It's like, oh, okay good I came in under the the limits on that front I I like I like that you you talked a little bit about elk stack as well And and I'd like to focus on that data collection strategy And so so does that mean you don't necessarily have to change it with open telemetry and with permit these Can you keep your same measures and measurements that you used to have? Yeah, absolutely Being so the idea is we we sit if you've already implemented open source great. Well done We love you and we just sit on top of what you've already done So we just providing that layer of Intelligence and automation to make your life easier on top of the hard work You've already done and freeing you from having to manage and maintain hundreds of dashboards You'll probably better knock that down to just a handful of Very specific ones to your business The rest of the rest of the commodity dashboards. We've done that for you the same thing with those with those health rules And we also you know same we we solve that real problem of of how do you correlate across the distributed system? How do you how do you go from one service to another and from traces to logs and back to metrics? You know, how do you manage all of that? Well, hey, you know, we're doing that for you I think that's one of the most painful things that I've dealt with in in other roles and responsibilities is having to Being told that, you know, like, okay You have to rip out everything that you have installed and you install this new operator and no You have to use our agents and everything like that It's so much nicer to be able to leverage what already exists And then it makes it an easier adoption path at least in my experience. So Yeah, that's what the I think the open source tooling is all is all about is you can you can implement these great open source tools And you're you're not tied in anywhere you You can you can use that data wherever you want You can choose to to send it off to some licensed cloud software company or you can you can use a One of the big cloud providers as a service, you know, Prometheus as a service Well, the big cloud providers do that or you can run it yourself The freedom is yours. You can choose to do with it What you want and that's definitely our philosophy that we're saying All that hard work you've done with open source rip it out and start again install our agent No, we we we just want to sit on top of your data that you've already got and Just allow you to do a lot more with it I understand I understand it's a lot more job security to keep rewriting it, but it's very Amazing amazing. I think one thing that helps too is that I took a look at a cert site and one thing I thought that was interesting was The intelligent sampling that you have because that's a core concern of what a lot of people focus on especially right now with workforce reductions and everything else is What's the cost and so by utilizing things like that and Prometheus open telemetry Can you talk to some of those cost cutting concerns that that kind of come into play? Yeah, like I said by doing the intelligence sampling Of those traces you're going to significantly reduce the amount of storage you need Traces are big and there's a lot of them if you if you can turn the dial up if you try and collect a lot You try it's one of those. Yeah, so that's a horrible balancing act because If you collect all the traces, there's a lot of them and it's really expensive You've got egress charges and if you're sending them somewhere You'll certainly break through the free tier of that cloud provider So you've got big bills for the the storage if you're running it Yourself, you know, you've got to scale up a big Cassandra cluster to store all that data And again, you're going to be burning through through storage space So and but then if you go the other way and turn it right down then You know, that's the second Murphy's law clearly states the traces you want the traces is that you didn't sample So what's the point of doing tracing in the in the first place? The thing you really really wanted it for didn't get that So yeah doing that intelligence sampling is the best of both worlds It's it's going to really really compress your data down So you don't have the cost associated with trying to process and store all those traces but Having compressed it down. You've still got the really interesting ones when you try and when you're trying to do that Problem-solving. Why didn't that work? Where did that go? Why did that walk through that error? You can go and you can go and open the trace up and you've got all those all those interesting ones You haven't thrown it away. So yeah, Murphy gets frustrated at that point I'll have to write a law My my last question then We'd love to tie it up and talk a little bit more about any calls to action or you know Anything's that you'd like to point out with asserts but when it comes to understanding the relationship between your data and automating that correlation Are there any tips tricks or or things that asserts offer that help out with that when it comes to open telemetry and from yes As I said, yeah that that graph database that we build by analyzing the the metric tags And the the tracing tags to build up that relationship model So we know what services are talking to what and where they're running And that you know, that makes your your life a lot easier You're not relying on that tribal knowledge You might be trouble, you know might be an engineer You might be trying to troubleshoot a payment gateway But you're dependent on a bunch of other services and they're giving you some troubles. So, uh, right But I don't know how they're deployed And so then you've got to call somebody else in and then they're Oh, yeah, but that uses this databasey thing, but I know nothing about that So I've got to call somebody else in and suddenly your war rooms got half the company in it And because if you're if you've got all your engineers in a war room trying to Head scratch and figure out a problem, then they're not doing really what the organization's Primarily paying them for which is writing new features and fixes So there's a big cost to be to the organization there. It also Most programmers I know prefer to write code than debug code So They get they get a bit frustrated. So having the system automate a lot of that donkey work for you is You know again is a big boon. It's it's like sort of productivity and the sort of the and the intangibles is, you know, happy engineers Amazing amazing. Yeah. No, I can't I can't agree more with wanting to kind of jump more So into the code and less, you know focus on the uh, you know, I I like the idea of code, but uh Writing useful code is always more helpful I think uh, some some interesting things I've seen within the community are like the open telemetry demo, which I'll I'll link in the chat for folks Many different programming languages and options to start Understanding what's possible with open telemetry and then, you know, you can tie that together with something like asserts or crafting up a dashboard that's really helpful for you But I like that the community is focused on providing an actual use case of how to put these things together So it's not just wishing you well and kind of leaving you, you know in the wind to figure this out on your own Yeah, a lot of it's pretty easy. I've done I've done a little bit myself in various languages with with open telemetry and It's not actually that onerous to do because These there's quite a lot of automation around the open telemetry libraries It's uh, if you Yeah, some things you don't have to change a line of code. It's just how you just a startup parameter But if you do change code, you're talking like six lines. It's Yeah, it's not a lot of work I think having that ability to link together all of those different types of telemetry to like your logs and traces and different application stack, you know Stack overflows those kinds of things really helpful. Yeah, not stack overflow copy paste service Yeah, that's where most bugs come from So I think someone did some did some analysis that there was an Incorrect snippet of code on stack overflow and it was it was found in like 500 projects across the internet or something Yeah, it's I forgot to remove example.com actually For folks looking to get started with asserts and just read more about you. Do you have any information on that front? Yeah, absolutely go to if you go to the asserts website, which is asserts.ai And there's lots of useful information on this and great blogs around how to actually Use Prometheus and how does how to set it up and get the collectors all going and the and the same with with hotel and the The other thing to have a have a great little play with we have a sandbox environment on the website So go in there and you can it's read only but so cc You don't break it for everybody else, but you're free to click around and and have a look and see how Asserts builds on top of these great open source Tools to and really sort of glue it all together and make your life a lot easier I love that. I always love when you can test something out before you actually go to purchase it and just get a better idea Of how to work with that. Yeah, right if you want you to go even further you can actually Run asserts for free forever this we have a free version So you can go and install it yourself if you just want a quick play around probably the quickest and easiest way to do it Is there's a docker compose for it? So you can you can run it up on a reasonably meaty developers laptop or a spinner vm up in a cloud somewhere And just point it at your existing Prometheus and it'll query the data give it a give it a minute or two And everything will light up obviously if you're going to do a more serious production install Yeah, running it on docker composes and perhaps the best way of doing it There's a helm chart so you can deploy it into into kubernetes And then you have all the benefits that kubernetes gives you of scalability and self healing the like amazing Again, just love that accessibility and the fact of being able to give folks the option to Just give things a test spin and dive a little bit more deeply into what's possible for them. Yeah, absolutely Awesome Well, I don't see any more questions rolling in and with that, uh, we'd love to give a final call for that Otherwise, but that's uh, steve. Do you have any parting thoughts wisdoms mantras or anything else that you like to share before we spin down today I think we think we've covered everything but I can think of So, yes, it's been great having a chat Hopefully hopefully our audience learned something useful and they'll go to asserts and read some blog posts and have a play in the sandbox Uh, my my passing word of wisdom is make sure that your computer's turned on Uh, that usually fixes it for me No, so you've got to turn it off then turn it on There's a good podcast I'll link to later that Somebody went into like for an hour about the deep mechanics of why that actually works and state machines and everything else But it's another fun conversation for another time Awesome. Well, thank you so much steve. Thank you so much everybody for joining us today on this live stream We hope to catch you again and uh until we see you again. Keep your head in the clouds We'll catch you around. Okay. Thanks Thanks everybody. Thank you steven taylor Thank you. Thank you