 Hey, welcome to the Continuous Profile. We were given microphones, so now it's like official. Richard, you know a lot about this stuff, right? Sorry, I'm just just- You know a lot about this stuff, right? Just continue. That's how we define a lot. I don't know. So what did you all want to talk about? So my interest in this is from the perspective of, again, like with my sake observability or tag observability had on Continuous Profiling is the next thing after metrics logs and traces or logs metrics and traces depending on where you're starting. But the usual user story I'm seeing is people who are in like less evolved software systems tend to start with logs, then they go to metrics, then to traces. Sometimes they go logs, traces, then metrics. But usually the common path is for the new starters to be logs, metrics, traces. Whereas if you have more mature systems, either course people already have larger installations which they have been observing for some time or other things like network and such infrastructure where people already had all those birthing pains which we are currently seeing in the observability space. They usually start with metrics, of course they already had this realization that it's a lot cheaper to reduce information into numbers. Either way, as we are dealing with more and more software, obviously, preaching to require, I guess, after tracing comes Continuous Profiling. That's something we see again and again and again with all my different hats on. So that's why I'm here. I'm interested in how to make this work for the generic case. Can we start with the tilde? With what? The tilde? Yeah, sorry. You mean of what Continuous Profiling is? Any volunteers, or should I? Google. I don't know. So, okay. A profile is basically you look at the structure of your running program. It used like the thing which is now often called a profile used to be called trace in Unixland which kind of makes everything confusing but from the history or from how it's defined as of today is I take a snapshot of the current state of my system. How much time have I spent in X function? How much memory is this entire thing using these kinds of things? And going away from just a snapshot to doing this again and again and again and again and seeing how this changes over time, how this evolves. Maybe I'm in an error state. I'm seeing high latencies. I'm seeing this and that error. What is happening with, I don't know, my memory usage at the same time or my CPU or which functions. What is my hot path? What are the paths my code is usually running through all those kinds of things? You can answer by looking at the profile of a program again and again and again. With Go you have it relatively easy. Of course you have your prof endpoint or P-prof endpoint and you can just look at the thing. Other languages, it's not as easy. So initially from the cloud native slash Go world, continuous profiling used to be to just look at the P-prof endpoint a lot and determine what changes over time and what desirable and undesirable behavior this correlates with. Very well. Our services at AWS, we use P-prof to profile our services and we run it on the regular cadence but we don't necessarily currently store it in any backend. It's all used locally so we can access the information for P-prof information. It's the fourth signal but what you do with it is not a, I come from an APM background so it's not a. For the recording I think you need to put the. You need to go directly? Yeah, because the people who are enrolled won't. You won't hear it. So it's the frequency of collection and what do you do with the data afterwards? That profile information. Some of the challenges we faced in the Ruby SDK is we had a vendor specific implementation that had a continuous profiler and it ran in a thread in its own thread so therefore it would get its own context object and its own thread local variables whereas the main process for example that is running the operations or the request had all of its context information in a separate thread and trying to share the information in between the sampling profiler that's running in its own P thread and the telemetry adult SDK running in a separate thread. We had to like essentially patch the context in order to give it access to the other thread that's running. So some of the challenges in implementation at least on the Ruby side is that a challenge that other language implementations have faced or might face in the future? Like you do JS work, right? Yeah. Oh, of course. So everything's easy for you. You see easy for that, you can. JMX. I mean, we touched on this a little bit in the session before. There's like we touched on this in different contexts but still it's a little bit the same story. You have languages where this is already built in like for example, Go. You have languages where this is super hard like for example, Python. And you have languages where you have something where you can at least talk on in the middle Java. The thing which all of those have in common is they run on Linux. So one of the things which I do think might make sense is again looking at EBPF. Of course, there all of a sudden it's, you see what function calls you see how often and you see this, it doesn't matter if it's Java or if it's Ruby or if it's Go or if it's Fortran. You just see what you're hitting, how often you're hitting it, what time you're spending in between hitting those. You don't get the precise same depth of information like you wouldn't get what Go's PPROF gives you just out of the box. But I also strongly believe that over time EBPF will be enabled to carry more context from user space into kernel space. Maybe attach a label to a syscall or something to actually be able to transport information more or less for the explicit intention to just signal to EBPF and at that point you don't need to do this in Ruby or in what have you. The other thing is, in particular in scripted languages and such, implementing those kinds of things as part of the language tends to become super expensive. And if you look at, I think the paper from IBM is from the 70s about how much effort and time and money you should be spending on, I think they didn't even use the term monitor, but whatever, on the same concept with the same underlying desires and concepts which we are still dealing with today, they came out at 10 to 20%. If you look at actual research, that tends to be those magic 20% tend to be the cost which you're allowed to invest and also what you should invest to get good outcome. It's a lot easier to do this in the Linux kernel than for example in Ruby. And not like go to 50% or something of your total cost. There was, as an anecdote, a nameless company working on a nameless, it's not Grafana, but I heard about this through. It wasn't us, but they were running something serverless and cool and blah, blah, blah. And they realized that over 50% of what they were actually paying for in CPU time was just service mesh and they realized this with continuous profiling. This is not an amount of money you should be spending on your service mesh. Same goes for your observability stack. Was that a configuration issue though or was that a particular type of environment or applications or work flows that were running on service mesh that was causing this? I think how they called it or what amount of data they sent in and out. I don't have the details. I would be making stuff up at this point. It was user error or programmer error of how they interacted with the endpoint or with the APIs. But I don't know the details. But this is the type of thing to come back to the TLDR. Which is relatively easy to find with continuous profiling. Of course, all of a sudden you're getting those insights and you used to be able to do this with like classic systems. A JET attached GDB and go to town like in more when you had different containments for your complexity. I guess I come back to the continuous profile but you matched it back up to the EBPF and how do you correlate the two between user space or system calls and be able to say that this is using the aggregate of all these calls or actually what's contributing to your resource usage or your memory. I would think that would be the correlation. No. You can't. You don't have it at that level though. Yeah, yeah. You don't. You can change where it happened. Yep. Just to reiterate what I thought I heard you say, Richard, is that you can see this coming or you're predicting that this might happen in EBPF is that they'll add the ability to add custom labels which would give us the ability to pass the trace ID and span ID context. Is it an aggregated information? Aggregated on what? Aggregated in such a way that an individual trace ID or span ID is irrelevant. In the common case, it will be irrelevant. And also, like in all of the things we're talking about, you'll usually have aggregations. Logs and traces are not precisely following this but anything else which you're seeing is basically already aggregations like metrics and continuous profiling and others. Of course, it's just super expensive to not have aggregates and you need to go to aggregates as soon as possible and you see this in humanities history again and again and again and again. That you want to have this level of detail and then you realize you actually don't want to pay for this and then you start putting numbers in such things. I think that, A, yes, like in my mind, at some point I do think that there will be more possibility to actually talk to EBPF, so to speak. But also the other thing, and now we're back at this correlation of signals and having unified alerting, single pane of glass, I don't need to tell you that I actually drank this Kool-Aid and believe in this stuff, of course I also wouldn't be working where I'm working at. If I see my latency spike, I know when that happened and I can look at my continuous profiles to see what they looked like before, after and during this thing or I see this in that error suddenly is spiking. I don't have to extract all of this information from my continuous profiles. I absolutely do not. I would actually argue somewhat strongly that both continuous profiling and traces shouldn't be carrying all this depth of information. Of course, I already have my aggregates, which are the metrics and I already have my specific incidents which I care about, which are my logs and those already carry the meaning. For example, when you look at, so for open metrics, when we talked to Google, I forgot when, we were talking about how to do this for other types of signals, like apply the things which Prometheus does for others, blah, blah, blah. And they said something which really stuck with me that it is not efficient to search for traces. And when Google tells you that searching for something doesn't scale, you better listen. And what they did is exemplars and that's what you now find in open metrics, is what you find in Prometheus, it's what you find in Loki, it's what you find in others. They have an ID to a trace. And you only need this ID. Of course, the thing is if I'm ingesting all my traces with all the raw metadata, I need to actually distill meaning from this again. And it's super nice to be able to slice and dice this as I want at runtime, but it's super expensive. There is, I already have my information that hey, this and that latency is spiking or this and that error is spiking. And just by attaching those exemplars, I know I can jump into this trace, I have the mental context, I can jump into the thing, I can look at it. And the same is true, not with specific traces, but at least with time and what machine and what fleet of what service and what have you, is true for continuous profiling. Of course, all of a sudden I can say, okay, I have this and that thing. It's in undesirable state. How do my profiles look? So I was jumping into looking at what the proposal actually was for the, what's in the OTAP to describe a bit about how they wanna do any sort of correlation whatsoever with tracing. And maybe I was wondering if there was something in here that I could quickly look up. But this is talking about doing correlation with P-PROF, and then I was looking at like, well, I could have sworn that I saw that Pyroscope had support for tracing integration. That's also P-PROF, not EBPF. So these are all P-PROF-SCOLE specific. And I would strongly agree that GO is the default in Cloud Native, but if you look at, for example, their metrics is going first within open telemetry, it's Java and C, or C-Hash. Of course, that's where the most users are. And it's nice to have P-PROF annoying. I like P-PROF, but it's completely and utterly useless to someone using Java. I don't know. I've looked at the P-PROF definition just for less than a half hour total. Is it standard enough where it's language and technology agnostic? Is it only useful for like the continuous profiling situation where you're trying to compare multiple profiles over time? Should it just, I guess, I don't understand about it. I wanna learn more about it and whether it's a good candidate to use as the data transfer type. P-PROF itself is really good. And it's, in my opinion, in my strong opinion, standardized enough to just use for different languages. And if you were to magically have your GVM or whatever, emit P-PROF-compatible data, you would make a lot of people very happy. And you already have an ecosystem around this, which just integrates super well. It also would mean that anyone who, yeah, no, long story short, if other languages were to speak P-PROF, I do think this would be a tremendous benefit for the whole ecosystem. I generally strongly, very extremely strongly believe that open standards are the most important thing about anything which we're doing. Of course, technologies come and go, implementations come and go. But if you look at how ISO OC layers without those, the internet would not exist. You still can read out old Modbus installations from 30 years ago, SNMP installations from 20 or 30 years ago with the same protocols because you have those open standards. And even if your implementation is literally 30 years newer, it still interoperates. So yes, having Java emit P-PROF type profiles would be awesome. So why isn't that then? Was it because of the overhead that it had? I mean, I'm not a Java person. I would personally, I strongly believe that having a standard here would be great and P-PROF is best suited to be a standard, which is arguably outside of the scope for open telemetry. But I mean, in the end, we care about having the functionality and this is probably the quickest way to have this. At least looking at, for example, one of the P-PROF is what is the Datadog APM, which is what was donated to Hotel Ruby and that's where we're starting implementations from. The profile that's built into that is using P-PROF to export data. So that might be an opportunity or that might be how it would have, but it might evolve for Ruby. I'm looking at this other comment here added by Peter Pig about Pyroscope and it talks about utilizing RB-SPY. Not that this is an expiration moment for Ruby, for all of you, sorry to hold you hostage here, but it's just curious because I'm looking at all these different comments, sorry in there. I figure that might be an interesting point, but it seems to align closely to what might be feasible for us at least. So did they donate it and then hands off or are they actually actively working on this? We do not currently have any Ruby engineers from Datadog assigned or maintaining the project. We did have one, but they moved on to go work somewhere else and then they stopped maintaining and stuff. Okay, Petty, of course what you said initially sounded like this could be, okay. It is what it is. I wonder what conversation you had. I mean, we're just a fraction of the folks that are interested in Pro-Pymon. I think the initial call had like 60 people on it. We're doing some hackathons internally and some internal projects on P-PROP and evaluating Pyroscope for the profile of A and other ones, so we're trying to gather the best way to do continuous profiling. So happily we'll share some of the results with the rest of the community as soon as we get that data. And the thing is you always have this initial influx of people who are interested, of course, to hear about it. Most people don't stick around to do the actual work. It's more like you need to actually have commitment for this, like long-term and continuous. You need to find the value, right? If there's value, then you'll build it, but if there's not really a value that you've proven with later, and that's the thing, I mean, that's where AWS we're always working backwards from customers. If the customer asks for it, we'll build it, but if there's still, I would say, very infancy stages to be able to go and build these things, so it's not, that's why the demands are just not there and nobody's proven the value, so it's a checking and egg thing, right? I would say the value is there, but it's not easy to prove. And also you need to anticipate, like the customer needs to tell you what they already know they want. That's not the visionary approach. That's like the more stopgap-y approach to some extent.