 All right. So before I get started, I want to actually give a couple of shout-outs to a couple of people. So if you checked out the schedule before, basically yesterday, you may have noticed that I wasn't originally on the schedule. So first of all, originally my colleague, Sumera, was going to give it. But for health reasons, she couldn't show up. She's OK. But she couldn't travel. And then secondly, thank you to Ganesh and Ruslan for agreeing to swap their slots last minute as well. So do check out their talk. They gave the talk today at 10.50 in the morning. So thanks to everyone for all of your flexibility so that we could end up giving this talk. All right. So as Bartek already said, we've already heard about profiling a couple of times today. We've talked about extending open telemetry to have a standard for profiling. And hopefully today, I'm going to be able to give you at least a little bit of an insight of what formats are there, the kind of different trade-offs that different formats have chosen to implement. And what I think could be an interesting format for the kind of profiling that we're seeing in this new cloud-native era. Because I think it does differ a little bit from what we've seen traditionally. So before we get started on formats, maybe a show of hands who here knows what profiling is and has used profilers before. All right, that's a good 70%, 60% of the room. How many of you know how profilers work? OK, that's considerably less. And then last but not least, who here has an understanding of how profilers actually persist the data that they obtain? All right, cool. So hopefully after this talk, pretty much all hands are going to be able to go up for all of those questions. So without further ado, profiling. What is profiling? Profiling is really as old as software engineering itself is. Because with profiling, what profiling allows us to do is understand where resources of our software are being spent. So when our program is running, what pieces of code are using CPU? What pieces of code are using memory? What is allocating memory? What is holding memory? All of these kinds of things in order for us to be able to improve it. If we don't have the data of what code is actually causing a lot of CPU time, what code is causing a lot of allocations, what code is holding a lot of heap memory, for example, we're in no place to fix that. Best case scenario, we would be poking in the dark. And maybe because we know our code base really well, maybe we can make good changes. But really, with data, we're so much more efficient. So that's kind of the why. We want to improve the performance of code. And ultimately, that could have other effects as well. It could be because we're now doing the same task with less CPU, for example, that could mean that we're spending less money on our infrastructure. Today, we actually heard several times, how do we reduce cost of our observability infrastructure? And two of my colleagues actually recently did a livestream where they showed how to identify metrics that were being produced on hardware components that their cluster didn't even have. So we were spending a ton of CPU cycles on something that just fundamentally didn't make any sense. And we were able to see that using profiling data. So using profiling, we can improve CPU. We can improve memory usage. We can just about improve on just about any dimension. And fundamentally, what profiling data is, is we're taking a function call stack, and we're assigning a number to it. In essence, that is all that profiling data is. And so I want to take two main categories of profilers, but primarily speak about one. And so I'm going to get the one out of the way that I'm not going to speak about much today, which is tracing profiling. Tracing profiling is essentially we're truly recording at a very, very granular level when does function A start, when does function A end, when does this other function start, and so on. We're actually tracing the program execution. And so this is useful, but generally speaking, this is not done very much in production because it has a very, very, very high cost. And so typically, in production, we use sampling profilers. And so what that essentially means, and I'm going to specifically talk most about CPU profiling because that's the one where we tend to get the most gains. The way that sampling CPU profilers work is we truly just, let's say, 100 times per second, look at what is the current function call stack, right? So 100 times per second, we're recording this. And if we're seeing the same function call stack multiple times, we just count up by one. And we can use this data then to kind of build statistics to see where is that CPU time being spent. Because if we see the same function call stack multiple times, that must mean, at least statistically speaking, that's where we're spending our time. And most sampling profilers use something in the range of 5% to 10% overhead. If you use the right techniques, and I'm going to talk about that later, you can get it as low as 0.2%. But these are kind of sampling profilers where we're profiling, let's say, at 100,000, at 10,000 hertz. So we are collecting 10,000 samples per second. We can considerably get this overhead down using a couple of techniques. But we'll talk about that later. All right, so I have a very small piece of example code here. So first, we're going to have a function called iterate long, which has just a for loop that iterates 10 billion times and just doesn't do anything. We're just trying to produce some CPU time, right? And then we have the same function that does the same thing, but with 1 billion iterations. And so I made up these numbers. But the point being, the second, the iterate short, takes 1 tenth of the first execution, right? So we have 20 samples that we observed for the long and two for the short. And what we're actually seeing on the left side, sorry, on the right-hand side here is our first format. It's called folded stacks. I think I'd consider this to be probably the simplest format out there. It's a very human-readable one, but as we'll see later, it also has a lot of kind of shortcomings. So let's talk about formats. Now that we have kind of a simple understanding of what profilers are, what kind of formats, and what formats, at least in spirit, represent, what are some very concrete implementations of these formats look like. And one that I particularly like, and I think is very, very widespread, especially in the cloud-native ecosystem, because the go runtime natively implements profilers that produce profiling data in the PPROF format. So that's the first one that I want to talk about today. PPROF kind of descends out of what is generally referred to as the Google Performance Tool Suite. This kind of went through a couple of iterations until Google eventually published this work. And PPROF is Protobuf, as so many things in Google. And this is a relatively, my slides look different, even though it's mirrored. OK, weird. So in essence, what PPROF is, is it's a list of samples. The sample, and we'll look at this in more detail in a second, the sample is just a stack trace and a value, right? Like a recurring theme that we'll see throughout this talk. As I said, that's truly the essence of what profiling data is. We have a stack trace and attached a sample to it. What the meaning of that sample is, depends on the profiling data, right? It can be CPU time. It can be allocations. It can be heap memory held. It can be a lot of things, right? Whenever we can make an association between a function call stack and a value, it can be put into profiling data and used. And profiling tools can be used to analyze this data. So these are the kind of four main components, I think, of PPROF. There's a bunch more metadata, but I think this is the core. So we have samples, stack traces, and values, mappings. And we'll go into mappings first. In the next step, we have locations. So these are kind of an abstraction for function call frames. And we'll look at that in a second as well. And then just functions. That's function name, line numbers, file names, and so on. So mappings. I think the reason why I wanted to specifically call out mappings was because when I was very new to all of this, this was the one that kind of confused me the most. Because what does this even mean, right? Like what is it talking about? Address ranges, file offsets, file name, build ID. What's even a build ID? So I want to give you a very, very quick demo of what this actually is. And I'm on a Mac, so I actually had to ask a co-worker who is on Linux to actually provide this for me. Because what this is, is typically when we, I hope everybody can see this, but when we're on a Linux system, the way a program executes a setup to execute code is the operating system memory maps the executable code to be executed. And the mappings file, which you can find in this scheme, PROC PID maps, on any Linux system, will tell you which kind of object code is mapped into which address space. And that's essentially what these mappings mean in PIPROF. It tells us, so what we have here is the memory mappings of system D. We just happened to choose that because PID 1 of that machine. So we can see Lipsy is mapped as memory mapped here. And a bunch of other libraries that just happen to be used by system D. Why do we need this? And where did the build ID come from? So the build ID, and I have another quick demo here, is something pretty interesting that is kind of an identifier for binaries. And so what PIPROF is saying, whenever there's an address in this address space, look at this binary with this build ID to figure out what an address space means. So I have the code that we were looking at earlier here, and I compiled this. And what we can then do using some standard tooling is, I mean, I hope everybody can see this, but if not, basically there's a tool called adder to line where we can say pass this executable and an address. And then it can tell us which file this function name belongs to, the function name. And in this case, we're actually seeing multiple functions. And what this means is that the compiler made a specific optimization called inlining, where it decided setting up an additional function call is actually too expensive. We're just going to do all of this in one function. So it's a short story. But that's what mappings are about, so that we can do the translation of an address into actually something that we humans understand. And in the folded stacks, that's not really something that we can communicate. The only thing that we could do with folded stacks were strings. And so here we already see a difference between PIPROF and folded stacks. PIPROF is obviously designed to be able to support what we call asynchronous symbolization. So next components of PIPROF, the sample. Like I said, truly, the sample is only a list of locations. Remember, I said locations are kind of an abstraction over function call frames. And what we're seeing now is a location can be, what a location can be is just an address. It just has an address and a mapping ID. That is a possibility for a location. So we don't have to have the symbols available in the PIPROF formatted profile. That can actually happen at analysis time. And this can save a ton of information, a ton of space in storage or data that needs to be transferred, and so on. So again, all a sample is is a function call stack, which happens to be an abstraction mapped to a value. So what I wanted to show here is, so I took the kind of folded stack trace that we had earlier, and I converted it to a PIPROF profile using a tool created by someone in the community. Shout out to Felix for creating it. And what I wanted to do is kind of show this PIPROF, exactly the same data that we had as the folded stacks, now in PIPROF. And what we can already see is that this data is much more kind of complex, right? But it comes at a trade-off, right? The folded stacks were very, very easy for us as humans to understand. However, PIPROF is able to represent much more complex situations and be much more efficient in doing that. The first thing that we'll see on the right-hand side here is that PIPROF actually makes a very great attempt at deduplicating as much information as possible. So it has a string table. It tries to deduplicate locations as much as possible and therefore kind of try to save as much space as possible. So we see the string table, and whenever we see a number, for example, in the function names, we just see a reference into this string table. So in this case, 0, 1, 2, 3, 4, 5, 6. So the last function that we have here is the iterate short function, right? So all I'm trying to say is this is a binary format, and it's able to represent a lot of very complex situations that may be interesting to handle. So that was the PIPROF demo. The next format that I want to talk about is speed scope. Speed scope I specifically wanted to cover because it is able to not just kind of map a single stack trace to a single value. It's actually able to also say kind of the relationship over time. And so it can tell us not just over these 10 seconds, there were two seconds spent in function x, y. It can also tell us it was actually called first at time t1 and terminated at t2, but it was again called at t25. Like, it doesn't really matter. The point is that it doesn't just have an aggregate view of the data. It can also tell us about the timeline of it. And so the way that this format, I don't know why this is doing this. OK, if I go forward and backward, I see everything on the slide again. Anyway, what we're seeing is that in the shared kind of section of the specification. And by the way, I specifically chose PIPROF and speed scope because they actually have specifications. There are a lot more profiling formats out there, but many of them, the implementation is the specification. So I wanted to specifically take two that have an actual specification. And so in speed scope, you can define frames. So the same thing as we saw in PIPROF, but not quite as optimized. It doesn't have a string table or something like that. So it's somewhere in between, right? It's saying, OK, frame one is our let's take our example of the main function. Frame two is the iterate function. Frame three, the iterate long. And frame four, the iterate short function, for example. And then you can define these profiles as evented or sampled. And this is exactly the difference that I wanted to show. So the sampled profile speed scope, essentially, is very similar to PIPROF there. It only shows us the aggregate view across the entire timeline. So this is really exactly the same thing, right? Like we have a sampled stack, but it's just numbers into the frame array. So this is different representation, but in essence, kind of the same thing as PIPROF. However, it does not save the address space. It does not save addresses and so on. So it does not support asynchronous symbolization. So therefore, we would always have to have these frames already symbolized at collection time, which can be very, very expensive to do. Or maybe even impossible because symbols may not always be available on a host where you're doing profiling. But the really interesting thing that I wanted to look at with speed scope is this one, the evented profile. So what we're seeing, I don't know why it's doing this, what we're seeing is that it kind of differentiates in two types of events. Opening a frame and closing a frame, right? So what it's doing is it's telling us this hierarchy of different frames and when they're being opened. So we see the at attribute, for example. That's a timestamp. And the kind as well as the frame number. So again, quite similar to what we were seeing before. However, it's also telling us, in addition to the frame, it's also telling us it was opened here at this particular point in time. Or it was closed at this particular point in time with this value. So quick speed scope demo. Actually, I'm going to use this cool example that they have on their website. So I'm going to kind of start with just the left heavy one. So this is what we would see as kind of the aggregate view. Like all we're seeing is always the whole aggregation across the entire time frame. So this is what typically sampled profiles look like. But the really cool thing about speed scope is that it can tell us about the timeline and how things behave over time. So we can select just this portion of the profiling data. So that's super cool to see how did this actually evolve over time. Where was my CPU time spent over time? Because this could be interesting to use when we want to figure out was this called multiple times? Or was this called once for a very long period of time? That can make a big impact on how we're going to actually improve this code. All right, that's pretty much all I wanted to show for speed scope. Oh, no, one more thing. The actual format then looks exactly like I just said. It kind of defines the starting frame from 0, as well as the end frame to value number 22. And then we have the opening frame, another opening frame. And we always have the reference into the array for our functions at the very top here. And then eventually, it needs to actually close it, and so on. So this format is a little bit different to what we were seeing before, right? It's not just a list of locations or frames to values. It's actually already telling us something about how we're going to visualize this information. So that's kind of unique about speed scope in that sense. So how am I doing in time? Need to speed up a little bit. So lastly, what I want to talk about is continuous profiling. So so far, I've only talked about kind of this point in time profiling, right? I'm looking at a process for a 10 second period of time. I'm doing very high sampling, so 10,000 samples per second. And that, obviously, as I said in the beginning, has some overhead, so 5% to 10% in overhead. Continuous profiling kind of takes the kind of extreme opposite approach. We're always going to profile absolutely everything in your infrastructure, but we're going to do it at very, very low sampling frequency. So the profile that I happen to work on actually only profiles at 19 hertz, so only 19 samples per CPU core per second. And so we've been thinking about there are some very drastically different trade-offs that we've chosen for continuous profiling. Do these formats that actually took very different approaches for collection actually suit this kind of profiling? And while the obligatory XKCD applies, we did think that because there are such fundamental differences in the collection of this data, there is actually a huge amount of possibility for improvement. So at Polar Signals, we actually run a continuous profiling service where our customers send us profiling data. And so we were thinking about how much of this data is actually could be optimized away. And one core with our product, and this is roughly the lowest that I'm aware of, it produces 675 megabytes per month per core. And so if we multiply that by relatively small infrastructure size, 10 nodes with 128 cores each machine, we get almost a terabyte of data that needs to be transferred out. Let's say our infrastructure is in GCP. Our customer's infrastructure may be in AWS. They're actually paying egress cost to use our product, aside from paying us for using our product. So we want to make sure that that kind of cost is as minimal as possible while still communicating the same intention. And so if we were to use Pprof, and that's what we're doing today, we're producing this amount of data. We're just kind of marshaling all the stack traces every single time, every 10 seconds, and sending those off to the service. So we're producing roughly $80 in cost just by sending this amount of data. And so we did some analysis and found out that the stack traces that we're sending, just the function names, file names, and all of these things make up about 80% of all of this data being sent. However, with continuous profiling, we're looking at the same processes across time. And the kind of reality is that long running processes tend to roughly do the same thing. And so what we're doing is we could just keep over and over and over sending the same stack traces all over again, with 100% granularity and detail. And so that means if we could optimize these 80% away, we could actually save our customers, or anyone running continuous profiling infrastructure. Everything that we do is open source, by the way, under the Parkour Open Source Project, PARCA. Then we could make everyone's life better, right? And so what we've been thinking about is something I call peeprof, but with a twist. And so essentially what we would be doing is instead of us kind of sending the same stack traces over and over and over again, we actually only send the hashes of the stacks. And only if this hash is not known to the backend, do we actually kind of retry and send everything at 100% detail, right? And so the reason why I'm kind of calling this peeprof with a twist is this is actually only a single field difference to what is currently known as the peeprof format. So I believe this could actually be an interesting change that we could propose through the peeprof format in order to gain this efficiency gain. However, it is not just the format, right? That's kind of the important thing, and why I also find it a bit awkward to put this into peeprof. Peeprof is a file format, right? So I should be able to kind of read this information from disk, and it should have everything it needs, like self-contained in this file. And so this kind of actually changes it into sort of a stateful protocol, right? The client kind of says, hey, I want to send you some data for stack XYZ, and I observed it 123 times. The server says, ah, no, actually, I don't know what the stack is. Can you tell me what the stack is? So the client kind of retries and says, actually, at full detail, this is the full stack trace, at which point then the backend can say, OK, cool. I'll accept that. I'll write that to storage, and I'll kind of remember this hash for the next time. So my point here was kind of trying to show. And by the way, this is my opinion. I know there's a kind of working group within the OpenTelemetry project, which we also participate in. The thing is, I personally think this type of stuff is not really explored quite well enough. So my personal belief is that I think we're a little bit too early to kind of standardize these things in the profiling space, because I think most of these things that I'm proposing here haven't really been tried and tested. And if we were to kind of set and stone these protocols today, I feel like we're kind of blocking ourselves from kind of innovating in this space. So while I'd love to have some standard, at the same time, I feel like there's still so much to explore here that we just haven't done yet. So that's kind of my overview for profiling and profiling formats, and why I think we should kind of emphasize innovation in the profiling space, because I feel like there's still so much left on the table to explore. Thank you. We are out of time, but one quick. Let's go. So it seems to me that there is no clear winner on the transport side of things. Is there a clear winner on the storage? I get on the storage on disk, on the presentation on disk. So if I understand the question is, clearly, or at least my opinion is, the protocol hasn't really been set and stone is the storage set and stone. I don't really think so. We happen to invest very much into kind of a column or a database to store this data, but actually the symbol storage is much, much, much, much more complicated. And so no, there's definitely still very, very much innovation left to be done in this space. I'd like to think that we're getting better out of it, but definitely we're nowhere near the maturity of metric storage or log storage with profiling. And I think there's still a lot of efficiency that we can gain. OK, I'm sure Fredrik can answer more questions around here, but it's time for another talk. So thank you.