 Yeah, all right cool. What's going on everyone? My name is Ryan. I am going to be talking today about a tale of two flame graphs Or at least that's what the talk is titled But really what I hope for this to be is kind of a more practical guide to how you can get value from flame graphs From continuous profiling Those two are somewhat synonymous I'm gonna get more into the nuances of what that all means throughout this talk But yeah, I hope by the end of this that you'll have an idea of how you can use Flame graphs and profiling to better understand your applications and understand things about it that you wouldn't be able to Otherwise understand with you know the other signals that are out there So before I start as I said, I'm Ryan I'm from Indianapolis originally moved out to Oakland to work on Pyroscope with My co-founder Dimitri over here We were acquired by Grafana later. That's where I now work the As a guess as a side note one of the reasons why was that we we really felt like Profiling had a lot of use inside of the context of other logs metrics and traces and That seemed easiest to do there, but that's a conversation for another time Yeah, I focus mostly on the product in UX side of profiling Obviously, there's a lot that goes into collecting profiling data efficiently storing all these flame graphs Being able to query them from the technical side But all of that doesn't necessarily mean anything or have any value if you aren't able to then analyze that data in a way That's meaningful and valuable And so I spend a lot of time focusing on that and how we can improve that within the within the project As far as fun stuff, I'm an avid Super Smash Bros player. I like disc golf trashy reality TV Yeah, with that, let's go ahead and get into it what I'm going to talk to you talk about today What are flame graphs? different types of profilers how profiling kind of fits into the overall observability space how to efficiently you know manage the storage of profiles over time and The main use cases for profiling and then finally I'm going to give some examples both real world of some internal things that we've done with profiling and Also, how it can be used to help tell a story So starting with what our flame graphs This is a image by Julia Evans. Who's one of the creators of RB spy Ruby profiler And this is a abstract example But basically what a flame graph is is it shows which parts of your code are consuming the most resources So that can be CPU it can be memory It can be a number of other metrics that are similarly formatted But the way that you read these flame graphs and sometimes they're I guess flipped Inverted vertically, but the way you read them is that horizontally from left to right is 100% of the time and then as you go up in this case, this would be Referencing that main calls the function either alligator or panda 60% and 40% of the time respectively then those functions would call, you know these other functions And so you basically read it as a stack trace vertically And then the width represents the almost like a pie chart the amount of time that was spent by your application on that particular you know function and I'm gonna get back to this piece here, but you know a common way that this has been used traditionally is is a bunch of these kind of pipes To get the actual flame graph stuff And that's something that now as profiling has kind of evolved has is slowly getting phased out in favor of more robust kind of query languages and ability to query profiling data So a slightly less abstract version a version with actual code. This is showing how Code that you would actually write gets transformed into a flame graph And so in this case, this is just a obviously a very simple Python program to show conceptually how this works and so You know if this were happening in server dot pie Obviously server dot pie is happening a hundred percent of the time and then you see it either calls fast function or slow function Which then call work? They each call work for a different amount of time And so you see that reflected here and that the slow function is consuming you know 80% the fast function is consuming 20% and The the work beneath it is reflected as well So yeah, so there's two main types of profilers I would say a lot of people Profiling is not a new concept. It's something that's been around for a long time But traditionally, I would say the you know kind of standard traditional profilers The way you can get this information of how long is spent on a function or how many resources are spent there Would require sort of inserting, you know a Function at the be are almost like a break point at the beginning of a function at the end of a function Then recording, you know some metric in between sending that somewhere and being able to then you know catalog that over time and While that is slightly more accurate or very accurate in most cases the The overhead of that is so high that it doesn't make it something that you could do in a practical sense in Production for example, you're just adding a whole bunch of extra overhead to your application for Relatively, you know relative to that little benefit more as time has gone on a lot of people have shifted to sampling profilers which rather than record every single little thing they Sample the stack trace at some sort of frequency often times. They'll hear maybe a hundred times per second and then by by that method you're able to get this profiling data from basically an outside process and not slow down your application and also not add a whole bunch of extra overhead and Then you know you collect that profiling data. You send it off somewhere where it can be Stored and queried efficiently Depending on how you're doing the profiling You're able it may not even require runtime changes You know some are in the form of like a ruby gym or a pit package or something like that But other things like ebpf allow you to get profiling data without necessarily having to even change code within your application and Although it is technically slightly less accurate. You do find that there's You know directionally it's close enough that You know again relative to the amount of overhead that you're saving by collecting the profiling data this way You're able to to get a lot of value from it. So here's a So so how does profiling kind of fit in with the other signals? I mean, I'm sure a lot of you have heard about Metrics logs and traces a lot of people often refer to profiling as kind of the next pillar of observability It's Yeah, and so profiling is really useful and I would say it's it's kind of on that Spectrum a little bit earlier stage than some of the other signals but I would say especially due to like hotel and People's familiarity with the value they get from these other signals. I would say that it's definitely moving more quickly into Into people's workflows and so, you know typically with other signals like if you take logs for example the way You know it typically starts you probably Console log a bunch of things locally get a bunch of do put a bunch of debug statements somewhere You get some information you need to debug something then as time goes on you might formalize that a little bit more you have it on some You know your production machines maybe in your staging environment You have some sort of logs that you're able to you know tail or SSH in and go look at when something's wrong and then You know again as you start to mature even more Depending on where your company lands on this sort of buy versus build, you know Maybe you build some sort of storage to be able to collect this data and query it back when you need it or You know potentially use some sort of database for whatever it is maybe using Prometheus or Or low-key or something to store that or using a vendor to actually store all this data where you can then focus on Actually optimizing your application and so you know that's an example with logs I would say with profiles. It's it's a little bit more in that area If you're using go Profiling comes Standard with the runtime there and so a lot of people are familiar in the go ecosystem with it Ruby has some pretty strong profilers. So does Java with the with JFR and And yeah, and so a lot of people are familiar with it in the sense where they'll you know do a benchmark or profile Something have it stored, you know, maybe it's just a file on their desktop or maybe there is something a little bit more formal Where, you know, you'll save something in and put it in as three bucket or something but the what I would say there is that the a lot of the profiling back ends whether that's open source or vendors are able to make profiling data much more Kind of compact and efficient to both store it and query it just by the nature of how profiling data is set up relative to metrics logs and traces and And I'm going to show some examples of that in a second But it also sort of lends itself really nicely to people kind of moving directly towards this sort of centralized optimized phase rather than Having to spend too long on this sort of earlier stages of the maturity curve Here's just some examples of what the UI for a continuous profile looks like I'm gonna You know talk a lot about a pyroscope today just because that's the one I'm most familiar with but there's pixie Which also is an open source profiler. There's parka Elastic they have theirs is not open source, but a whole system profiling data dog, so a lot of these companies are You know again doing profiling but able to get this profiling data really Inexpensively and then add a lot of value by You know kind of storing it efficiently and rendering it back to you And so I'm gonna talk a little bit about how that is About how we were able to achieve that Conceptually inside of pyroscope again These concepts are you know might take different shapes and forms depending on who's doing them But at the end of the day a lot of it ends up being being very similar And so I already talked about how profiling data gets turned into a flame graph And so now I'm gonna talk about yeah, like I said that that storage piece and the querying piece so the problem with you know just taking this profiling data and Storing it in you know in an s3 bucket or on a file system somewhere is that it's going to take up a lot of space Particularly if you're doing continuous profiling where you're getting you know a profile every you know Maybe second or ten seconds or over some period of time And per host it's going to add up really quickly and get to the point where you have so much data that you're Paying forward to store that you're not getting more value than the cost it takes to just get this data to begin with And the way that we address this is you know a profiling a profile You can think of as just a giant tree and so everything in a profile You know it starts at the main function which then calls some functions which then calls some functions and as you can imagine Especially if you just think about what a stack trace looks like there's a lot of repetition in it Where really the only main difference in a lot of cases is going to be the you know the bottom leaf nodes And so by turning these stack traces into a tree We're able to de-duplicate a bunch of the information that we would then store and be able to like I said store that much more efficiently And then on top of that not only the the stack traces, but the symbols themselves within the stack traces also have a lot of repetition So in this case, you know the symbols being you know net HTTP requests net IO read net IO write Here's kind of an example of how you would take those three Those three symbols and be able to turn those into a Try in order to compress the symbol names themselves in addition to compressing the The stack traces and so then when you combine this you know idea of compressing the stack trace with compressing the symbol names You end up with a very Efficient representation of profiling data that you can then you know now when you're putting that into wherever the storage system is you can you know do so without it being too expensive and Help the economics of kind of the the return on investment there And so yeah, so after you've stored all this data efficiently the the next problem becomes then how do you query it efficiently? You know now you have all this you know highly compact data that's been stored in You know some database or some storage somewhere and you want to recall it back maybe you want to see you know how much resources were we spending on you know this part of our architecture something along those lines and You're gonna need to query that data back again If you're storing this data along with these time stamps every you know 10 seconds And let's say you want to see a full days worth of data or a week or a month Whatever it might be if you aren't doing anything special on the query side It's going to be a very expensive query because you're going to need to take a whole bunch of profiles and merge them at query time And so the way that we address this was being able to basically do use segment trees to store the data at different granularities so that you're able to decrease the amount of actual work that's needed in order to build these profiles for whatever time range you're looking for and so in this case Like I said by default it's going to you know store these profiles every 10 seconds and then but then it's going to store a you know 20 second time segment profile and a 40 second time segment profile and that's going to you know continue on 80 You know so on and so forth and what that allows is then let's say we want to query 50 seconds worth of data The the benefit of using these segment trees is again now instead of having to merge four of those 10-second time 10 second flame graphs you're able to then you know Basically merge instead of doing four more to operations You can just do one and you can take this 40 second flame graph a 10 second flame graph merge the two of those Rather than merge all four of these Individually cool, okay, so so once you have all this data So that's just kind of a high-level conceptual idea of how you can get all this data and you know Retrieve it efficiently store it efficiently query it efficiently Again the all of that's great from a technical perspective, but then it comes down to after you've done all that What is the actual value that you get from profiling? What is the business case and what I would say there is that there's there's sort of and we kind of learn this from both You know customers open-source users people who we've talked to at community calls stuff like that the issues people right There's kind of three main use cases that we see One is for cost-cutting You know obviously if you understand where these resources are being allocated you can then know that you know Let's say you're spending a hundred thousand on compute resources if you then have a flame graph that represents your compute resources your CPU Then now you can use that to say oh if we want to knock 10% off of that We should check out this flame graph figure out exactly where we can go in order to make that find the You know the hotspots the low-hanging fruit that are kind of the biggest I guess cost centers are the biggest things causing us cost You also have revenue generation this one's a little bit harder to explain but I'll show an example in a second but this one often times latency for for many applications correlates to some loss or gain in revenue depending on which way the latency is moving and Basically you can then you know if you're using profiling to understand that latency then you can often optimize that and be able to Have some kind of impact on your end users that might result in you know more revenue or Not losing revenue and then finally though the one that you know probably makes the is the easiest to understand is just debugging an incident resolution I would say that I often say that that profiling is kind of the most fundamental You know when it comes to understanding your performance data the most fundamental unit flame graphs Are they tell you a breakdown by line? often by line number by function all of that and Being able to have that information whether you're looking using logs metrics traces You know one of them all of them none of them having that granularity is always going to give get you a little bit closer To finding the root cause of whatever issue you're looking at then without it and I'm gonna show an example of that in a second as well so The so yeah, so this is like a visualization. That's kind of showing that the cost-cutting You know idea in more detail We I say three percent here of the overhead of profiling it often is much less You know occasionally it can be a little bit more, but Conceptually the actual percentage You know to some extent with it being reasonably low doesn't really matter whether it's you know one three five Because the at the end of the day, you know, you likely aren't running all of your You know applications at like 97% capacity where three is going to be the difference between anything major But then you know again having this data is So valuable that now you're able to understand so many other parts of your infrastructure That you can then you know optimize those in a way that you you know the profiling sort of pays for itself very quickly Once you start to look into the flame graphs and understand your architecture better And so, you know, I would say you know, there's a lot of people who Traditionally maybe didn't focus as much on cost and then you know in today's market environment There's a lot more focus on Efficiency and making sure that people are You know being efficient and not just you know scaling things up and instead understanding if Yeah, if the the return on investment for the various parts of the infrastructure Actually match what you would expect or what you would want or have budgeted and so For this 3% you now understand your usage of logs metrics and traces something that we often see is that a lot of people will worry about like Oh, well 3% that sounds like a lot of overhead or whatever it might be And then they start profiling and realize that you know someone left a log line in production that's Causing them to spend maybe 10% just logging things mindlessly instead of being more Strategic about what they're logging or maybe the trace sample rate is much higher than is actually needed and is actually causing more Overhead than they expected whatever it might be Messaging and queuing systems are also very Commonly offenders here, you know, it's something that's happening asynchronously. So a lot of times people will tend to write not as good of code there and And then it just stacks up over time to the point where then, you know, maybe the queue is overflowing And you want to understand why if you don't have profiling, that's a really tricky thing to debug But if you do have profiling, you just go back in time understand where the resources were going and break it down Yeah, here's here's an example as well And some of these statistics are even a little bit old So this one's from a long time ago, but again conceptually for every amount of For every amount of latency, it's more likely someone's gonna leave their shopping cart or they're going to leave the app For Google they said they generate 20% less traffic for every additional 500 milliseconds in latency And so again having these profiles is really good at debugging that uber had a really good blog post where they were talking about their metrics ingestion and They didn't even realize that they were spending a ton of time almost half the time doing a operation that was Almost completely unnecessary. And so by cutting that out. They were able to cut their ingestion latency in half and now again, you're able to Much be much more efficient with the way that you're doing things There Yeah, this is just talking about other other industries where the same concept Occurs, you know fintech banking e-commerce streaming everybody's who has dealt with that, you know Buffering lag and online gaming advertising latency is extremely important there This is just a number of other areas where you can kind of see that and then visualizing again Just conceptually that curve of how latency affects bounce rates and revenue and that kind of stuff And then on the incident side Here's an example that I don't know how easy it is to see up there But this was something from us internally where we had an outage and this is kind of what it looks at what? you know before This was like right after we joined the Grafana team and after we had started profiling everything in production and You know, so we have this outage people are used to looking at you know at these Prometheus slash Mimir charts and in this case, you know, all of the things that are happening here are not Incredibly important to the point that I'm making but if you look at this blue line here, which I believe is is just Errors I think and so this blue line here is errors, you know You see one spike here of this blue line You see another spike here where it spikes up very high and there's another issue here and then you know we thought we solved the problem and so This is proof or this is how you would you know, maybe go through an incident if you didn't have profiling And you did have some sort of metrics that are at least letting you know that something's wrong But then with profiling, you know again, so you can see here It's like 21 20 to 21 35 then 21 40 to a little bit after You can see the same thing reflected with continuous profiling the differences that now these spikes are Attached to some sort of code that is attributed to them. And so in this case, you're able to Use profiling use these and you're only seeing one flame graph here But there's actually two flame graphs that are dipped in this result right here But you have one flame graph from where things were healthy and then you have one flame graph from where things were unhealthy You're able to overlay those on top of each other And now you go from just knowing that there's these spikes with these issues to knowing Exactly what caused them which lines of code are the ones that you should look at in order to debug those issues And I just realized I forgot to run something so we'll see if that works cool so So yeah, so now I'm going to go through an Example showing some other cases where you can kind of use to flame graphs either of the of the same application with different labels or different time periods whatever might be and Basically, you can use those flame graphs in order to understand some resource similar to the example that I just showed Where you can? Yeah, you can just more clearly see what's what's going wrong. So This is a visualization kind of showing this is like the Yeager hot rod app basically just a sample rideshare application So if you can imagine that you could apply this to really any any company that has You know multiple servers whatever might be we have one in the east region one in the north region one in the south region and We've labeled those with the region tag and then we have three different routes that we've labeled with the vehicle tag and Every ten seconds those servers are collecting profiling data of what's happening on those servers and then Sending those to the pyroscope server where all of that fancy compression trees and try stuff happens And then that's where we'll be able to then query this data from When we want to see what's going on and so let me show what it actually looks like in practice So so yeah, so here make some more space here So yeah, so here's what it looks like once you get into a UI again each one is going to do this slightly differently, but effectively do the same thing and so in this case we're able to see we're looking at the CPU from that image that I was just showing you and We can look at the and we're looking specifically at the region tag to figure out You know, is there something wrong with the region tag and in this case you can see there's a pie chart that represents the CPU utilization for this application and for some reason the north region is consuming 67% of CPU while the east and the south regions are consuming Significantly less now if this is your application, maybe you just have more people Excuse me, you have more people in this north region and maybe that makes sense or maybe this is something that seems a little suspicious to you either way being able to Go in here and click on these different regions see the flame graphs associated with each of them is You know a good start in being able to debug why there's these differences between them and So if here if I wanted to say compare We'll pick one of these regions where there's not a lot of CPU being spent to this north region So let's say we'll pick the east one in blue and the north one in green and hit this compare tags button And so now we're able to let me select the same time period here So now we're able to compare these two side by side so you can see here we've selected the east region for this flame graph the north region for this flame graph and You know if you're looking at this and this is your application Already you might start to see that there's some differences between them this this node here is only taking it's taking much less width of the entire application compared to this one and But again being able to see these two flame graphs starts to be like, okay we're starting to see something different between the east and the north region and Then the the sort of next step to this is to then overlay these two flame graphs on top of each other and be able to Calculate the diff between them and so in this case now that we've overlaid them We can see that the difference between this region and this region is that there's You know one is consuming 31% of CPU on This specific function and the other is consuming 80 and so you can imagine that's super helpful and telling you That you know, maybe there's something that you should look into with this check driver availability function And then you can kind of follow it down to figure out what it calls Apparently in this case it may be a mutex lock. Like I said, this is just a conceptual example That is causing that and then then you would be able to go into the code fix that and potentially save a lot of CPU Save a lot of money. Maybe fix an incident the next one That I just started running So then another example we talked a lot about CPU This one I'm gonna show a memory example So, you know the same way CPU is useful. You can do the same type of analysis with memory profiling as well This is one that I just started Running but you can see here that there's clearly we're looking at In use space for the same application again broken down by region and in this case, we're seeing that there's That the the green and the yellow ones are are fairly consistent, but this blue one There's it's starting to grow if I had started this at the beginning of my talk like I meant to it would show a clear memory leak you can already see it starting to happen and again, you you're able to kind of jump in here and compare these different flame graphs to each other and You know if you're looking at them here, it might not be as clear But once you you click on that region where you do see things starting to spike You can then go in and see specifically what the issue is there and there's a Cache vehicle locations function that's consuming a bunch of the resources The last one I will talk about is one that actually was part of the reason that we started pyrescope there was me and Dmitry the other Co-founder of it. We were working at it's at a previous company and we were we started using profiling We were using RB spy not in a continuous sense But I guess yeah, we kind of built a hacky version of what now exists as pyrescope and basically What we realized once we started doing that was that we were we were doing a whole bunch of compression on some stuff that we were storing in Yeah, storing for our company and basically we didn't realize until we started doing profiling how much of that how much of our CPU resources were actually being spent on this compression and as it turned out by default the The library that we were using defaults to like the maximum level of compression where it's consuming a ton of CPU to do that compression and It was just not something that we needed It wasn't a conscious choice that we made to do that much compression as soon as we Changed it from the default to a much lower level of compression. We were able to save I think it was like 20% on our compute bill and our bosses were very happy and so That was where we kind of started to realize that there's there's probably so many of these issues not just for us But for others where it's kind of you know lurking somewhere in the code You know something that was not necessarily a conscious decision But has a major effect and again We would have never known that that's what all the CPU is getting spent on until we see this breakdown By function and by line Yeah, so that's all I have I guess the last thing I'll mention Obviously logs metrics and traces already have a little bit of a home inside of open telemetry Right now. We are working hard and have been working on for the past year and a half ish on on getting profiling into hotel as well and so We will be at the hotel booth Talking about profiling so we would love to hear anyone who has used this in any capacity or is interested in learning more Feel free to come by and chat with us about it We have just proposed the new profiling Otep the model for what profiling would look like in an hotel context and Yeah, are looking for people's feedback on it. So that's all I have I talked longer than I thought I would so we only have 55 seconds for questions Or just feel free to come up after because they're gonna shut off the thing in a second You