 Okay, let's get started. Hello everyone, my name is Lily and I will be giving a presentation on tackling performance issues with effective visualization of profiling data. A little bit about myself, I'm currently a senior software engineer at Datadoc. I've been working there since April of 2022. Before then, I worked for three years at Slack on the front-end infrastructure team, and before then, I worked at Apollo.io, which is now a Series D startup. For my education, I did my undergraduate at MIT, and following graduation, I did a couple of professional development courses at Stanford and Berkeley. So let's get started. Why should you care about profiling? Well, when we think about observability, we often associated with three pillars, logs, metrics, and tracing. However, from my personal experience working in the performance optimization space, and from talking with other performance experts, I'm convinced that profiling is extremely important to understanding observability, and therefore should arguably be its own pillar of observability. However, there's a problem with profiling, and that is its high barrier to entry. Profiling has this reputation for being a tool for experts and not for the everyday developer. And this might be part of the reason why the importance of profiling is not as recognized as logs, metrics, and traces, because after all, when people don't understand something, they cannot appreciate the importance of that thing. And there is a solution to this problem, and it is with better visualization. Visualizations are important in conveying information. A good visualization is worth a thousand words, and a bad visualization can often mislead. And before we go into the solution space, I want to first illustrate the importance of profiling in observability. And for that, I want to tell a story. It was back in 2020, and I was working at Slack at the time. And Slack is a sort of like a messaging application. There are many channels, and you can switch from one channel to another to chat with your colleagues. One of the performance metrics that Slack was tracking is the time it takes to go from one channel to the next channel, and for the second channel to have visible content. And this metric, the channel switch timing we called it, was slowly increasing over time, to the point that it was making some execs very nervous. And I was pulled into investigating why the channel switch was getting worse over time and what can be done on the front-end side. And from this effort came the front-end observability team. We collected logs, metrics, and traces during a channel switch to figure out what was happening during a channel switch that could lead to the performance issues. For tracing in particular, we use Honeycomb for visualizations. Slack is a React application, and at the time, at least when I was working there, used Redux for state management. So we traced things like Redux actions, Redux thongs, API calls, component rendering to figure out what was taking the longest time. And what we quickly realized is that it was not very effective. Tracing alone only brought us so far. And there's a couple of big limitations that we ran into. One, it's difficult to aggregate the data, so it's hard to do analysis of comparing channel switch that were fast with channel switch that were slow. And hunting for relevant traces also became like hunting for needle in a haystack. That was limitation number one. And limitation number two was even when we did find traces that were relevant, we were stuck. Where do we go from there? So what the trace said, this action took a while, or this thong took a while. How do we improve the performance after that? And trace, what we quickly realized is that tracing needed to be more and more granular for it to help us solve our problem at a code level. And by granular tracing, I mean imagine you have some function and then you ended up creating spans to trace your code and you had to pass these spans around to the children and the children creates more spans and pass to their children. And this quickly blew up the code so the code becomes less readable over time and less maintainable over time. And we were trying to solve a problem, a performance problem with tracing when a tool like profiling would have been a lot more helpful because after all granular tracing is like a wall time profiler. And just a little bit background on profiling. So you have some code, the code gets executed on the runtime and your profiler samples the call stack at every set intervals and this profiling data is then visualized in some manner to the developer. And my experience at Slack show that tracing only brought us so far and you want to really understand the issue behind what we were observing. We needed profiling. My thoughts are not unique. In fact, since the late 2020 there's been a lot of effort from the open source community to make open telemetry support profiling as a four signal type. And in fact, earlier this year in September the PR to introduce the profiling data model was introduced. So profiling is important to observability. The profiling has existed for decades. Why is its importance not as widely recognized as logs, metrics and traces? And one reason is because of its high barrier to entry. As mentioned, it's often viewed as a tool for experts. And these are some of the quotes that I have gathered from my friends and colleagues in the industry who have years of development experience. Profiling is hard. And it's hard for two reasons. One, a lot of tools out there are not very beginner friendly. And two, I think one of the reasons is because we're not leveraging visualizations effectively. A good visualization should be able to present data in a way that's understandable to the reader and a bad visualization can mislead and be confusing. So perhaps a solution to making profiling data more understandable and therefore more accessible to the everyday developer is more effective visualization. And just as an example on the relevance of visualization let's say that you have a distribution of age. One way to visualize this data is with a pie chart. But another way to visualize this data is with a histogram. And a histogram, a pie chart, they have their strengths and weaknesses. Hello, testing, okay, great. So a histogram and a pie chart have their strengths and weaknesses. A pie chart, you can quickly see roughly the percentage of each segment, but with a histogram you can more easily visualize the shape of the distribution. And similarly, there are many different ways of visualizing and presenting profiling data. And they each have their strengths and weaknesses and can shine depending on the scenario or the use case that you have. So let's get started. As a front-end engineer, my first experience with profiling was of course the Chrome Dev tool. This is a flame chart. I was working at Slack at the time. And when we realized that tracing alone could only get us so far, we turned to profiling. And as someone who was new to profiling, landing on this tool was not a great experience. One, there's a lot going on which made it very difficult to understand, but that's not the only problem. This visualization is a flame chart. And though flame chart is useful in some cases, now looking back, it was not the best tool for trying to figure out why Slack was having slow channel switch for some customers. As someone who was new to profiling, I thought that I was supposed to just look at this flame chart and know exactly what was wrong with Slack performance. Turns out that's not true because profiling is so much more than a flame chart. And as the front-end observability team was looking into profiling solutions, at that time my family and I decided to leave the Bay Area and start a new life in Paris. And with that, I left my job at Slack and I joined the profiling team at Datadoc to continue my journey in profiling. And similar to what Datadoc has been building, there's a lot of tools out there for profiling that are open sourced. Some of these tools are for data collection. Some of these tools are for data visualization. And the fact that there's a plethora of tools illustrates the importance of profiling in software development. And we'll focus on data visualization for the rest of this talk. As I mentioned, there's already a flame chart. And it's compared to the same graph. These are, chances are, if you're a software developer, one of these realizations will be your ease or has been your first encounter with profiling. The flame chart is the Chrome Dev tool. The Firefox Dev tool also has a flame chart. And the flame graph, it's similar, but it's very different and there's a key difference. Flame graph was initially developed by an engineer called Brandon Gregg. And since then it has become this de facto visualization for many platforms. And the difference between these two is best illustrated with an example. Let's say that you have this plan vacation function because who doesn't like vacations? And there is a loop and this loop loops three times. And inside the body of this loop you have three different functions. A flame chart, if you collect, if your sample is called back over time, the flame chart will look like this. You have the plan vacation at the top and you will see each of these functions and your children functions three different times because the flame chart has a time component on the horizontal axis. A flame graph on the other hand does not have a time component. A flame graph will merge your frames together and then sort them alphabetically. And this is a key difference that makes them useful in different scenarios. So a flame chart has a time component so you can look at the code execution over time. It can do things like detect loops but because it has a time component it also has severe limitations. One, it can only work for a single thread. You cannot aggregate multiple profiles which means let's say that you want to aggregate many different profiles collected during a channel switch. You cannot do that with a flame chart. And three, it's hard to do side-by-side comparison. A flame graph on the other hand because it does not have a time component it supports frame merging. You can aggregate multiple profiles and it does make it easier to do side-by-side comparison. So a slap for instance, instead of looking through flame charts one by one if we had the ability to compare the flame graphs of customers with slow channel switch versus customers with fast channel switch maybe the performance problem would have been surfaced. More obvious. And flame chart and flame graph also share similar limitations in that they're hard to view functions with the most self-time. And with that it's hard to see common leaf node. And what I mean is this. Imagine you have this function called check balance and it spreads throughout your frames. The flame graph and the flame chart will not easily reveal that this is the function with the most self-time. So if you want to make your application run faster that's likely the biggest low hanging fruit. What can help you find low hanging fruit more easily is a call graph. A call graph in this case will have many different code path pointing to it. And the size of the node reflects the self-time of the function. And currently one example of a call graph out there is a P-prof tool which integrates with Gravis to generate the call graph. And flame graph and call graph share common limitations as well. Because they don't have a time component because they can't do aggregation they cannot do any of those three things. Which is they cannot isolate small bursts of activity. They don't reveal patterns in activity and they cannot help with issues of concurrency. Let's look at the last one which is issues of concurrency. That is something that the threat timeline can help solve. One example of threat timeline is the Perfetto tool. And to see how threat timeline can help solve issues of concurrency let's look at a simple example. So now we're moving away from JavaScript into Go which can be used for multi-threading. And so in this I made the example slightly more complex so it's the same idea you have a loop and the loop causes these inner body three times. And then you have this overall budget that you do not wanna go over when you plan out your three vacations. So if you do this you start a new Go routine with this Go function. You will create three threads and they will run in parallel which means your code will run approximately three times as fast. But you may go over budget. And the reason is because of the race condition because you have many threads trying to share this access the same resource. So you could have one thread that is subtracting but then the second thread is trying to subtract from is trying to it's running this function before the first thread has finished and therefore you end up going over budget. And the way you avoid issues like this is with locks. And so if you do something like this you put a mutex lock before your functions and you unlock it after you won't have you won't go over budget anymore but if you do this your code will not run three times as fast anymore either. And if you look at the frame graph to figure out why you will not be able to because the frame graph it just shows all the frames and it does not tell you why that is happening. A thread timeline on the other hand will easily show you why that's happening because you will see that the CPU activity for your threads even though there's three of them are run sequentially as opposed to in parallel. And if you go ahead and fix your code by moving the locks around that will be reflected in the thread timeline. But the frame graph for these the code with the bug and code without the bug look exactly the same. So thread timeline can help with issues of concurrency but it has limitations as well. The example I gave is very simplistic and in real life you can have hundreds or thousands of threads or go routines. And so there's a lot going on. There could be a lot going on. And with that we're back to the first problem with profiling which is that it could become overwhelming and by then it's not beginner friendly anymore. And what's I believe a more beginner friendly tool than thread timeline but still captures your activity is called the flame scope. And I see a flame scope as kind of a summary of a thread timeline because instead of breaking your CPU activities by thread it kind of summarizes them together onto a 2D heat map. So the flame scope is also developed by Brandon Gregg who's the maker of flame graph and it works like this. It's a 2D heat map with time on both axes. The X axis is the passage of every second and the Y axis is fraction of a second and the intensity of each block reflects how many events there are such as how many CPU events there are. And flame scope can reveal patterns like this and the top is every few hundred milliseconds or so all your CPU are maxed out and then at the bottom every few hundred seconds or so all your CPUs are idle. And the idea behind flame scope is that you will be able to segment a small section of the profile such as the part where all your CPUs are maxed out and you view the flame graph for this tiny segment and then you can compare this segment with a segment where you don't have a lot of CPU activity to figure out what's causing the spikes. And with that said, thread timeline and flame scope both have the time component and therefore their grade at showing activity over time and patterns at activity over time. And because of this time component also they have the same limitations as the flame chart which I talked about earlier and this limitation once again is you cannot aggregate profiles. It does not work well for large time frames and it's hard to do differential analysis with two or more profiles. Ooh, that was five different visualizations we went through and each of these visualizations are useful depending on the problem you're trying to solve and depending on the context. If you want to compare two sets of profiles for example, fast channel switches with slow channel switches the good place to start is the flame graph. And if you want to find the most likely low hanging fruits for performance work a good place to start is the call graph and these two can support aggregation. If you want to solve issues concurrency use threat timeline. If you want to be able to isolate short burst of activity flame scope and if you need to look at the code over time for a single thread, that's a flame chart. And I already mentioned some of these software out there during this presentation and here's a recap. And one thing to keep in mind like a big caveat is that not all of these tools support all runtimes and all profile formats. When a profiler collects profiling data it can generate many different formats. For instance, the Java JFR profiler generates the JFR format. A lot of profilers generates the Pprof format. And flame scope for instance by created by Brandon Greg support formats generated by the Linux Perf tool but it does not support the Pprof or the JFR format or the trace events format. And Pprof which is a very common profiling format does not have a timestamp attached to it. So if your profiler collects data and outputs them in the Pprof format you cannot build a flame scope or a threat timeline or a flame chart with those profiles. And I believe that being able to make profile these visualizations work across runtimes and across different formats it's a big opportunity space for the open source community. Because in the real world and in the ideal world you'll be able to combine these visualizations to solve performance issues. And the power of these visualization also comes from combining them together. So imagine you have this observability pipeline where you use tracing to trace your endpoint latency and then from the trace data you build some metrics to get displayed in a dashboard. The dashboard shows oh there's a spiking latency what's happening. You isolate that you find the relevant traces from the trace you find the service that's bottlenecked. And then if your service has profiling on then you find the profiles that were collected at the time of the spike. And you can either look at the flame scope or go to threat timeline, select the relevant threat and look at the flame graph or the call graph for that service. And this makes for a very powerful experience but it cannot be done if the current tools do not support all these different runtimes and profiling formats. So with that we come to a conclusion. In this talk I talk about why you should care about profiling and why profiling should be its own pillar of observability. There's a problem with profiling is its high barrier to entry and it is this problem that's perhaps making the importance of profiling not as widely recognized as it should be. And there's a solution which is leverage visualizations more effectively to help users and everyday developer understand profiling and profiling data. Thank you. Yes. Thank you so much for your presentations. Really awesome. So I have a question about kind of approaches going beyond just like a single binary or application when it comes to profiling. I do a lot of stuff with continuous integration and delivery and what developers will often do is they will write like a bash script that calls multiple different things and it can be a challenge to figure out which programs that it's calling in those scripts are the bottlenecks. So can some of these techniques go beyond just profiling individual applications to say like a script that's calling multiple binaries and applications? That's a very good question. I think if there's a way to tie the profiling data together. So you know that for tracing for instance when you collect a trace I don't believe that currently it's supported by people but if the profiling data you collect can somehow get a metadata from these other tools such as like if you're doing tracing a trace ID or if you're running a bash script identify for this the bash script that you're running if that can be tied to the profiling data then yes that could work. Thanks for the really excellent presentation. This is not so much of a question but more of a comment. You mentioned at the end that you've got all these different formats and all these different visualization systems. A colleague of mine put together a website called Profilerpedia. That's P-R-O-F-I-L-E-R, pedia. And it tries to graph the profile collection to the visualization and also includes tools that do conversion as well. I agree though we definitely need like better solutions for doing that sort of thing as well. Thank you. Thank you for the great presentation. Maybe this is a little bit off topic but do you have any recommendation on performance wise? I mean you showed lots of the different applications for profiling. Is there any like a specific application or any others have like a better performance? I mean it doesn't really the main application it doesn't really impact. It doesn't like a slowdown or things like that. So my question is do you have any which one is more performance? Which one has better performance or do you have that sort of list or anything? I'm not sure I completely understood the question. Which one has the best performance? What does, which one? In my, I mean the specifically the speed wise, for example, I don't remember all the components but some, the P profiler or for example, A profiler. Oh yeah, yeah, yeah, yeah. The different formats? Oh yeah, yeah, yeah. This one, so for example, P prof is, it doesn't, it's faster than like the others or do you have those kind of thing? I do think that different profilers have different overheads when they are collecting the profiles but once you have the profiling data and you're trying to visualize them via visualization tool, the performance is not as much of a concern. But yes, they are different overheads depending on the tool that you use to collect the data. Okay, understood, thank you. Thank you so much everyone.