 and welcome to our next session. I would like to also welcome our speaker, Yiri Pavela, and from the Faculty of Information Technology at Bernau University of Technology, and his topic called Perun, Keep Your Project Performance Under Control. So without any further delay, the floor is yours. Okay, thank you very much for the nice introduction. Maybe first of all, can you hear me in the back rows? Okay, thank you very much. So today, I would like to introduce you to my research and a tool that me and my colleagues at Bernau University of Technology are currently developing. I'm a PhD student at Bernau University of Technology, and my area of focus is efficient performance analysis and testing, both static and dynamic, and Perun is a tool that basically kickstarted this interest and it has been developed, or is being developed throughout several years already. So first of all, I would really like to thank Red Hat for sponsoring this research financially. So let me start with some brief motivation why you should care about your software's performance or why we should perhaps all care about it. So software performance bugs are an omnipresent problem. They are with us basically in pretty much every software. So just some horror story examples. Your favorite cluster computing engine might freeze just after an update, which actually happened for Apache Spark after one of the major updates. Or maybe some of you've heard about the famous outage of Stack Overflow, which was actually because of a performance issue or performance bug. It costs more than half an hour outage, which is, as you can imagine, a lot of money. And another example might be that parsers when implemented inefficiently can actually cause some severe slowdown. So you might think, oh, those are some horror stories. This cannot happen in my team, in my project, right? This is just, I'm not working in such a big company or something. So maybe some more examples, which might be more relatable to your day-to-day work as a developer or student. So for example, if you choose an inefficient algorithm, your favorite compiler might run out of memory, which happened for the C-sharp compiler because of inefficient quadratic complexity implementation of a C-sharp constant folding of strings. Another example might be some pattern matching algorithms, right? I guess you all know or like regular expressions. So if you implement or if you create a regex and use an inefficient regex matching engine, you might actually consume the majority of your CPU time, just matching those patterns. Or if you like WIM and you use and you perform some searches for tags, an inefficient implementation or inefficient data structure might cause you to wait a bit longer than you would like. Or if you like generating documentation from your source code, some inefficient parsing also might cause you to spend more and more time generating the documentation, which is I presume not something you would like to. So I hope I've convinced you that performance issues are or that you should probably care about performance, that performance issues happen and they probably happen to you, that they probably will happen to you. So now that we hopefully all agree that performance is important, what should we do about it, right? So the easiest thing would be to find the performance box or performance issues as fast as you can, right? Preferably right after you release or you have a new software or a new version of your software. So there has actually been a study conducted regarding dormant box, which are bugs that might be in your code base for years and months before they're discovered sometimes with some catastrophic consequences. So the study basically found out that if you have recently introduced bugs, they take less time to fix. They require less experienced developers to fix them. And the fix is generally smaller, right? So if you find bugs real fast, preferably right after you push your commit or your PR, the cost for your company will be much lower than if the bug is found, let's say, two, three years down the road. So yeah, the simple solution is to find bugs as fast as possible, right? And that's something we've been actually doing, which is handled by continuous integration and automated testing, which became, I would say, industry standard is used pretty frequently for functionality testing, right? For your functional box, this is what you usually do to discover them quickly. But what about performance bugs, right? You might develop some tools, some framework, some utility script, or some tests that target performance. But those solutions are mostly ad hoc or proprietary. So they're not the best of our knowledge, not complex enough, and not open source, right? So meet Perun, performance version system that tries to tackle this issue. We like to call Perun a complex solution for performance analysis and testing, OK? That's a lot to unpack here, right? So what does it mean? What do we mean by complex solution? It means that Perun is able to collect performance data, which is basically the thing you do when you profile your software. It also creates performance models, right? So some representation of the performance you've measured. It integrates version control systems, such as Git, to access the full project history. It detects performance changes across your different versions of your software, such as you release a new version and there's a severe degradation. Perun is able to find it. And it also visualizes the performance, right? It's much easier for us as humans to understand some results if we see it in a picture. Just a small note, collecting performance data is not always, but usually, the only thing that traditional profilers do, right? If you think of call grins and such tools, they usually just give you the data, maybe some summaries, but they don't offer a complex solution for your performance problems. So how does Perun actually work? I like to present it on a bit simplified overview of its workflow that consists of four major steps that are repository, profiles, models, and detection. So here you can see the whole picture. I will try to break it down and describe each part separately. So you have your project. That project is basically some working directory, right? This is the directory where your code lives, where your configuration files and so on can be found. You very likely use a version control system, such as Git, to manage your versions and contributions. And on top of that, you might also consider initializing or using Perun in this very working directory, very similarly to how Git is initialized or used in your typical repository. So now that you have the version control system and Perun in your working directory, you might want to measure your project's performance and obtain something called profiles, right? This is basically a file that contains some performance metrics regarding your software based on the inputs you tested on. Nice feature of Perun is that the profiles are actually stored within its, let's say, database, not exactly, but some structure that keeps those profiles, linked to corresponding version control system versions, right? So if you have a commit with some hash, Perun actually associates some profiles with this hash if the profiles were collected while your project was in that version, right? Or when it was the hat version. So when you have profile, you maybe want to obtain some performance models based on this profile, right? So profile is just some structured data, perhaps compressed binary that contains a lot of performance information, which isn't really something you would like to parse, let's say, manually as a human. So you might want to construct some models that help you understand the data better. And models are actually also stored within Perun, let's say, alongside the profiles. So models sounds a bit maybe too abstract, hard to imagine. So maybe an example might help. So models are basically mathematical functions that are based on the input size of your function or your program, or perhaps some statistical summaries regarding the main performance features of your profile, right? So for example, take regression analysis, which is one of our, let's say, post-processors that generates some models. So here you can see some red dots. Those are basically the performance data you've collected in your profile, right? They describe some behavior of your function with respect to time and size of a structure that the function is operating on. And those curves you see right there, those are basically the functions that regression analysis tries to fit your data with. So it's a bit messy. So what you usually want to have is the best model that describes the best fit your data, right? So using regression analysis, you can find out that the best model that fits your data is linear. It doesn't mean the function has linear complexity. It just means that it behaves linearly based on the input you gave it, right? And you do that using some statistical, let's say, coefficients that help you understand which models fit your data best. So as you can imagine, understanding this large cluster of data points might not be the easiest thing, right? But instead, if you know that the function is behaving linearly, it's much easier to imagine it than to see, I don't know, 2,000 points plotted in your graph. So the next step of the workflow is the detection phase, where you basically take your models that you created, or directly profiles, and compare them, right? So we call the actual or the current version of your project the target version. And the version you're comparing to, we call baseline, right? So you basically take the baseline models or profiles of the parent repository, and using some oracle, some magical algorithms, you compare it with your current version models, and you obtain some result, right? Like there are some performance degradations, perhaps. Or there was no change, which is usually what you like to see. Or perhaps even you manage to optimize something even a bit more, right? So you take profiles or models of two versions. You compare them, and you get the result, right? Like, yeah, my new release is 10% slower. That's something I should maybe care about or try to fix. Again, a small example of the detection algorithms. I've called them oracles, which is, I guess, the easiest thing to just take it as some black box that gives you result. But just as a small example of one of the detection algorithms that we've implemented is something we call exclusive time outliers. Let me start with what exclusive time is. Perhaps if you've profiled before, you might have encountered this term or maybe the term self, which in this context usually means that the time that your function took, the duration of your function, is basically the time that you spent exclusively in the function without any call es, right? So if you have a function f, which calls function g, the exclusive times regards only the function f, not the called function g. So we basically take this exclusive time for all functions and we compute the deltas compared to the previous version for that function, right? So you have two versions. You have many functions in both of those versions. You check the time of or how long those function took in previous version, in the current version. You compute the delta and then using some statistical approaches for identifying outliers, you create a hierarchy of different severity issues, right? So for example, in our implementation or in our algorithm, if you identify some outlier using the modified z score, it's usually the most severe degradation or perhaps optimization. Okay, so enough with the theory. You might want to see how Perun might actually help you in your day-to-day work. So let me show you that on an example where we try to find performance changes across different versions. So many of you perhaps know the CPyton project, which is a reference C implementation of a Python interpreter. Recently, there's been a performance issue reported by one of the users on GitHub, which basically stated that there's a 8% higher function call overhead compared to the previous stable release, right? So you have some alpha release and the function call overhead is 8% higher, which doesn't sound like a lot, but if you use this function a lot, your program might get slower by 10% which might be important enough. And the problem was in one of its modules, namely the C types module. The issue could be replicated using the PyPerformance C type benchmark, so it can be replicated easily and it was fixed quite soon after the report, right? So problem solved. Why would we need another tool to help us? Well, the hard part is usually discovering the issue and finding the root cause of the performance issue, right? Even identifying that there's some problem which might be in a module that your tests aren't really covering that much. And it often requires quite significant manual effort by the developers, which costs money and time. So Perun tries to help with this task and help developers solving those issues. And Perun does it by utilizing recency and past profiling, right? Those, the repository of profiles across your project versions. So how could we perhaps identify or find this issue and also find its root cause using Perun? So first of all, you have to have the CPython repository. You initialize Perun in that repository. And we assume that you already have a profile for the previous release, right? You're continuously using Perun across your development cycle, so you already have a profile for the release, for the 310 release. And we will call this profile baseline. Just a small note, as I've already said, the profiles are linked to the different project versions internally. So new version rows out, right? It's the alpha version. And you want to profile that version to see if there are some issues. So you use the Ctypes benchmark for the new version. There's some simplified command which can be used to profile it. In reality, it's a bit more complicated but not that much more. And we will denote the new profile target, right? And now we want to compare the baseline and the target to see if there are some issues perhaps. So as I said, we support multiple comparison algorithms and for this particular issue, we use the exclusive time outliers that I briefly introduced before. So these are the results of the comparison. So the first thing you might notice is the 9% slowdown which roughly corresponds to the reported slowdown in the GitHub issue. And those are the two most contributing functions or the two functions that contribute the most to the slowdown, right? So now we've already discovered there's some issue. There's some roughly 9% degradation. And we also know which functions are responsible for this slowdown. So all that remains is to check those functions and lo and behold, we find that there's a bug in the implementation, right? You have a if which is checking some flag asking if we've already complete the initialization and if not, we will perform the initialization. The problem is that the flag is never set to true, right? So we initialize over and over again. So now we know the issue. We create new hotfix or a new hotfix branch with the fix. Here you can see the simple fix. You profile your new version with the hotfix again to obtain a profile called hotfix. Now you compare the baseline and the hotfix to see if the issue has been already fixed or if your fix actually solved the performance issue. And okay, this table might be a bit more confusing. So let's break it down. So there are basically some all the deltas which refer to the previous comparison of our targets and baseline. And here you can see the 9% degradation. Here you can see it in absolute terms in the time that it took compared to the or how much more time it took compared to the previous stable version. And in those black columns, you can see that the new degradation after the hotfix is only around 2%, right? So there's still some small degradation. But the important part is that the two functions that contributed the most seem to be fixed now, right? So their relative change to their previous runtime is sub 0.1%, which is good, right? That's something you want to see. Especially since the problem was because of a bug introduced during refactoring of your code. So you'd expect that the performance will stay roughly the same. Okay, so what have we learned? We've learned that Perun leverages version control systems and the recency principle to successfully discover and help the developers find performance issues and perhaps even guide them to the possible root cause of the problem. Okay, so that was one of the demonstrations prepared for or demonstrations of Perun. So I have a second one, which relates to generating, let's say some interesting workloads for your program, right? So the problem of testing or performance testing is finding the right inputs to test your program with. So recall this tech overflow issue I've been talking about before. Maybe just a quick survey. Have some of you heard about the issue before this talk? Did you notice it, I don't know, a few years earlier or when it was recently found out? Okay, so maybe some more details. The offending regular expression that caused the outage was this one, right? So let me break it down a bit. There is some regular expression that is matching spaces or white spaces or their unicode equivalent from the start of the line or they're basically matching or they're trying to find some white space characters at the end of the line. The regular expression was used for trimming white spaces, right, so you have some file and there's some, let's say trailing white spaces you want to get rid of them. So the problem is that when you use an inefficient or let's say a bit more simpler regular expression matching engine you might encounter extensive backtracking. The issue was actually in one user's post that contained 20,000 of those white spaces which were not at the end or at the start of the line. So the engine was basically backtracking around, I don't know, 200 million times which is a lot, right? So can Perun somehow help with detecting this issue before it becomes a catastrophe and basically kills your servers for half an hour? Well, we have some module or some tool that's called Perun's performance fuzzing. It's based on a previously or originally proposed principle of performance fuzzing but we've added some more things to that. So maybe just a quick introduction since perhaps you hear the term for the first time. So fuzzing is a form of fault injection stress testing that might not have helped a lot, so let me break it down. It means that you generate some inputs preferably malformed or invalid inputs and then you feed them to your program. And what you expect is to find at least some inputs that cause some unexpected crashes or maybe errors in your program, right? If you're parsing, if you're writing parsers you might want to find some inputs that get, let's say, into some corner cases that you haven't tested properly and maybe kill your program or cause it to crash or to, let's say, get into some inconsistent state, right? So your goal is to find some inputs that will cause your program to behave incorrectly. So Perun's fuzzer builds on this principle but is profiling guided. What does it mean? It means that we do not care about crashes or errors. We care about slow down, right? So you want to find inputs that cause your program to slow down significantly. And to find this out, you need to check those inputs by profiling, right? So you generate, I don't know, million, two million, 10 millions of inputs and some of them might cause your program to run 10 or 20 times slower, right? So those are the inputs you're interested in because they definitely triggered some edge cases or some corner cases of your program. And the goal is to find those inputs that cause such severe slow down. Just as a side note, in some of our experiments we were actually able to find inputs that caused several hours or slow down in the range of several hours, right? So if the original program ran for like half a second we were able to generate inputs which were not that much bigger than the original one that caused it to basically hang almost, let's say forever for a normal user, right? So can we actually find the stack overflow issue using our fuzzer? Well, we will use these following settings for our experiment. We will have a small C++ program which is trying or which is using the regisks search function that was the culprit of the problem. We will have a seat. Seat is an input that your fuzzer is basically using to generate the malformed inputs. So there's like a 150 line of code implementation of parallel grep that you will use that we used as a seat. And there are some unique mutation rules implemented in Perun. And one of them, particularly interesting in this case is basically inserting white spaces to a random position in a string, right? So that's what will trigger the issue. And to keep it fair, we chose to keep the inputs relatively small so that they may relate to the size of the input you would probably expect your regisks to be matching on. So here we have the results, right? So you can see that for the seat which was the small C implementation of a parallel grep which has around 3,500 bytes, around 150 lines. It took, well, 0.1 second, right? When we tried our fuzzer, after several hours we identified two inputs that caused it to slow down severely, right? So you can see around 16 times slow down. That doesn't seem too much, right? But maybe notice that the size of the input is almost the same as the original seat, right? So you have pretty much or close to the same sized input but it will run 16 times slower. Or the second one, right? Almost 30 times slower. Just for 10,000 bytes, right? 10,000 characters and it's already running for two and a half seconds. That's a lot, I'd say. If you get an input which has millions of lines you might never be able to reach the end. So Perun Fuzzer can force potential performance issues to manifest, right? You have a code which is working fine until you scale it or until you reuse it in some other module. So this is something that fuzzing or performance fuzzing might help you with. It might find the issues that are not issues yet that are only potential issues. Just as a side note, we're using different mutation strategies based on the type of the input you supply it to. So if you use text files, there are some different mutation rules. If you use binary files, there are again some different mutation rules. There's also support for domain-specific rules. If you have some format of your input, you can develop your own rules, mutation rules that are likely to cause trouble. And that's basically it. So just to summarize, we are able to find some existing performance issues that manifested in a new version of your software. And we're also able to identify potential issues that might cause trouble if you don't fix it. So something about our ongoing and future work regarding Perun. So one of our main focus is on increasing granularity of the profiling, right? So instead of measuring functions and their duration, we would also like to measure the duration of each basic block within your function, right? Why is it important? It helps you to pinpoint the exact or almost exact root of your issue, right? So if you know that the problem happened in a specific basic block, you know which lines you should check for the potential issue, right? So there's just a visualization of the preliminary results, I'd say, where you can see that one function is behaving or is taking much more time than the other functions in your program. And you can also see which basic block it is, right? There's only one, so that's pretty easy. But if you have some more, let's say elaborate functions, you might find out that some of your basic blocks are taking much more time than you would perhaps expect. The second area of our interest is increasing the profiling precision, right? So instead of just monitoring the runtime of your functions, we would also like to relate it to the parameter values you supplied, right? So for example here, you can see that how your function behaves with certain inputs. And third field we're trying to tackle in our work is to increase profiling efficiency. If you've ever profiled and used an even-based or yeah, even-based profiler, you might have noticed that your program is running, I don't know, 100 times slower than it usually does. So we try to speed up the process of profiling and to profile only functions that are important, right? So what are important functions? That's something that's hard to pinpoint precisely, but we're developing some heuristics or some, let's say, approaches that might help you identify those functions. The core challenge is achieving sufficient precision, right? So if you profile only one-tenth of your functions in your program, the precision might be quite low. So this is the main challenge. Fully profiling the important functions so that you get the information you need while not spending your whole afternoon profiling your tool or your program. Something about our future work. So we're currently focused mainly on CC++ programs, but we already have some students that are working on some extensions for C-sharp or Java. We would also like to measure more performance metrics than just duration. We have some experimental stage for memory profile or memory consumption profiler, and one of our students is currently working on tracking energy consumption of software, which is a quite hard task. And we would also like to support more state-of-the-art or well-established tools that exist. Currently, we have some Facebook infer plugins or lupus, and we'd also like to support ballgrain tools like suede, G-prof, and so on. And lastly, we would also like to perform some more elaborate performance analysis of dynamic data structures, which is usually, not usually, but it's quite often the problem or the performance issue of your program that you choose some inefficient dynamic data structures for your task, right? So to conclude, Perun is a complex performance analysis and testing solution. What does it mean? It integrates VCS. It collects performance data. It derives performance models based on those profiles or the performance data. It detects performance changes across your versions of, across the versions of your project, and it visualizes the performance, right? Throughout the slides, you could see some visualizations that were used during some of the module or some of the analysis. So it's not just mere profiling, right? That's the key takeaway. It's not just a profiler. It's a complex tool suite. We've shown, or I've shown two examples of how you can use Perun. One was, one example was how to detect a new performance degradation in your code, in your new versions, in your new release, and also how to find out that your program has some potential issues that might manifest later. And in our ongoing and future work, we would like to increase the granularity, the precision, and the efficiency of profiling, and also support more languages, more metrics, and some, and incorporating some existing tools that users are quite used to in their day-to-day work. So that's all for me. Thank you very much for the attention, and I'm more than happy to take your questions now. Okay, thank you. So if you don't mind, I'm gonna start with a question. So, can Perun go back in time? Sure. VCS? So, what I mean by that, of course, create baseline somewhere back in time and compare with my current version. Okay, sorry, so what was the question again? The question is, like you said, okay, let's say I learned about Perun right now, I'm gonna use it on my project, I'm gonna see if I'm going the right direction with some sort of a function. So I just say, okay, start with, you know. Yeah, I understand. Version back in time somewhere. Yeah, yeah, exactly. So when you use Perun, you can actually choose to retrospectively measure the performance of your program, right? So it basically traverses your history or the history of your project and performs the profiling and deriving of models in all those versions that you select, right? So you're two years in your development and you want to check if, you want to find some baseline, right? So you say, yeah, let's measure those 10 previous versions. So you create some configuration for that, you run it and it basically collects all those data for all those versions so you can retrospectively check if there was perhaps some creeping degradation. Great, thank you. Any more questions from the audience? Hello. So is Perun just a tool or do you also provide a service that I can hook up to my GitHub repo and just tell, hey, do it, do it. I don't understand what you are doing, but do it. Yeah, so, yeah, right? So the question is how can I actually use it, right? So currently, there's been a work of one of my colleague that is working on a tool that allows you to run analyzers and tools externally on one of your servers, right? So we have an interface for that tool so you can set up Perun in some remote server and then basically use this interface to check it regularly, right? So let's say for each new commit, each new release or pull request. So it's, as of right now, it's not directly, let's say integrated into GitHub Actions or something like that or the most established CI platforms but it's one of our future work that we really plan to address, all right? So for now, we have some temporary solution with one of the existing tools but it's not in the mainstream tools, let's say. Any more questions? Thank you, that's really interesting as a tool. One question regarding the data collection for the establishing of the baselines. Sometimes you need to restart the execution of a test or some program that you want to find out the performance of to reduce or remove the differences due to the test environment. Sometimes there is like the operating system which tends to be busy, other tasks in the ground, et cetera. How does Perun cope with that? Yeah, so if I understood the question correctly, you mean that you want to perhaps get rid of some other profiles? No, no, for example, taking the example of CPython, you run on the good known version and you took the baseline out of that. That will give you some like metrics on the performance of that. But sometimes when you run this kind of performance testing, you get some deviation or problems due to the environment where the tool is running on. And sometimes you want to remove those differences and to get true data from your input regardless of the operating system environment. Yeah, yeah. So that's something you can not always but usually solve by repeating the profiling process. So Perun actually supports you running the inputs multiple times. So you run the same workload 10 times and you let's say throw away the first two runs because they might be biased. Your cache is not initialized properly and so on. So you can actually repeat the tasks you give to Perun. So let's say you have 10 inputs. You say let's run those 10 inputs 10 times. Let's get rid of the first two and let's see how the remaining eight were performing. So this is the approach we've taken to this problem. So you can keep multiple instead of just one. Yeah, yeah, exactly, yeah, exactly. Nice, thank you very much. I think we have time for one, maybe two short questions. Anyone? Hi, how do you create the profiles? Do you have to manually set up some functions and inputs? Yeah, so you're asking how we actually do the profiling itself, the process of collecting the performance data, right? If I understood it. Yes, yes. Okay, so currently we have a profiler based on a system tab and BPF, which is basically hooking to your binary executable and monitoring the functions and how long they take. It's sometimes not as efficient as we would like to. So we also have some profilers that basically operate or instrument your program during the compilation, right? So when you compile your program, it adds some instrumentation code and it take notes of how long your functions took and then basically assembles the final profile, the resulting file that contains all those records. So you have to run your project yourself or run some tests? Yeah, yeah, exactly. So, yeah, profiling is basically a dynamic analysis so you have to supply the inputs. But we also, as I mentioned right at the end, we do support some external tools which are working statically, right? So Facebook infer and lupus, they perform static analysis of your code and they can report the derived or the inferred complexity of your function. So you actually don't have to run the program but you are limited by what static analysis can do for you. Okay, thank you. Yeah, thank you for that question. That was my question I wanted to ask. So, but maybe one extra question for you. I'm not sure if you mentioned it but just to give me an idea how much time is needed for the profiling, creating the model in case of the C-Pyton bug. Okay, yeah. Like some timeframe for that. Yeah, so measuring the benchmark, I believe we were doing like five to 10 runs of the benchmark. It's around 10 minutes, if I'm not mistaken, which is quite a lot if you compare it to the original run, right? But also, Python is quite a big project with thousands of functions. So it definitely, the overhead of profiling definitely shows up. And as for the models, this is usually in, let's say, tens of seconds usually for the project of C-Pyton size. Okay, unfortunately we're out of time. So if you have more questions, I'm sure Yorca will be happy to answer them in the hall. So thank you. Please one more round of applause for Yorca. Thank you.