 My name is Noel Granden. I do some work for Calabra and also for a company called Paradex down in sunny South Africa, Cape Town. I'm married with two lovely small children and I'm going to be doing a talk on performance tuning. That's where I stole the name from. This is a book that I read to my children very often. Okay, so there's two kind of major schools of thought when it comes to looking for performance issues. There's the Valgrin people and the Perf people. Both of these things have their have their pros and cons and I personally like Perf but Valgrin is perfectly adequate. Valgrin gives you much more accurate answers if you're running performance tuning on the program. Perf for me gives me faster turnaround times. However for Perf, you're doing performance, you're doing sampling based profiling, which typically means you need kind of a minimum run time for your program. If your test case is not running for at least about 10 to 20 seconds, it's kind of hard to get decent data out of it. Valgrin on the other hand gives you accurate data no matter what but it just takes quite a long time to get an answer out of Valgrin. So my typical workflow looks something like this. You have to install Perf first and a handful of support libraries and then the command books Perf record. The dash F parameter is the frequency of sampling. Typically, I tweak the frequency of sampling so that I get roughly a hundred megabytes worth of sample data by the time I've done. You can see I've set the call graph, set told it to grab call graphs. That's another way of saying telling Perf that I want stack for everything that comes out of the data. We need dwarf because by default tries to use LBR recording for its stacks and we don't generate good LBR data and we need to tell it to use the maximum buffer size, which is 64k because LibreOffice generates some very deep stacks and if the default buffer size for capturing stacks is I think around 16k or something and that's nowhere near close enough to capture our stacks. Then after that we run Brendan Egg's brilliant flame graph program and that generates a lovely SVG which looks something like that. Now when interpreting such a flame graph, it's as well to be aware of some funnies. As you can see the kind of horizontal axis is how much time is being spent in something. So for example, at the bottom level it's always 100% and then you can see as you go up the kind of width of the block indicates how much time is spent in that part of the program and the vertical axis represents stack frames. So we can see that in this case the bulk of our time is being spent over in various GTK related stuff and also on the far right-hand side there's a bunch of writer layout stuff happening. However it's as well to be aware that we run multiple threads so perfect sampling all of the threads in our program every time it's sampling time at ticks which means that we are seeing effectively a collapsed view of multiple threads here. So this GI clone that's sitting over here does not indicate that we are spending 25% of our time inside clone. It just means that 25% of the time it's sample is a stack and that stack happened to have GI clone on it and that stack was probably doing nothing. It was probably just hanging about or being in the process of being created or destroyed. So it's just as well to be aware of things like that. This is what I typically use on Intel on Windows. This is Intel's VTune amplifier, a reasonably recent discovery. It's quite a nice tool. It has a free version, a community version which works pretty well. I haven't had enough time with it to discover its funnies, but it has a reasonably decent user interface. It gives reasonable data and it didn't take very much effort to set up at all. If you really want an open-source one, there's an open-source one called Very Sleepy. I don't know where the name comes from. That works reasonably well. It was working brilliantly for me up until a few months ago when, for whatever reason, it started having trouble with our debug data. So I switched to this and I've been using that since then, but it's quite probably Sleepy will work for you. And the maintainer of Sleepy is actually quite responsive, so that's also quite nice. Okay, when... Sorry, let me just slide. I'm not seeing my notes for some reason, which I'm expecting to be seeing. Anybody know how to make the notes show up? I wasn't gonna have to guess what I... Do you know where that is, Tor? Ah, there we go. No, that's not good. Ah, there we go. That's much better. Okay. Generally, when I'm optimizing, I start at the top of the program and work down to get an idea of where the bulk of the time is hiding. So that means in the case of a flame graph, I'm working from the bottom of the flame graph. But once I get an idea of where the hotspots are, I switch my view and I start looking at the top of the stack, which in other words, the top of the flame graph, and I work down from that point to kind of try and find localized hotspots that I can optimize. It's important to submit any performance changes you make in small patches. I learned this the hard way in the beginning. Performance patches have quite a high probability of generating regressions. There's almost invariably some subtle thing somewhere that you missed when you're doing a performance patch, because you're quite focused on making this little piece of code run faster and you missed some weird edge case that the previous code was doing. So it's very helpful to the QA people to submit your patches in small doses that when they bisect back, they will find just the thing that you did wrong and then if it's, you know, if you're in a rush and you don't have time to fix it, you can you can revert just the small piece that you messed up and not the other team changes that you make. It's important to re-profile after every change. A number of times I have run a flame graph, again an answer. I see two or three quite obvious hotspots in the flame graph. I picked the biggest one. I optimized that one and I think, okay, no, I know what the next stage is. I'm going to go to the next stage and I run the profile again and it's completely changed. The profile is completely different. Previous hotspots are utterly gone and a completely different hotspot has shown up. So it's a you've kind of got to just keep profiling, which is why I like Perf because I get to cycle often. It's often to be aware that when you're profiling, you don't always get the luxury of sticking to working within one layer of Libre Office. Often you'll find that a performance problem is split across multiple layers because something that looked like a great idea in Kelk is calling something else that looked like a great idea in SFX2 and the interaction of those two things is causing some problem. It's all about staying in cache. Most of the performance stuff that I do, it boils down to trying to keep our operations or some tight loop running inside cache. Typically these days CPUs have at least 256k of cache available, typically at least that much per core. Modern CPUs you're looking at multiple megabytes of cache, which is awesome, but we at least have sort of 64k to 256k even on a low-end CPU of cache available. And that's a fair amount of memory and if we can get our tight loops to stay inside that block of memory, we can get between two times and 20 times faster. I'm not kidding here. It really can get that much faster. If you can stay inside the cache, then your CPU is not dropping out of cache and ending up accessing main RAM and all of a sudden it is running enormously fast. Which is why flat data structures for the win because pointer chasing is really slow. So things like standard vector when hands down over standard list even though standard vector may waste some memory, standard vector stays inside cache where a standard list almost always needs to chase pointers through memory, which means you are not hitting your cache nearly enough. Now, this is an entertaining example of when UNO gets expensive. This looks like a fitting normal spreadsheet. However, what you don't realize is that each of those cells is actually a picture. I'm not kidding. Somebody went to the effort of creating pictures that represent numbers and words. I don't know what the point of this was. Quite possibly this is a made-up example, but it came from a genuine user. And yeah, so we have a calc spreadsheet which has several thousand tiny little SVX, SDR images, shapes hiding inside it. So what we end up with is at the bottom of this we were iterating and broadcasting events. Now, that's a really need to be normal event loop for us and most of the time such event loops are not a problem at all. However, in this specific case, we had thousands of images and each image had thousands of child components and each child shape was listening to something and we were doing this and the hotspot fairly blatantly is the part where we do an UNO query of the listener because we trigger the UNO query, we trigger the UNO kind of dynamic figure out whether this interface supports stuff logic and that when you run it in a loop like that, it's pretty expensive. So what we do is relatively straight. No, that's not what I wanted. So here's my next one. No, sorry. Okay, I think you seem to have lost a slide there. So the relatively straightforward change in this case was to switch that event listener that because we in fact in this case we lost some information, we already knew originally that that list contains only X event listener. So we just needed a bit more type safety and we can switch out that array from containing and used to contain X interface. We just explicitly say that yes, we know it contains X event listener. We can skip the dynamic query and it was about four or five times faster. This is a spreadsheet with a whole lot of style names inside it. Now what's interesting here is that A, there's quite a lot of styles here. I think on the order of several hundred and B, the styles are not English. The not English part is important. Now because we like to make things easy for our users, this particular list here is being clever. It's using a natural sorting algorithm and a natural in this case means it breaks up the words into the word part and the number part and attempts to sort things so that kind of word space one sorts before word space 10, which is very clever. It's great. Works perfectly until you have this situation where you have a lot of entries in your list and you have a non Latin one encodings going on here. Now this particular chunk of code originally had a bubble sort. So Julian Naburf found that and fixed that. He switched to the standard sort and that made it at least twice as fast. Then I took a look at it and realized that one, the expensive part of this operation was A, we were constructing some temporary OU strings. Here and there. So I switched that out and that made it a whole bunch faster. The second thing that I realized one night while trying to put my children to sleep was that we can sort twice. So what we do is first we sort just using kind of the default string ordering which for most of the, which for Latin text and even for non-Latin text will normally produce an ordering that is pretty close to the final ordering, but not entirely accurate. Then we fire up our expensive sort operation and we run our smart natural ordering comparator over it. And because the result in list is now mostly sorted, we need to do a lot less comparisons. It's well to remember here that your normal sorting algorithms are in log in, average case, and that log in when you have several thousand entries can add up So in this case, we're going from sort of one thousand log one thousand to probably something on the order of one thousand We're probably only needing to do probably two thousand comparisons or so because most of the time the thing will compare The thing will be ready be in the right order. So we only need to reorder small parts of the list That was the piece where I switched out a small piece of temporary OU string We were creating a temporary there. So we were creating a copy of a string and Didn't do some work on that I just changed the code a little bit so that we passed down a position in the length And the code doesn't have to create a temporary and that's surprisingly made things considerably faster because this was right inside the inner loop and That's where cash makes the most benefit. So just by avoiding the creation of a temporary there I was able to avoid some memory allocation and consequently that became two or three times faster This is a fairly natural Loop what it doesn't tell you is that we've got a vector there We're iterating through the vector and we're deleting the front of the vector now I suspect this code originally used a standard list In which case it's roughly o n But when you have a when at some point it got switched to a vector and it's not obvious But when you delete from the front of a vector what you end up with is pretty close to o n squared Because every time you delete the front of a vector it moves all the elements And that's roughly an o n operation and then you delete the next front of the element and it deletes and it does that again. So So this we ended up in o n squared and this was actually when the this was when the spreadsheet was switching from You just deleted a column and it was now terminating a bunch of listening and then and then doing some recreating it's listening internal listeners and It was deleting from the front So all I did was I switched that to walk backwards through the data So it deletes from the end and walk backwards and that made it considerably faster because now The vector didn't need to do any reallocations or any move any moving around. It just needed to it just needed to remove from the end This is I think it's a Russian document it's a legal document and this one took me quite a while to figure out it's a it's writer and it looks pretty normal and It was just incredibly slow to page up and down in this document and I spent ages looking at it Because it was doing a whole lot of work down in VCL complex layout All that sort of stuff. All that sort of stuff's been fairly heavily worked on people like Khaled and And mark hung have optimized it And it should be relatively good So I kept looking and it's even a nice little cash. The cash is all of our layout information McLosh has worked on that and in another cash below that will be cash font information and It's all good However, as it turns out the cash was The cash was eight elements big Now eight is quite isn't is a fine number of fonts to have an irregular document This particular document happens to have around 16 fonts on a page Which meant that as we iterated through the page that of our layout We were evicting elements from the cash and creating new elements which meant that our cash was completely useless We were getting basically a zero percent hit rate on the cash So because the cash was relatively small was cashing relatively small Items I just bumped up the size of the cash and the document magically became enormously faster The hard part here was figuring out why the cash wasn't working and that took me a while Okay, we have a we have a relatively large this is a Loading an autocorrect file and it was quite slow to load and quite slow to switch from one autocorrect file to another And this in this what we're doing here is we've loaded the file MX replacement TLB is the list box and we are iterating through our data and we are inserting We are we are not inserting we're updating pieces of the list box with the new data And that looks like a naturalist that should be fine however Just because something's called get and it takes an index does not necessarily mean in its 01 So I stared at this for a while before I worked out that in fact in the specific case of GTKs get text operations It is o n doing a get text there so this loop is actually o n squared because you know every time you call get text get text iterates through its own internal model and Consequently we ended up with that so Okay, no, that's not what I wanted so the the fix for this was actually really straightforward I didn't realize it but quail on had already created a nice iterator pattern For list boxes, so we just switched this to the iterator pattern and it it got a whole bunch faster This this little chunk of code is hiding inside SC It's one of which one it was It's a it's a dynamic one of the dynamic containers inside calc inside calc and It it normally when it runs out of space it bumps up its limit it bumps up its internal array by a fairly small amount and This is normally great because normally when we're editing a calc document We add another column or something that we already need a little bit more space however in this specific case we were we were adding We were loading a lot document with a very large number of items and Consequently every time we ran into the limit of space available. We would bump the limit by a small amount and then we would Add some more items and bump the limit by a small amount It's not hard to see that we're doing an enormous amount of memory allocation and de-allocation here and consequently the costs added up fast So we switched to bumping the limit by about 50% Which just seems to be kind of the default thing the standard vector and similar stuff does and magically got a whole bunch faster the ideal fix here would have been to Pass down some kind of reserve operation and reserve the space I needed However, the point at which this was happening was quite far away from where the Where we knew how many are kind of items we would we're going to be dealing with was available So the amount of work required to plumb that information down through several layers didn't seem worth it Okay, this is a this is a nice example. We were loading a file Somebody had a case where they were connecting to LibreOffice from another program and they were doing They're trying to load a file into LibreOffice and we were we were running through our type detection algorithm And the type detection algorithm the flame graph when I fired it up showed no obvious There was just stuff everywhere So this is one of those cases when performance is all about chasing the small stuff So I chased down individual pieces of flame graph One of the filters was taking a string buffer and loading about four meg of data into it But it was doing so in kind of 64k chunks. So the string buffer was constantly being resized So explicitly size the string buffer and another Part down and deep in SFX2. It was creating a temporary file that it didn't actually need It was never using the temporary file and then it was also was creating the temporary file on disk and then throwing it away So I passed down some information there. So it got a little bit smarter In one of the stream UNO stream classes. We had a buffer We were reading into a buffer and then copying from that buffer into our final kind of our final stream thing So I skipped that buffer and all of those added up to a lot less memory traffic and Then once I ran the flame graph again I could see fairly obviously that one of the filters was One of the filters was doing something slightly peculiar. It was hammering Really really really hard on OSL get file position Now most of the time OSL get file position is a really cheap operation However, in this case we were hitting OSL get file position almost every single time We read even a tiny block out of the file and there wasn't an obvious way to change that because that particular Situation we were integrating with an external library Which just as part of its kind of pattern of operation seems to want to do that So this is one of those cases where even standard mutex, which is relatively fast Turned out to be a problem because we just we were hitting it like inside a tight loop all the time so we I down in the down in the Down in the cell sub system switched out for locking around file just around file position From using standard mutex to using standard atomic meant most of the time We were hitting something that was already locked by the CPU so the CPU is handing the locking for us and Standard atomic hides that nicely and consequently that got a lot faster In this particular case we were loading a nice big document and we were doing some we were Computing style information for the for the document and we were firing up Firing up some edit-edge infrastructure in order to do that now edit-edge infrastructure is a pretty decent piece of infrastructure Most of the time. However, it's quite heavy weight and After looking at this for a little while it became fairly obvious that Because this was a spreadsheet Almost all of the cells in a given column share the same style information So we didn't need to keep recomputing it So this was a case of effectively stuck in a one element cache there So what we do is as we run through the loop We just check to see whether the current cell shares the same style information as the previous cell and if it does We can avoid recomputing the style information and yay, we went from five minutes to about two seconds So that was a that was a nice one This is a case where we had a large document large document had a lot of the same image showing up inside it and We have a cache and the cache should be catching all those images and avoiding doing re when we save it avoiding doing Re-sizing and recompression it all looks great However, standard map which is what the cache was using relies on operator equals so I went digging around and what is operator equals doing and For some images operator equals works great. So the cache was working fine However for the specific kind of image here Operator equals equals was comparing pointers Which means that Even though the images even though the kind of image object pointed to the same underlying bitmap buffer Which is reference counted. We were getting false from the operator equals equals. So and on top of that there was operator equals was calling some other operator equals and That was returning effectively the wrong answer for this situation. So I I tweaked operator equals to return what I consider the right answer and This is one of those cases when it's good to make small changes because as is as some of you might realize What the answer that operator equals returns is a little bit context-dependent sometimes and I changed this and it worked great For my situation the code was a lot faster. However, something else was relying Unknowns to me someone else something else was relying on the pointer quality and Consequently some other part of Libra was quite a little bit of happy and I've got a regression a couple months later Which I was able to fix but it helped a lot that I had a small patch to work with here So thank you very much. I think optimizing is fun. I've enjoyed working on this stuff and Anybody have any questions? Yes, sir hot spot is great. You're talking about kdab hot spot It's a great tool it fire to load up the perf data at that particular point in time I had an issue on my fedora box Something to do with a combination of the compiler was using the debug data and hotspot And they weren't playing nasty together, but yeah, it's a great tool. I recommend it Okay Okay, I've also and I found the kdab maintain it to be very friendly and very willing to look into bugs No more questions. Great. Thank you