 Okay, let's see if I can get you. So what we typically aim for is 300 milliseconds or bust. So 300 milliseconds is typically the point at which software starts feeling smooth and slick. It's better to be well under that, but kind of 300 milliseconds is the target. I typically tend to aim for about 100 milliseconds in my own software just to allow for the cases when things are not as good. So why do I do this? The short answer is that it scratches an edge and I really enjoy optimizing stuff. And then my general day job, I don't get to do it very often, maybe once or twice a year if I'm lucky. Also, I really like it when LibreOffice is snappy. It's, I really hate waiting for my computer. I hate it when it interrupts my train of thought because it needs to take a while to think about stuff. I was doing some work on this slide show earlier and I happen to notice that for example, in the press, when you right click the context menu on the left hand side to create a new slide, it takes about a second or so to open and fully display the context menu, which is not great. But I do have to say that there was a lot of disappointment when it comes to trying to answer my software as complicated as LibreOffice. It is a huge undertaking. There's just this mammoth load of software and you have to punch your way through often several layers to get between the piece that needs the information and the piece that has the information. So often I end up throwing away my attempts. Roughly 80% of my optimization attempts get thrown away before I get to something useful. So about one in five. So you need to be prepared to do this a lot. But take heart because it does work in the end. If you keep trying and you keep asking people, eventually things will work out. One thing that I have find is that it's wise to stash your attempts. So I either get stashed them or I use get format patch to export them to a patch file and I'll say them in another folder because as you're exploring and working your way around you might often find that get patches or I mean an earlier attempt is a better fit as you figure out things later on. Now, when you do this recommendations, do the easy ones first. You'll often start optimizing and I often will do a whole class of optimizations across the whole code base and I'll always start with the easy ones because the easy ones get you into the rhythm of things. They let you get some early successes going which maintains motivation which is very important when working on a code base as large as LibreOffice and it lets you slowly adapt whatever optimization you're doing to the code as you grow the optimization and get better and better. And surprisingly, fixing the easy ones often has just, fixing two or three easy ones often has just as much impact as fixing a big one. The easy ones often add up to a decent size improvement and that's partly due to cache effects which are pretty much what dominates CPU stuff these days. The other thing that I can definitely recommend is having good hardware and I know this is problematic for a lot of people because good hardware is expensive but doing optimization with our decent hardware is just an exercising frustration. LibreOffice is not that fast to start with and once you throw an optimizer or a profile or something on top of it, you will just often find that your machine lags dreadfully and you won't get yourself into a good flow unless you have at least a decent CPU between eight and 16 gigs of RAM and preferably actually pretty much a requirement these days is to have an SSD type hard drive. Now that's not ideal because your machine is now no longer representative of the majority of people using LibreOffice but it's just one of those things. The other thing that I can recommend is using two source trees when you're working on LibreOffice. You will often find yourself needing to switch backwards and forwards between optimizing the code and then testing out those optimizations and it's very, very hard to debug and optimization in an optimized build. It can be done but it is just frustrating because the source code doesn't always line up nicely with the executed code. The debugger has a more difficult time assigning useful values to things that you wanna inspect and stuff. So typically I'll bounce backwards and forwards between trying out an optimization and then I'll copy that optimization using GetDiff over into my debug build tree and I'll compile it there and test it out there and debug whatever issue I'm trying to debug there and once I've got it working, I'll GetDiff it again and then get applied to my native build tree and see how it's worked out there. One of the things that I worked on over the last year was temporary files. I thought that the temporary file situation in LibreOffice was wonderful and I've noticed issues a couple of times with temporary files and I just, it hadn't really kind of, it hadn't really stuck in my head until I ran across a particular use case which was trying to load a presentation and Microsoft presentation and creating hundreds of temporary files in the process and that became really obvious that our temporary file situation on Windows was not ideal and the more I dug into this, the more I realized that LibreOffice's idea of a temporary file is closely aligned with the Unix idea of a temporary file which is great because that's where it came from but that doesn't line up with how Windows treats temporary files. Now, in the Unix world, temporary files typically load in the slash temp file system and that slash temp file system is typically a special magical file system which is very, very fast and is sloppy in the sense that it's, it will lose data if the machine dies and that's fine because they're temporary files but Windows doesn't have this concept of a slash temp file system. It has temporary folders but inside those temporary folders, those files are normal. The way to make a file from a Windows perspective, a magical fast temporary file is to pass in file attribute temporary which actually works really, really well when you pass in file attribute temporary, Windows will then say, okay, great. This file can live in memory and it doesn't have to get flushed to hard drive and if it dies, well, that's just too bad. So I made that small change and that didn't make as much difference as I thought and then I had to dig through the code further and then I discovered that we were closing these temporary files and reopening them. Now, in Unix, this is great, it works fine because it lives on a special file system so opening and closing it is really relevant but on Windows, the moment you close it, it becomes a real file and then it gets written off to the hard drive and then the slowness kicks in. So we had to unwind that change that closed and opened things on Unix which was probably there for back in the early days when file handles were a really, really scarce resource but it's no longer an issue with today's machines. So that got us better. However, it was still not as fast as it was on Windows as it was on Linux. So I did some digging and as it turns out, we use our normal file handling infrastructure when we deal with temporary files but our normal file handling infrastructure will flush files when that file object dies which is great for normal files. It is terrible for temporary files. The moment you flush, Windows says, oh, you really must want this data and then it writes it to hard drive. So we had to pass in a special flag down to our file handling infrastructure to say when this is a temporary file and you don't need to flush it on close and then finally we reach the promised land with temporary files on Windows are pretty much as fast as they are on Linux which is great because it speeds up a bunch of stuff that we do. Okay, so we really need to be nice to cache. When you're optimizing stuff, cache is incredibly important. And one of the things I've been doing and if you've been watching my comments lately is I've been switching out some uses of unique pointer with standard optional and what I'm trying to do is I'm trying to co-locate objects here. For example, if you have an expression like int x equals P1 dereference at fetch P2, dereference at fetch P3, what you are doing is you are doing, you are bouncing through memory and every time you bounce through memory you are probably triggering a fetch from DRAM which is a long way from the CPU. It's a long walk and the CPU is effectively stalled until it gets that data back from RAM and then installed again when you fetch the next one. So you're bouncing through RAM and you're really slowing things down. So ideally we wanna try and co-locate stuff to reduce the amount of time we spend fetching stuff from DRAM. For example, standard vector is almost always a better idea than standard list unless you have particularly large objects because the data is co-located. So when you fetch it from DRAM you're typically fetching multiple elements at the same time. Okay, now as it turns out, when dealing with memory, malloc is actually quite expensive. It may not seem like it because our CPUs used to be slow but our CPUs are so incredibly fast these days that malloc is actually becoming a significant bottleneck. And the reason is that malloc pretty much always has to take a lock. Now, one possible option here would be switching to a fancier allocator like JEE malloc or one of the other mean mallocs or other things floating around. But when you're dealing with the application as large as LibreOffice, that is not an ideal answer because there's just so much magic going on down in the language that we use that switching out malloc implementations is not ideal. So where possible, we should try and minimize malloc calls and even better allocate things on the stack. So I've been making a bunch of changes lately where instead of allocating something from malloc, I just allocate it on the stack. Because the stack is wonderfully cheap. The stack is wonderfully cheap because allocating on the stack is literally from the CPU's perspective, a case of bumping the SP register. You increase SP register. The RAM that you're dealing with is almost always in cache. The CPU generally goes to quite a considerable length to keep the stack data in cache. The stack data is accessed linearly forwards and backwards. So it's very nice from the CPU perspective. So stack is generally very, very cheap. And we have a fair amount of these days. I believe we have about eight meg worth of thread stack on a typical application. And if we ever need to, we could easily increase that because LibreOffice allocates very, very few threads. Now, some people say when we need to make things go faster, one thing to do is to use lockless type algorithms. Now, that is generally not great because lockless algorithms require slightly weird data structures that are very awkward to use. And you have to be very, very careful about using them because when you're using the lockless data structure, you have to be careful that you're not accessing two different lockless data structures at the same time or a lockless one and a different one because then you could end up with two pieces of data that are not consistent with each other. So I have avoided this thoroughly, except for one case. We have found one case where a lockless data structure is great. Now, the funny thing here is that this is in the SVL shared string pool where we intern strings and calc, we share strings because calc often are very large spreadsheets. You have an awful lot of strings that are exactly the same and then it becomes a significant benefit to share the string objects in question as well as the uppercase variance of those strings. Now, I actually tried this lockless data structure about two years ago and it didn't make any significant performance difference. So I threw that patch in my pile and I keep stacked and I forgot about it. And I went back to the string pool about a year ago and the string pool was still slow and I tried the lockless data structure and it still didn't help. And then myself and Lubosh worked on the string pool together and we improved it and it was great for a while. And then this year I looked at it again and the shared string pool came up again in a profile of a problem document and I stuffed in the lockless data structure and this time it made a difference. So the answer is lockless is generally not the answer except when it is because you tried everything else. So this particular time it did make a difference. And we used a quite nice little library called libcuckoo, which is a concurrent hash map and that improved things nicely. The other option is multithreading. So we have multithreaded a few things. We've multithreaded the calc load and we multithread the loading of an unzipping of zip files in the background. And this is great to speed things up, but, but. Multithreading keeps bringing in regressions. It's just really easy for the initial work to be fine and then later on it turns out that in certain scenarios you're accidentally touching some code which is not using a mutex or you're touching some code which is not using atomic reference counting and things fall apart. So we use multithreading where we can but we acknowledge that it is a problem in the code-based side of the Libra Office where it is very hard to cordon off things and be sure that you've gotten all the edge cases there. We've had to debug several, not me personally but other people that had to debug lots of multithreaded related issues. And consequently, so we treat this with great care because it doesn't always provide the magical land. And also when you're dealing with multithreading you often find that you multithread it and it's just as slow as it was before. And that's because you find yourself hitting shared locks or you find yourself having to put mutexes around pieces of code to protect them because they have been called from all the threads simultaneously. And then as soon as you've done that you discover that all of your threads are now ganging up on that one mutex and then you've gotten a benefit. So multithreading is nice but then you have to do a whole bunch of follow on work that's follow on work recently where I had to take mutexes out of pieces of code and convert that code to immutability. So the code was initialized once, the data structure was initialized once before the threads were created and then the data structure didn't change after that so that threads could all hit it simultaneously without needing to take locks. And that worked out very well. And that is actually the end of my talk.