 Excellent. Well, thank you for coming. Am I ready? Are you kidding? Am I in time? I am. I'm late. Good. So calc, calc threading. So this is what I'm going to say. So it's not interesting. It's an index. But here's the disclaimer. Pretty much all of this was done by Tallilqvist and Dennis Francis, neither of whom can be here today, which is a great shame. But I was partly irresponsible for them and helped, too. So here we go. It is a pretty interesting code base, obviously, like everything else. It's 30 plus years old. The data structures have improved a huge amount recently in the last three, four years. But there's still some significant scope for improvement there. We'll look at how they are in a bit. But the calculation engine has been left pretty much as is. We tried to tack another one on the side to do OpenCL calculation and compiling formulae to OpenCL. But it's been badly in need of love. And so we'll look a little bit how it works and how we improved it to thread it. So since LibreOffice 4.3, the data structures have looked pretty much like this. You have a document, which is, I guess, your spreadsheet. Inside it, you have a whole series of sheets, which are called tables. So several tabs, I guess, along the bottom. And then we have columnula, column... Anyway, columns, which are stored something like this. So there's a whole array. It's a fixed-size array, rather a large one, of columns. And then down it, we have these wonderful multi-dimensional data structures that are sort of spans of contiguous types in chunks going down there. So we have things like blocks of strings or chunks of doubles or, you know, various other things. But we'll really be looking at the formula cell stuff today. So inside those formula cells, you have a whole run of these formula cells, bang, bang, bang, bang, like this. But we try and group information together about them. So there's a token array. And the token array basically represents your formula, you know. So you have equals sum 1, 2, 3. And there's two representations of that. The first thing is a token array like this. And these tokens are the same tokens in a different order. So this one would be, you know, sum 1, 2, 3. And the reverse Polish equivalent, which would be 1, 2, 3 sum. Of course, this is quite a simple example. There are, you know, a lot more twisted ways. But the nice thing about the reverse Polish is you don't have to do any complicated stuff. You execute this stuff, pushing and pulling to a simple stack as you calculate. So, yeah, there's a whole load of different things like this. But the key things, I guess, are things like single references, like get a cell from A1 or whatever. Double reference, getting a range of cells. And this can, of course, go three-dimensional ranges through multiple sheets. There are special cases for external references from other things. And, of course, simple numbers, strings, and then operations like do a division or, you know, execute this macro with these parameters. And here's how it works. So when we want to calculate a formula, we, well, there are several ways of triggering this. But one way is to just get a value out of a cell. So you ask a cell, give me your value. And if it's just a simple double or something in this array, then, well, we just pass the double back. So it's a formula. We need to check if we actually need to calculate the results. So this maybe interpret stuff goes, well, maybe we should actually recalculate before we return the double. And that eventually ends up into interpret. And then there's an amazing recursion flattening thing here, which I'll talk about later. And eventually ends up in a thing called interpret tail. And that creates an interpreter object on the heap, passes the code in it, which is the token array, where it is in the document, all those other good stuff, and does interpret on it. Interpret then, of course, starts building this stack of these reverse Polish tokens, execute those one by one. And as part of that process, you recall that some of these things are go get data out of the sheets somewhere else. Yeah. As part of that, sometimes we recurse back up to here and we find something else that needs another cell. So if you imagine a case such as an entire column, and someone types 42 in A1, and then they type equals A1 in A2, and they fill this all the way down so you have a million formulae, all of which refer to the last one. And then you call get value on the very bottom cell, as you're trying to draw the screen, or whatever. This then potentially can recurse a million deep down your stack. And it's not a very shallow recursion. And so there is this, quote, amazing recursion flattening here, which goes, ah, you know, we've recurse quite a lot at this point. We're starting to panic about how much heap we've got. Maybe we should do something creative, you know, rearrange what we're doing in some way so as to defer work and come back and do some more later and hopefully complete. So there's some fun stuff there that probably doesn't bear over much thought, but it's just a bit irritating. And of course, that's just a single column. You can imagine much worse situations where you have these very deep traces. So you recall that actually all of these tokens are arranged into a formula cell group, and we know how big this thing is. We know it spans a whole column. So perhaps we could do better. So there's this thing called interpret formula group that in various cases is called, and should be called more frequently, but there are future plans for that. And this essentially can do something different. So the existing open-cl and software cases then can try and interpret great chunk of this group at once. And to do that, what we do is we call getValue, this thing that can recurse as you recall, on all of our input. So we can look at the formula group and we can go, well, this formula only operates on one cell, but as we go down the column, that cell will actually turn into all of these other cells. So as it goes down, we should fetch all of this data at once and pack it away into a matrix. So this works nicely for simple string and double values. And we pack all of that into a nice, flat, uniform chunk of memory. So instead of looking at formula cells and doing operations for each of them, we have just array of doubles. And then, so first of all, we check that it's safe to do this for some value of safe. And we think that this is a formula that we can optimize and that this set of takens is safe to do this stuff with. We get those values and then we can choose. We can send those to open-cl, so we can push those across to your GPU. We can compile these takens to some clever open-cl kernel and we can shovel that data in and we can get the results back. And in some cases that works really well and is really fast. In other cases, compiling the kernel is slower than actually executing. So you don't win. It depends on the shape of your sheet. And so we have some stuff now that tries to judge the weight of a formula. How much work is it really doing? Is it a simple copy, memory copy, in which case copying it to the GPU and back again is not going to help. Or is there a more complicated function? And then we've got a software version too that does some kind of accelerated, SSE accelerated summing across these things in some nice way. And as we calculate these things in the software stack, down here we manipulate this matrix so that it looks different but we don't do any copying. So we have a sort of abstract matrix that Kendi created over here at a very late at night before a deadline to make this work very beautifully. And it turns out to be very efficient as we'll see later. So why thread? Well, we need thread because, well, sometimes CPUs get actually slower. The megahertz goes down. The IPC goes down. But hey, I've got another three cores that aren't doing anything, which is good for thermal management perhaps. You can move to a cool core always. But anyway, the process clocks are anyhow stymied pretty much at four gigahertz. You're not going to get much faster than that. So they're all going very much wider. And so if you want an IPC improvement because instructions per clock are pretty much, well, they're not improving hugely, even with all this clever speculative execution that we're so fond of these days. So the good news is that AMD is really sort of stirring up this market and providing new high IPC widely threaded stuff. Laptops, I think arguably have four threads sort of minimum. The mid-range stuff is eight threads. Workstations, 16 threads. I mean, it's cheap. I meet people that buy these things. Your new PC will have more threads than you know what to do with. And so, of course, AMD's been trying to help make sure those are used effectively. So Marcus, my friend at the back here as a hero who created this crash reporting thing, and we were looking at the statistics the other night to see, well, how many cores do people have? And frustratingly, CPUs are very good at repointing their core count but not their thread count. So some of these are hyper-threaded cores and some aren't, which is really irritating. So I wanted to show you how many threads people actually have that are spare. But the bad news is that some people still have one, one core, although I'd like to think it's hyper-threaded so they have at least two threads. As you can see, there's a sort of declining number of people with two or potentially four threads. And then there's, well, there's really quite a lot of people here with four or maybe four cores that are weak, I don't know, but either way, you see the picture. This guy is growing and will grow more. If we enlarge the very small bit at the top, the trend is even more encouraging. So, you know, the 48 CPU machines, we even got some 80 core guys that seem to be crashing. So I don't know whether you can extrapolate from the crash data that more threads means you crash more often, it's quite possible. And maybe we get less reliable as we go, but either way, the idea is that everything is getting more threaded, so we should use that stuff. So, threading interpret formula group. So what we really wanted to do was reuse the existing formula core rather than creating more special cases off at the side. We wanted to take that and avoid too much sub-setting and ideally remove the software interpreter as well so that it could collapse everything in. So that's how we started out. So the idea basically is that we pre-calculate our dependent cells much as now, but instead of stuffing them all into a matrix in a strange way, we just leave them where they are. We're confident that when we go to get them, maybe interpret will return false, and so we don't need to worry. We can just use the existing code. Of course, if maybe interpret actually calls interpret and we recurse, there's a big old assertion there that goes bang and the whole thing falls in a heap. And we've been catching a few of these assertions in the crash testing, which is exciting. Then of course, there's some functions that are horrible and we have the black list, but we could parallelize essentially reusing the existing code, which is pretty nice. So the schema goes something like this. You call all your get values at the beginning, say that subsequently all your maybe interprets will return false or not do anything and just give you the raw value. Again, the amazing recursion flattening, I think we actually implemented this time. And so then in interpret tail, you start to parallelize, okay, as you interpret this whole group, you can call this in multiple threads after you've set everything up nicely at the beginning. So yeah, so that was basically the plan. And there's a nice big assert here that says, don't do it if the threaded group characters in progress. So that sounds good. The only problem is it turns out, when you look into it, that the nice pictures of here are not quite as wonderful. The ST interpretive, for example, mutates the actual formula as it calculates it. So due to a fit of cunning, the iteration variable is actually in the token itself. So we are actually in the token array itself. Of course, there's a whole load of complicated stuff going on. You know, there are macros being called that in theory can do anything, right? They can mutate the document, the table, the cells, the thing that you're in right now. Some of the functions mutate the dependency graph, which is again tied to the document and a disaster. And so yeah, so we were really rather keen to have simple locking that didn't require lots of highly granular locking everywhere, particularly if that's going to be in the common single-threaded case that's also still used. So we're eager to make it relatively simple. So we cleaned up this magic of having the current index inside the instance of the token array that you're iterating over. You're like, this is an iterator start. This is actually the first one, and then you call get next, get next, and it's mutating the thing itself. So we now have a nice external iterator. We have mutation guards everywhere that are essentially designed to sort of crash and lock hard if they ever see a mutation that occurs while this threaded calculation is going on. And so we add those, sprinkle those liberally in scary-looking places. Yeah, so by turning various things off, set match, and so on. And actually, if you look up and each look up, generate new dependencies as they calculate. So turn those off. Macros we disable for now. I mean, if you look at what Excel does, they look at the macro code and go, oh, this is a pure function. It doesn't mutate stuff. So actually, Excel doesn't allow macros that do stupid stuff to be called informally, but we're not quite as advanced as that. What would be even nicer would be to parallelize the basic interpreter, but that's quite a few people out there in the industry that would love to have parallel macro execution because their quant uses these weird functions for pricing Greeks or whatever, and they want that quicker. Yeah, there's a whole lot of stuff. At the moment, we allow external, you know, extensions to be called because, well, they're just as bad as macros, but they probably don't exist. So that's all right. We should probably turn that off. There are even more nasties, more global variables left and right. And as we started to look at these, it's nowhere obvious to hang them that would actually be sensible. So we have a whole load of thread local variables for the calculation stack, current document being calculated, matrix positions and a few more. We had to upgrade the mac tool chain to make thread local variables work, which is slightly unfortunate. And eventually, we then thought we'd introduce an SD interpreter context that became more and more things to optimize and improve performance. And so now we pass an interpreter context through many of the functions and try and make that add up. So how did it go? Well, initially, it did quite well. So this is the single threaded calculation. This is the same performance with just one thread. So, you know, hopefully this would be reasonably flat. There's two things here. There's my Linux laptop, and then there's some Ryzen 16 core monster. And there's several things you can see here, which is probably better actually on a log plot. But if you draw the log plot, you see this going very nicely linearly downwards until you hyper thread, at which point, yeah, you know, it doesn't really speed up a whole lot. You can see this flattening off massively at the end. Because we're really hammering this thing quite hard, such that the hyper threading doesn't work well, so well in this. Of course, this is because it's doing a big SSC. This test is just a large sum and it's doing a lot of double work. Hyper threading probably helps you more than other use cases. But at the moment, we turn that off and it actually speeds things up. So, yeah, so to this point, then we have had four sets of calculations. So we could do a plain old calculation, single threading. We could have the software group single threaded calculation. So again, aggregate, stuff in matrix, calculate. We have the open CL thing, and now we have the new threaded calculation. Look at these nice acronyms I've added. Horror of horrors. On benchmarking it, we discovered that sometimes the new threaded calculation, which is all shiny and pretty and, you know, doing no locking really at all and absolutely wonderful, was slower than the single threaded calculation with the software group interpreter. So that's pretty depressing after some, you know, month of work. Yeah, and it turns out that actually the process of collecting all that data from the sheets, checking its types, fooling around, looking at format types and so on and so on for each formula cell. It's really expensive. And often it's done again and again and again for these things where you have an n squared. So you're doing a big operation on a column and then you're doing it multiple times as you go down. And yeah. And A, of course, collects it once and then it's hyper-optimized as a C goodness, you know, really whipping through that. So we then threaded the A version as well. So instead of having a single threaded software group interpreter, we threaded that as well. And then suddenly life was good again. So I got a picture of some of that in a minute and how the stats looked. So then we sit there and we're going, well, it's all very well getting a 6X, but we've got eight threads. Why aren't we getting an 8X, you know, or a 9X? Ideally, you know, you hear about these sort of super linear speed-ups due to extra cash use. They read well in the textbooks, don't they? So we start looking on windows. There's absolutely terrible profiling tools there, but on Linux we use Perf. But looking for thread issues is not entirely obvious. Of course, if you've got lock contention, you know, there's a lot of time spent sleeping, but it's not real time. You know, like the kernel is going off and counting sheep instead. And that doesn't really show up easily on many of the profiling tools. So there are a whole lot of different events you can look at and you can look at sort of bouncing of kernel, of a few text memory locations between threads, I guess. And there's various things like that, but we spent a lot of time, and eventually Perf turned out to be probably the best thing to help with this. And looking for things like false sharing where you've tried to separate your memory, but because your allocator is smart, it shoves all the same-sized things in the same place next to each other, and they all end up in the same cache line, and then this bounces between all your cores. So we tried looking at a lot of these things, and with not a vast amount of success, but here were some of the horrors we found. So most of the threading things that we were looking for didn't turn out to be terribly, you know, findable, but the other stuff was pretty silly. So as you operate on this reverse Polish stack, we were regularly just allocating and freeing things all the time as we did it. And, yeah, using the system allocator instead of Matthias Herschwan here really sped things up, particularly for parallel use there. So we dropped the custom allocator and after that we also reused these tokens where possible. So why bother allocating and freeing hundreds of double tokens when you just did that just before? So we have a little stash of some. So there's no need to take a lock. You can just reuse the thing. Another particular folly that people like is to use standard stack, because it sounds like if you're making a stack, that's what you want, right? But let me tell you, if your use case is to extend and grow a stack like this, you know, you exactly don't want a stack. Because underneath it uses a DQ, which has got a list inside it, so every time you push something on it, it allocates a new node. And when you're multiply-threaded, what you really don't want is to be constantly hammering your allocator left and left and right to allocate and free all these tiny nodes and chain them together, just use a vector. Very nice, a win just comes out of that. And then, of course, the interpreter context which starts to cache these things as being freed and reallocated. By saving some of these things we got a lot better. So there are some other particularly awful things. So we see the SFX item set, which is a favorite of Bjorns and many others, appearing right in the middle of the interpreter. So, you know, people are doing getNumberFormat on cells as they do this arithmetic. It shouldn't need to be used at all. So this getCellValue will zero inside the getValue function. And it just does some really crazy stuff, you know, like... And it's really unclear why it needs to do this and you start looking into this and it's really... It's rather frightening. So anyway, here's a sort of performance story. So as we thread it, your time wants to get lower if you want to get faster. So we did all this nice threading work and this is the first step of a recent master, I guess. And some sheets got massively faster, which is good, you know, so we're going from way down here, maybe, you know, maybe twice as fast, which is great. But at the same time, a whole load of other sheets got slower, which seems strange. The reason they got slower was that we turned off the software interpreter. So that was doing this nice pre-gatherer and then some and then push it back. So by threading the software interpreter, then we managed to get back, you know, some of this stuff, stopping thrashing the token array, halving the number of threads. So getting rid of hyperthreading took us to here. So flat-ish for some loads and big wins for others. So getting rid of hyperthreading, we just use half the number of threads and the OS knows what to do on its own. Custom allocator again, big wins for some, maybe a slight win, loss for others. Caching, formula double tokens, various other, lots of commits that don't do anything. Isn't that nice, you know? So it's good to know you're making a difference as you do these three factors. And what else? Yeah, C++ threads, various other things at the end. And it's still overall pretty good coming down from, you know, I don't know, 600 milliseconds down to 150 for this one and, you know, 1600, 1.6 seconds down to less than half a second or around half a second and so on. So yeah, so that's pretty much the thing. And what should we next? Well, yeah, the crash testing loads these 100,000 documents and it asserts liberally left and right. Almost certainly implicit intersection is killing us, so implicit intersection is a clever way of writing formally wrong and Calc noticing this and correcting them as it calculates. I simplify. But yeah, people can deliberately use it, I suppose, but I don't know why they would. And so the reason is that they write a smaller range is actually ends up being used. And so when we look at our dependence, we don't pre-calculate stuff and then multiple threads get caught in this assert of oh dear, we're fetching this data and it's not calculated. Yeah, I'd like to kill those global variables. That's actually probably relatively easy for a newbie to do. Maybe it should be an easy hack now, because we've got this nice context to put them on. Yeah, big stuff like killing the formula cell and making it a formula cell group, one of these groups that just happens to be one long would be kind of nice to maybe be able to simplify some of these pieces and making the plain old calculation just a single threaded calculation. And finally getting rid of this format type stuff that should never be happening at all during calculation, I think. So that's about it. That are the conclusions. They're kind of obvious. Yeah, one point here is that it's actually just an economic problem. Unfortunately, technology is fun but just being able to invest in optimizing this thing as soon as you open the profiler you start thinking why is it doing that? That's really silly. So yeah, just thanks then to AMD for supporting it. That's my talk. Any questions? Sir, you have a finger up, no? No? You're trying not to look like a questioner. Okay, anyone else? The first person has to be brave but after that it's easy. No? How many threads do people have? I'm going to do a poll for the rule of static. Does anyone have a single threaded laptop or CPU they're using at all? Okay. It's your main work PC. Oh, you do. Okay, it's the Raspberry Pi 1. Yeah? Okay, I thought it would be. Excellent. Yeah, your phone is ancient. The 4K phone. So how about two? Okay. Okay, so I'm talking threads. So let's do two threads. Anyone with two threads? There's the sky. And that's what you actively use in your day-to-day work. Okay, fair enough. So it's not I once had a BBC Micro 65028 bit. I'm talking the, you know, sort of... Yeah, yeah, what were you using? Four? Who's good for four? Four, yeah, there's some more. Eight? That's me. Sixteen? Okay, it starts to top out at this point. Anyone who can do better than sixteen on their workstation? Beyond, tell me. Yeah, it's 64, isn't it? Yeah, 12. Yeah, yeah, yeah. Yes, so that is true. So there's thermal management in these things. Yeah. However, arguably, it is better to be more efficient and get it done quicker and then idle the thing, or only have it going for a long, old time. Arguably. Raced idle, so they say. Who knows? Raced idle. Well, yeah, yeah, yeah, this is the hurry up and wait. This is the approach to power saving, you know. Yeah, good. That was not a question, but that was a good statement. Anything else? We've got another three minutes. Sir? Okay. Wow. Could we integrate LLVM to compile the formulae? Yeah, I think so. So there is, in fact, an LLVM compiling the formulae solution already built in, but it's typically known as Software OpenCL. And if you look under the hood of these OpenCL implementations, what you can discover is that it's pretty much that. The problem is, of course, it's not a perfect match for what we do, and in terms of our formulae engine is heavily built around a lot of the concepts I've showed you. It turns to the stack and how these things are passed. And so, you know, the sine function is not a C function that's like double, do the sine, say double. It's like a, have this bag of hammers of things you could get, and what if you pass a boolean to it, and what if you pass a string, it's sort of all shoved into that formula. And so, in terms of code reuse and simplification, it's not an idea. But Marcus is our calc hero, maintainer, wave a hand, Marcus, so people harass you afterwards. And so he probably has a more detailed view. But yeah, it's a good idea, perhaps. But we like to simplify. Marcus? Yeah. See, there's a whole load of refactoring we can do to carry on this improvement and make it to the point that we could have something much, much sweet here. But yeah, I think the wins LLVM will give you a small compared to the refactoring the core fun. Good. Well, if there's nothing else, thanks very much. You're doing very good.