 Excellent. Well, thank you for coming. Am I ready, you kidding? Am I in time? I am. I'm late. Good. So, Calc, Calc Threading. So, this is what I'm going to say. So, it's not interesting. It's an index. But here's the disclaimer. Pretty much all of this was done by Tall Lil' Clist and Dennis Francis, who... neither of whom can be here today, which is a great shame. But I was partly irresponsible for them and helped too. So, here we go. It is a pretty interesting code base, obviously, like everything else. It's 30 plus years old. The data structures have improved a huge amount recently in the last three, four years. But there's still some significant scope for improvement there. We'll look at how they are in a bit. But the calculation engine has been left pretty much as is. We tried to tack another one on the side to do OpenCL calculation and compiling formulae to OpenCL. But it's been badly in need of love. And so, we'll look a little bit how it works and how we improved it to thread it. So, since LibreOffice 4.3, the data structures have looked pretty much like this. You have a document, which is, I guess, your spreadsheet. Inside it, you have a whole series of sheets, which are called tables. So, several tabs, I guess, along the bottom. And then we have column... column... Anyway, columns, which are stored something like this. So, there's a whole array. Actually, a fixed-size array, rather a large one, of columns. And then, down it, we have these wonderful multi-dimensional data structures that are sort of spans of contiguous types in chunks going down there. So, we have things like blocks of strings, or chunks of doubles, or, you know, various other things. But we'll really be looking at the formula cell... the formula cell stuff today. So, inside those formula cells, you have a whole run of these formula cells, bang, bang, bang, bang, like this. But we try and group information together about them. So, there's a token array. And the token array basically represents your formula, you know, so you have equals sum 1, 2, 3. And there's two representations of that. The first thing is a token array like this. And these tokens are the same tokens in a different order. So, this one would be, you know, sum 1, 2, 3. And the reverse Polish equivalent, which would be 1, 2, 3 sum. Of course, this is quite a simple example. There are, you know, a lot more twisted ways. But the nice thing about the reverse Polish is, you don't have to do any complicated stuff. You execute this stuff pushing and pulling to a simple stack as you calculate. So, yeah, there's a whole load of different things like this. But the key things, I guess, are things like single references, like get a cell from A1 or whatever. Double reference, getting a range of cells. And this can, of course, go three-dimensional ranges through multiple sheets. There are special cases for external references from other things, and of course, simple numbers, strings, and then operations like do a division or, you know, execute this macro with these parameters. And here's how it works. So, when we want to calculate a formula, we, well, there are several ways of triggering this. But one way is to just get a value out of a cell. So, you ask a cell, give me your value. And if it's just a simple double or something in this array, then, well, we just passed a double back. But if it's a formula, we need to check if we actually need to calculate the results. So, this maybe interpret stuff goes, well, maybe we should actually recalculate before we return the double. And that eventually ends up into interpret. And then there's an amazing recursion flattening thing here, which I'll talk about later. And eventually it ends up in a thing called interpret tail. And that creates an interpreter object on the heap, passes the code in it, which is the token array, where it is in the document, all those other good stuff, and does interpret on it. Interpret, then, of course, starts building this stack of these reverse Polish tokens, execute those one by one, and as part of that process, you'll recall that some of these things are go get data out of the sheets somewhere else. Yeah? As part of that, sometimes we recurse back up to here, and we find something else that needs another cell. So, if you imagine a case such as an entire column, and someone types 42 in A1, and then they type equals A1 in A2, and they fill this all the way down, so you have a million formulae, all of which refer to the last one. And then you call get value on the very bottom cell, as you're trying to draw the screen, or whatever. This then potentially can recurse a million deep down your stack, and it's not a very shallow recursion. And so there is this, quote, amazing recursion flattening here, which goes, ah, you know, we've recursed quite a lot at this point. There's a lot of panic about how much heap we've got. Maybe we should do something creative, you know, and rearrange what we're doing in some way so as to defer work and come back and do some more later, and hopefully complete. So there's some fun stuff there that probably doesn't bear over much thought, but it's just a bit irritating. And of course, that's just a single column. You can imagine much worse situations. We have these very deep traces. So you recall that actually all of these tokens are changed into a formula cell group, and we know how big this thing is. We know it spans a whole column. So perhaps we could do better. So there's this thing called interpret formula group that in various cases is called, and should be called more frequently, but there are future plans for that. And this essentially can do something different. So the existing open CL and software cases then can try and interpret a great chunk of this group at once. And to do that, what we do is we call you this thing that can recurse as you recall on all of our input. So we can look at the formula group and we can go, well, this formula only operates on one cell, but as we go down the column that cell will actually turn into all of these other cells. So as it goes down, we should fetch all of this data at once and pack it away into a matrix. So this works nicely for simple string and double values, and we pack all of that into a nice flat uniform chunk of memory. So instead of looking at formula cells and doing operations for each of them, we have just array of doubles. And then, so first of all, we check that it's safe to do this for some value of safe, and we think that this is a formula that we can optimize, and that this set of tokens is safe to do this stuff with. We get those values, and then we can choose. We can send those to open CL, so we can push those across to your GPU. We can compile these tokens to some clever open CL kernel, and we can get the results back. And in some cases, that works really well and is really fast. In other cases, compiling the kernel is slower than actually executing it. So you don't win. It depends on the shape of your sheet. And so we have some stuff now that tries to judge the weight of a formula. How much work is it really doing? Is it a simple copy, memory copy, in which case, copying it to the GPU and back again is not going to help. Or is it a more complicated function? And then we got a software version, too, that does some kind of accelerated, you know, SSE accelerated summing across these things in some nice way. Yeah. And as we calculate these things on the software stack down here, we manipulate this matrix so that it looks different, but we don't do any copying. So we have a sort of abstract matrix that Kendi created over here very late at night before a deadline to make this work very beautifully. It depends out to be very efficient, as we'll see later. So why thread? Well, we need thread because, well, sometimes CPUs get actually slower. You know, the megahertz goes down, the IPC goes down, but hey, I've got another three cores that aren't doing anything, you know? Which is good for thermal management, perhaps. You can move to a cool core always. But anyway, the process clocks are anyhow stymied pretty much at 4GHz. You're not going to get much faster than that. So they're all going very much wider. And so if you want an IPC improvement because instructions per clock are pretty much, well, they're not improving hugely even with all this clever speculative execution that we're so fond of these days. So the good news is that AMD is really, you know, sort of stirring up this market and providing new high IPC widely threaded stuff. Laptops, you know, I think arguably have a, you know, four-thread sort of minimum. The mid-range stuff is eight threads. Workstations, 16 threads. I mean, it's cheap. I meet people that buy these things. Your new PC will have more threads than you know what to do with. And so, of course, AMD's been trying to help make sure that those are used effectively. So Marcus, my friend at the back here is a hero who created this crash reporting thing. And we were looking at the statistics the other night to see, well, how many cores do people have? And frustratingly, CPUs are very good at repointing their core count, some of these are hyper-threaded cores and some aren't, which is really irritating. So I wanted to show you how many threads people actually have that are spare. But the bad news is that some people still have one, one core, although I'd like to think it's hyper-threaded, so they have at least two threads. As you can see, there's a sort of declining number of people with two or potentially four threads. And then there's, well, there's really quite a lot of people here with four or maybe four cores that are weak, I don't know, but either way, you see the picture. This guy is growing more. If we enlarge the very small bit at the top, the trend is even more encouraging. So, you know, the 48 CPU machines, we even got some 80 core guys that seem to be crashing. So I don't know whether you can extrapolate from the crash data that more threads means you crash more often, it's quite possible. And maybe we get less reliable as we go, but either way, the idea is that everything is getting more threaded, so we should use that stuff. So threading interpret formula group. So what we really wanted to do was reuse the existing formula core. Rather than creating more special cases off at the side, we wanted to take that and avoid too much sub-setting, and ideally remove the software interpreter as well, so that it could be, you know, collapse everything in. So that's how we started out. So the idea basically is that we precalculate our dependent cells, much as now, but instead of stuffing them all into a matrix in a strange way, we just leave them where they are. We're confident that when we go to get them, maybe interpret will return false, and so we don't need to worry. We can just use the existing code. Of course, if maybe interpret actually calls interpret and we recurse, there's a big old assertion there that goes bang, and the whole thing falls in a heap. And we've been catching a few of these assertions in the crash testing, which is exciting. Then of course, there's some functions that we could parallelize essentially reusing the existing code, which is pretty nice. So the schema goes something like this, you call all your get values at the beginning, so that subsequently all your maybe interprets will return false or not do anything and just give you the real value. Again, the amazing recursion flattening I think we actually implemented this time. And so then in interpret tail, you start to parallelize as you interpret this whole group, you can call it multiple threads after you've set everything up nicely at the beginning. So yeah, so that was basically the plan. And there's a nice big assert here that says don't do it if the threaded group calculs in progress. So that sounds good. The only problem is it turns out when you look into it, that the nice pictures of here are not quite as wonderful. So the SC interpreter, for example, mutates the actual formula as it calculates it. So due to a fit of cunning, the iteration variable is actually in the token itself. So we are actually in the token array itself. Of course, there's a whole load of complicated stuff going on, you know, there are macros being called that in theory can do anything. They can mutate the document, the table, the cells, the thing that you're in right now. Some of the functions mutate the dependency graph, which is again tied to the document and a disaster. And so yeah, so we were really rather keen to have simple locking that didn't require lots of highly granular locking everywhere, particularly if that's going to be in the common single threaded case that's also I still use. So we're eager to make it relatively simple. So we cleaned up this magic of having the current index inside the instance of the token array that you're iterating over. You're like this first, this is an iterator start, give me the first one and then you call get next, get next, and it's mutating the thing itself. So we now have a nice external iterator. We have mutation guards everywhere that are essentially designed to sort of crash and lock hard, you know, if they ever see a mutation that occurs while this threaded calculation is going on. And so we had those sprinkle those liberally in scary looking places. Yeah, so by turning various things off, indirect offset match and so on, and actually if you look up and each look up generate new dependencies as they calculate so turn those off. Macros we disable for now. I mean if you look at what Excel does they look at the macro code and go ah this is a pure function, it doesn't mutate stuff. So actually Excel doesn't allow macros that do stupid stuff to be called informally, but we're not quite as advanced as that. What would be even nicer would be to parallelize the basic interpreter but that's quite an exciting problem. There are people out there in the industry that would love to have parallel macro execution because their quant uses these weird functions for pricing Greeks or whatever and they want that quick up. Yeah, there's a whole lot of stuff. At the moment we allow external extensions to be called because well they're just as bad as macros but they probably don't exist so if they're, that's alright we should probably turn that off. There are even more nasties more global variables left than right. And as we started to look at these it's nowhere obvious to hang them that would actually be sensible. So we have a whole lot of thread local variables for the calculation stack current document being calculated, matrix positions and a few more. We had to upgrade the mac toolchain to make thread local variables work which is slightly unfortunate. And eventually we then thought we'd introduce an SC interpreter context that became more and more things that we wanted to hang somewhere to optimize and improve performance. And so now we pass an interpreter context through many of the functions and try to make that add up. So how did it go? Well initially it did quite well. So this is the single threaded calculation. This is the same performance with just one thread. So hopefully this would be reasonably flat. There's two things here. There's my Linux laptop and then there's some Ryzen 16 core monster. And there's several things you can see here which is probably better actually on a log plot. But if you draw the log plot you see this going very nicely linearly downwards until you hyper thread at which point it doesn't really speed up a whole lot. You can see this flatten off massively at the end because we're really hammering this thing quite hard such that the hyper threading doesn't work well so well in this. Of course this is because it's doing a big SSE. This test is just a large sum and it's doing a lot of double work. Hyper threading probably helps you more in other use cases. But at the moment we turn that off and it actually speeds things up. So to this point then we have had four sets of calculations. So we could do a plain old calculation, single threaded. We could have the software group single threaded calculation. So again aggregate, stuff in matrix calculate. We have the open CL thing and now we have the new threaded calculation. Look at these nice acronyms I've added. And then horror of horrors on benchmarking it we discovered that sometimes the new threaded calculation which is all shiny and pretty and doing no locking really at all and absolutely wonderful was slower than the single threaded calculation with the software group interpreter. So that's pretty depressing after some month of work. And it turns out that actually the process of collecting all that data from the sheets checking its types, fooling around looking at format types and so on and so on. For each formula cell it's really expensive and often it's done again and again and again for these things where you have an N squared so you're doing a big operation on a column and then you're doing it multiple times as you go down. And yeah and A of course collects it once and then it's hyper optimized as a C goodness you know really whipping through that. So we then threaded the A version as well. So instead of having a single threaded software group interpreter we threaded that as well and then suddenly life was was good again. So I got a picture of some of that in a minute and how the stats looked. So then we sit there and we're going well it's all very well getting a 6X but we've got eight threads. Why aren't we getting an 8X or a 9X ideally? You hear about these sort of super linear speed ups due to extra cash use. They read well in the textbooks don't they? So we start looking on Windows and there's absolutely terrible profiling tools there but on Linux we use Perf but looking for thread issues is not entirely obvious. Of course if you've got lock contention you know there's a lot of time spent sleeping but it's not real time you know like the kernel is going off and counting sheep instead and that doesn't really show up easily on many of the profiling tools. So there are a whole lot of different events you can look at and you can look at sort of bouncing of kernel a few text memory locations between threads I guess and there's various things like that but we spent a lot of time and eventually Perf turned out to be probably the best thing to help with this and looking for things like false sharing where you've tried to separate your memory but because your allocator is smart it shoves all the same size things in the same place next to each other and they all end up in the same cache line and then this bounces between all your cores. So we tried looking at a lot of these things and with not a vast amount of success but here were some of the horrors we found so most of the threading things that we were looking for didn't turn out to be terribly you know findable but the other stuff was pretty silly so as you operate on this reverse polish stack we were regularly just allocating and freeing things all the time as we did it and yeah using the actually using the system allocator instead of Matthias Herschwan here really sped things up particularly for parallel use there so we dropped the custom allocator and after that we also reused these tokens where possible so why bother allocating and freeing hundreds of double tokens when you just did that just before so we have a little a little stash of some so there's no need to take a lock you can just reuse the thing another particular folly that people like is to use standard stack because it sounds like if you're making a stack that's what you want right but let me tell you if your use case is to extend and grow a stack like this you know you exactly don't want a stack because underneath it uses a DQ which has got a list inside it so every time you push something on it it allocates a new node and when you're multiply threaded what you really don't want is to be constantly hammering your allocator left and left and right to allocate and free all these tiny nodes and chain them together just use a vector very nice a win just comes out of that and then of course the interpreter context which starts to cache these things as being freed and reallocated so just by saving some of these things we got a lot better so there are some other particularly awful things so we see the SFX item set which is a favourite of Bjorns and many others appearing right in the middle of the interpreter so you know people are doing get number format on cells as they do this arithmetic shouldn't need to be used at all so this get cell value will zero is inside the get value get value function and it just does some really crazy stuff and it's really unclear why it needs to do this and you start looking into this and it's rather frightening so anyway here's a sort of performance story so as we threaded your time wants to get lower if you want to get faster so we do all this nice threading work and this is the first step of a recent master I guess and some sheets got massively faster so we're going from way up here to way down here maybe twice as fast which is great but at the same time a whole lot of other sheets got slower which seems strange the reason they got slow was that we turned off the software interpreter so that was doing this nice pre-gatherer and then some and then push it back so by threading the software interpreter then we managed to get back you know some of this stuff stopping thrashing the token array halving the number of threads so getting rid of hyper threading took us to here so flat for some loads and big wins for others so getting rid of hyper threading we just use half the number of threads and the OS knows what to do on its own custom allocator again big wins for some maybe a slight win loss for others caching formula double tokens lots of commits that don't do anything isn't that nice you know it's good to know you're making a difference as you do these three factors and what else C++ threads and it's still overall pretty good coming down from you know 600 milliseconds down to 150 for this one and you know 1600, 1.6 seconds down to less than half a second or around half a second and so on so yeah so that's pretty much the thing and what should we next well yeah the crash testing loads these 100,000 documents and it asserts liberally left and right almost suddenly implicit intersection is killing us so implicit intersection is a clever way of writing formally wrong and Calc noticing this and correcting them as it calculates I simplify but yeah people can deliberately use it I suppose but I don't know why they would and so the reason is that they write a smaller range than that is actually ends up being used and so when we look at our dependence we don't pre calculate stuff and then multiple threads get caught and this assert of oh dear we're fetching this data and it's not calculated yeah I'd like to kill those global variables that's actually relatively easy for a newbie to do maybe it should be an easy hack now because we've got this nice context to put them on yeah big stuff like killing the formula cell and making it a formula cell group one of these groups that just happens to be one long would be kind of nice and then maybe be able to simplify some of these pieces and making the plain old calculation just a single threaded threaded calculation and finally getting rid of this format type stuff that should never be happening at all during calculation I think so that's about it the other conclusions they're kind of obvious yeah one point here is that it's actually just an economic problem unfortunately technology is fun and all that but you know just being able to invest in optimizing this thing as soon as you open the profiler you start thinking why is it doing that that's really silly so yeah just thanks then to AMD for supporting it that's my talk, any questions sir you have a finger up no you're trying not to look like a questioner okay anyone else the first person has to be brave but after that it's easy how many threads do people have I'm going to do a poll since you're all static does anyone have a single threaded laptop or CPU they're using at all okay it's your main work PC oh you do okay it's the Raspberry Pi 1 yeah okay I thought it would be excellent oh yeah your phone is ancient yeah the 4k phone so how about two okay okay so I'm talking threads so let's do two threads so two threads anyone with two threads and that's what you actively use in your day-to-day work okay for enough so it's not I once had a BBC Micro 65028 bit I'm talking the you know sort of um so 4 who's good for 4 4 yeah there's some more 8 16 okay it starts to top out at this point anyone can do better than 16 on their workstation beyond tell me you know 64 yeah 12 yeah yeah yeah yes so that is true so there's thermal management in these things um yeah however arguably it is better to be more efficient and get it done quicker and then idle the thing we'll only have it going for a long old time arguably raced to idle so they say who knows raced to idle well yeah this is the hurry up and wait it's like the military approach to power saving you know that was not a question but that was a good statement anything else we got another 3 minutes sir okay wow could we integrate LLVM to compile the formulae yeah I think so so there is in fact an LLVM compiling the formulae solution already built in but it's typically known as software OpenCL and if you look under the hood of these OpenCLM so what you rapidly discover is that it's pretty much that the problem is of course it's not a perfect match for what we do and in terms of our formulae engine is heavily built around a lot of the concepts I've showed you it turns to the stack and how these things are passed and so you know the sine function is not a C function that's like double do the sine say double it's like a have this hammer bag of hammers of things you could get and what if you pass a boolean to it and you pass a string you know it's sort of all shoved into that formula and so in terms of code reuse and simplification it's not an idea but Marcus is our calc hero maintainer, wave a hand Marcus so people harass you afterwards and so he probably has a more detailed view but yeah it's a good idea perhaps but we like to simplify yeah yeah see there's a whole load of refactoring we can do to carry on this improvement and make it to the point that we could have something much much sweet here but yeah I think the wins LLVM will give you a small compared to the refactoring the core fun good well if there's nothing else thanks so much very good