 Let's do three. Ah, ah, ah. Be good? We are good. I don't know if you've noticed, but we build a thing. Are we going to talk about Scrooge again? A little bit. OK. But another aspect of Scrooge. All right. So that's kind of interesting. OK. So this might be a long one. So bear with me. I'm going to start at where we started, and then we kind of fell down into this rabbit hole. And I want the audience to fall into the rabbit hole with us. Yes. And I'm really looking forward to this one, because sometimes when we do these, one of us is maybe slightly pretending to know less about the subject than we do. Whereas in this one, there's a lot that I really don't understand. And I'm really worried that I might not actually be able to explain everything as much as she would like me to. OK. So let's see where we end up. Yes. I'll let you know honestly. What are images on the web if we manipulate them with JavaScript? Right. So let's talk about image data, which is a data structure that we use in Scrooge. So once you get an image in, we decode it, and we turn it into an image data object, which is a data structure that exists on the platform, basically has three properties, a width, a height, and data. Yep. And on data, there is a U and A clamped array. And in there, you have just four bytes for each pixel. Yes. And it's the first row, and the second row, and so on. And then each one is like red, green, blue, alpha, right? Exactly. So what you see here is like it's a red pixel, then a green pixel and a blue pixel and a white pixel. And because the image is two by two, that is what the image would look like, right? Oh, nice. So it's basically just a series of numbers with no concept of rows or columns. But because of that information, we can rearrange them to interpret them as a proper two-dimensional image. Brilliant. That's kind of how it works. All right. So now in Scrooge, we had the goal to rotate an image by 90 degrees. Sounds like a simple thing. Probably only take 10 minutes. I mean, you wrote it in the first version, right? Yeah. And so let's talk about how you wrote it. You wrote it in an image by 90 degrees. It gets an input image, which is this image data object. Yes, it is. And what we do, we figure out by 90 degrees, what is the new width and the new height, which is pretty much just height and width swapped. You're doing fancy Serma code already. It's a little bit because otherwise it wouldn't fit. So I'm compressing things down. Right. And this guy could have too. So here you're essentially assigning the height to the width and the width to the height because it's 90 degrees. Right? OK, I'm following. And I'm training a new output image which has this new width and the new height. Yes. So now the goal is to go through pixels and put them in the right spot in the output image. So what were you? It should be a for loop over all the pixels in the input image. And we figure out where they would have to land in the output image. So basically the new X coordinate is that kind of formula, the new Y coordinate is that. And then we figure out which input pixel it is, which output pixel, and just copy it over. More fancy Serma code here. I know, right? I wouldn't get through review. I know, if you don't like it. You don't like it. OK, that's fine. And then because we have four bytes per pixel, we just loop over four times and just do the thing, right? Like we copy the R value, the G value, the B value, and the A value. Yeah, enough. Enough we go. And this works. And this was actually decently fast. We shipped it this way. We should say the reason we did this over than Canvas is because we wanted to run it in a worker. That's an entire different story. But yes, we did a lot of tests with what seemed more fancy, faster technology, didn't seem to work. So we ended up writing our own piece of JavaScript just for this problem. Yes, because off-screen Canvas only in a couple of browsers, whereas this is just basic JavaScript. JavaScript, so that works everywhere. And it can run in a worker because it's just an image of it. So this, we shipped this. This worked. And then I looked at some point and was like, hmm, there's actually kind of an obvious optimization that you missed. And so I basically added a little patch. This all stays the same, same as before. But now I'm creating a U32 array. Yes, yes. So what this is. So basically we have the same underlying chunk of memory. But instead of seeing it as a series of bytes, we see it as a series of 32-bit numbers. Because every pixel consists of a 32-bit number for RGB and A. And so this way we can simplify or actually remove the inner loop. So yes, this bit that was here, that doing something four times every time we're now just doing it once. It's now one copy operation which actually maps to a machine instruction. Most of the time so V8 will be like super smart and go like, whoa, fast. So this ought to actually quite a bit faster. Yes. So cool. And then we ship this, still fine. And then it turns out that for some reason in one browser this was super slow. Right. And we've been advised by legal. By our legal apartment to not name the browser. Apparently it's a Chrome policy not to. Yeah, I've never heard that before, but. No, we're not allowed to talk about other browsers. So we can't mention which browser it is. But it's one that didn't run on our machines. It didn't run on your machine, did it? You had to use a VM to run this different browser. Okay. Either way, like most browsers were fine, good enough at least. And then for some reason this one browser just ended up being extremely slow, like unreasonably slow even. So we must have hit some weird corner case. Yes. Because this browser isn't slow usually. It's a very good browser. Yes. And different JavaScript engines optimized for different things. So the fact that one browser was slower here isn't saying that that browser is terrible. It's just saying like V8 is very good with this kind of tight loop code. Yeah. Other engines have optimized for like more DOM binding stuff. Exactly. So it wasn't that surprising that one browser was completely different in terms of performance with this piece of code. So we thought, what do we do? Maybe we try more WebAssembly at the problem, right? Hey. So we looked into that. And the first problem we had that when you write WebAssembly and you load it, it turns into a module that has functions, the functions that you wrote in whatever language you were using. Yes. Right? This is different to an ECMAScript module. It's a Wasm module. It's a different thing. It's a different thing. And these functions can only take in and return numbers. Yes. So it's no easy way straight up to like pass in an image. So what do you do, right? So what we ended up doing, I'm going to reuse the video I made for my article. Oh, brilliant. Basically, the JavaScript is going to load the image, put it into the WebAssembly memory, and then we're going to use WebAssembly to just do the reordering within that WebAssembly memory buffer and use JavaScript to read it back afterwards. Right. That means the WebAssembly really is completely isolated from all of the outer world, really, so to speak. It just has its chunk of memory to work on. We'll read in the image, do the reordering that we've shown before, and then JavaScript comes back, takes over, and reads back the resulting image. So the JavaScript and WebAssembly, the thing they share is memory. Pretty much. But to WebAssembly, it's its memory. So this WebAssembly.memory is WebAssembly-specific memory, but it is also exposed as an array buffer that we can use as a view in 32 array or whatever we need in that very instance, right? So the amount of memory we need for WebAssembly is essentially double the size of the image because it's going to have the main image in memory and then the next bit. Okay. Yeah. So how do we create Wasm? We've done it before with MScript and C, but there's also Rust, but we actually found a very interesting project we stumbled over called AssemblyScript. Yes. Which is a, they call themselves a TypeScript to WebAssembly compiler, which is true, but might be a little misleading because it is not, you can't just take any TypeScript and compile it to WebAssembly. And it is using the TypeScript syntax and the TypeScript standard library things, but with their own version of the library that is specifically tailored to WebAssembly. So what you can see here is the signature way. Now we have types, as you know from TypeScript, but there's the I32 type, which is the type WebAssembly has, but JavaScript doesn't. And that's 32-bit integer, right? The, yes, the signed 32-bit integer. Signed 32-bit integer. It's also the U32, which is the unsigned. Why are we using signed? For reasons. Okay, let's gloss over it. But this is good because I can recognize this. It looks a lot like JavaScript. It looks a lot like TypeScript. And so will the rest, except for two lines. So this looks the same. You know, we switch height and width. Now this is a bit interesting because we have this chunk of memory. We kind of need to know where our input image starts and where our output image starts. Right. And that's what these two variables are. So our input image starts at zero, at address zero in this memory. Which it always does. Index zero, you could say. And the output image is right after the input image ends. And the input image consists of width times height times four bytes. Four bits per pixel. Bites. Bites per pixel. Thank you. See, the thing about this, and I'm sorry to interrupt the flow is that I should say that I came to the web as a CSS person, CSS front-end, and I learned JavaScript. Whereas you came to the web from being a programmer. Well, and then you went to the web. Well, I did embedded systems. I was literally writing kernel code and low-level memory management. And I had no idea about CSS and how to do UI and anything like that. Right. It's just two completely different angles. But I would say that if anyone is watching this thinking what is going on, I am feeling exactly the same. So don't worry too much about it. Right, come on. But for now, these are basically just indices in the array. Where does the input image start? Where does the output image start? And then this looks familiar, looping over all the pixels. Figuring out where the GNU coordinates are. We did all this before. And now there's these two assembly script-specific functions. The first one is load, which allows me to load a U32 from the memory at a given address. Right. And so in this case, what I'm doing is I'm using the input image space where the image starts plus the pixel I want to read. So this is very similar to what we were doing before in the U32 array. But this is a special command to get it straight from memory rather than... Yeah, because it's a WebAssembly memory and that's kind of implicit. It's not something that you get handed as a reference. It's just there as a global, almost like. But it's the same thing. We're passing the same indices into it. Exactly. So we're loading our pixel and then all we have to do is write it back to the output image. And it's the same thing. We're storing the value v, which we just read, back as a U32 into the output image space. Okay. And now we have written assembly script. And this converts to WebAssembly and what really struck me with this is that if I wanted to write WebAssembly, this is the tool I would use. Yeah. Because this looks really familiar to me. You don't have to learn a new language, right? Yeah. Because I think you've learned a bit of C because of Squoosh. Yes. But that's pretty much it as far as I know. You've not written Rust, I think. You can read it. I know PHP. Exactly. I'll read a PHP to WebAssembly compiler. I would love it. That was the first language I learned. So we have this function now. And now we want to compile it to WebAssembly. And luckily, assembly script makes it very easy. So we just install the assembly script package. And then we have an ASC command, which we give our TypeScript file to, and give us back a WebAssembly file with no additional glue or JavaScript, which I think is quite interesting because most other implementations for WebAssembly give you glue code, which is an additional JavaScript file at the back. It's really difficult to deal with and work with. But this is just the wasm, right? So we did this. We got a rotate wasm file. And now the interesting bit might be how to load it because usually glue code loads it for you. But now you don't have glue code. How does this work? It's actually not that difficult. What you do is you take the instantiate streaming function from the WebAssembly object and put your fetch in there. Because the WebAssembly compiler, at least the non-optimizing one, can compile while the wasm file is still downloading. So this instantiate streaming takes a promise? A promise? Or a response? Or an array buffer? That's a weird API. Why does it take a promise? Because they want to make this simple, but you don't have to wait the fetch, right? OK. I don't agree with it, but that's fine. You should just put in a wait in there. Either way, I find it really interesting. It starts compiling while it's still downloading. So it's not like download, then compile. It's actually almost in parallel, which for WebAssembly, most of it can be quite big. I think the Unreal Engine one is like 40 megabytes. That will make quite a difference. Yes, absolutely. Not so much here. No, absolutely not. So yeah, the wasm model, by the way, is like 500 bytes or something. So it's really small. It's smaller than the compressed and gzip JavaScript code that we had. So that's actually quite cool. So now we get an instance back from this one. And on that instance, we can have exports. And the exports is all the functions, but also the memory that we are going to work on. Right. So we can grow our memory because we don't know what size it has, but we have to grow to the size that it fits our images two times, right? Which would have to do calcay and skip this year, but I would do that. So that would just be the size of the clamped array data times 2. Yeah, exactly. And then I will somehow load this image into the buffer, which is really just memory has.buffer property, which is a normal array buffer. So we can use all the methods you know to put data in there. Right, this to this. Put it in. And then your call roll hit 90 and read the image back and you're done. Ah, so exports have all of the methods. Yeah, so this is the method. This is the magic where you call into WebAssembly. And you can also get synchronous. So WebAssembly is something that will actually take the control away from JavaScript and do its thing and then return the control back to JavaScript. It's just like an actual function. Okay, okay. Which I think is super nice. And so this was fast. We were super happy about this. Yes, this was much faster than... Well, it wasn't faster in Chrome in the sense that like it didn't outperform JavaScript, it was as fast or almost as fast, but it was consistently fast across all browsers. Yes, it was taking the browser that doesn't run on Mac from seven seconds down to like... 500 or something. 500 milliseconds. Something that was very, very acceptable. Yeah, it was really nice to see that similar value across all browsers. So we were super happy about this. So we, you know, we opened the PRNR Squoosh and you reviewed it and we wrote an article. And then... Hacker news happened. Hacker news happened. And that's something I would never say because usually the comments on our articles are quite annoying. Hacker news can sometimes be quite pedantic, I found, but in this instance there was some pedantry, but the pedantry was really interesting. It was really interesting. I had some fascinating results and I was both just... a lot of it I didn't understand. And I hope you're going to explain it to me now. Yeah, so someone said, why aren't they using tiling? Tiling would make this so much faster. Let me quickly try it. Yeah, I totally did this in like 20 milliseconds. I was like, what? Yeah. So I sat down... So they were taking it from what was it, sort of fringe of 400 milliseconds down to what was it? I think 40. 40, which is such a huge improvement. And that is even faster than we were seeing from a canvas element. Like, it's... Yeah. And I had, like, obviously sit down and actually understand what's happening. So let's talk about what tiling actually is. Yes, please do, because I have no idea. So I'm going to explain tiling, but it was also another suggestion for performance optimization. I'm going to talk about both of these. But I'm going to talk about the other one first because to get it out of the way. Basically, some people were saying, oh, if you look at this Y times width, it's completely independent of the inner loop. So if I move it out between the outer and the inner loop, I would make it faster because that calculation can happen only once per outer loop. It doesn't need to happen every time in the inner loop. Yes, and I thought this was going to be the kind of thing that the optimizer thingy dude, I would take care of for me. And it is. So this is exactly the kind of advice where you don't have to worry about these kind of things. Like, moving constants out of a loop is something that not only most compilers can do. So, like, the Ascension compiler could do this or the Rust compiler. But even the V8 compiler that go from the JavaScript to machine code or from WebAssembly bytecode to machine code will do this. So this is an operation that we don't have to do and where we can say, you know, let's keep it readable and obvious and don't introduce another variable where people that read the code would have to have even more state in their head to understand what's going on. Yes, OK. But the other thing is tiling. And tiling is something that I hadn't heard of. I actually had heard of it, but I also was under the impression that compilers would do it for us. And in this case, it is not. What is it? So what is tiling? It's an image. It's actually the album cover for a podcast. I don't know, did you know that we do a podcast? We do a podcast as well. We should link to it in the description, Jake. Yes, we should. So we have been reading this image so far like this. We've been going row by row. And just, you know, what is this pixel or does it belong? OK, copy. And look at the next pixel in the same row. That's kind of what we did. And we thought, fine. Tiling is a different approach where you tile the image into tiles. That's good. Yeah, those are tiles. Excellent. And then do whatever you're trying to do within a tile first. So instead of going row by row, you just go tile by tile and within that tile, you go row by row. This is legitimately... It's the same thing. This is legitimately a different way of doing the same thing. I know. Now, the interesting thing is that this turned out to be so much faster. Yeah, like a tenth at the time. I still don't understand yet. So let's implement this real quick, which because it's not actually... Can I just say, like one of our previous recent episodes, we talk about the dangers of over-optimization. Yeah. And when... Why are we doing this? Because it ends up being so much faster. Okay, okay, okay, okay. We actually, with this optimization, we end up going well below 100 milliseconds, which within the rail guidelines makes it feel like an instantaneous response to the button. It is an optimization. And before that, we were at like 300 to 500, which was fine. But, you know, if we can go under 100, we should go under 100. Especially for bigger images. Okay. So basically, I just do an additional two outer loops, which usually sounds wrong, but in this case, it's very, very right, where we iterate over all the tiles that we have. And then in there, we basically have the same old loop where we loop over each individual tile. I'm starting to hyperventilate. I... Why... Okay. I... So this is tiling implemented. So I get it. And let's talk about why this might make things faster. Okay. That is the bit I don't understand. So originally tiling, whenever you... when I Google tiling and research it, it was mostly the use case for matrix multiplication, which is a different use case because input values are used multiple times. So, right, if you... if you multiply two matrices, you have to read the cell at one, one multiple times for, I think, for each column that you're calculating in the output matrix. Okay. So if you do tiling, you have a better chance of having that value still in the cache. We're talking now processor level one cache, by the way. So... So hang on. Like... Okay. We'll need to explain what that is at some point as well. But, like, my feeling is by reading memory sequentially, you're more likely to hit caches because you're talking... you're dealing with a little bit of memory that was very close to the last bit of memory that you dealt with. Yeah. So if I have these two really big matrices in the row of the input matrix, by the time I come... I end up at the end, the values from the start might have been kicked out of the cache because level one cache in the processor is really small. We're talking, like, 200 kilobytes of cache, maybe, or less. Right. So the processor has, like, an L1 cache. Which is, like, super fast. So there's these set of caches that gets bigger and slower until you get to memory... So actually memory is actually really slow. Memory is in relative terms. What it does is, by shortening the amount of time you spend going away from your initial value, you have better chance of having the initial value still in your level. Bu-bu-bu-bu-bu-bu-bu-bu-bu. For matrix multiplication. So with this one, it didn't make sense why this would make it faster. Because the second row is a massive jump from the first... For the rotation, we read every value once and we write it once. So why would caching make things better? That is roughly the question I have in my head. So there's two theories, and I don't know which one of them is actually true. Why are you telling me you don't even know? Well, I even talked to Benedict, our V8, VM engineer. And he's like, I have two theories, but it's really hard to test. Okay. So one version is that lots of processors nowadays are really smart at predicting what memory you are going to grab next. So by basically seeing the tiles, it can make better predictions what cells to grab already put into the cache for you, and have an exit that code yet. And the other thing is that because the cache is so small that there's a certain pattern, which cell can be cached in which cache cell? So this gets a little bit confusing. But basically, if you think about it, if you have like three cache cells, just three individual cells... What can go in a cell? Like one value. One value. Okay, okay. You say, okay, so memory address 0 can only go in cache cell 0. Adress 1 can go in cache cell 1, cache 2 can go in cache cell 2. Memory address 3 can only go in 0 again. You wrap around, right? So you assign those. And then again, by keeping it smaller, you have a better chance of not overwriting the old value you have put in into your level 1 cache. So basically, all this is about is by making things, making your access memory smaller so that you don't evict the cache from the things that you... So this is basically it's working because our inner loops are smaller. Yeah. Right. So it makes the processor make better predictions and also makes also not evict the cache because the area we work on is smaller. So then there's the tile size. Yeah, what's the tile size? So that's what I thought, right? Okay. And so I did some benchmarks on a MacBook, on an iMac, and a Pixel 3 because the bigger the machine or the bigger the processor, the bigger the level of one cache usually is. Right. So the iMac that I have is like an 18-core massive processor thing that has massive L1. Well, the Pixel 3 obviously has a very, very tiny level one cache. All this code is single core anyway, right? Yeah. So basically at zero is the relative time it took for no tiling. So that's the original piece of code. That's our baseline time. The wasm. Yes. So what you can see here is how the time shifted relatively to that base time depending on what the tile size is. Interesting. So if I have a tile size 2, a 2 by 2 pixel grid, it makes the code slower, which is not very surprising because you have so much more looping going on and more jumps. Okay. It gets faster really, really quickly. At some point over here you kind of hit level one cache boundaries where it then gets slower again. Right. I see. Okay. To be honest, there's one weird thing where the Pixel 3 is slow even with a massive grid which I'm not quite sure why that is. I think actually I expect the Pixel to like go up somewhere around here. You would assume the level one cache is less than any MacBook. It probably is and there's probably other effect that I don't quite understand. Okay. But what I found really interesting is different architecture as well in that processor. But it seems to be a sweet spot between like 16 and I don't know 64 depending on what you want. I think 16 looks really promising in this graph which means you have like a 256 pixel grid that you work with. I thought I was going to go away understanding why the tiling works. No, it just does. It's it's as much I spent the last week on this, right? Right. You've been kind of sitting across the moon hearing me talking to people and trying to figure this out. This is as close I've gotten to understand it in that there is this interaction between the processor predicting what values were in the cache and then not forcing the processor to evict that cache because you read too far ahead. But this is a massive case for tools not rules, right? Don't go away and rewrite all your code. With tiling. With tiling. No. Is this something you would have to very carefully profile on a wide range of machines with different processor architectures to see is actually working across the And it's also an interim because we started at let's rotate an image a very high level use case and we fell down and ended up with like let's talk about processor architecture and level one caches. Yes. So thanks to Hacker News for being so inspirational even though I still don't fully understand it but I feel like I'm okay with that. Yeah. I feel like my understanding of lower level stuff is like I said there's that confusion element but I feel like I've got an appreciation for like the smarts, right? Yeah. Go into that. It's it's incredible. So let's let's take a breather and you'll see our poor audience next time. But this but this is going into description. Yeah. Yes. Oh, that's one thing up. Ah. Nice. Why did I write this down? How do I fix this? Something for the edit something for the edit. Okay. Let's get from here.