 So let me tell you what SHA-256 is. The SHA, the hashing library, is something that generically just takes any message, an arbitrary length message, as in a sequence of bits of any length, and it just fits out of that 32 bytes. The digest of this thing is 256 bits. The procedure to go through those consists of three parts. So I'll just go over the three parts. The first part is breaking your message into multiples of 64 bytes. So those are the chunks, 512 bits. The second part consists on scheduling words, computing a bunch of double words, which are 4 bytes each. And the third part consists on performing a bunch of rounds, which is just the function that actually does the hashing itself, the function that is not invertible. Here's just a brief description of this slide I just stole from somewhere on the internet. But I want to go in some detail in each one of the three parts. So let's start with the easiest one, which is scheduling words. So scheduling words, as you see here, you're giving a message. Let's suppose that we've already broken the message in pieces. Each one of them is 64 bytes. So the first thing you do is you compute 48 double words, 4 bytes each. The first unit 64 in total, the first 16 are your message, are the 64 bytes that you're giving. These are 16 words. With those 16 words, you can compute the next 16 words. And with those 16 words that you just computed, you can compute the next ones and the next one, and you computed the 64 that you needed. And the important thing that you need to remember from this slide is that to compute those scheduled words, the only thing that you need is the previous words, nothing else. That means that when you're scheduling words, I think the consensus people are starting to come in. Anyways, I've already started, guys. So when you're scheduling words, the important thing is that you don't need to know the previous date of the hashing. You only need to know what were the previous scheduled words. In particular, it only depends on the chunk that you're hashing now, and it doesn't depend on the other chunks that you're gonna hash. So if your message consists of 10,000 chunks, you can compute the scheduled words for the 10,000 chunks without caring about how you would hash each one of them, and without caring about the rounds part. So scheduling words only requires the previously scheduled words. So the diagram there is taken out of an Intel paper describing what were two new instructions that do this scheduling of four words at a time with only two instructions. This can be done now on modern CPUs. And this is just a sketch of how diagrammatically you compute four words at a time. But it's irrelevant the method that you use to compute those scheduled words. What I want you to remember is that to compute them, you need to know the previous words and nothing else. Rounds, we don't care about what rounds is. We only care that it's a function that is not invertible. That it's computationally hard to invert. But the important thing that we need to remember is this, we were given the message, we broke it into pieces of 64 bytes that I haven't told you how. We computed those scheduled words that we need. And then what we do is we pass an incoming digest. If it's the very first chunk that you're hashing, we pass a constant digest that the method has. If it's the third chunk, you're gonna pass the hash that the second chunk produced. The point is that you have a status, which is your current hash, and you pass it through this function that takes this status, the hash, the 32 bytes. It takes one of the scheduled words that you computed, and it takes another constant word that the protocol has. So it takes this data, and it produces for you a new hash. So what do you need to remember from this is that you need to have computed at least that scheduled word before passing through the rounds, and you cannot do this in parallel. To pass through the round, you need to have computed already the hash before. So this is something that you cannot do in parallel. Okay, so the padding block. So this is the first part of the three process, which is breaking the message into multiple of 64 bytes. How do you do this? Well, you just break your message into multiple of 64 bytes. You add one bit at the very end of the message just to signal this message has ended. So here the message is less than 64 bytes, is 24 bits. This three bytes there, A, B, and C. This one there is showing that we added that extra bit to show the message has finished. And then you pad with zero bits up to a multiple of 64 bytes, minus eight, because you're gonna use the last eight bytes or 64 bits to encode the length of the whole message, as is there in binary that's the number 24, which is the actual length of this message. So this is the procedure to pad your message, which was arbitrary length into a multiple of 64 bytes. Vectorization, which is the main topic of this talk is vectorization. So the first thing to notice is that you're not going to beat any of the hazards out there. Every implementation that I've reviewed before I got into this, and I'm not an expert at all on this, I'm from a completely different subject, but every implementation I've seen is equivalent to Intel's original white paper on this, and it's equivalent to what opens a cell, for example, does. All of them implement the following. You can compute your scheduled words, as I told you, without knowing what the current status of the hashing is. So all of them use vector instructions to compute several words at a time. If your CPU supports 128 bits registers like this ones, and almost every CPU now does, then you're gonna compute four words at a time, four double words at a time. You start with your 16 double words, which is your message, and you can put those 16 double words in only four registers. If your computer supports AVX2, it can do everything like you can registers that are 256 bits, then you're gonna compute eight. You're gonna use only two registers, and you're gonna compute eight words at a time. If your computer supports AVX512, you're gonna compute all of them, the 16 of them at a time. And well, AVX1024 does not exist yet, but it's already in the books. There's two options into computing this. Modern implementations either do this, this vectorization, or if your CPU implements cryptographic extensions, like what was there in the picture, then they will use 128 bits registers, regardless if your computer has larger registers, because cryptographic extensions can compute four words at a time much faster than vectorized computations. But the point I want to make is that this vectorization is there in every current implementation. This is since Intel's white paper. Also computing the words like this is useful because since you can compute the words and at the same time pass through some rounds, so let's say that you computed the fifth word, then you can pass up to the fifth round. Then you can mix scalar operations with vector operations and the CPU will perform this in parallel. The CPU has several ports and can take different types of operations at the same time. So you can be computing the fifth round that is a computation that has to be done on the scalar part of the CPU because it cannot be in parallel. At the same time, you're computing the sixth word in parallel on vector registers. Okay, so I've covered what is a typical implementation of hashing and this is the thing to remember from the whole part, from the whole part of the talk, is that the hasher signature is this. In either language, it's something like this. It's something that takes an arbitrary length byte slice and it gives you back 32 bytes, which is the hash. So it takes an arbitrary length message and it gives you a digest. And this is something that you're not gonna do. You're not gonna implement this better than open SSL. No one's gonna do it. You perhaps can do it for one CPU. You're not gonna write an implementation faster than what is already there. However, we use hashing in a very restricted scenario. We use hashing to hash Merkle trees. And Merkle trees are not arbitrary length. Merkle trees are in the case of the consensus layer, for example, where the nodes are 32 bytes. They're something like this. Each one of these nodes represents 32 bytes and each parent node has only two children and these two children are hashed together. So the two children are 64 bytes and co-continated. You hash them and you get the hash of the part. You get what goes in the part. So this is what we hash, typically. And we observe two things. Well, actually, the execution layer has something completely different, but still the same technique is gonna apply. What we observe immediately is the following thing. First of all, we're not hashing arbitrary length. We're hashing every single time 64 bytes. And second of all, we can hash this in parallel. You don't need to, you can hash the blue one by knowing the entire left subtree and hash the yellow one completely in parallel in the other side. Of course, you can do this in parallel and if you have two CPUs, you can just use two threads to hash this and this is exploited in the consensus layer. I don't know if any other besides Lighthouse uses this, but I want to use this, I want to talk about a different kind of parallelization. You can parallelize this in different threads, but the point I want to make is that you can do this by using vector instructions. This is the typical, this is the one that is in the specs in the consensus layer specs. This is the implementation. I think this is Vitalik's implementation. This is a flat array approach to having a Merkle tree. I highlighted the line where you're computing the hash of the parent by hashing the two children that are concatenated. The top one is a different memory layout. This is Proto Lambda's implementation on Remarkable. The point is that it's the same kind of hashing. You take the two children and you hash them and you hash two blocks at a time. This is fairly, fairly inefficient. This is a Go implementation. This is also Jim's implementation. This is a production implementation. And then again, the same thing. This is slightly more complicated because it takes into account other things, but the highlighted line is the point is the point where you're hashing and you hash one node as the hash of the two children. And you do this on a loop. For each pair of children you hash once, you call the hasher and you get one hash. Okay, so I want to tell you what is the right way of hashing a miracle trick. And it's fairly, fairly simple. So there's two things that we want to exploit. We want to exploit the fact that our hash is exactly 64 bytes. That's the thing that we want to hash. So remember how is it that we padded our message into a multiple of 64 bytes? Well, our message is already 64 bytes. But unfortunately, we need to add one bit to tell the message ascended. So that's the first bit there. The problem is that since we added that bit, now we need an entire block. And the entire block is all made of zeros except this little bit one here. And that bit one here is saying the message was 64 bytes. So that bit there is the 512 bits that the message was. But the point is that this block is known. It's known before we started hashing. The whole block is known. And in order to compute the scheduled words, we only need to know the message. We don't need to know the hash of the previous block. So whenever we hash 64 bytes, we can already compute the scheduled words for this entire block. And we can do this now. We have them so we can hard port it in the hasher itself. This is something that the Bitcoin community learned and they have it in their standard node. So the, I don't know how is it called. I think it's Bitcoin Core, the client. And it has it there. And no Ethereum client used this. So I was surprised when I found this out. So we can steal this from the Bitcoin community. And indeed, so Prism uses this library. Loads that are also implemented this. And just by implemented this, you get at least 20% gain on your hashing speed. Just this, no changes on your code. Just include this hard code words. The second one is vectorization itself. And vectorization, this is also something that is implemented already. If you have several messages that you need to hash at the same time, there are libraries to do this. Intel has a very good library. And all I need is just change the signature of that library. But the point is that you can be hashing several buffers at the same time. If you have, so this is a way of like encoding the words again, you're gonna take, this is the message that you want to hash. So these are the 16 words corresponding to this node here, the zero note. Then you're gonna have 16 words corresponding to this one that you want to hash, 16 words for these two and 16 words for these two. And what you're gonna do is collect the first double word from this one, the first double word from this one, the first double word from this one and the first double word from this one into one register. That's if you have registers that are 128 bits. If you had AVX12, you would collect eight of them at the same time. And again, if you have AVX512, you can do 16 at a time. The point is that you can hash, in one pass, you can hash up to 16 blocks on AVX512. And you can do the same thing for the words themselves. So the digest consists of eight double words. If you have registers that are 128 bits, you can put four of them in one register and then you use eight such registers for the eight words of the digest and you can hash four blocks at a time. So this is, again, this is a library that exists. So all I'm selling you is something that already exists. It's not that we're rolling out our own crypto, is I need to brag about something. I was told as a student that if you're giving a talk, you need to brag about something that you did yourself. And I guess this is something I did. Of course, the assembly for Intel and AMD, Intel's libraries, I'm not gonna beat it. Unfortunately, Intel, or reasonably Intel, does not produce assembly for ARM. And there's something interesting here. ARM assembly, I told you that every library out there uses vectorization to compute the scheduled words. That's not true on ARM. On ARM, OpenSSL had a library using neon instructions, vector instructions to compute the hashing, and it turned out to be slower than scalar. So most libraries for ARM do not use scalar instructions. However, if you're gonna hash a miracle tree, you can hash several blocks at the same time, and that's very, very much, much faster than hashing them on scalar. So the pipelining for ARM, I think it's purely mine. Let's say that you want to use this library and you want to implement it. So there are changes to your code. So the changes are following since, let me go back a few slides, perhaps. Ah, it did have a pointer. Oh, I'm just technology at first. Okay, anyways, let's see. This is the, so as I told you, I'm gonna hash with this kind of registers, we can hash, ah, now I have a pointer. Of course, with this kind of register, I can hash four blocks at a time. That means that I can compute this entire layer in one pass of the hash. So instead of passing on a loop and hash this to, hash this to, hash this to, and so forth, what I'll do is, and this is a requirement, you cannot use Proto's implementation that has pointers everywhere and the data can't be anywhere. If you want to use this library, then you need to have, at the very least, this entire layer consecutively in memory, and this entire layer consecutively in memory, and so forth. Vitalik's flat array would work fantastic for this thing. If I were to implement this, I would just put everything on one array. The point is that, what you're gonna pass to the hasher is a pointer to this, or this whole slice, whatever equivalent signature for this is, and the hasher is going to give you back all of this at the same time. So this is something that you need to change. The hasher signature for this has this form. So in Go, this takes an arbitrary byte slice, so this is the layer that you want to hash, and it gives you back a slice of hashes, of 32 bytes, these hashes, all of them at the same time. I think I might have gotten this correctly in Rust after several iterations. This would be in Python. We have a library, we have two libraries. We have one in Go assembly, and we have one in Usure assembly with C bindings. If you are going to use that, let me know because crypto extensions on ARM is still not implemented. It's gonna take like 10 minutes to add this. But this is the signature that the library's gonna use if you're using a C bindings. All right, so that's all I want to tell you, and that's it. Okay, sure, I'll just repeat that for the sake of the recording. I was just asking if the ability to process multiple different hashes in parallel was already implemented in libraries. That was the question to the answer. So there's two things that you have here, right? One is to pre-compute the padding block, and then the second is the ability to in parallel process multiple different hash values and then produce multiple different hash values, right? Correct. And so there are already implementations that do multiple different hash values at once, and you just modified them to deal with the padding block, is that right? Yes, so there are two modifications here. One is the thing that you're gonna use the padding block, the padding block has this constant hard-coded, and then there's other modification which is the fact that you're expected to get a list of 64 bytes chunks, and then you pipeline this, so what this library does is it grabs all of the blocks consecutively memory. It gets a matrix on the vector registers. It transposes this matrix, and now you have on all of your registers the different messages. So then you can use Intel's machinery to hash those messages in parallel. You print the, you output these hashes, and then you just loop it back. So that's- Okay, cool. So you have, so the result is basically like, so it's in GoAssembly what you have, right? Sorry? Your implementation is in GoAssembly? Yeah, so the original implementation was just purely assembly, this is there. It has C bindings, but the C Go overhead is horrible, so it ended up being slower than using the Go implementation, the standard library in Go, so we needed to write a GoAssembly library to use ourselves. Yeah, okay, cool, well done. Very good, very impressive. Thanks. So I was curious, you've shown that there are always optimizations you have to do to get around the general purpose nature of these general perfect hash functions. Do you think it would be possible to design a hash function that was like a special hash function only for like- Oh, that's a good question. I'm not sure if you can do better than the current implementations, like you can take the sponge type implementations, SHA-3 and company, and you can try to adapt this for this. I don't know. I think, yeah, your question is completely open. It's a very good question. I think we should think about this. So if I understood your question asked, can there be a method that is designed for Merkle trees instead of like a generic one? In your implementation does it, so say there's some really big Merkle trees in the Beacon State, right? Is it the job of the implementation to kind of split that Merkle tree up into smaller subtrees that fit the size of your CPU? No, no, so you're saying if you have this large tree, if it's the job of the implementation in splitting into smaller trees, no, no, this is completely orthogonal to that. So what you guys are doing in LiveHouse is splitting this into smaller trees and you'd send this on different threads to compute them in parallel. This is completely different. You just pass the entire slide. So big trees is the big gain for this. You're gonna be at least hashing four times faster than the standard library. You just pass the entire slides, like all of the slides that you, of the bottom slides and what this thing is going to do is not splitting to subtrees. What this thing is going to do is grab as many blocks as possible that it can fit in your registers and then just go to the next chunk and the next chunk and the next chunk and like this. So it doesn't split in subtrees. Okay, and so you fade it all of the leaves and then it produces the intermediate leaves. It produces the next layer and then you just feed it the next layer entirely and it produces the next one. Oh yeah, I got you. So for the state, so you're thinking in the beacon state, for the state, this is incredibly fast but this is not how we hash the state because we have it in cachet typically and you just have a few nodes that you're changing. So when you hash the dirty trees, the dirty leaves, then, well, there's two things that happen. So sometimes you have several of them that are consecutive and then you can be smart and pass this consecutive layer to this hash or you can just use whatever you're using now, just hash two blocks at a time. You're not going to get the vectorization impact but you're gonna get at least the 20% of the hard one, the padding block. Yeah, okay, cool. So I guess there's maybe like an argument for it. Maybe it's quite useful for small trees. You don't get a lot from caching. Yeah, it's not really useful for small trees. You only get the 20% of the padding block. It's very, very useful on large trees and here large means more than eight blocks. So anything that has depth more than two, this is already four times faster.