 Hello all, welcome to our talk on the design of the new AES instruction set extensions for RISC-5. The aim of this talk is basically to give you an overview of what RISC-5 is and how our role within it. Talk a little bit about instruction set extensions for cryptography generally and then the specific ones that we've built for RISC-5. So we'll talk you through our design process, explain how these new instructions work because I hope a lot of you will be using them and then explain sort of our next steps as a group, as a group of people who are developing instructions for RISC-5. We're going to keep the talk as technical as possible within the time, so most of the details will still be in the paper and if you're watching this through on YouTube I will not blame you if you watch this back at half speed because it is quite dense. So moving on to exactly what RISC-5 is, so in their own words it's a free and open ISER that lets anyone pick up the specification and build a CPU or software to execute on that, as opposed to someone like ARM or RISC-5 where either cost you a lot of money to buy the chip or simply license the design. This means anyone can use RISC-5 for free so it's very popular in research and industry, particularly in industry for niche applications like security. The main principles of RISC-5 are that it's a very small base instruction set architecture and then you have domain specific extensions for particular things. So this incomplete map sort of shows this, so at the bottom you've got the base instruction set and then each of the little bricks represents sort of an extension on top. The blue ones are ratified so that means they're frozen and won't change. The bright yellow ones are sort of being developed currently and then the slightly pale yellow ones are on the roadmap and some of these extensions are being ratified this year including the scalar cryptography instruction set that we have been working to extend with these AES instructions. So that's the focus of this talk, it's not so much the scalar cryptography instruction set extension specifically but the design process that we went through to build the AES accelerator instructions for RISC-5. So just to introduce instruction set extensions for cryptography more generally, so these screenshots are from the arm manual on the left and the Intel blog on the right introducing their AES acceleration instructions. We're all on some level familiar with these, the idea being that if you add an instruction to your architecture that performs the very specific operation that AES does or indeed any other piece of cryptography or the particular specific operation that your CPU needs to do, you can get enormous power performance improvement. So cryptography is a really good example of sort of domain specific instruction set extensions because you get such an enormous performance increase, you can very often get security benefit as well by removing timing to act vectors. So for AES this means that you're no longer computing your S-boxes and your mixed columns operations through T-tables and you also get something of a portability benefit. So for sort of smaller embedded applications wherever I'm using a slightly different dedicated AES sort of accelerator engine you have to interact with via some sort of hardware abstraction layer. That's a bit of a faff and it's quite difficult to do for portable software whereas if you've got them built into the instruction set software can always rely on them being there. The flip side of this is that particularly for AES most of the way that these instruction set extensions have been defined and if I flip back to the previous slide you sort of see this is that they build on top of the existing SIMD or vector registers in the CPU and that's great because AES cryptography generally tends to have quite large inputs to its operations so if you've already got these big registers in your CPUs it makes perfect sense to reuse them you can see here on the bottom left here that it's 128 bits going in, 128 bits coming out and that's great if you've already got these big registers in your CPU but for a large class of CPU that's just not feasible they are designed to be area optimised and very very small and of course lots of you have written software for sort of embedded microcontroller platforms like ARM Cortex-M something and you know that writing cryptography for those can be a right pain it can really benefit from having some level of accelerated cryptography both in terms of performance but for smaller devices in particular it's more about power efficiency as well so that's the good and the bad of instruction set extensions for cryptography the ugly until recently was that RISC-5 didn't have any cryptography acceleration and there have been papers pointing out that actually RISC-5 struggles in a lot of regard when it comes to certain cryptographic primitives so that was our job so the authors of this paper and a whole bunch of other people are members of the RISC-5 cryptography extension task group and it's our job to make RISC-5 the best architecture out there for doing cryptography and that's hopefully what we've somewhat achieved and again this presentation will take you through the AES versions so the first question we started with when building these extensions was do we go down the traditional route of accelerating AES? do we reuse the forthcoming RISC-5 vector registers? or do we take something a bit different that's not actually been done before? do we look at the scalar approach which is to only use existing general purpose registers? in true RISC-5 fashion we actually decided to do both but we're focusing on the scalar instructions first because they're ready whereas the RISC-5 vector extension is still being defined so you probably guessed from the title of this presentation that we're looking at scalar stuff now and what makes scalar different is that your input and operator input registers are maybe only 32 bit or 64 bits wide as opposed to 128 bits and this means that inevitably your implementation is going to be a little bit slower but it's going to be available to a much, much larger set of CPU types and that was the really interesting thing from a RISC-5 perspective but currently it's most popular in the sort of embedded compute space so in terms of what we actually did to define these instructions and our sort of approach we basically went looking through the literature we had a really positive experience going through existing stuff chairs and various other venues going back years for how people thought about adding instruction set extensions to 32 bit processors in order to speed up AES so we found sort of three distinct pre-existing works that you can see sort of snapshotted here and we also invented a couple of our own designs one of which was for the 64 bit architecture because we didn't actually find any sort of 64 bit only work in the literature for accelerating AES and then we went through a fairly standard process of benchmarking their software performance looking at static and dynamic instruction counts how much they cost to implement in hardware and the sort of fuddy complexity like what does this mean if you want to actually integrate this into a CPU does it make the rest of the design disproportionately complex how would we verify the functionality all these other sorts of things that aren't immediately apparent if you just look at the sort of academic designs and once we benchmarked all of these luckily we thought there were two very clear winners so we picked one for the 32 bit base RISC-5 architecture and one for the 64 bit and the rest of this presentation will give you a sort of whistle stop tour of exactly how those two different instructions and extensions work so for the 32 bit design we basically, the way we like to explain this is it's based on T tables in hardware so if you imagine you've implemented AES and you've done it in software with a T table style you've got essentially a one byte input producing a 32 bit output but normally you would do this by looking up into memory but in this case we roll up that entire operation into a single instruction so each instruction does one byte of SBOX it does some of the shift rows and the one byte's worth of mixed columns and then you XOR that result back into the state just like you would with a normal T table operation the advantage for this, particularly for a 32 bit design is that you only need to instance one SBOX in hardware and the area for these instructions is always dominated by the SBOX despite years of research on efficient SBOX they're still the sort of, they're the bottleneck basically so this is on the right there you can see the actual ISO specification for these instructions I won't dwell on those too much I'll give you a link to the spec later on but you get, using these instructions you get a pretty big speed up for a relatively small cost so there's only about one cane and two gates for both encrypt and decrypt which is amazing and you end up with about 20 instructions per encrypt decrypt round so in terms of what a round actually looks like at this point please do pause on YouTube just to take in this extremely busy slide so pause and back so the actual round loop itself so you start with four load word instructions so that's your next round key essentially and then you've got 16 of these AES32 ESMI so AES32 bits, encrypt, sub bytes, mix, i with an immediate and each four instructions process is one row and each instruction within a row selects the ith byte of one of the input registers applies the SBOX, says what's going on over here then if it's a middle round it applies the mix columns and if it's not a middle round then it doesn't and then you rotate the results and X or it back into RS1 I do not expect this to be completely understood on a first reading here but this is essentially how the whole operation breaks down for one round of AES on a 32 bit risk 5 so if you want to pause and just take all that in I will not blame you for the 64 bit design rather our key realisation that if you take two 64 bit registers you can fit an entire 128 bit state this is an absolute revelation to us and that means that we can fit an entire AES state input into one instruction so over two instructions which can only produce a 64 bit result each we can compute the entire AES next round state so two instructions, an entire round we have, these instructions are designed because they're for 64 bit processors which are naturally a little bit bigger we opted to make them somewhat more higher performance so that means that you can do them in do an entire AES encrypt or decrypt operation in six instructions per round as opposed to 20 in the previous so you imagine already there that's quite a big difference in terms of how that stacks up against sort of ARM and Intel's version well they can do an entire round in one instruction so although we're not as fast as them in terms of like cost per performance increase we're actually extremely efficient and this particular instruction has some trade-offs there where you can implement it with one S-box over eight cycles so you make a multi-cycle operation that's a bit smaller and a bit slower or you can do eight X-boxes in effectively one cycle or pipeline it as necessary so again you've got a snapshot from the spec there let's give you a concrete example we're a little bit less busy this time so this is the recommended way that we think you should implement AES round on a 64 bit risk five first of all load in two sets of round keys so we do double rounds and we do this because we want to pipeline all the loads in one go we use a normal XOR instructions two of them to do the add round key step and then the AES 64 encrypt subbytes mixed columns instruction does the rest subbytes, shift rows, subbytes and mixed columns the actual steps involved so you take 128 bits of input from RS1 and RS2 and then you do shift rows and you only take the low 64 bits of the result of the shift rows to which you then apply subbytes and mixed columns and output those 64 bits so that gives you half of a round and then to compute the other half you just swap the order of your input registers and because of the way shift rows works that means you naturally get the other half of the output if that isn't immediately apparent just trust me but it's a nice little trick that meant we only needed half as many instructions as we thought we did that's the 64 bit design in terms of our evaluation again I apologise it's a bit of a busy slide but these are all the numbers that you really want to see from the paper the headlines being that we're between five and ten times faster than software t-tables we get rid of all of the data requirements because there are no more tables the hardware cost is very modest and we're a fair bit faster so the headline numbers if you want to pause again and just take all of this in please do we did have an anticipated question because these are obviously instructions that are intended to go in smaller machines smaller CPUs that possibly have side-channel concerns so how did we consider this in our design process so for timing it's pretty easy you're no longer doing memory lookups because you shouldn't have been doing memory lookups anyway but if you were there's no need to anymore because the actual S-box operation is wrapped up in the instruction so timing we actually found fairly simple for EM and power side channels however that's a very different question and I would love to talk to you all about this in extreme depth so please come and find me about this because there's a lot to talk about but basically we excluded these from scope pretty early on in the design of these instructions because at the time there wasn't much research on how to add side-channel guarantees to instruction set architectures and whether or not the instruction set definition is actually the right place to do that because the moment you add it to that point in the abstraction stack you have to be able to verify it and verifying the absence of power side channels is quite difficult at the moment you can make all kinds of claims about sort of formal proofs of power side channel security but in reality those rely on a certain set of assumptions that we couldn't guarantee would hold for everyone who was going to implement these instructions so we actually excluded them from scope however hopefully we're going to put out an e-print soon so you might add side-channel protections to these instructions but generally we need more research on this so please do come and talk to us because it's something that the RISC-5 sort of security and cryptography community is really interested in so that was more of it so I anticipate your question please do keep asking about it but just to tell you we have done some thinking about it so in terms of what's actually happening now because of our work the RISC-5 scale of cryptography ISE is out for public review and it includes these instructions so the actual instruction set extension includes not just instructions for AES but also SM3, SM4, SHA2 and a bunch of other sort of more generic instructions for accelerating cryptography and you can go and find it at these various links but we hope that the chess community will find this very interesting because RISC-5 is the first instruction set that's been widely adopted that's going to have a dedicated extension for small CPUs to accelerate cryptography this hasn't really happened before and we really want to sort of get it out into the community so tell us what you think basically and in terms of what's next on our horizon so the cryptography task group within RISC-5 we're going to start looking at vector instructions for AES so these are the really high performance versions much more similar to what x86 and ARM have done in the past we also want to look at post quantum cryptography how are we going to support the primitives necessary for that on a RISC-5 based system we also want to look more at side channel security specifically for these AES instructions because they're a very common target but also more generally how can we add constraints to the RISC-5 sort of design process that makes it easier to support power or EM side channels security we've got some ideas about this but we want to talk to you about it because you're all the experts because overall our goal is to make RISC-5 the best ISO for cryptography in the world and we've already used a lot of work that's appeared at Ches in hopefully getting closer to that goal but we know there's a way to go so thank you for listening to my presentation that I realized was a little bit all over the place but I hope you've learned a little bit about not only the AES instructions that we've designed but also where they're going to go how they're going to be used so yeah, thank you very much for listening