 Hello and welcome. My name is Richard and in this video I will present our paper about instruction set architecture extensions for finite-field arithmetic. We investigate the impact of smaller accelerators and in this work we examine an exact extension for finite-field arithmetic. To benchmark such an extension we applied it to lattice-based cryptography. Our contributions are as follows. We use an open RISC-5 implementation to introduce an instruction set architecture extension for finite-field arithmetic. We present an optimized RISC-5 implementation of the Kaiba and New York ciphers for the standard RISC-5 as well as our extended instruction set. Finally, we benchmark the results examining cycle counts, clock frequency, wall clock time as well as area and time area products of the design. Let's first get to RISC-5. What is RISC-5? It's a free, open and as the name suggests a reduced instruction set architecture which is extendable by design. It's meant for processor designs as opposed for example simulation or binary translation. It does not dictate a specific microarchitectural style. More and more free implementations exist for some time now and I will quickly show those which we consider to use in this work. The first RISC-5 implementation we examined was the rocket-jet design. It utilizes the chisel hardware design language to actually generate a processor. It is quite configurable and offers an extension interface called the Rocket Custom Co-Processor Interface as shown on the right. If a co-processor is included in the design, it simply responds to a chosen upcode. The disadvantage here is that the co-processor resides outside of the pipeline and this may lead to pipeline stalls when for example the co-processor requires multiple cycles. The Pico RV32 implementation by Claire Wolf is a very size-optimized implementation and it offers a quite similar interface. If an instruction can't be decoded by the processor, it is simply offered to a bus that contains all the co-processors. The co-processors then decode and execute the instruction, returning an optional result register. The RISC-5 multiplication instruction is actually implemented via this interface in the Pico RV processor. But for this work we opted for the Vexpress 5 implementations by Charles Papon. We chose Vexpress 5 for its flexibility which stems from a software-orientated, plugin-based approach for design. The design doesn't have any co-processor, instead the design consists of a set of plugins which extend the stages with logic. Each plugin has access to any stage register which enables highly adaptable designs with very few redundancies. The flexibility of Vexpress 5 stems from its language, SpinalHGL. SpinalHGL is a very similar language to the aforementioned chisel language. SpinalHGL is actually technically a software framework for the Scala programming language that serves as a hardware description language. It's quite important to note that SpinalHGL does not follow a high-level synthesis approach. The framework simply introduces powerful tools to describe hardware and it's at its core not that different from common design languages. Scala lends itself well to designing a domain-specific language, for example due to the optional infax notation or the flexible function naming scheme which seamlessly enables introduction of new operators, as shown below. The biggest difference to Veriloc or VHGL are the powerful elaboration tools. Instead of simple loops or conditionals, we can parametrize and elaborate with any Scala mechanism. An example on the right shows a carrierless multiplier which utilizes a simple map and reduce expression which will automatically result in a balanced tree for the accumulations of the multiplier. The workflow for SpinalHGL is interoperable with existing hardware design languages. SpinalHGL will simply produce a generator for Veriloc or VHGL and the result can then simply be synthesized as usual. Using the sources produced by SpinalHGL in other language sources is very straightforward as SpinalHGL will preserve all signal names used in the language. Integrating existing legacy sources is just as easy as external hardware modules can be instantiated by SpinalHGL. Now let's get to our easter extension. To explain the extension, I'll first talk about how the WexWiz 5 plugin system works. It helps not to think about WexWiz 5 as a CPU with discrete components such as a decoder, memory bus or ALU. This is simply what WexWiz 5 pretends to be. WexWiz 5 is actually more a flexible and extendable pipeline and the CPU is simply an emergent property of all the interworking plugins. The basic elements of such a pipeline are the stages, stagibles and plugins. A plugin will simply generate additional logic inside of a stage. The logic then uses and produces stagibles which can be any piece of data you may want to pass around in a pipeline. The pipeline generator will then automatically route the produced data to all the consuming plugins and if the data is required in the next stage it will automatically be routed through registers. Let's go through a quick example. First we define the stagibles which can be of any desired type. Then we plug in a piece of logic into stage A. And for every piece of logic we plug into it, usually the logic produces some values which can be used by other plugins. And when these outputs are used within the same stage by another plugin, the signals are simply routed through the appropriate input. When however the outputs are used in another stage, the signals will automatically be routed through other registers. Shown here is an overview of the plugins that make up the CPU. The instruction bus plugin will fetch and inject an instruction. The decoder will then set initial values to some stagibles depending on the instruction. The register file will then load and inject the register values referenced by the instruction. The source plugin and integer alu plugin produce the arithmetic results. And in the final stage the register file will store the result. Note that the multiplication plugin will split the multiplication into 460 mid multiplications which are then assembled in the later stages. This is a great example of a plugin which uses multiple stages to produce a result. Now let's take a look at the extension we defined for the finite field arithmetic. Shown in the table above is the instruction format of the 4 arithmetic instruction. In addition, a subtraction, a multiplication and a separate reduction instruction. Note that all instructions include a reduction. The separate reduction instruction simply serves as a normal reduction without any arithmetic, for example to reduce sample polynomials. In the simplest variant, the extension simply reduces the instruction result by a fixed modulus. A second variant can reduce the result with a set of fixed moduli chosen by an index. The last variant is actually completely flexible with a modulus set by an internal register that contains the modulus and a pre-computed value. In future versions such an internal register could also be designed to use a controlling status register instead of specialized instructions to access these internal registers. Shown here is the hardware design with its three stages. For ease of use we opted for the barrette reduction. This avoids a conversion that, for example, the Montgomery reduction would require. And since we have access to any internal register, we can reuse the ALU and the multiplication plugins. In the second and third stage the extension of the extension the result is reduced. Note that the reduction uses fixed constants, which means that we can make use of small hamming weights substantially reducing the cost of small multipliers. The flexible variant in turn simply feeds values from registers instead of constants. To benchmark the design we need a complete system. As Vexvis 5 is just a core, so we loosely base our design on the Murox example of the Vexvis 5 project, which is a complete system ownership. Our design goal was to mimic an ARM Cortex M3, so we extended the Murox example with full multi-masterboss, so that the core can simultaneously fetch code and data. To mimic a separate flash memory for the code, we also separated our memory into two distinct blocks and linked all the executables accordingly to split the code and data sections. Now to assess the impact of the instructions at extension we evaluated various design variants on the Xilinx Artix and Latticemi IS40 FPGA platforms. Included here in the table is a reference platform, an extension with a single fixed modulus, an extension with four fixed moduli, a flexible extension, as well as variants without a general purpose multiplier. The extensions introduce a fairly small overhead, with the flexible variants requiring the most. The variants without a general purpose multiplier are especially interesting, as they significantly reduce the size of the design. Shown here are the results as presented in the paper, but we do have an update. Later analysis showed that the memory bus was the longest path of the design. And introducing a buffer for the bus adds one cycle of latency for memory operations, but significantly raises the maximum frequency. It also shows that the first reduction stage of our extension is the longest path of the processor. Here the simple custom variants profit significantly from the simplified multiplications with constants that don't require any DSPs. Now let's have a look at lattice-based cryptography on RISC-V. First we developed a RISC-V implementation for finite field arithmetic without any extension. The approach of most reference implementations for the ciphers use a signed Montgomery reduction, which does not lend itself well to RISC-V. RISC-V does not include a sign extension instruction, so we need to use two shifts to sign extend the result. While one of the shifts can be avoided by shifting the multiplication constants, it's still quite cumbersome. So we opted for a Barrett reduction, which is a much better fit for the platform. As RISC-V has separate instructions for the high and low result of multiplication, we can avoid any shifts otherwise necessary, reducing the reduction to just three instructions. Next is the number theoretic transform, which many lattice-based cryptography schemes use in some form or the other to speed up large polynomial multiplications. The reference implementations usually process the entity layer by layer, as shown here in the first layer and then the second layer. A more common approach for optimized implementations is merging the computations of multiple layers as shown above. This avoids loads and stores of the coefficients. RISC-V has enough registers to merge three layers, while for example the ARM platform only has enough for merging two. While each merge doubles the number of loaded values, we still have enough registers to employ other optimization techniques such as interleaving all four butterfly operations of the entity. Each butterfly of the entity requires a so-called twiddle factor, which in software implementations is usually pre-computed and stored in a lookup table. A lookup table is used as simple loads from a table are usually more efficient than a finite field operation required to compute it. Hardware implementations, however, use optimized circuits for finite field operations, so they compute the twiddle factors on the fly. And since our instruction-sector extension introduces such an optimized circuit, we can use these methods. So to implement the entity, we then use the iterative entity algorithm and pick an appropriate butterfly operation to ensure that the twiddle factors are used in ascending powers of the root of unity. This way we can almost entirely avoid lookup tables. The only exception being kyber, which requires some twiddle factors for its polynomial multiplication due to the early termination of the entity. However, however, we were able to reduce the lookup table to only 32 elements. Let's evaluate these results. We implemented the polynomial arithmetic including a suitable entity and inverse entity for the kyber and new hope ciphers. New Hope uses polynomials with 512 and 1024 coefficients and kyber as a module every only needs 256 coefficient. Left shows the cycle counts for the 1024 element polynomial arithmetic. Now, compared to the standard risk 5 implementation, which is in the middle, the custom instruction reduces the number of cycles by 26%. A much bigger improvement is the code size shown on the right, which includes the pre-computed table. As we employ some amount of unrolling, the code of the optimized implementation is larger than the reference code. The pre-computed tables, however, still make up a large portion of the code and in fact are larger than even unrolled code for the larger polynomial sizes. This way, the instruction set extension can offer a large advantage in terms of code size as it doesn't need lookup tables. For the updated variant of the design with an additional bus latency, the speedup of the instruction set extension becomes more pronounced and is raised to 30%. As the instruction set extension completely avoids the additional latency introduced to the load instructions, which is used for loading twiddle factors, its advantage becomes more pronounced. When we examine the whole scheme, the speedup is not as pronounced. The culprit here is the hash function used for the random sampling of the polynomials. The effect is, of course, also present in other implementations, for example, for the ARM platform, and the benchmarks usually remove the influence of the hash function. When we remove the cycle-spent hashing, the speedup translates reasonably well to 13%. Once again, our new memory architecture introduces further speedups due to the missing lookup tables. The question now is whether such an instruction set architecture is worth the additional circles. To that end, we examine a time-area product of the design. Sadly, this is somewhat complicated, as we don't have an adequate area translation for DSPs and RAM blocks. To estimate the time-area product, we synthesize the design without any DSPs, and this, of course, significantly lengthens the longer path in the pipeline with a large impact on internal multipliers. Note that the custom instruction could significantly shrink the required memory, which could shift the time-area product even further. Sadly, we can't emulate this result. The comparison shows, however, that the designs without general-purpose multiplier are particularly interesting. Once again, introducing an additional memory latency raises the maximum frequency, but greatly exacerbates the influence of the reduction stage. The variants without a general-purpose multiplier, however, still stay the most attractive variants, and the fixed variants remain still quite fast. As they use simplified multiplications with constants. To conclude, we presented an instruction-set extension for arithmetic with finite fields and demonstrated its impact on lattice-based cryptography. We showed that such an extension can have a significant performance benefit, both in terms of cycle counts and walk-rock speed, but especially co-sides. The results require some further analysis in a context beyond FPGAs, for example in an ASIC. But the results are already quite promising, particularly for small architectures. If you want to experiment with a RISC-5 platform presented here, I polished it a bit in my free time and published it under the show URL. The platform targets several FPGAs, and if you don't have an FPGA at hand, it even includes a Cycler Accurate Simulator that enables you to even debug your code with a GDB. And with that, I thank you for your time.