 I'm Isis. And we're going to talk to you about implementing pure REST elliptic curves. So for my day job, I work at the Torr project. I write C. This work is done in my spare time. So this isn't actually work that I'm doing for my normal day job. This talk is for people who are lightly familiar with REST. I'm not expecting that any of your advanced REST programmers are REST is actually not that advanced. Most of the time we're not even doing lifetimes or memory allocations or anything like that. This talk also isn't aimed at cryptographers. I'm not expecting that you have any advanced knowledge of cryptography or actually any knowledge of cryptography at all. We're also not expecting you to really know any math. Just a basic level of high school algebra, and you're going to be fine. Even if you're not fine, we're happy to answer any questions that you might have after the talk. You can also email us or talk to us on Twitter or whatever. OK. So what is curve 2 by 5, 1, 9, Dalek? And then we're going to be talking about, Harry's going to talk about implementing low level arithmetic, field arithmetic in REST. And then we're going to talk about some things about REST that we found really nice and some things that we think could be better. And then we're going to go over some of the other crypto that we've implemented using our library. So what is curve 2 by 5, 1, 9, Dalek? In order to talk about our library, it's necessary to situate it within the stack essentially that it's sitting in. And so you have your application at the top, and your application is using some sort of cryptographic protocol that could be, for example, like a signature or a key exchange or a zero knowledge proof. Underneath that, you have an abstraction layer called a group. Normally in cryptography, you want a prime order group, which we'll touch upon that later. But it's just a group of elements or things which has a prime number of things in the set, basically. You can think of a group essentially like a Rust rate in that it's concretely implemented by an elliptic curve. So the things in the group, in this case, are the set of points satisfying a certain curve equation defined over a finite field. Usually, and in this case, we're talking about the field of integers modulo of prime p. So our implementation was originally based off of Adam Langley's ed25519 go implementation, which itself was based off of the reference ref10 implementation just to give credit where credits do. In order to talk about what curve 25519 dolic is and why we made it, it's important to talk about a little bit of the history of other elliptic curve libraries, their designs, and some common problems. Other elliptic curve libraries tend to have really no separation between the protocol that they're implementing and their implementation of the field, the curve, and the group. And so this ends up with a lot of problems. You end up with idiosyncrasies, sometimes in the lower level pieces of the code, that carry over to the higher level protocol implementations. So things like accidentally flipping a sign and then the protocol comes out being implemented correctly because the protocol is accidentally also flipping the sign in the reverse direction and you end up with the right output, but for underlying the wrong reasons. There's also problems with assumptions about how these lower level pieces are supposed to behave. And those aren't necessarily correct if you try to use the field or the group implementations to implement a different protocol. This also results in super excessive copy pasta. So cryptographers have this thing where they tend to literally copy paste each other's code around. They also, this is exacerbated by a lot of cryptographers somehow think it's appropriate to have a tar ball of their code unsigned inside another tar ball of a benchmarking library. And that's how you're actually supposed to get it as an end user. It's just like, this is mind boggling to me. So anyway, this leads to large monolithic code bases, which are idiosyncratic. They're incompatible with one another in really hard to debug ways. And they're often highly specialized to perform only the single protocol that they're implementing, which is often signatures or key exchange. And there's just no consideration that there's this whole rest of the field of cryptography. And you might want to do something other than these two protocols. And there's worse. Some of the bugs I've personally seen in major widely used cryptographic libraries, which are not going to name any names, but using C pointer arithmetic to index an array. So in C, just like a recap, array indexing works both ways. So taking the six element of the array A is the same thing as saying five bracket A. So in this case, they were doing A bracket P plus 5, where P is a pointer. This is equal to A plus P bracket 5 and 5 bracket A plus P. And there's just so many ways that that could go wrong. I've seen overflowing sign integers in C and expecting the behavior to be seen or similar across different platforms. This is like canonical undefined behavior. You just don't do this. And I've seen using basically untyped integer arrays. So in REST, it would be like an array of 32 U8s. And taking this without using any type system or anything like that, just saying that this is the canonical representation of multiple things in the library and multiple things which are mathematically, fundamentally incompatible. So when you have an elliptic curve point, you can compress it usually to 32 bytes by just taking either the x or the y coordinate. So that could be an array of 32 bytes, also a scalar, like a number could be 32 bytes. And these are things that are not mathematically compatible. They shouldn't be switched. And your type system should be protecting you against making errors like this. I've also seen using pointer arithmetic to determine both the size and the location of write buffer. And there's still more bugs I can keep going with a lot of horror stories of things that I've seen. So we didn't want to do this in C, obviously. So we started working in REST. And the design goals of our library were that it should be usable for other cryptographers to implement their protocols. It should be fast to write. Essentially the same as writing a sage script. It should be versatile. So you shouldn't only be able to write a signature scheme or a key exchange, but you should be able to write almost any type of cryptographic protocol. It should be safe. And by that, we mean multiple kinds of safe. It should be memory safe, type safe. I mean, REST has extra nice things. If you build in debug mode, there's underflow and overflow protections. It should be readable, which is a huge thing, because if you're copy pasting around all these assembly files, and each cryptographer is making all these tiny tweaks and changes, and there are tar balls, and there's no get history, and there's no way to know why someone changed something, you just have this blob of unreadable code that takes forever. And it's not very explicit. And readability also implies that it should be auditable, which is a huge thing for security critical code. All of these things are things that we would get from a higher level, memory safe, strongly typed, polymorphic programming language, aka REST. So with that, I'm going to turn it over to Harry, who will start to explain some of the low level field arithmetic in REST. Hi. OK, so as we saw in one of the previous slides, there's this table of the different pieces. And since that's kind of large, as an example, we're just going to go through one thing that we do. So we're going to look at how, so as part of this, we have to implement field arithmetic for integers mod p, where p is this 2 to the 255 minus 19. And just as a worked example, let's see how that works. So we're trying to do this only using the operations that we have available on our CPU. So in order to figure out how we're going to do this, you need to answer two questions. First, what are our actual primitive operations? And also, what does the multiplication look like? So when you do a multiplication, you're using a fixed size primitive type. But when you multiply numbers or integers, they will get bigger. So how does that get handled? Basically, there's four possibilities. One possibility that you can do is you can do error if there's an overflow. So in this case, if you did, I'm using u8, so it's small numbers, 8 times 40 will give you a panic. And that's what Rust does in a debug build. There's also wrapping arithmetic, where you would reduce mod how big you can fit into that type. That's what Rust does in release mode. There's also, for some things, you might want saturating arithmetic, where if you get too big, it just clamps to the highest thing. And then the fourth thing that you could do is widening arithmetic, where the result of the multiplication is going to be the next biggest type. And so in Rust, you have intrinsics for 1, 2, and 3. So if you explicitly want one of those, you can pick it. I'm not aware of an intrinsic for the fourth one, but you can just write it like this. And so now you might want to know, OK, what does this actually turn into? So let's suppose that we're on x8664. There's a really cool tool that you can see, Godbolt's Compiler Explorer. So in the top window, I've just put an example of a thing that does a widening multiplication of 2 u64s to produce a u128 output. And then in the bottom, it's produced what the actual assembly that this will do. And there's two windows, because you can see that LLVM will give you a nicer instruction on newer processors. So on the older processors, there's this mull instruction where the inputs and outputs go into fixed predetermined registers. So you'd have to do a bunch of moving things around. And then on newer things, you can pick where they go. But the point is that you can just sanity check that this really does turn into something reasonable. So suppose that we have this ability to do multiplication of two 64-bit numbers into a 128-bit product. How are we going to implement multiplication? So if you look at the original paper, they suggest using a radix 2 to the 51 representation. So what does that mean? It means that you're going to write numbers in base 2 to the 51. So you're going to get five coefficients. You might wonder, where does the 51 come from? Where does the five come from? So these things are going to be basically 256-bits wide. And so you could break it up into four times 64. But if you think about the discussion in the previous talk about instructions per clock and out of order execution, it's actually much better if not all of the operations depend on each other. If you have things that are full width, then every time that you do an operation, there will be a dependency between them, and it'll be slower. So that's why you would pick the next biggest one. And so we could write this in Rust as a tuple struct, and you can use this multiplication to do it. So how would we actually do that? Well, you could just write out the coefficients of do you're naive school multiplication. If we start by writing out what the coefficients of this product are in the low term, we'll just get x0 times y0. And then we get x0 y1 plus x1 y0. And we continue in this way. So I'm writing the coefficients of the output on the left hand column, the actual digits on the right, and you get this sort of nice triangular structure. OK, so now you'll notice that our numbers got bigger when we multiplied them. But we're supposed to be working mod p, and so we would like to reduce this back to the original size of the inputs. So how do we do that? Notice that this prime has a special form. Since it's 2 to the 255 minus 19, you know that 2 to the 255 is 19 mod p. The reason is that mod p, p is 0, so 0 equals 2 to the 255 minus 19, and you bring the 19 over. And so why is this useful? If you write out this product that we've just computed, then you can see that, for instance, this z5 term has a 2 to the 255. You can replace that 2 to the 255 by a 19. Similarly for this 306 term, you can write it as 2 to the 51 times 19, and that simplifies into this nice thing. So you can get basically a pretty fast inline reduction. And when you combine that with the formulas on the previous slide, you get this where the triangle below is going to get moved up into the upper part. So this technique for doing really fast reduction mod p, actually, you can trace the lineage of this all the way back to the 15th century. If you're curious, it coincides with the development of early capitalism in Venice. So unfortunately, we've now moved on to late capitalism and things are now looking up. But why don't we just write this in Rust? So I put some Rust code on the slide. We're implementing mul. There's some weird lifetime stuff. That's one of the things that we'll get to later. So just disregard that for now. I just put it in because it's the real code. And I'm going to define this little helper function that's like my own little intrinsic for doing widening multiplication. It's inline always, so it will just disappear. And we start off. So remember in the previous slide, we had a 19 times some stuff. But that stuff is going to be U128. And it's better instead of trying to do 128-bit multiplication, you could just do a multiplication by 19 beforehand. And then you just write down that formula. Now we have this problem, which is that these CIs that we're getting are 128 bits wide. They're U128s and not U64s. And remember that our original goal is to try to get back to U645. So we have to then reduce these CIs. And we can do that. So I've written this formula. But basically, the idea is that you take this 128-bit value. You take the low 51 bits. You keep them. You take the high part. And you add it to the next biggest coefficient. So you're just carrying the value up into the next biggest thing. And you can write that in Rust in the following way. You can construct a mask and do this carrying. You can also, because a nice thing with Rust is that you can rebind the same names. And so here these C, there's a rebinding happening where we're notifying the compiler that now that we've done this reduction, all of these CIs are now going to fit into 64-bit. So do whatever you want with that information. And once we've done this first pass, we've now fixed all of the CIs to lie into U64s. But now they're not maybe as small as we'd like. So we can just do another carry pass. And that's what that field element 64 reduce function does. It does essentially the same thing, but it's less complicated because you don't have to change types. And it also will get inlined. So actually, that's our implementation. It's not the simplest thing, but it's not that complicated. In our actual code, there's a bunch of debug assertions to make sure that all the things that are right size, that there's no possibility that we're violating some preconditions, that none of the intermediate things can overflow, whatever. But that's the actual code that we have. And we kind of just throw LLVM at it and see what happens. So you might be wondering, how does that compare? Like I heard, you had to do a lot of work to get things to be fast. Well, it turns out it's actually really, really, really fast. So N25519-Donna is an implementation. It's optimized assembly. It's what Tor currently uses by default. And as you can see here in the red, our performance is comparable to Donna, slightly better for things like verification, and then also to just throw in a general so that you understand what the numbers are kind of supposed to be. Ring is also a Rust library. It's a higher level library than ours. So implementing protocols. And Ring does that by wrapping boring SSL's implementations, which are assembly implementations and wrapping that in Rust. So it's pretty, pretty fast. So now to talk about some things about Rust that we really like and other things that we think could be a little bit better. So as, obviously, Rust's code generation is done by LLVM, as I just showed. It's really good at generating code. It's not just good at generating fast code. It's good at generating safe code. So historically, there's a worry that an optimizer code in theory break constant time properties of an implementation. So what does this mean? People have essentially in the past said that you can't use compilers to write cryptography. You have to write handwritten assembly, because that's the only way to control what a chipset is going to do. Not only is that not entirely true, there are chipsets that do weird things, which I'll get into later. That's just, this is just all kinds of insane. So what does it mean to say that code is constant time? So there are these things called side channels. And the side channel is essentially a mechonism by which an adversary can learn some sort of internal program state. For cryptography, this is especially insidious, because learning a few bits of a secret can often lead to full key recovery attacks. So a concrete example of a side channel attack is I make a static string. And I just hold the A key for five minutes, make this really huge string. And I load it in your CPU's cache. And some of the caches are shared with other programs that are running. So the other program that you want to do a side channel analysis of, let's say it has an if, then statement, where it's branching on a bit of your key, whether that bit is a 0 or a 1. So if the if statement is a really small piece of code, and then the then statement is this huge chunk of code, you can basically load your giant static string into the cache. And you wait a little while and just chill. And then you try to access your giant string again. And you time how long it takes. And as you saw in the previous talk, hitting different layers of the cache and hitting memory instead results in different timings. So that's a timing side channel. There are other side channels you can do. You can do differential power analysis where you build up profiles of a different device or a different chip. And how much power it draws as it's doing certain operations. But basically, this is all bad. You want to write code that does exactly the same thing with respect to secrets all the time. So to prevent this, we have put a lot of effort into, as you saw, propagating the carries in Harry's code. There's all these crazy bit shifting operators everywhere. And crypto code just tends to look weirder because of this. For LLVM's optimizer, it turns out that it's not breaking our code as far as we can tell for x86. And we've sat there and geeked out on this assembly for hours and hours and hours and way longer than I would like to. In the future, we're hoping that there would be some way that we could do an LLVM pass with a sanitizer where we can statically analyze the output assembly. There's a trick for Valgrind that Adam Langley made where you just mark secret data and then it gets uninitialized. And then if you ever try to index over the data or branch on it, the whole thing just poops its pants. And you know that you've done something wrong. Are you talking about I-for-fire? OK. So REST is capable of targeting a lot of platforms and targeting extremely constrained environments using no standard. So for example, so Dalek uses no standard. So if you write your protocol and tell Dalek to use no standard, you can present a foreign function interface that can be used in a lot of weird places. So Tony Rciari, who's somewhere around here, got Ed25519 Dalek, which is my signature implementation on top of Curve25519 Dalek, running on an embedded PowerPC chip inside of a hardware security module and is working on getting it running in an SGX on play. As a side note to go back to the constant time thing, I can't guarantee anything if you're running something on PowerPC, because PowerPC is one of these weird chips that I mentioned. For example, its mole operator will return early if you're multiplying something by zero. So I actually just can't guarantee that crypto works at all on your ancient MacBook. Don't do that. So Filippo Valsordem had a recent blog post, which was really interesting. I think it made the top Reddit of both the Golang one and the Rust one. So he has a thing where he is calling Rust from Go with minimal overhead, and he used Dalek as an example. Interestingly, this is three times faster than calling the Curve25519 library that's in pure Go that is in Golang's standard library. So there's some things we don't like. One of them is what I've been calling the Eye of Sauron. And it's this thing that we have. Looks like this. This is not very ergonomic. As someone who's kind of dog-fruited my own library, I don't really like being like in the print. This results because Rust operator traits take arguments of type T, not type borrow T. So in order to avoid it, so if you want or you need a copy for your types, which we do, because we have a lot of constants that are just various things in the curve that get used all over the place in different protocols, like for example, the base point of the curve, which is kind of just a point that you pick and it's special. We have a constant file, whatever. So we need copy. So we have to implement our operations on type anti, and then you end up with code that looks like this, where you keep putting parentheses around things as you have this expression of multiplying or adding things together. It gets messy pretty quickly. I'm not a compiler dev, but it might be possible to perhaps do auto borrow for copy types or some other special marker where we can say, hey, I said the type, but I really meant borrow the type. Don't actually copy it. Another thing which would be really cool and which we know is coming and we're really excited about is const generics. So we've already thought of really cool ways to abuse use const generics to optimize the field arithmetic. So the basic idea would be to statically track the sizes of the intermediate values of these field elements, and to use specialization to insert reductions when necessary. So it would automatically detect, oh, hey, you have three bits of carry space, and you've done two ads already. Let's do a reduction now instead of forcing users to know when to do it by hand. So next, Harry's going to talk about some of the crypto we've implemented with Dalek. So just like a brief overview of some of the stuff that we've done. One thing you might want to do is some zero knowledge proofs. So the idea is to prove that you have some statements about some secret values without really revealing anything. One example that people seem to want to do is called discrete log equality, where you have, say, four points, and you want to prove that a equals g times x, and b equals h times x simultaneously without revealing your secret value. So implementing these proofs are for people who care. There's schnor-style proofs, so there's a lot of boilerplate for when your expressions get more complicated. So we made a crate that has an experimental zero knowledge proof compiler implemented in Rust macros, not procedural macros, just like ordinary macros, because it's worse that way. And as a user, it's kind of nice because you just write this, and that corresponds to that example above. And what does that expand into? It turns into a DLEQ module with all the code for creating, verifying these proof objects, and it uses CERDI to derive a parser into wire format and back. So that's pretty cool. Thank you. And another thing that we did that Isis can talk about. Sure. And so there's other types of zero knowledge proof statements that you want to make. And one is, in a lot of applications, you want to prove. So before there was proving inequality, you might want to prove an inequality. You might want to say, I know an x, and it's bigger than y. The problem is that if we're working in a cyclic group, x being bigger than y is the same thing as x being smaller than y if it wraps around the group. So you have to do what's called a range proof, which is saying, I know an x, and it's between y and z. And there's ways to do this in zero knowledge without giving away any information about x, other than it lies in the correct range. These are often used in confidential transaction systems, and we're also using it in a future ANSI censorship system that we designed for Tor, which uses a micropayment scheme that we made that's embedded inside of anonymous credential. And we use this micropayment scheme, essentially, to store proof of good user behavior, which you can sort of spend later as a way to filter out bad actors from a system without knowing anything about the users or being able to track the users. So the basic idea is you want to prove x is in the range, 0 to b to the n, and you write x in base b, such that it's the summation from 0 to n minus 1 of xi times b to the ith power. And you prove that each digit is in the range, x is between 0 and b. So traditionally, the way to do this is due to a cryptographer named Shinmacher, and you just write the number in binary, and then b in this case is 2. So you just prove that each digit in the number is 0 or 1, and then you can derive various statements about the range of the number. So verification essentially amounts to checking a ring signature on each digit's proof. And if each digit is in the correct range, the whole number is in the range. We implemented, like there's a recent construction due to back end Maxwell, which determined that if you do Baromian ring signatures over a ternary system, that you can share data between the digits, and this ends up being a lot more efficient. The name Baromian ring signature is actually a pretty cool name. Baromian rings are like, if you imagine the sign for the Olympics, it's this mathematical thing where you have three rings, and they're interconnected, and if you cut one, the whole thing falls apart. So the signature scheme is named after these interconnected rings, and that's because each of these rings has some of the data of the other ones, or is dependent on the other ones. Computers don't really like ternary systems. So also human brains don't really like ternary systems. It turns out to be really not nice to implement. But just so you can see an example of what it looks like to use CurvedQ5-5-1-9-Dollock, this is using rayon, so each of the digits, the computation of the ring signature for each of the digits can be done in parallel, and this is from the inner loop of the verification of one of these groups. So you can see that like, ooh, the laser kind of shows up. Okay, so you can see that like, oh, we can subtract two points. Should we just go faster? Well, I mean this is the last slide, so yeah. All right, well, anyway, this is an example of like what it looks like, and this isn't like the cleanest code, but it turns out to be really, really fast. And it's not the worst to write, like we did it in an afternoon. It still looks about as messy as the actual paper, and that's all, and thanks.