 I'm Isis. And we're going to talk to you about implementing pure REST elliptic curves. So for my day job, I work at the Torr project. I write C. This work is done in my spare time. So this isn't actually work that I'm doing for my normal day job. This talk is for people who are lightly familiar with REST. I'm not expecting that any of your advanced REST programmers are REST is actually not that advanced. Most of the time, we're not even doing lifetimes or memory allocations or anything like that. This talk also isn't aimed at cryptographers. I'm not expecting that you have any advanced knowledge of cryptography or actually any knowledge of cryptography at all. We're also not expecting you to really know any math. Just a basic level of high school algebra. And you're going to be fine. Even if you're not fine, we're happy to answer any questions that you might have after the talk. You can also email us or talk to us on Twitter or whatever. OK, so what is curve 2 by 5, 1, 9, dollar? And then we're going to be talking about, well, Harry's going to talk about implementing low-level arithmetic, field arithmetic in REST. And then we're going to talk about some things about REST that we found really nice and some things that we think could be better. And then we're going to go over some of the other crypto that we've implemented using our library. So what is curve 2 by 5, 1, 9, dollar? In order to talk about our library, it's necessary to situate it within the stack, essentially, that it's sitting in. And so you have your application at the top. And your application is using some sort of cryptographic protocol. That could be, for example, like a signature or a key exchange or a zero-knowledge proof. Underneath that, you have an abstraction layer called a group. Normally in cryptography, you want a prime order group, which we'll touch upon that later. But it's just a group of elements or things which has a prime number of things in the set, basically. You can think of a group essentially like a REST in that it's concretely implemented by an elliptic curve. So the things in the group, in this case, are a set of points satisfying a certain curve equation defined over a finite field. Usually, and in this case, we're talking about the field of integers modulo of prime p. So our implementation was originally based off of Adam Langley's ed25519go implementation, which itself was based off of the reference ref10 implementation, just to give credit where credits do. In order to talk about what curve 25519dollic is and why we made it, it's important to talk about a little bit of the history of other elliptic curve libraries, their designs, and some common problems. Other elliptic curve libraries tend to have really no separation between the protocol that they're implementing and their implementation of the field, the curve, and the group. And so this ends up with a lot of problems. You end up with idiosyncrasies, sometimes in the lower level pieces of the code, that carry over to the higher level protocol implementations. So things like accidentally flipping a sign and then the protocol comes out being implemented correctly because the protocol is accidentally also flipping the sign in the reverse direction. And you end up with the right output, but for underlying the wrong reasons. There's also problems with assumptions about how these lower level pieces are supposed to behave. And those aren't necessarily correct if you try to use the field or the group implementations to implement a different protocol. This also results in super excessive copy pasta. So cryptographers have this thing where they tend to literally copy paste each other's code around. They also, this is exacerbated by a lot of cryptographers somehow think it's appropriate to have a tar ball of their code unsigned inside another tar ball of a benchmarking library. And that's how you're actually supposed to get it as an end user. It's mind boggling to me. So anyway, this leads to large monolithic code bases, which are idiosyncratic. They're incompatible with one another in really hard to debug ways. And they're often highly specialized to perform only the single protocol that they're implementing, which is often signatures or key exchange. And there's just no consideration that there's this whole rest of the field of cryptography and you might want to do something other than these two protocols. And there's worse. Some of the bugs I've personally seen in major widely used cryptographic libraries, which are not going to name any names, but using C pointer arithmetic to index an array. So in C, just like a recap, array indexing works both ways. So taking the six element of the array A is the same thing as saying like five bracket A. So in this case, they were doing like A bracket P plus five, where P is a pointer. This is equal to A plus P bracket five and five bracket A plus P. And there's just so many ways that that could go wrong that I've seen overflowing sign integers in C and expecting the behavior to be the same or similar across different platforms. Like this is like canonical undefined behavior. Like you just don't do this. And I've seen using basically untyped integer arrays. So in RASP, it would be like an array of 32 U8s. And taking this without using any type system or anything like that, just saying that this is the canonical representation of multiple things in the library and multiple things which are mathematically, fundamentally incompatible. So when you have an elliptic curve point, you can compress it usually to 32 bytes by just taking either the X or the Y coordinate. So that could be an array of 32 bytes, also a scalar, like a number could be 32 bytes. And these are things that are not mathematically compatible. They shouldn't be switched. And your type system should be protecting you against making errors like this. I've also seen using pointer arithmetic to determine both the size and the location of write buffer. And there's still more bugs I can keep going with like a lot of horror stories of things that I've seen. So we didn't wanna do this in C, obviously. So we started working in Rust. And the design goals of our library were that it should be usable for other cryptographers to implement their protocols. It should be fast to write. It should be essentially the same as like writing a Sage script. It should be versatile. So you shouldn't only be able to write a signature scheme or a key exchange, but you should be able to write almost any type of cryptographic protocol. It should be safe. And by that we mean like multiple kinds of safe. It should be memory safe, type safe. There, I mean, Rust has extra nice things if you build in debug mode. There's underflow and overflow protections. It should be readable, which is like a huge thing because if you're copy pasting around all these like assembly files and each cryptographer is making all these tiny tweaks and changes and there are tar balls and there's no git history and there's no way to know why someone changed something. You just have this like blob of unreadable code that takes forever. And it's not very explicit. And readability also implies that it should be auditable, which is a huge thing for security critical code. All of these things are things that we would get from a higher level memory safe strongly typed polymorphic programming language, aka Rust. So with that I'm gonna turn it over to Harry, who will start to explain some of the low level field arithmetic in Rust. Hi. Okay, so as you saw in one of the previous slides there's like this table of the different pieces and since that's like kind of large, as an example we're just gonna go through like one thing that we do. So we're gonna look at how, so as part of this we have to implement field arithmetic for integers mod p where p is this two to the 255 minus 19. And just as like a worked example, let's see how that works. So we're trying to do this only using the operations that we have available on our CPU. So in order to figure out how we're gonna do this you need to answer two questions. First, what are our actual primitive operations? And also, what does the multiplication look like? So when you do a multiplication you're using a fixed size primitive type but when you multiply numbers or integers they will get bigger. So how does that get handled? Basically there's four possibilities. One possibility that you can do is you can do error if there's an overflow. So in this case if you did, I'm using U8 just so it's small numbers, 8 times 40 will give you a panic. And that's what Rust does in a debug build. There's also wrapping arithmetic where you would reduce mod, how big you can fit into that type. That's what Rust does in release mode. There's also for some things you might want saturating arithmetic where if you get too big it just clamps to the highest thing. And then the fourth thing that you could do is widening arithmetic where the result of the multiplication is gonna be the next biggest type. And so in Rust you have intrinsics for one, two, and three. So if you explicitly want one of those you can pick it. I'm not aware of an intrinsic for the fourth one but you can just write it like this. And so now you might wanna know like, okay what does this actually turn into? So let's suppose that we're on x8664. There's a really cool tool that you can see, Godbolt's compiler explorer. So in the top window, I've just put an example of a thing that does a widening multiplication of two u64s to produce a u128 output. And then in the bottom, it's produced what the actual assembly that this will do. And there's two windows because you can see that LLVM will give you a nicer instruction on newer processors. So on the older processors, there's this mull instruction where the inputs and outputs go into like fixed predetermined registers. So you'd have to do a bunch of like moving things around and then on newer things you can pick where they go. But the point is that you can just sanity check that this really does turn into something reasonable. So suppose that we have this ability to do multiplication of two 64-bit numbers into a 128-bit product, how are we gonna implement multiplication? So if you look at the original paper, they suggest using a radix two to the 51 representation. So what does that mean? It means that you're gonna write numbers where the, in base two to the 51. So you're gonna get five coefficients. You might wonder where does the 51 come from? Where does the five come from? So these things are gonna be basically 256-bits wide. And so you could break it up into four times 64, but if you think about the discussion in the previous talk about instructions per clock and out of order execution, it's actually much better if not all of the operations depend on each other. If you have things that are full width, then every time that you do an operation, there'll be a dependency between them and it'll be slower. So that's why you would pick the next biggest one. So we could write this in Rust as a tuple struct and you can use this multiplication to do it. So how would we actually do that? Well, you could just write out the coefficients of do your naive school book multiplication. If we start by writing out what the coefficients of this product are in the low term, we'll just get x naught times y naught, and then we get x naught y one plus x one y naught, and we continue in this way. So I'm writing the coefficients of the output on the left-hand column, the actual digits on the right, and you get this sort of nice triangular structure. Okay, so now you'll notice that our numbers got bigger when we multiplied them, but we're supposed to be working mod p, and so we would like to reduce this back to the original size of the inputs. So how do we do that? Notice that this prime has a special form. Since it's two to the 255 minus 19, you know that two to the 255 is 19 mod p. The reason is that mod p, p is zero, so zero equals two to the 255 minus 19, and you bring the 19 over. And so why is this useful? If you write out this product that we've just computed, then you can see that, for instance, this z5 term has a two to the 255. You can replace that two to the 255 by a 19. Similarly for this 306 term, you can write it as two to the 51 times 19, and that simplifies into this nice thing. So you can get basically a pretty fast inline reduction. And when you combine that with the formulas on the previous slide, you get this where the triangle below is gonna get moved up into the upper part. So this technique for doing really fast reduction mod p, actually you can trace the lineage of this all the way back to the 15th century. If you're curious, it coincides with like the development of early capitalism in Venice. So unfortunately we've now moved on to late capitalism and things are not looking up, but why don't we just write this in Rust? So I put some Rust code on the slide. We're implementing mul. There's some weird lifetime stuff. That's one of the things that we'll get to later, so just disregard that for now. I just put it in because it's the real code. And I'm gonna define this little helper function that's like my own little intrinsic for doing widening multiplication. It's inline always, so it will just disappear. And we start off, so remember in the previous slide we had a 19 times some stuff, but that stuff is gonna be U128. And it's better instead of trying to do 128 bit multiplication, you could just do a multiplication by 19 beforehand. And then you just write down that formula. Now we have this problem, which is that these CIs that we're getting are 128 bits wide. They're U128s and not U64s. And remember that our original goal is to try to get to back to like U64, five. So we have to then sort of reduce these CIs. And we can do that, so I've written this formula, but basically the idea is that you take this 128 bit value, you take the low 51 bits, you keep them, you take the high part, and you add it to the next biggest coefficient. So you're just carrying the value up into the next biggest thing. And you can write that in Rust in the following way. You can construct a mask and do this carrying. You can also, because a nice thing with Rust is that you can rebind the same names. And so here these C, there's a rebinding happening where we're notifying the compiler that now that we've done this reduction, all of these CIs are now gonna fit into 64 bits. So do whatever you want with that information. And once we've done this first pass, we've now fixed all of the CIs to lie in the, into U64s, but now they're not maybe as small as we'd like. So we can just do another carry pass. And that's what that field element 64 reduce function does. It does essentially the same thing, but it's less complicated because you don't have to change types. And it also will get inlined. So actually that's our implementation. It's not the simplest thing, but it's not that complicated. In our actual code, there's a bunch of debug assertions to make sure that all the things that are right size, that there's no possibility that we're violating some preconditions, that none of the intermediate things can overflow, whatever, but that's the actual code that we have. And we kind of just like throw LLVM at it and see what happens. So you might be wondering, like, how does that compare? Like I heard, you had to do a lot of work to get things to be fast. Well, turns out it's actually really, really, really fast. So N25519 Donut is an implementation. It's optimized assembly. It's what Tor currently uses by default. And as you can see here in the red, our performance is comparable to Donna, slightly better for things like verification, and then also just throw in a general, like so that you understand what the numbers are kind of supposed to be. Ring is also a Rust library for it's a higher level library than ours. So implementing protocols. And Ring does that by wrapping boring SSLs implementations, which are assembly implementations and wrapping that in Rust. So it's pretty, pretty fast. So now to talk about some things about Rust that we really like and other things that we think could be a little bit better. So as, obviously, Rust's code generation is done by LLVM, as I just showed, it's really good at generating code. It's not just good at generating fast code, it's good at generating safe code. So historically there's a worry that an optimizer code in theory break constant time properties of an implementation. So what does this mean? People have essentially in the past said that you can't use compilers to write cryptography. You have to write handwritten assembly because that's the only way to control what a chipset is going to do. Not only is that not entirely true, there are chipsets that do weird things which I'll get into later. That's just, this is just all kinds of insane. So what does it mean to say that code is constant time? So there are these things called side channels. And the side channel is essentially a mechanism by which an adversary can learn some sort of internal program state. For cryptography, this is especially insidious because learning a few bits of a secret can often lead to full key recovery attacks. A concrete example of a side channel attack is I make a static string and I just hold the A key for five minutes and make this really huge string and I load it in your CPU's cache and some of the caches are shared with other programs that are running. So the other program that you want to do a side channel analysis of, let's say it has an if then statement where it's branching on a bit of your key, whether that bit is a zero or a one. So if the if statement is a really small piece of code and then the then statement is this huge chunk of code, you can basically load your giant static string into the cache and you wait a little while and just chill and then you try to access your giant string again and you time how long it takes. And as you saw in the previous talk, hitting different layers of the cache and hitting memory instead results in different timings. So that's a timing side channel. There are other side channels you can do. You can do differential power analysis where you build up profiles of a different device through a different chip and how much power it draws as it's doing certain operations. But basically this is all bad. You want to write code that does exactly the same thing with respect to secrets all the time. So to prevent this, we have put a lot of effort into, as you saw, propagating the carries in Harry's code. There's all these crazy bit shifting operators everywhere and crypto code just tends to look weirder because of this. For LLVM's optimizer, it turns out that it's not breaking our code as far as we can tell for x86 and we've sat there and geeked out on this assembly for hours and hours and hours and way longer than I would like to. In the future, we're hoping that there would be some way that we could do an LLVM pass with a sanitizer where we can statically analyze the output assembly. There's a trick for Valgrind that Adam Langley made where you just mark secret data and then it gets uninitialized and then if you ever try to index over the data or branch on it, the whole thing just poops its pants and you know that you've done something wrong. Are you talking about Alpha Phi? Okay. So REST is capable of targeting a lot of platforms and targeting extremely constrained environments using no standard. So for example, so Dalek uses no standard. So if you write your protocol and tell Dalek to use no standard, you can present a foreign function interface that can be used in a lot of weird places. So Tony Arcieri, who's somewhere around here, got Ed25519 Dalek, which is my signature implementation on top of Curve25519 Dalek, running on an embedded PowerPC chip inside of a hardware security module and is working on getting it running under in an SGX on play. As a side note, to go back to the constant time thing, I can't guarantee anything if you're running something on PowerPC because PowerPC is one of these weird chips that I mentioned. For example, it's mole operator will return early if you're multiplying something by zero. So I actually just can't guarantee that crypto works at all on your ancient MacBook. Don't do that. So Filippo Valsordem had a recent blog post which was really interesting. I think it made the top Reddit of both the Golang one and the Rust one. So he has a thing where he is calling Rust from Go with minimal overhead and he used Dalek as an example. Interestingly, this is three times faster than calling the Curve25519 library that's in PureGo that is in Golang's standard library. So there's some things we don't like. One of them is what I've been calling the I of Sauron and it's this like thing that we have. Looks like this. This is not very ergonomic. As someone who's like, you know, kind of like dog-fruited my own library. I don't really like being like in like print. Like it's not, this results because Rust operator traits take arguments of type T, not type borrow T. So in order to avoid, so if you want or you need a copy for your types, which we do because we have like a lot of constants that are just like various things in the Curve that get used all over the place in different protocols. Like for example, the base point of the Curve which is kind of just like a point that you pick and it's special. We have like a constant file, whatever. So we need copy. So we have to implement our operations on type like anti and then you end up with code that looks like this where you keep putting parentheses around things as you have this like expression of like multiplying or adding things together. It gets messy pretty quickly. I'm not a compiler dev, but it might be possible to perhaps do like auto borrow for copy types or some other special marker where we can say like, hey, I like, I said the type but I really meant like borrow the type, don't actually copy it. Another thing which would be really cool and which we know is coming and we're really excited about is const generics. So we've already thought of really cool ways to abuse or use const generics to optimize the field arithmetic. So the basic idea would be to statically track the sizes of the intermediate values of these like field elements and do you specialization to insert reductions when necessary. So it would automatically detect like, oh, hey, you have like three bits of carry space and you've done like two ads already. Let's like do a reduction now instead of like forcing users to know when to do it by hand. So next, Harry's gonna talk about some of the crypto we've implemented with Dalek. So just like a brief overview of some of the stuff that we've done. One thing you might wanna do is some zero knowledge proofs. So the idea is to prove that you have some statements about some secret values without really revealing anything. One example that people seem to want to do is called like discrete log equality where you have say four points and you wanna prove that a equals g times x and b equals h times x simultaneously without revealing your secret value. So implementing these proofs are for like people who care their schnor style proofs. So there's a lot of boilerplate for when your expressions get more complicated. So we made a crate that has an experimental zero knowledge proof compiler implemented in Rust macros, not procedural macros just like ordinary macros because it's worse that way. And as a user it's kinda nice cause you just write this and that corresponds to that example above and what does that expand into? It turns into a DLEQ module with all the code for creating, verifying these proof objects and he uses CERDI to derive a parser into wire format and back. So that's pretty cool. And another thing that we did that I just couldn't talk about. Sure, and so there's other types of zero knowledge proof statements that you wanna make. And one is like in a lot of applications you wanna like prove, so before there was proving inequality you might wanna prove an inequality. Like you might wanna say I know an x and it's bigger than y. The problem is that if we're working in a cyclic group x being bigger than y is the same thing as x being smaller than y if it wraps around the group. So you have to do what's called a range proof which is saying like I know an x and it's between like y and z. And there's ways to do this in zero knowledge without giving away any information about x other than it lies in the correct range. These are often used in confidential transaction systems and we're also using it in a future ANSI censorship system that we designed for Tor which uses a micropayment scheme that we made that's embedded inside of anonymous credential and we use this micropayment scheme essentially to like store proof of good user behavior which you can spend like sort of spend later as a way to like filter out bad actors from a system without knowing anything about the users or like being able to track the users. So the basic idea is you wanna prove x is in the range zero to b to the n and you write x in base b such that it's the summation from zero to n minus one of xi times b to the ith power and you prove that each digit is in the range x is between like zero and b. So traditionally the way to do this is like as due to a cryptographer named Shinmacher and you just write the number in binary and then b in this case is two. So you just prove that each digit in the number is zero or one and then you can derive various statements about the range of the number. So verification essentially amounts to checking a ring signature on each digit's proof and if each digit is in the correct range the whole number is in the range. We implemented like there's a recent construction due to back end Maxwell which determined that if you do Baromian ring signatures over a ternary system that you can share data between the digits and this off like this ends up being a lot more efficient. The name Baromian ring signatures actually like a pretty cool name. Baromian rings are like if you imagine like the sign for the Olympics it's this mathematical like thing where you have three rings and they're interconnected and if you cut one the whole thing falls apart. So the signature scheme is named after these like interconnected rings and that's because like each of these rings has some of the data of the other ones or is dependent on the other ones. Computers don't really like ternary systems. So also human brains don't really like ternary systems. It turns out to be like really not nice to implement but just so you can see an example of what it looks like to use curve 25519 dollar. This is using rayon. So each of the digits, the computation of the ring signature for each of the digits can be done in parallel and this is from like the inner loop of the verification of one of these groups. So you can see that like the laser kind of shows up. Okay, so you can see that like we can subtract two points. Should we just go faster? I mean this is the last slide so yeah. All right, well anyway. This is an example of like what it looks like and this isn't like the cleanest code but it turns out to be really really fast and it's not the worst to write. Like we wrote it in an afternoon. It still looks about as messy as the actual paper and that's all and thanks.