 Okay, good morning everyone. I lost my voice. So I hope I'll make it through the entire talk I'm gonna talk about high throughput secure multi-party computation Breaking the billion gate per second barrier and this is Based on a series of work with collaborators from Barilla University and NEC Japan secure multi-party computation enables Parties to compute on private inputs without revealing anything but the output and there are a huge number of potential applications for for multi-party computation, we can compare DNA samples without revealing that revealing them we can run learning or data mining algorithms on Databases distributed databases for example different hospitals collaborating without revealing their private Patient data we can run secure SQL. We can predict protect credentials and biometrics by for example splitting keys or splitting templates and then Computing on them without bringing them together and therefore achieving higher protection There's a lot of interest in MPC lately And it's even now being deployed with a few startups working an interest in in from number of other places as well There are two standard models that we've considered that have been considered For a security one is this called a semi-honest model and this is where the adversary is rather benign It runs the actual code that it's supposed to run But tries to learn more than it should from the transcript these models things like inadvertent leakage But also makes sense in the example of different hospitals The only reason why the hospitals won't bring their data together because they're not allowed to because of privacy law But they don't really they're not suspicious of each other. They're actually going to try to actively cheat However in many other cases we want security against malicious adversaries and this means that even if an adversary can run arbitrary malicious code They still can't learn anything beyond what they're allowed to so this for example in the case of protecting keys against breakings We would generally want malicious security Now secure multi-party computation holds great promise. It's been studied since the late 80s and we've talked about applications for decades and They are now becoming more and more Important and more and more viable, but the main question is whether we're actually whether we're actually able to fulfill that promise Can we actually achieve speeds? for MPC that are relevant for Applications in practice. There are some applications that we can solve today and these are things that are being done We can do cryptographic operations. We can do by biometric matching We can do DNA matching and things like that, but meet scale and medium to large scale MPC on large data with very very large circuits Seems to be way beyond reach especially we want to consider malicious adversaries because the initial adversarial model is one that is much much much harder to achieve and This is the question that we came to that we're trying to solve in this work So we consider a very specific setting which is secure three-party Computation with an honest majority meaning that most one party is corrupted and we want to achieve security in this setting It's important to distinguish between two different goals and two different efficiency measures for these types of tasks one is To consider latency, which is how much time it takes from beginning to the end and the other is the throughput How many computations can we carry out per second and depending on the application? You will either want low latency or high throughput. Sometimes you may have constraints on both of them So if you're looking for low latency protocols the garbled so-called garbled circuit approach is the approach that typically is taken It has a constant a number of rounds And so even on slow networks that can be very very efficient However, the bandwidth is relatively high, which means it's not so good for getting high throughput on the other hand If you want high throughput, then the so-called secret sharing approach is better. It has fast simpler operations Much lower bandwidth, but you need a number of rounds of communication that depends on the depth of the circuit being computed And so on a slow networks, this will never be good But on fast networks these sort of computations Computations can also actually have relatively low latency and can perform very very well so we're going to focus on the high throughput setting because we're thinking about again carrying out massive computations on on data on large data sets with massive circuits With this in mind and in understanding that bandwidth is a much bigger bottleneck Than computation today in these types of protocols because cryptos become so fast with the help of Intel and others We constructed a protocol for the semi-honest model that requires the parties to send only a single bit for every AND gate and Soar gates are for free And also it has a very nice communication pattern you each one party each party sends a bit to another and it goes around in a ring And in between you can do computation the operation the operations are very very simple with just a few xors and a few ands Per gate and you need to do generate randomness, but you can do one aes operation for every 64 and gates So using aes and I this is almost For free as well The protocol is very amenable to parallelization because of its structure So we utilize the Intel intrinsics and the AVX instructions set and packed many Many values together a single register actually the semi-honest implementation use 128 bit registers now already Intel instruction Intel Chips have 256 the next generation is even 512 and with a highly optimized implementation and packing Each core receiving 12,800 executions in parallel We actually got very very Good performance. So this is these the results on a cluster of three mid-level servers So they're not home PCs, but they're not massive machines They're they're 20 core machines collected with a 10 connected in a 10 gigabit Land We're with a single just looking at a single core And actually below 1 gigabit per second connection. So this is actually just a very simple type of setup We're really getting over 540 million and gates per second At the tw at this scales linearly up to 10 cores at 10 cores We're getting 5 billion over 5 billion and gates per second Then you get some degradation because of the networking, but at 20 cores We get over 7.1 5 billion and gates per second which translates to 1.3 million aes operations per second Now this is Truly high throughput. This is a very very significant Computation you can think of that as an encryption machine where the share is key to nobody knows it But you can think more generally forget the aes. That's just an example circuit You can think of a large circuit computing I don't know a learning algorithm or something on distributed data at 7 billion and gates per second. You can actually I can compute you know a circuit with a trillion and gates in under 15 Under 15 minutes even less actually. Sorry. That's the yeah something like that It's you can do really really really big computations and this I haven't got time in this talk to compare to previous work But this is far much much much faster than any previous reported result It's a combination of a better protocol with lower bandwidth and simple operations and also the engineering in the implementation With the parallelization enables us to get such such speeds We wanted to see that how this would incorporate into an actual application. So we considered the problem of active directory breach Where in in Kerberos If you get the hash passwords, you actually don't need anything else is no need to brute force anything because That's actually what you need to decrypt the ticket granting ticket that you get from the from the server from so if the accurate director is breached then everything is completely gone in your whole organization and We'd like to protect that so our idea is to split the hash passwords and keys of servers and Between three different servers. I have separation with different administrators So there isn't a single administrator that can now steal everything from the active directory and if an attacker breaches the network He has to steal more than one administrators credentials and we rewrote the Kerberos ticket granting server and the client to work in counter mode instead of cbc mode because cbc mode is a Inherently sequential and that would be a problem with latency, but we rewrote them to work in counter modes They could be fully paralyzed and it turns out with all of the encryptions You need to do of the service keys there with the user's password and so on so forth You need 32 as operations for every user login and the result that we got Is a latency of 200 milliseconds? which is very reasonable for human login, especially it's only at login time and With a single core we can support 3000 log-ins per second and a 20 cores Approximately 41,000 log-ins per second. This is enough to support a Log-in storm of a huge organization I don't think there are many organizations in the world that need to support over 40,000 log-ins per second and this can be done on a single server. So this is a real a high throughput Computation and a real scenario and we see that the NPC can actually support this at a scale far beyond What we would have thought until very very recently What about malicious security so semi-honor security is good for some applications? Arguably for the Kerberos applications a bit questionable if someone breaches the active directory then and they have root credentials They could change the code that's being run there on or on those servers So we'd like to prevent that You need to prevent it's much harder to prevent a corrupted party from somehow tampering with values to check that they're behaving correctly And in the past this is mean this is meant that Protocols from malicious security are orders of magnitude more expensive than semi-honor security And so then our high throughput will just go out the window We use the so-called multiplication triple approach Essentially, you just you need to generate a huge amount of multiplication triples and then you randomly Shuffle them and you open some and you check against each other and and so on and so forth I don't have time to go into the details But I'll just say that a lot of the work in doing this protocol was Optimizing every single bit that is sent to reduce the bandwidth as much as possible And also you found something very interesting that since we're now working on massive arrays of triples because we Want to get high throughput actually the bottleneck became the cache misses That was the most expensive thing in the protocol and slowed everything down to a halt essentially, so we designed a Cache efficient shuffling method which is suitable for cut and choose for here and a very optimized and tight Component oral analysis because this has a huge effect on the actual efficiency of the protocol It tells you how many triples you need to open and check and so on and so forth our most operas optimized protocol sends only seven bits per AND gate as Nigel said the protocol is actually really simple. He said moronic, but I'm just translating to simple But actually from what I understand in the real world, that's supposed to be a good thing the protocol is simple not a bad thing We have another variant that sends 10 bits per AND gate but has a Better online phase if you want to separate between an offline preparation and an online computation phase and Then on the same clusters before utilizing 20 cores We get something that that really is a surprising result We're able to do Compute 1.15 billion and gates per second and this is So actually it's even not it's even better than 1 7th of the semi honest protocol That's because of more optimization in the implementation and this translates to about 215,000 aes operations per second begin think of computing on medical data or other data Again over 1 billion and gates per second So this is the 15 minutes actually because if you think about a trillion and gates then that will translate into 15 minutes So in semi honest would be two minutes for a trip for a trillion and gates For the offline online variant we can actually get over 2.1 billion and gates per second If you're counting just the online time, which is about 400,000 aes operations per second This is orders of magnitude better than anything that that has been done until now So in summary it is actually possible to achieve very fast rates even from malicious adversaries Specifically looked at the three-party setting with with an honest majority But this is suitable for a number of different applications like the key protection application like Hospitals collaborating and so on and so forth a lot of them can use this type of setting With rates at semi of semi honest sort of 7 billion and gates per second and over 1 billion and gates per second from malicious We're able to truly deal with large computations. There's been a lot of interest in MPC With the move to cloud with the with the desire to collaborate to to carry our computations together but the bottleneck of the of the throughput is something which has to be Has to be solved before we can actually solve real problems and and I believe that what we've Done here is showing that it can be done Of course, this is for these are fully paralyzable. So adding more service means you would just linearly increase the throughput and So MPC can be used for a much larger task than we thought beforehand. Thank you So we have time for one or two questions I'm glad there's time for two because my question is not really that insightful I'm not very familiar with multi-part computation and the actual protocols But you mentioned the malicious and the semi honest model I know that the malicious model is in general what we want when applying it to what extent are the protocols that you built for the malicious model Just extensions of the ones for the semi honest model Do you sometimes directly work in the malicious model or do you always go through semi honest to achieve malicious? You don't always have to go but typically in terms of the design That's the way you do you take a semi honest protocol and you add things to prevent the adversary from cheating In this case we build very heavily on a semi honest protocol because you can multiply really quickly So you can generate a huge number of tuples really quickly and then you just can check them So it's the there it's based very heavily on it But it turns out also to be quite a simple protocol which means that deploying it is not difficult What will happen if their servers are in different jurisdictions? So the latency is big right so in this sort of protocol It's not suitable for that you would you would want to use something like a garbled circuit approach for that But if you wanted to get high degrees of separation you could think of for example having one server in Azure in On the east coast and one in Amazon on the east coast you would then still have low latency But you'd have a high degree of separation between the different servers So there are things that you can do to achieve that and still get low latency Yeah, my question was exactly the same because it seemed to me you're running this in the same cluster So you're not gonna get so much Independence between the yeah, well the independence you can get by having different administrators on the server So you can't have any single administrator. That's already a reasonable separation, but point one three milliseconds ping time I think you get within the east coast you can set up between different cloud providers and get that sort of ping time as well It's not it completely unrealistic How configurable these oh, sorry, okay How configurable these results like how much of the optimization was done because you were specifically computing AES? Or how easy is it to just like plug in whatever you can plug in whatever circuit? The only thing is that you need to build something which will You want to work in a run the same circuit at least Many times together, but even if you're doing for example learning You'll often do that you'll be running the same thing on many pieces of data and then continuing through it So it's not at all optimized for AES