 So I'll be talking about how to garble RAM programs in the parallel RAM model, and this is done in a fully black box way, depending only on warming functions. So this is a joint work with Raphael Strafsky, I'm Steve, and what I'll be doing is I'll just be, you know, first motivating the problem, and then giving you some reviews and definitions and what we achieve. I'll go into the main construction, then give you a flavor of the proof, and finally conclude. So just to motivate, and I think we've seen some of this already in the previous two talks, well, what about computing on encrypted data? Right, if Alice here is a client and she wants to compute some data, some store some data on some server that Bob holds, you know, she wants the server to compute some function f of x, but she doesn't want necessarily the server to learn what her input is, what her function is, what her data is, and this is done on, you know, perhaps a really big, big server, right? So this server can have lots of cores, can execute lots of things in parallel, multi-threaded programs, this is looking at things in the parallel RAM model. Okay, so what can we do? Right, and so sort of the overall motivation is to do secure RAM computation, namely that, you know, the secure RAM computation allows you to have persistent memory, you know, you can store things, you can retrieve things, it stays there. There is a potentially exponential gap, for example, you've got to do binary search, it's a lot quicker than if you wanted to do, using just a big circuit on everything, it could be input dependent running time, and furthermore, it's also a natural model for programs. Okay, so the first sort of generic compiler was suggested by Astropsy and Shub, and there has been a long line of work on secure RAM computation. Interactive secure RAM computation was, for example, first implemented in this work by Gordon et al, and, you know, there's been a long line of work of secure RAM computation in the interactive model, right? But what about the non-interactive model? Well, so in this work, we're gonna look at this in the non-interactive model, and also make it parallelizable, and it makes only use of a black box, use of some underlying warming functions and PRF, and well, the reason why black box is attractive, I mean, in addition to black box is the new black, is that if you wanted to implement this using, let's say AES, you can leverage hardware AES, or other hardware techniques that makes only a black box you don't need to actually have the underlying code. Okay, so that's the sort of overall subject of my talk, and it's gonna be, there's the RAM program, it's black box, it's parallel, and we're going to garble it. Okay, so just to give you a couple of quick reviews, and you've seen this in the previous you talked, so I'll just go over this real quick. Garbled circuits is a method of taking a circuit, garbling it into some garbled format, C prime, you can take an input, you can garble that into some garbled input X prime, and then you can have this evaluation algorithm, right? And the evaluation algorithm, the correctness property says that if you evaluate the garbled circuit on the garbled input, you should get the same as you would get on the clear circuit, and privacy states that you should learn nothing else beyond this. Okay, what about oblivious RAM, right? I mean, this is the OT and ORAM session, so I should talk to you a little bit about oblivious RAM and why it's important in this context, right? Oblivious RAM helps you hide the access pattern of the program, and so if you actually wanted to make sure that the server doesn't actually learn anything about your computation, right? You wanna hide the fact that, for example, your program is perhaps taking this secret if branch as opposed to this else branch, right? Because this might be dependent on some secret data, right? So oblivious RAM was introduced by Goldreich and Ostrowski, and there's been many, many subsequent works in various different models, and basically what it offers you is given a RAM program P, you can compile it into a functionally equivalent program P prime that has oblivious access pattern. And because we're looking at parallel garbled RAM, what about parallel oblivious RAM? So in the parallel RAM model, there's been a couple of works, so at TCC last year, Boyle Chang and Pass and Chen Lin and Tesaro talked about introducing parallel oblivious RAM, and then later last year at Azure Crypt, documents led at all further introduced, you know, looking at parallel RAM and the network RAM model and some outcome coming work by Kartik and John also sort of looks at improving the efficiency of parallel oblivious RAM. So, you know, given that there is such nice research on parallel oblivious RAM, let's see how to do it non-interactively, right? So in the garbled RAM world, right? So just the plain RAM model, not necessarily parallel, this was introduced in a work with Rafi at Yerker 2013. There's been a lot of subsequent works on this area, both in the realm where you're only assuming minimal assumptions, you know, from domain functions or PRFs, and also interesting, you know, relations to compactness, reusability and IO. And last year at TCC, Boyle Chang and Pass also introduced the notion of parallel garbled RAM, and they managed to achieve this using IBE, and this was done where you actually have to have the code of IBE running inside your circuits. So, what we're gonna try to do in this talk is we're gonna show how to get parallel garbled RAM using only, you know, black box use of any PRF. Okay, so, you know, what is parallel garbled RAM? What can you do with it? Well, this is what it looks like. It looks very similar to garbled circuits, except now you have some parallel RAM program P. You have some input I. You have some data that you wanna also garbled D. And there are algorithms that can help you garble all of these. And again, you want this correctness property where the garbled evaluation matches the plain evaluation, and you also wanna have this security property which says that you can actually simulate, okay? And I'm gonna actually give you the weaker definition as well, which is that the simulator not only does it know the security parameter, the size of the database, and the running times of the programs, but it also has the, I mean, and of course the outputs of the programs, but it also has the memory access, right? So for full security, you don't wanna give it the memory access, but just to make things similar for this talk, let's pretend that the simulator can actually have access to memory access, and then in the full construction, we actually deny this from the simulator. But the simulator should, in terms of security, be able to generate a simulated garbled database, garbled programs, garbled inputs that's indistinguishable from the real one. Okay, so now that I've given you the definition, let me go ahead and state our main theorem, which is that assuming the existence of one way function, there exists a fully black box, so it's black box both in the construction and in the proof. Construction of a M processor garbled PRAM scheme for arbitrary M processor PRAM program. So if you have a parallel program that uses M processors, you can have a garbled version of it that also uses M processors. And with polylog overhead in the size of the database, size of the input, running time, the number of processors, and we get this sort of a multiplicative overhead in the size of the database, you also increase the size of the input and the running time. Well, and unfortunately, it's not exactly compact in the garbled program size, the garbled program size is actually gonna be proportional to the running time. I mean, we would love to have a solution which is actually proportional to the size of the program as opposed to the running time of the program, but this is what we get in our construction. And later on, I'll talk about the open problem of how do we actually get this succinctly. Okay, so now let me go ahead and jump into the main construction. Okay, so the main construction overview is let's start with an arbitrary PRM program and let's show how to garble it, right? What I'm gonna do is I'm gonna actually apply some sort of oblivious parallel RAM compiler to it first that provides uniform access pattern. And I put a uniform star there because we don't actually need it to be actually, actually uniform, but as long as the access pattern can be bounded using turnoff bounds, it's all okay. So you do this compiler and you get a uniform access oblivious PRM program. And then what do you do next? Well, then we do this garbled tree strategy and I'll talk to you about what this is in just a bit. And using this garbled tree strategy, we can get a parallel garbled RAM, right? So this is where a lot of the magic is hiding and so that's what I'll be focused on for the rest of the talk. Okay, so as a starting point, let's actually start with this sort of garbled tree idea and this was used in a couple of previous papers and actually this has been quite useful. So both in the previous talk and also in Nico's talk yesterday, you saw the usefulness of having like a tree of garbled circuits and then you evaluate down one of these paths, right? And it also has a lot of interesting and unexpected connections. Let's say for example to IBE. Okay, so let's review what this garbled tree construction really looks like and I'll give you the flavor of it that was introduced in this GLL work. Okay, so in this work, you had basically in order to do garbled RAM, this is not garbled parallel RAM, this is just garbled RAM, you have a bunch of garbled CPU steps and you also have garbled memory but it's organized in this tree fashion, right? It's organized in this tree fashion where each node contains a whole bunch of garbled circuits and these garbled circuits can talk to each other, right? So basically each circuit in a node can speak the language of other circuits in this node as well as a couple of the child circuits, right? So if you were to do this just one time, right? If you only wanted to garbled one CPU step, right? You would just have a tree and there would be one circuit for each node and you would just go down one path and then that would be the end of it, right? But if you have multiple CPU steps, you would need, A, you need some way to link these different multiple CPU steps together and B, you wanna make sure that, well, because you don't know a priori which path you're gonna take, you're gonna have to stuff enough of these circuits into each node so that you don't ever run out, right? So this is called the overflow problem, right? And so basically if you wanna look at the combinatorics of this, you wanna basically connect the ife parent to more or less be able to speak to about half of I, right? Because if you've already consumed I circuits here, you expect half the time you go left and half the time you go right, right? Because if you have uniform access pattern, half the time you go left, half the time you go right. So you expect to need to speak to about your roughly I over truth child, but because when you throw balls into bins, it doesn't always land exactly like this way. You need to have a little buffer of delta to make sure that you don't run out. Okay, so that's sort of the overall sort of combinatorial construction of this couple tree and let's sort of see how we actually go down a path, right? And so the circuit logic is actually really, really simple. If I have some location that I wanna read like from here, well what I do is I say, all right, well this says, all right, I wanna read this location, right? So it just has the key for the root and then the memory location is encoded inside as part of the input to the garbled circuit and all this garbled circuit does is just says, all right, well do I need to go left or do I need to go right? So I just look at the ith bit of the input, right? So the root looks at the first bit of the input, right? And if it's zero, it go left and if it's one, I go right and so on and so forth. So the logic of these circuits are actually quite simple. You just figure out do I need to go left or right, pack up the keys I need to pack up and pass down to the next circuit. Okay, so that's an overview of the garbled RAM scheme of GLO, what about paralyzing this, right? One thing we can do is just let's just make all the circuits like wider, right? If we, instead of reading one memory location, we wanna read M memory location within one step, let's make everything just wider, right? So here's M CPU circuits and they wanna read M locations within one single step. So just widen the circuits and carefully use turnoff bound to size it off and that idea more or less works and I'll go into the sort of detail of like, where the subtly comes in and why doesn't this exactly work? Okay, so each CPU has a uniformly round location that you might wanna read. These locations might still have collisions but existing works actually show how you can actually guarantee M unique locations reading when you're doing this read. Okay, so here are the details of doing this in parallel. The root circuit will have sort of inputs for M keys, right? So M is the number of parallel CPUs you have and it's gonna route basically where each of these keys go, right? And you'd expect half of them to go left and you expect half of them to go right and that's great, right? I mean, so, well, let's first do like the stupidest thing possible, right? Which is just widen every single circuit by a factor of M, right? That certainly will guarantee that you'll have enough circuit space to hold all the keys but I mean, that's not great, right? Because this will actually increase the overall memory size to M times N when we actually want like a poly log dependency, right? So this is too much. Okay, so how about we widen each circuit by the expected number of keys that will pass by, right? Because if you're doing this reading left and right, half of them will go left, half of them will go right and that more or less works. I mean, that more or less gets you what you want except there is the subtle interference with the circuit consumption rate of the underlying GLL scheme. So the issue here is that when the number of parallel processors is comparable to log N, to poly log N, you're gonna be consuming a lot of these lower circuits and in the original GLL construction, it wasn't exactly prepared to actually handle all of these sort of parallel consumptions. Okay, so we need to actually be a little bit careful here and actually use techniques from occupancy and these concentration bounds to actually designate a special level where we actually switch the strategy, right? So we can't use this exact strategy where we just, all right, the first circuit is of size M, these are size M over two, these are M over four and so on, that's straight up, if you just do it like that, that doesn't exactly work out in terms of the analysis, but if you actually introduce this special level, then things will work out. Okay, so let's see how we do this sort of carefully. Okay, choosing a level. So what kind of level do I want? And basically I want a level such that within a single parallel step that when you're doing these accesses, no more than B access paths will go through a single node at that level, right? So let's say it's this level, right? What's the probability that more than capital B of these access paths will go through exactly this node, right? That's some probability, I can bound this probability and I want that probability to be negligible. And obviously if I choose my level to be like the lowest level then of course, right? I mean, it's gonna be zero because at the lowest level, there's only gonna be one at most one, right? But I don't want to choose a lowest level. I actually wanna choose a level that's still high enough that there are still less than M nodes where M is the number of parallel processors. And by doing the sort of splitting, the higher levels and the lower levels, this exactly handles the subtle issue where M is small versus M is large compared to this log N factor. Okay, so now that we have this level, what do we do? Right, near the leaves, so I can tell you what happens near the leaves. Well, near the leaves, we're just gonna basically instantiate one copy of GLO per subtree. And what about above the level B? Well, above the level B, we're gonna do that strategy that we were talking about earlier, right? Where the root has M, you know, size M keys and then the next level has M over two keys and so on and so forth, right? So basically you expect, you know, M over two the I keys to appear there. And again, we need this, you know, a similar sort of half plus epsilon factor and plus a small overflow buffer queue. Basically, you just add in these additional factors and this can guarantee that the probability that you're gonna have too many CPUs trying to read down one path to be also negligible, right? So these are the techniques that we're gonna use to avoid this overflow problem. Okay, so putting it all together, how do we perform one of these parallel reads? Near the top, you activate all paths in parallel down to the calculated level, right? So you're gonna read the top and then you're gonna go read the children and the grandchildren and so on and so forth. And this goes all the way until this calculated level. And then below this level, you're gonna actually run sequentially in parallel, right? And by that I mean, right, so each subtree. So let's say for example, I wanna read these four locations, right? These two are within this subtree. These two are within this subtree. And so I'm just gonna do these sequentially in the sense that within each subtree, it's sequential, right? So first I'm gonna read this one and first I'm gonna read this one. But because there are less than M subtrees, I can do each subtree in parallel. Okay, so this is my first parallel sequential read and the second one is gonna be this other one, right? So this other one is gonna read these other two locations. Again, it's in parallel across the subtrees but within each subtree it's sequential. Okay, so if you do that, then we can actually get all the keys and still not run to the overflow problem and everything's great. Okay, so let's take a look at how this actually sort of works out and I'll give you a flavor of the proof sketch. Okay, so building the simulator. The simulator for the full security gets only the output, the running time of any arbitrary RAM program. And the first thing we're gonna do is we're gonna just feed it through the oblivious parallel RAM simulator, right? And that's gonna give us a simulated memory access output and running time. And now that we have a simulated memory access, we can actually feed this into the unprotected memory access simulator. And that's exactly what we're gonna do. So we're gonna actually perform, if you actually can sort of go into the details and look into the paper, we're gonna perform a sequence of hybrids similar to GLO where we replace the garbled circuits on the tree with simulated ones. Okay, so you walk through the sequence of hybrids and you get the simulated garbled tree, you get a simulated memory and now everything's simulated and it's indistinguishable from the real sort of garbled program and database. Okay, well, the key technical point here is what about the overflow problem, right? And this is sort of the main combinatorial challenge that we had to deal with in this paper. And sort of how do we choose the bounding level B and how do we analyze what's the actual overhead, right? Because we still want low overhead, but at the same time we need to choose this level B. And spoiler alert, we choose B to be roughly log the number of processors over C, some constant. Okay, so let me just give you a sketch of what the combinatorics will look like. So why is the overhead polylog, right? Because we promised polylog overhead, so let's see what it is. On the top half, right, you activate all paths in parallel down to level B. And if you let WI represent the circuit size, you get this equation and this equation basically comes out of estimating what these WIs are. The WIs, they're basically M keys times this factor followed by this additive key buffer factor, right? And if you bound that, you get this bound where this is either the two times B times epsilon times this factor. And if you choose B correctly and if you choose epsilon correctly, you can actually take this and show that this is in fact good. And then what about the ones below this level? Again, right, because each subtree is being activated in parallel, well, what is this, right? I mean, it's just two to the B plus one times capital B, which is the number of actual paths you need to evaluate times the cost of GLO. But because we chose this level B such that the number of nodes in that level is less than M, you can just assign one processor to do one of these parallel executions. Okay, great. So why won't the paths overflow the circuits? Well, I mean, I sort of gave you a flavor of that during the talk, but if you want to look at the details, just check out the paper. And with that, I'll conclude. So what are the open problems here, right? One great open problem is, how do you get succinctness in the terms of program length, right? So there are a long line of works on succinctness and reusability and IO, and these are from stronger assumptions. Can we get this from weaker assumptions? That's one question. And like I said, this is one of the things you can get from stronger assumptions, but what are some other things you can get from stronger assumptions, right? So if you assume the computational Diffie-Helmholtz assumption, and you combine the garbled tree idea, you can get IB like we saw in Nico's talk yesterday. And what about other distributed models of RAM computation, right? PRAM is not the only one. There's many other models. What about those? Okay, and just to wrap up, in the talk you saw a way to garbled RAM programs in a black box parallel way, and this allows for non-interactive secure RAM computation in parallel. All right, thanks.