 In the first session of the second day, this is our track, we have four talks here. The first one is on the depth of oblivious parallel worm by Hubel-Chern, Kleinman-Chorn, LNC, and Hubel is the speaker. That's my introduction. So in this talk we're going to discuss the depth of oblivious crab or crab. So of course I have lots of things to cover. So let's just start from the basics. So what is an ORAM? An ORAM essentially is a very useful construct. There's a part that's secure. We can think of that as the CPU or the client in which it has some constant sized secret memory. And then there's a part of the larger memory that is untrusted. Usually it's maybe the hard disk or maybe the server. So the adversary can observe what's going on here and then this part here is insecure. Now there's this untrusted part. Unfortunately if you want to secure the data there using just encryption it's not enough. So that's an example of why that's true. Suppose I'm going to store some sensitive macro data on some server and if I do not use anything special if I just say this part is related to breath cancer, this is related to liver and it's related to kidney. Then even though they're encrypted, if the client somehow when they access this database you see that the client is accessing this part very frequently then you can infer that maybe this part is looking of something related to kidney. So if you just use the encryption, if the layout of the memory is naive you sequentially lay them out. If adversary can somehow know that which part you are storing, which kind of data that will have an issue with privacy like invasion. So this ORAM is actually I have been studying for a while. So like abstract speaking, so there are some let's say logical addresses. Essentially ORAM is used to implement the data structure like in array. So there's some logical address and typically it's indexing from 0 to n minus 1. And let's say you have some storage on the server. To store this data, let's say this is long array, it wants to emulate this RAM. So the storage on the server, there's order n storage and that is not secure in the sense that the adversary can observe which part of the server you're accessing. But then the content of the previous block, we say that the unit data of access is a block. On the server the content of the block cannot be observed. But then the adversary can see in this example which location on the server you can access. So a more part of adversary can actually distinguish whether you are doing read-only or write. But let's say that's a minor factor in the data. So what does it mean by ORAM is secure? So this notion is introduced in more than 20 years ago. So essentially if you have two request sequences of equal length, then what the adversary observe the access content on the server should be indistinguishable. So of course if you have a very short access sequence and a very long access sequence, you cannot hope to be able to obfuscate it to. Otherwise the short program would be very inefficient. So the ORAM notion is only to be secure against when you have two request sequences of equal length. And the adversary when observed the access pattern on the server, they cannot decide which world you are in. So there are two notions of security. So statistical security basically is that if you use randomness in your implementation of the ORAM, then the distribution of access content should be exactly the same for those two sequences. That's statistical security. For computational security, we only need to be efficient against, like secure against, polynomial adversary. For example, you are allowed to use PRF to run a function in the design of the ORAM. And there's this notion of failure. Here failure is not a security failure. The failure means maybe when I shuffle blocks around on the server some blocks might be lost. So obviously I want the failure problem to be small. So in this talk, the ORAM will have some very small failure probability. Then some blocks might actually get lost. So that's the sequential version. It has been studied for a while. And in this talk, we're actually doing the parallel version. That means there's more than one client. Maybe you can imagine that many researchers are trying to access this database. And the notion of security is similar. So the adversary should not be able to distinguish batch sequences. So if there are, let's say, N clients. So at each time step, they're going to issue N such requests. The adversary cannot actually distinguish whether they are accessing the same site or maybe different sites. So actually, when you think about it, it's quite different. You have to make sure that even if the two clients have some collision on the same, they're actually accessing the same data, the adversary actually should not be able to have this. So that's been challenging. So this notion is introduced like 12 years ago, actually. So in this model, let's say there are N clients. And then, let's say, at each time step, they are requesting a logical block. So it's like the parallel version of ORAM. We can think of how many is interesting. Let's say N is around with N. Let's say that will be some interesting case. And the underlying PRAM, we can assume that the clients, they are allowed to, of course, do concurrent read. And let's say if they do concurrent write, there's some underlying route to resolve the conflict. But let's say when we implement this PRAM using the obliquist version, implement the obliquist version, we are more suspected we do exclusive run. So actually, this is the best configuration. When you're trying to implement something that allows for concurrent write which is something that you exclusive run. So we can think of this as the abstraction and then the operant is the actual implementation, obliquist implementation. All right, so what is a good implementation? There's some performance metrics. So those are the old metrics that have been considered before. So what's total bandwidth or total work? If you think about the original model PRAM, each CDU, if you lay out the memory using the naive way, to access a block just need a constant, let's say IO, constant transfer of blocks. If you want to do it obliquistly, then you might have to pretend to access blocks you don't need or shuffle blocks around on the server. That would introduce extra number of accesses. So the blow-up is defined as, let's say the total number of blocks accessed by those CPUs defined by N, because when there are N CPUs, then the naive way just need to access all the N blocks. So the blow-up will be the total number of blocks accessed in each time step defined by N. So that's the bandwidth blow-up. Now we're doing a parallel PRAM. So what is the parallel runtime? It's the number of RAM trips using N CPUs. So let's say for the old MAPS rate, you have N CPU in the P-RAM, in the abstraction, and in the implementation, you also allow N CPU. So that's the parallel runtime. And previously, actually, there are two main old RAM approaches which both can be generalized to the parallel version. So that's the hierarchical framework that's quite old. And I wouldn't have too much time to go into it, but just understand that each of these array represents a hash function. So hash function and you have to use a sugar random function in this approach. So you achieve computational security using the hierarchical approach. And then there's tree-based approach which is discovered a bit later. And in this approach, typically, we can achieve a statistical security. So we don't need a sugar random function in the tree-based approach. All right. So what's the best result for old RAM? That is a sequential version. So for the best blow-up, achieved by the hierarchical old RAM is this log square N of the log of N. And for tree-based old RAM, this is for journal block size, which is actually at least log. And because you need to somehow at least be able to store the key. So for block size at least, log and bits, that's the blow-up. And for tree-based old RAM, if you need to have this journal block size, the blow-up actually is log square N. But if the block size is really large, that means you can store maybe so many bits, and the blow-up of the bandwidth is actually log N. This actually matches the blow-up. All right. Now, what about the previous results for old RAM? When old RAM, the parallel version was first introduced. They have these blow-up. And we can see that actually this doesn't match the performance of the sequential version. So very recently, after I was involved in better projects for tree-based old RAM, the blow-up actually matches exactly the sequential version. And later on, actually, in this same session, my co-author will talk about this. All right. So now we have the matching performance of the sequential old RAM. So what does this work about? So this work is about another performance metric, the depth on the parallel algorithm. So when you think about it, since there are MCPUs implementing this of previous old RAM, and this is actually, in here, a parallel algorithm. And in the logic of parallel algorithm, there's this concept that if you allow more CPUs in the computation, then what is actually the number of parallel steps? And in the algorithm, sometimes you can represent using, like, depth, which indicates a data dependency. And in such representation, the depth simply is the number of layers in your graph. So that actually represents the minimum number of parallel steps required if you have actually an unlimited amount of CPU in your implementation. So we're going to investigate if you're going to implement an old RAM, then what is the depth of the implementation? All right. So in this result, in this paper, we have several interesting results. So we show that for statistical and secure program, the total workload up is same as the total, like sequential version, and also just the off-gram version. But the depth, we can get something like a lot of end of a lot of end. And then if it's computation-secure, we also have a variation, but I wouldn't have to much time to talk about this. And then, if the block size is really large, then what's interesting is that we actually get quite small depth, a lot of end, but we actually blow up the size by a little bit. So before for large block size, we only have a lot of end, but in order to decrease the depth, we see actually we have the surface lovely in the total workload blow-up. So that's the result, upper bound result. You observe that there's actually a lot of end term here, and actually it's necessary. It's because we want to obfuscate in each time step whether the end clients are accessing the same block or they're accessing very different blocks. So I'm talking about why actually this will cause a lower bound on the depth. All right, so those are the highlights, and then I just have some not too much time to tell you, like, what are the things to look for if you're going to read this paper. So there's some technical help, but to be honest, let's take three things that I want you to do. So that's a take on message. All right, so the first thing is that we're using tree-based ORAM. So if it's the first time you see it, of course, you can only get a very brief idea. So for tree-based ORAM, you have this batch, which is a reading something apart from the tree-based ORAM. And then this part, the ORAM is actually not ordered by, so it can be easily made parallel. And there's this maintain phase in which you remove the block you just read, update it, and put it back. So actually, sometimes on tree-based, it may get parallel, but it's actually in a tree-based paper. So what, actually, you have to understand tree-based ORAM and understand what the top next is that in tree-based ORAM there's an invariance that in this tree, given a logical block, you'll have some position map which contains the leaf. And the invariance is that this block must be found on the path from the root to that leaf. And whenever the block is accessed, the next time it will get mapped to a different random leaf. Now the issue is that if you can store this position map on the client side and it's good, what if you cannot store it because the block size, we only have one block of memory. So the client can only store limited amount of information. Then you have to store this part of the position map recursively like this. Now you can see why this is back for the DAF because in order to access things on the top's level, you somehow have to sequentially get data between adjacent recursion DAFs. Requestion levels. So you can see if the DAF is deep, then somehow each level will incur some DAF and then the total DAF will be D times the DAF incurred by each level. Now, naively, if you do it like this, the intuition is that when you route between two levels, there are n sort of n results fetched between this level and then n request another level. And in order to maintain oblivious, this routing between adjacent level has to be oblivious. And that will actually incur a lot n. And all the source of the DAF is actually when you read a pub, the link of pub is actually lot n and to read something useful, there's only one thing that can be useful on the pub, that will automatically take a DAF lot lot n. So this spot somehow is not avoidable, but this paper we're trying to avoid is lot n. So the small idea is that we're going to do the offline and online routing trick. So the trick is that to compute this oblivious routing, we need somehow lot n DAF. But fortunately, this lot n DAF can be done offline in parallel over all the recursion levels. The reason is that even though we don't have the information, if we pretend that the moment we have the information, we can already build up some kind of random routing permutation. So this spot, even though it takes DAF lot n, it can be done in parallel across the recursion levels. Now that the actual routing happens, you have the information acquired, then we can do the routing. But at this point, the random permutation is reduced adversely. It is okay because the permutation looks uniformly at random. So then this is a trick so that we don't need to suffer lot n at each recursion level when we establish just a lot lot n. So that's the first idea. The second is actually for the maintained phase. If you are familiar with part or random, or circuit or random, then the maintained phase needs to deal with a linear scan or on a part length lot n. So now if we do so, it will take a lot n DAF. And then here we use a very small divine concord wave, the details and the paper, but the conclusion is that we can reduce the DAF from this to lot n. But we have to increase the total work slot. So that's the cost of having a low DAF. Okay, so that's what the two things I just talked about is relating to the upper bound. Now for the lower bound, then you can ask why is there this lot n internally, why should there be a lot n DAF? So remember that we have to make sure the obvious property to satisfy the condition that we have n concurrent requests, then the adversary should not know whether the n concurrent requests, whether they are accent distinct blocks or the same blocks. Actually in general, they need to hide actually the partition induced by addresses. So what's mean by partition? Imagine if you have five people, they might be all doing the same queries than they are each in that group. That's the cost. Maybe those two of them are accessing actually the same block, and then those two are the same, and then this is actually accessing another block. So you can see that depending on the addresses they accessing, this will cause them to be partitioned into different from these classes. So the obvious grant, obvious program need to actually hide this partition. So in order to do so, actually, we have actually in the paper some company oral argument to show that because of this, the depth has to be low end. So of course the details are specifically complicated, so we wouldn't have much time to go into that. But let's say there are some free banks in this talk, you should get a message. First, the two is written up upon. In order to reduce the depth, then we have always the question level. The trick is that the work you do, maybe there's an offline portion and an online portion. The offline portion can be done in parallel so that it doesn't cause the depth to deteriorate. So that's the first idea. Second idea is that in the public section algorithm, then I usually buy the complex to avoid just a linear scan on the pump to reduce the depth. And then the third thing is about the company oral argument to argue that the low end on the depth because I need to hide how the end requires a partition. So those are the three main points in this paper. And of course there's an open question that is that through the trade-off between the total work and the depth because right now I have just like this, somehow seems to trade off and it would be interesting if I can also kind of know about on the trade-off. Alright, so thank you. That's my presentation. In your work, does you disperse the red SS and the red SS or do you call it SS? So in the definition the adversaries shouldn't be able to distinguish between and right accesses. But in the implementation the adversaries can actually see whether the block you're accessing is read-only or right. It's not too relevant if it's a sequential O-ring because if you want to just read something on the server you can just read it, re-encrypt it and put it back. So even read-only for the sequential, if there's only one CPU it's not a huge problem because you can always re-encrypt. But if your implementation has multiple CPUs then you have to be careful because when they're actually doing concurrent read then you cannot use the re-encryption trick. So when the multiple CPUs then you have to be more careful whether the implementation is really just read-only or you actually get that both concurrent write then because in our implementation we won't allow the concurrent write somehow you resolve by the algorithm upfront to decide which CPU is actually going to write the physical block on the server. Thank you. Other questions? This is our talk. I'm wondering in a real implementation how much how many CPUs are you thinking of? Well, to be theoretically interesting actually something like how many are there? The step thing is if you have let's say an ample supply of CPUs what is the minimal sequential amount of work you have to do? So in practice if you have fewer numbers you can always similar to a CPU with multiple CPUs just depending on how many results you have. If it's an SGS cluster you can have a hundred machine SGS cluster in which case you have a big SGS. Other questions? Okay. Thank you. Let's meet again.