 Great, so Benedict, the floor is yours. Okay, so yeah, thanks for introducing me. And yeah, I'm happy to speak here about this project, about data availability sampling. So the last couple of months, we tried to understand what is this data availability sampling, how can we formalize this? And yeah, so let's set the stage, what is the setting that we are looking at and why do we care about this? So let's say we have some peer-to-peer network and we have a party, let's call her Alice. She holds some data, for example, a block that should be posted in this peer-to-peer network because this network runs a blockchain. And this block has some associated header information and now all the blocks, all the parties in this network download the entire block along with, et cetera. Okay, but that's not the end of the story. We also have parties that we call clients and these clients are restricted in terms of resources. So they are not able to download the entire block. Instead, what they do is they only download this small header, but still they participate in this protocol, so they want to verify that the block is correct in some sense, according to the rules that this protocol specifies. And for that, they rely on other parties that download the entire block to tell them when the block is not correct. And for that, of course, the other parties need to have the block and that means the client's first job is to verify that the block is available in the network because otherwise these parties cannot tell them that it's a wrong or incorrect block. Okay, so the goal of the clients is to verify the block or the data in general is available within the network and for that they will make some queries. But keep in mind, they don't want to download the entire block because they don't have enough resources to do that. Okay, so let me talk about the threat model for a bit. So the clients here want to verify some statement, namely the data is available and the other parties want to convince the clients, so the proposer and the network. And therefore we model throughout the talk and in this work we model the proposer and the network as being malicious. We don't care about any privacy, so we don't model the clients as malicious typically. Okay, so that's the setting. Now, of course, you could ask the question, what does it mean formally that data is available? This is a very vague term, it is not really clear what that means. And also, let's say we have a formal definition of that, then how can you construct a protocol? How can you implement this efficiently? Okay, so these are the two main questions that we address in this work. But before I show you some of our results, let's first familiarize ourselves with this setting. I want to do this by looking at two example constructions. Okay, so again, let's say we have our proposer, Alice, we have the network and now we have two clients. So in general, we will have more than one client and this is important to keep in mind. So Alice has the data and let's say this data is now K chunks of equal length and now Alice wants to send it to the network. But before she does that, she commits to this data using a Merkle tree, which will give her a Merkle root and that will be our header. So the clients download this Merkle root and the data is sent to the network. Okay, and now what the clients will do to check that the data is available, they will query the network and what they will do, they will sample an index randomly, for example, this index I, and then they get back from the network, the data, the I data chunk and an authentication path that convinces them that this data is consistent with the root. And here we already see why we need this Merkle root. We needed to make sure that both clients get consistent responses. This is an important aspect, we will have that all the time. Okay, so this is the system, if the clients get enough of these responses and they are all okay in terms of this Merkle path, then they are happy and they say the data is available. Okay, so the question is now, is this a good system or not? And yeah, let's think about this. So what an adversary could do to make the data unavailable with minimal effort is to just delete one chunk of the data. Okay, but now let's check, how likely is it that the clients detect this? So for one query, the probability that they hit this chunk is one over K, so that's very small and that in particular it means to detect this with high or overwhelming probability, they need to make a lot of queries and they need to make as much queries as they would have to do if they want to download the entire data. So this is really not a good idea. So this is a first try, but it's not really a good scheme. Okay, and let's think about where this comes from, this problem, well, it comes from the fact that we sent the data to the network without adding any redundancy. And in that way, if we take the data that is stored in the network and delete just one symbol, then the original data is not available anymore. So what we will have to do is we will have to add redundancy. Okay, so let's give it a second try. So let's add redundancy using a code. Again, let's say Alice has this K chunks of data and now she treats them as coefficients of a polynomial over a finite field of degree K minus one. And then she sends to the network not the data itself, but instead all the evaluations or let's say two K of these evaluations of the polynomial. In other words, what she does is she encodes the data using a read Solomon code. Okay, and then before we saw we need some form of commitment to make sure the clients get consistent responses. Here we deal with polynomials. So it's a natural idea to use a polynomial commitment scheme such as the KCG commitment. So the clients will get the KCG commitment which is just one group element. And then along with the read Solomon encoding, Alice stores all the opening proofs of the KCG commitment in the network. Then when clients query random positions they get back the evaluation along with the opening proof. And if this all verifies for enough queries then they are happy. Okay, so that's our second try. Now let's think about this scheme. Is this a good way of checking data availability? Well, if you look at this encoding because of the redundancy what an adversary would have to do to remove information about the data is to delete more than half of the encoding, right? And if more than half of the encoding is deleted and you make say a hundred queries then you have a very small probability of not noticing this. So this is a good scheme at least intuitively but it's not really clear if this is the case from a formal perspective. So we ask the question, is this secure? And to answer this question we first need a security definition, right? I now just sketched an adversary but it's not really clear what this adversary is allowed to do. And yeah, so that's why we need a formal definition. And then also once we have a formal definition we may use this to quantify how much better is this scheme compared to the first scheme. So how much do we gain from using these codes? Okay, so that's the setting and two examples. I think we have a feeling now of what data availability sampling should be intuitively. Now let's look at our results and I want to split the rest of the talk where I present the results into three parts. So first I will show you our formal definition of data availability sampling as a cryptographic primitive. And then we will look at the constructions and for that we developed some framework that I will first explain and then we will look at specific constructions. Okay, so let's look at the definition now. How can we define this as a cryptographic primitive? Yeah, and the first thing you should do when you define a cryptographic primitive is think about what are the parties that are running this protocol and what are the algorithms that they run. So you specify the parties and the algorithms. Okay, so we saw this before. We have our proposer. The proposer somehow takes an arbitrary data and encodes it using some algorithm. And now this outputs a commitment, sort of the header that the clients will download and also an encoding and this is what is stored in the network. And here you see one aspect of our definition. We don't model this network as a set of a lot of parties. Instead we say there is this encoding. It is somehow stored within the network and the clients can query it. Okay, so how do the clients look like? The clients are basically just algorithms that can query this encoding. So they can say, please give me the Ith symbol of this encoding and then they get back the Ith symbol. And we split these clients into two algorithms. So V1 and V2, V stands for verifier because they are kind of a verifier here. And the first part just does all these queries and then it outputs a transcript and this transcript is then verified deterministically by the second part. And the reason why we split it explicitly into these two parts is to make sure we can talk about accepting transcripts. Okay, so that's what we have seen so far. I think nothing really special happened but now the question was really what is data availability? So how can we say data is available and we formalize this via an extractor? So this is an algorithm that takes as input some set of transcripts and also the commitment and it extracts the data. And so from now on, we will always say data is available if this algorithm can extract it. So availability means the extractor can extract it. Okay, and this is not only some concept that we have in the definition, you can think of this as being run by any node that collects enough transcripts. So whenever these clients finish their interaction they can just send their transcripts to everyone they know or everyone they want to and then if you collect enough transcripts you run this extractor and get back the data. So the data is still available. Okay, so that's the syntax and now let's think about the properties that we want. What kind of properties should these algorithms satisfy? The first thing you always want is some form of correctness or completeness definition which just says if everyone is honest and behaves as expected following this protocol then what you get is what you expect, right? And in our case, this would be, well, if I take the data and I encode it and run these clients then all clients should accept because the data is available and also what I extract in the end should be the same as what I input in the beginning. So this is just completeness but of course we cannot hope that everyone is honest, right? So we need some security notions and I already mentioned before that we assume the proposer and the network is dishonest, right? They want to convince these clients or verifiers that some statement is true. Whenever you have something like this it's kind of a proof system, right? You want some form of soundness telling you that if these clients accept then the statement is really true. And in our case, this means data is available, right? So what we want is in this setting where the encoding is controlled by the adversary and the commitment, if the clients accept then the data is available. Okay, so what does it mean that data is available? Well, we can extract it but now if we look at this picture we see the encoding is controlled by the adversary and the commitment as well. So there's no original data, okay? So that's the first challenge. And let me just explain what it means that the adversary controls this encoding. Well, he can adaptively answer the queries of the clients. So the client says, please give me position I and then this adversary sees, okay there's the first client, he wants position I so I will answer with something, right? He can decide adaptively what to respond. Okay, and now our definition of soundness says in this setting where the encoding and the commitment is adversarial if enough of these clients output one then we extract something. So some data is available. Okay, and I want to emphasize two things here and the first is this word enough, right? So typically in a proof system you would assume if one verifier accepts then the statement should be true so data should be available but we cannot really guarantee this for any reasonable system. So what we require is that enough clients accept and let me explain this. So let's say you have an adversary and this adversary targets the first client in a sense that it takes some data encodes it honestly and responds honestly to the first client. And then by completeness this client should accept. And now for all the other clients the adversary just says I just give you garbage or random things or nothing as responses. And then of course everything that every information about the data that can be used in extraction comes from the first transcript. So if the first client makes only a few queries then you cannot hope to extract. Okay. Excuse me? Is it okay to ask questions now or we should defer them? Oh yeah, you can ask one question now, sure. Yeah, I have a kind of notational question. So you mentioned in the beginning that some data is stored in the network but I can't fully understand the difference between the network and the proposer. So as you say that like that the adversary can answer differently to queries. This actually means that nothing is stored in the network. So like they're just the adversary has some storage and it answers arbitrarily or by some algorithms. So what is it? Yes, so in the, yeah, so good question. So in the honest case we have this party that comes up with the data or somehow has this original data and encodes it and stores it in the network. In the dishonest case here in the soundness notion we don't want to trust the network, right? Because the network is in the end the thing that wants to convince us that the data is available. So we should model this as the adversary. And we will see in the constructions that answering inconsistently is not really possible if we do it the right way. So in this model there's no network anymore. There's just an adversary and... Yeah, so there's an implicit network and the adversary controls whatever this implicit network is doing. Maybe a small add-on. If you consider a adversarial setting then the setting that Benedict is describing right now is very powerful from the perspective of the adversary. So if we can construct schemes that are secure according to this soundness definition then there will also be more certainly secure in a weaker model where you do have a network and you put some trust into the network. So having schemes that are secure according to this will also be secure according to weaker security definitions. Yes, yeah. So that's a very strong notion. Okay, so I hope this somehow answers the question. Yeah, so where was I? Okay, yeah. So the first thing we have to assume that there are enough clients just because of this adaptivity and how many clients we need is a parameter of the scheme and we will come back to that later. And then I say we extract something. I don't say anything about what we extract. So you could just output some default data and then you satisfy this notion. So that just tells us that this notion alone is not enough, right? This is not enough because we don't say anything about what we extract. And for that we have this notion of consistency. So again, I said before, so these clients they take their transcripts and then they send it to some other parties and whenever you receive enough transcripts you can extract the original data. Now let's say you have two nodes that do this, right? So you run this extraction on two nodes for the same commitment with two sets of transcripts. And now of course what you want is that they extract the same data, right? You cannot say that they extract some specific original data because the adversary controls the way this commitment is created. And so there is no original data but what you want is that they extract really the same data and this is consistency. But now if we look at this picture then this is still a very specific setting, right? So there's two sets of disjoint clients and you try to extract from their transcripts. What we really want is a more powerful notion where we don't care where the transcripts are coming from because if I try to extract then I don't really know who made up these transcripts. So in our consistency notion we give the adversary the power to come up with these sets of transcripts. So the adversary comes up with one commitment but two sets of transcripts they do not have to be disjoint. And then I try to extract from them, right? I get two possible data and my requirement is that they should be the same. So the commitment somehow binds you to the data. And of course I can only require this if the extractions do not fail. And now we also see that this outputting some default data is not a good scheme. It will not satisfy this notion because the adversary can just be honest in the first set for some random data that he encodes and force the second extraction to output the default data and then this will not work. So this consistency tells us something about what we extract. Okay, so this is our basic definition. Let me summarize it. We have this completeness notion which is very natural if everyone is honest then all clients accept and we extract the original data. We have a soundness notion telling us if enough clients accept we extract something. So some data is available. And then we have a consistency notion which says something about what we extract namely we never extract two different things. So if you get a commitment and some other party gets a commitment you are sure that you also get the same data. So that's our basic definition. And now we thought about extensions. So how can we extend this definition with more functionality that may be useful in practice? So the first extension that we looked at is repairability and here the idea is what if your encoding is broken and you want to repair it? Go back to a stable state where you have something that is close to the original encoding and works as expected but you don't want to change the commitment. You don't want to redistribute all commitments. You sort of want to do this repairing transparently so you need some way to repair the encoding such that it still works with the old commitment. And then we have a second extension which is a local accessibility here. The idea is what if you are not interested in the entire data but instead you are interested in parts of it for example one symbol and then of course you could just wait until you have enough transcripts and reconstruct the entire data but that sounds wasteful, right? So what if you can just make one query to the encoding and then get the symbol that you want the symbol of the data that you want. So we define syntax and also security notions for these two extensions but I don't want to go into detail here because I want to have more time for the constructions. So yeah, that's it for the definition side. So maybe now is a good point if anyone has another question about the definition. Okay, so if not then let me continue. Maybe just one quick question for this because you mean it's not true. Can I go back one slide? Just have one question about what you mean by repairability because I mean this kind of, this way of modeling the network is a bit strange just to make sure I understand it correctly. So this repairability means or repairing means that some of the queries are not answered at all or what's the sort of the setting? Yeah, let's say you have a setting where not all clients accept then you know that there's something wrong with the encoding, someone messed with the encoding. Now we want to return to a state. Sorry, just go back. I mean, if you go back into the way you model this Pi, I mean, you only have some kind of I and then your answer is on Pi I. So basically this, just not sure what do you mean by encoding in this because you have this kind of. So the encoding is Pi. So it's. Oh, so. So Pi is 1, Pi 2, Pi 3. Yes. It's just a string of symbols. Yeah, but that's a bit weird because you don't really know that every client gets the same Pi I as an answer. So I don't even talk about it. That's what you've been coding even mean. That's kind of what's confusing me here. Yeah, so you're right. In a malicious case, I cannot know that and that's something my construction has to take care of. So my construction will take care of that the clients get consistent responses. So right. And now you say in this case where the encoding is broken, what I mean by that is just that I see that some clients don't accept then I know something is wrong. And that means, okay, you're right. There's no explicit encoding anymore because now I'm in an adversarial setting but I see some something is wrong. And then I want to go back to a state where everything is fine without changing the commitment. Yeah, it just would probably, probably would really like to see the definition maybe, but yeah. We can look at the definition later. You see the problem that's kind of, it's not completely trivial how this would look like. That's why I'm sure. Yeah, we talk, also we, it took us some time to define it. So you're right. This is a bit tricky here. Okay, so any other questions about the definition? Okay, good. So now let's look at constructions. So we have this definitional framework now and now we looked at the existing constructions. And also we thought about new ones. And then what we saw is they all follow sort of similar framework. So we abstracted this and came up with this construction framework and it allows us to separate sort of the combinatorial parts of the constructions from the cryptographic parts. And yeah, so because all of the constructions follow this framework, I want to explain this framework first. So you can summarize it as data availability sampling from erasure codes. So we want to construct a data availability sampling scheme as I defined a few minutes ago. For that, we will use three components. The first is an so-called erasure code. And then we have a commitment for this. So this is a new cryptographic primitive we define. It's a erasure code commitment. You can think of this as a generalization of a polynomial commitment scheme. And then we have a third component, so-called index sampler. So this is our framework. I want to explain how this works. But we have already seen one example of this framework in the beginning when we looked at this Reed-Solomon code-based construction. So here the erasure code was the Reed-Solomon code. The erasure code commitment for it was KCG commitment. And the client sampled their indices uniformly with replacement. So this is one example. And now let me give more details about this. So the first thing we want to look at is erasure codes. For that, I need to recall some coding theory just to get everyone on the same page. So what is a code? Code is just a mapping, taking some element from gamma to the K and mapping it to lambda to the n. So how does this look like? You take some message, which are K symbols over gamma, and you map it to a code word, which are n symbols over lambda. So you add some redundancy here. And what you can think of is you have this small set gamma to the K, and you injectively map it into a large space lambda to the n. And this is your code. And I should say that I'm typically referring to the code as the mapping. But at some points, I will also refer to the code as the image of the mapping. So the subset of this gamma to the n, just we have to be flexible a bit here. OK, so what is an example of a code? So one very important class is the class of linear codes, where your alphabet is just a finite field. And you map a vector x to its code word by applying some matrix to it. So you take your vector x, multiply it with a matrix, you get your code word. And an alternative way to look at this linear code is via the so-called parity check matrix. So this is another matrix that helps you identify code words. So a vector is in the code. If its product with the matrix is 0. So think of it like this. You have this long code word. You multiply it with the matrix. And if it's in the code, then it gives you 0. So you can check membership in this code by a linear check. And we will use this later. OK, so one example of such a linear code is the Reed-Solomon code. Now what are erasure codes? So erasure codes, again, just any code, but a specific way of looking at it. So we look at it from the perspective of erasures. So what we want is we want to tolerate erasures. What does that mean? Well, we have our message encoded with the code. And now some of the symbols are lost. And this is something that is already close to the setting that we talked about before. And now you still want to be able to recover the message as long as you have enough of these symbols. OK, so that's an erasure code. And it has one very important parameter, which is called the reception efficiency. And it tells you how many symbols do I need to reconstruct the message. So whenever I have t symbols, and it doesn't matter which one, I can recover the message. OK, and of course, if you have the case where the alphabets are the same, then the best you can really hope for is that you have reception efficiency k. So you can reconstruct whenever you have k arbitrary symbols. You cannot hope to reconstruct from k minus 1 just based on information theory. Yeah, and one example of such an ideal erasure code is the Reed-Solomon code. Whenever I have k points of my degree k minus 1 polynomial, I can interpolate this polynomial. OK, so now we know what an erasure code is. Now let's look at this new commitment scheme that we define, the erasure code commitment scheme. And as I said, this is a generalization of a polynomial commitment scheme. So you can think of the polynomial commitment scheme as being an erasure code commitment scheme for the Reed-Solomon code. So how does that look like? Again, we have an arbitrary code. And now let's say we have Alice, and she holds a code word C, which is C of x of some message x. So this could be the evaluations of a polynomial. And now Alice wants to commit to this code word, and Bob wants to query positions of this code word. So Bob will query a position and get back the symbol of the code word along with this opening proof tau i. So this is, and then, of course, he can verify it. So this is just as you would expect as you have in a polynomial commitment scheme. So you commit to a polynomial, and then you get back evaluations along with opening proofs. OK, so for this kind of commitment, of course, we need some security notions. And for a commitment scheme, you typically consider hiding and biding. We are not caring about privacy, so we only consider binding notions here. And there are two that we consider. And when we came up with this, of course, you want, at the same time, you want to be consistent with the notions that are already there for polynomial commitment schemes. On the other hand, the notions should be strong enough to sort of suffice for constructing data availability sampling. OK, and these are the two notions we came up with. So the first is position binding. You also have that in polynomial commitment schemes where you, the adversary outputs a commitment, and then he tries to open this commitment at one position I to two different symbols. So that looks something like this. The adversary tries to convince you that there's this code word that he committed to. But at one position, he opens it to two different things. And this should not be possible. So the adversary wins when all openings are valid and the two symbols are different. So this position binding, and this is something you can easily achieve just using a Merkle tree. So this alone cannot be what we want. Instead, we need a different notion. Yes? Sorry, a question, yeah. Why have you decided to commit to the code words? Because I think in the very original, in the first slides, when you sketch some possible solution, you seem to commit to the data rather than to the code words. At which point have you decided to commit to the code word? Yeah, so you can also phrase this as committing to the data, but opening the code word position. So you input the data. So this is sort of equivalent, I would say, because this code is injective. So when you commit to the code word and it's really a code word, then it implicitly commits you to the data. But what we will do, and we will see this in a few slides, we take the data and input it to the commitment. But later, what is important, you want to open positions of the code word. That's for the construction, this is important. Yeah, I understand. When you say we now work with code words and we decided to commit to code words, this looks like a little bit kind of restrictive, because in general, of course, you don't have to encode as a linear code, but you can answer queries as just functions of the data. And as long as you can recover the data from function outputs, you should be fine. OK, but we will see that most of the constructions, you can write it in this way. So I claim this is not really a loss of generality, especially for the constructions that we consider. And if you have other constructions, sure, you can construct it in a different way. But this helps us to avoid redoing certain proof steps for all the constructions. So yeah. Maybe to clarify, the goal here is not to come up with a definition that covers all possible imaginable constructions, much rather it is that you come up with an abstraction that covers a large part of constructions. And then you show that if you have such a commitment scheme, you get automatically a data availability sampling scheme. And now you just need to focus on building the specific commitment scheme rather than building the more complex primitive. It does not meant to cover all imaginable constructions of data availability sampling. Yes, but it does cover a lot. That's why we're looking at this. OK, so yeah, good question. So now we have this position binding, right? But what I already said is you can easily satisfy this, right? Just take your data, compute the code word, and compute a Merkle tree. So this cannot be the end of the story. Instead, we need a notion that relates to this code. So this is code binding. And here we have, again, Alice, she outputs a commitment, same as before. But now what she outputs is just a set of openings, so a subset of this code word. And what we, Alice, wins if these are not consistent with the code, right? So there's no code word that is consistent with these openings. OK, what does that mean? It means whenever Alice outputs openings, then if the scheme is code binding, then we know that all the openings are consistent with the code. OK, so for example, if you have the Reed-Solomon code and you're talking about polynomials, then it cannot happen that Alice outputs k plus 1 points that are not on a degree k minus 1 polynomial. And this is something that actually in the original KCG polynomial commitment scheme paper, you don't have such a notion. So this is a notion that will help us. And we will see that these two notions together are sufficient to construct data availability sampling. OK, so now we know what an erasure code is. We know what an erasure code commitment scheme for this, for some code is. And I don't want to go into the details about this index sampler. This is a very interesting combinatorial thing. And we studied it for a while. But let me instead just show you how we construct data availability sampling now from these components. So we start with the data. And now the first thing we do is we use our erasure code. So we use our erasure code to compute a code word. And this will be the first part of the encoding that we will store in the network. And now, of course, we saw this in the beginning. We needed this mercury to ensure consistency. And now we use our erasure code commitment scheme for that. So our erasure code commitment scheme will give us openings for these code word symbols. And we will group together the code word symbol with its corresponding opening. So one symbol of our encoding for the data availability sampling scheme is a tuple, which contains the code word symbol and its opening. OK, so now this is the encoding. Let's now look at the clients. So the clients get the commitment from the erasure code commitment scheme. And now they should query this encoding to verify that data is available. So what do they do? They first run this index sampler. And now we know what this is. This is just an algorithm that outputs Q positions. And now they query this encoding at the positions that this index sampler outputs. So this index sampler is just specifying how we sample the positions. And now I query these and I verify all the openings. From every query, I get a code word symbol and the opening. I can verify it with respect to the commitment. If all the checks pass, then I accept and think the data is available. If some check fails, then I reject. OK, so this is the construction. And when we analyze this, we want to use, of course, the properties of the erasure code commitment scheme and of the code. So when we analyze this, we have to think about three properties. We have to think about completeness, soundness, and consistency. So this is a summary. So for completeness and soundness, we will just use the combinatorial properties of these objects. We will not use any computational assumption or anything like that. And the completeness and soundness just follows from the reception efficiency and some measure that we define for this index sampler, which we call the quality. And if we combine them, then this will tell you how many clients do need to accept such that you can reconstruct the data. So remember, in the definition of soundness, I told you we need enough clients to accept. And how many we need is given by the reception efficiency and this quality of the index sampler. So this intuitively makes sense. If you need to collect a lot of symbols of this code word to be able to reconstruct the data, then you also need more queries of these clients or more clients in general to reconstruct the data. But the third notion is consistency. And this one is computational. So we show it based on computational assumptions. We show it based on the position binding and code binding of this erasure code commitment scheme. So intuition is as follows. Let's say you have two clients and they query the same position. Then by position binding of the erasure code commitment scheme, they get consistent responses. And then if you look at all the responses together, then by code binding, you know that they are consistent with some code word. And that means whatever subset of that you take, as long as it's large enough, will give you the same message. So that means it gives you the same data and you have consistency. So this is just a high level sketch of consistency. OK, so that's our framework. It tells us how to take an erasure code and an erasure code commitment scheme and to construct data availability sampling. So as Mark said, we can now focus just on constructing erasure code commitments for erasure codes. And this is the third part. Let's now look at specific constructions. OK, so here's an overview of the constructions that we looked at in this project. So the first one, we already seen this a few times now, is take the Reed-Solomon code and any polynomial commitment scheme, for example, the KCG commitment scheme, by the construction that I just showed you, we can use this to construct a data availability sampling scheme. And this already sort of proves that this construction is secure, just instantiating our framework. OK, then we have a very generic construction where you just take an arbitrary code and you commit to the code word using a vector commitment. And then to get code binding, you use a snark on top of it. So this is generic, but it's also very inefficient because, well, the snark is computationally inefficient, if you, for example, if you use a Merkle tree. And I look at this more as an educational example of what is the general strategy that you want to have when you construct these commitments. OK, and then there's the Ethereum construction. So this is the construction that is currently used by Ethereum, and we can actually parse this and view this as being an instantiation of our framework. And for that, we look at the so-called tensor code and do some row-wise commitment. We will see that in a minute. OK, and then, motivated by the fact that all of these constructions sort of rely on a trusted setup, at least that the Ethereum one and the Reed-Solomon one, we ask the question, OK, can you construct this without a trusted setup and ideally from minimal assumptions such as just using hash functions? And we came up with two constructions, one based on hash functions, one based on homomorphic hash functions, and I will show you one of them later. OK, but now let's look at this tensor code construction, so the construction used by Ethereum right now. And what we should do really is we should first start with an erasure code. That's the first thing I will explain to you. And then for that code, we construct an erasure code commitment scheme. OK, so let's start with the code. We take our data and we arrange it in a square of size k times k. And now the first thing we do is we assume that we have some underlying base code, for example, the Reed-Solomon code again. And we use this code to extend the data along the rows. So we take every row and apply the code to it and then we get this rectangle. And now the second thing we do is we extend it along the columns. So we apply the code to every column. And you can write this very concisely in this form. So you take your data and now encoding row wise means multiplying by g transpose. Encoding column wise means multiplying by g. And this is the generator matrix that we talked about before. So now if you look at this code and let's say we use a Reed-Solomon code as the base code, then every row and every column is now the evaluations of a polynomial. Every row and every column should be a code word. That's our code. Before I show you how to commit to such a code word using the erasure code commitment scheme, let's first think about the properties of this code. And it's in particular about the reception efficiency of this. So how many symbols do we need to be able to reconstruct the data? And why do we care about this? Again, this affects the soundness and completeness parameters that we get. And you can sort of see that the worst case is something like this. We have all the highlighted cells, all the highlighted symbols of this code word. But you are not able to interpolate any of these rows or columns. And so it's about three quarters of the data that you need to be able to reconstruct, to be sure that you can reconstruct. This is the reception efficiency. And now let's look at the commitment. So I said we have this every row is in the code, in the base code that we use. And now let's assume we have an erasure code commitment scheme for this base code. So if the base code is the Reed Solomon code, let's assume this is the KCG commitment scheme. And now, because every row is already a code word, why not just commit to every row? And that will give us a set of n commitments. And now the first idea that we could have is just group them together. That's our commitment. And now we should think about these two binding notions. We should think about position binding and code binding. So first, to understand that, first let's see how an opening looks like. Let's say I want to open this symbol of the code word. Then this is, of course, associated to some row commitment. And what I do is I just open it with respect to this commitment. That's how you open it. Now let's think about position binding and code binding. But position binding is easy, because every position, as you see here, is associated to a row commitment. So if you break position binding of this new construction, you break it for this specific row commitment. And therefore, the position binding here follows just from position binding of the underlying scheme, for example, KCG. OK, now let's think about code binding. So for code binding, I want to be sure that whenever the adversary opens a set of positions, and this is consistent with the code. What does it mean? It means that in every row and in every column, this is consistent with the base code. So whatever the adversary opens, I know that every row is in the code and every column is also in the code. OK, what if a row is not in the code? Well, then, because I commit to rows, I break the code binding of the underlying construction. So code binding, so this half of code binding is clear because I commit to rows. Now, the challenge is what if some column is not in the code? And in particular, if you just do it like this with no additional change, the adversary can just form all these rows independently. There's nothing that ties these rows together, and therefore, this cannot be code binding yet. But we can make it code binding by one additional check. What we will do is we will check that column is in the code by checking it over the commitments. And for that, we will use the fact that the commitments, for example, KCG commitments are homomorphic. So we will do a homomorphic check that checks that all these commitments together are in the code. So all the columns are also in the code. And here, we will use the fact that checking membership in a code is a linear function. So we use the parity check matrix. OK, and then with some care, you can analyze this and you can show that it's actually code binding. OK, but what I want to show you next is not this proven detail. I think I sketched parts of it now. I want to show you our new construction. And again, the motivation was that you rely on a trusted setup and expensive public key operations here. So the question was really, can we do it just from hash functions? And this is the construction that I want to show you. Yes, question. Sorry, and something like Frye, which looks like Reed Solomon plus Merkel 3, wouldn't give you that? So we are currently actually working on whether Frye gives you that. We are pretty convinced, but the proof is not clear. So this is a good question. So I mean, you could always then use Frye as a polynomial commitment scheme, but that would be inefficient computing all of these openings. So what we're looking at is whether we can avoid that and just your proximity test with Frye. OK. In addition to some consistency checks for particular open points. So basically not use Frye as a black box, but rather use Frye directly applied to the application and avoid computational overhead that we would have otherwise looking at the security proof of Frye. Yes, so that's what we're currently working on. OK, so yeah, but let me show you this construction. It is for so-called interleaved codes, and we only use hash functions for that. So what is an interleaved code? Again, I want to start with the code, and then I want to show you how to commit to a code word. So as for the tensor construction, we start with our data arranged in a k by k matrix. And as for the tensor construction, we take some base code and extend all the rows using this base code. But now this is where the similarity stops. So now we won't extend it along the columns. Instead, we group together all the columns as symbols. So now our new code will have n symbols, and each symbol of the code word is an entire column of this matrix. So again, you can write this in a matrix form. And that also means if we look at this code, then the reception efficiency is the same as for the underlying code, but with the caveats that now symbols are larger. So if you get k symbols of the underlying code, then you have them in every row. Sorry, if you have k symbols of this new code word, you have these symbols in every row, so you can reconstruct every row. But again, the symbols are larger. And that means whenever a client queries a symbol, then it will get the entire code. OK, so now the first thing when we look at this and we want to construct an erasure code commitment is we should ensure that it's position binding. And here what we do is, let's say we do a very naive thing, we just commit to all of the symbols using a hash function. And position binding just follows from the collision resistance of this hash function. But now the tricky part is code bindings. So how can we ensure that this is code binding? Well, we want to check that every row that you open is consistent with the base code. OK, of course, I cannot check that for every row because I don't have the entire rows. But what I can do is I can sort of compress these rows into one row using a random linear combination. So I sampled some random vector r and I take the first row times r1, the second row times r2, and so on. And sum that up and I get one compressed row, w. And of course, if all of these rows are in the code, then w is also in the code. This is just because this code is linear. And the hope is that if one of the rows is not in the code, then the w will also not be in the code. That's the hope. So how can we implement this? Well, we can just take a random oracle, a hash function model that's a random oracle, apply it to the hash values and get our r. Then we take w to be r times c. So that's just combining the rows. OK, and now our commitment would be the hash values and the w. And now the first thing you check when you get a commitment is you check that w is in the code to sort of implicitly check that all the rows are in the code. OK, so how do you verify an opening then? So one aspect about this construction is that there is no explicit opening proof. Instead, I just give you the symbol. So a symbol, to recall, is a column in this matrix. So I give you this column, ci. You check that w is in the code. You always check that. For position binding, you check that the hash is consistent. And then you also check that along this column, the linear combination was formed correctly. So you check that r times ci is wi. That's how you check it. And now let's think about this. Is this actually code binding? So the intuition is that it is, but can we prove it? OK, so what does it mean? We have an adversary, and this adversary breaks code binding. So it outputs a commitment, and it outputs some openings. OK, and now this adversary breaks code binding. If all these openings together are inconsistent with the code. So there's no code word consistent with these openings. And now we want to rule that out. OK, how can we approach this? The first thing we can do is we can say, OK, if we model this hash function, again, as a random oracle, then these hash values define the pre-image of them. And we can, in fact, extract it. So this is a pre-image C star, which is the pre-image of all these hash values. And now by collision resistance of the hash function, the ci's have to be consistent with the C star. So that means, in particular, the C star has to be not in the code. Because otherwise, I mean, that's the winning condition of code binding. Whatever is consistent with the ci is not in the code. But also, all the openings need to verify, so that tells us that w has to be in the code. And now it's sort of the bad event that we have to rule out during the security proof is that, well, these openings are consistent with the C star. And they are also, via this R, consistent with the w. But the C star is not in the code, and w is in the code. And now if we just look at this, then believe me, you can bound this, but there's a huge problem. And this is that we have this, there is, term here. So we need to use a union bound. So why is that? Well, we don't get to choose the set i where Alice opens the commitment, right? So Alice can just choose this set i in an arbitrary way. And we don't know anything about it, except that it's large enough. And that means we have to rule out this event for every set i. And the only thing that we can do here is to use a union bound. But there's an exponential number of such sets. And that means the union bound will kill us here. So this is not the construction that we can prove secure. And now how can we change it? So the problem really is that Alice chooses this set i. And now an idea that you could have is what if, let's say Bob chooses this set i, or somehow this set i is chosen at random, and Alice cannot choose it. This is how we fix this problem. Again, we have the same hashes for position binding. We compute this random linear combination. This is everything as we had before. And now our fix is that in addition to that, we sample some subset j of all the positions. This can be a rather small subset, size roughly security parameter or something like that. And then within the commitment, we force Alice to open the commitment at the indices in the subset. So now, if we look at the security experiment that we had before, Alice no longer chooses the set. Instead, the set is chosen at random. And so we don't have to do this union bound. That's roughly the intuition of this construction. Okay, so that's our hash-based construction. And before I want to wrap up, I like to compare these constructions and sort of tell you the trade-offs that we have here. So we saw this, we have seen this read Solomon plus KCG construction. And here the real advantage is really that the commitment is constant size. It's just one group element independent of the data size. And also the same holds true for the communication complexity per query. So if you query, then you get one field element and one opening. So that's also independent of the data size. Then if we look at this tensor construction, so the Ethereum construction, we still have these advantages. So the commitment is larger now. It scales with a square root of the data size. But we have an additional nice feature, which is that you can do this reconstruction in a local way in a sense that you can take one row, reconstruct this row, and you don't care about the rest of the encoding. And now if we compare it to our new hash-based construction, then the clear advantages of this construction are that you don't need a trusted setup. We have sort of minimal assumptions. And you can implement this because you don't need public key operations. You can implement this over small fields, which gives you a computational efficiency. Of course, at the same time, you have a larger commitment size and a larger communication per query. But there's one final advantage that I want to mention, and this is the total communication size that we need to reconstruct the data. So what we can do is we can look at how many samples do we need to make in order to be sure with a certain probability that we can reconstruct the data. And then you multiply that by the communication per query. So you get the total communication complexity over all clients that you require to reconstruct the data. And we computed this, and it turns out that for this construction, it's smaller than for the other ones. The main reason is that, for example, if you compare it to the tensor scheme, you have this, you need three quarters of the symbols, right? So you need to make more queries to reconstruct this. Okay, so that was some comparison. Let me now summarize what we have seen and what are the next steps for us. So I showed you the definition of data availability sampling that we came up with, and I showed you this framework how to construct data availability sampling from erasure codes. And then we looked at two specific constructions for erasure codes and erasure code commitments. Okay, and the next steps, while we want to finish this paper, so the technical part is done. It's more about the writing the intro and texts and things like that. And then in the beginning of May, we want to submit it. And then we want to look at more constructions. For example, this fry-based construction would be interesting to see if we can improve it, improve what we have so far. So yeah, with that, thank you for your attention. If you have any questions, yeah. Okay, so I see there's a question in the chat. Are you working in an asynchronous or synchronous setting? Okay, that's a tricky question. So what we don't define like the communication between these parties, the only thing we define is that we define sort of the clients as an Oracle machine that gets Oracle access to the encoding. So what we assume is that you make a query and you instantly get back the response. So you query position I, you get back the I symbol of the encoding or nothing or something that the adversary chooses. But there's no... So in that sense, you should think of it more as a synchronous setting. Yeah, that makes sense. Was that answer your question? Yeah, no, no, it does actually. And I guess for the definitions, that makes a lot of sense. For the constructions, probably you might want to start thinking about the what happens if your network is a synchronous because I think it's fair to assume that it is. I agree, yeah. Yeah, so I mean, obviously not for this paper, that'd be crazy, but this could be quite a nice follow up. Yeah, I agree. So it's not directly clear what happens in an asynchronous network because the solidest definition of the adversary is already allowed to adaptively answer the client's queries. So that's even worse than an asynchronous network in some sense because you can just give completely adaptively chosen independent incorrect answers to the different clients. No, asynchronous isn't really about adaptivity. Asynchronous is sort of saying that you don't know when the client's going to respond. Like, you know that you'll get a response assuming that online and available, but whether you get a response in a minute from now or 10 minutes from now would be something that you would need to actively like, well, for partial synchrony, you would need to sort of consider what the actual time bounds are. And for full synchrony, you would need to sort of say, we can't model time at all. We just need to assume that we're going to have some kind of interaction where we sort of do it message-based. And when things arrive, we say done or not done. And yeah, I think that that's very, very complicated, not for this. Yeah, so this, I guess, like tackles a slightly orthogonal problem to what we defined in this paper at least. All I was saying is that like, you could see also asynchronous as like you're not getting a response, but that would be like, in terms of does it affect security or consistency or soundness? And that it does not like, lifeness issues can exist, but like security issues should not. That's all I was saying. Yeah, I think I agree. Because if you're, for example, just simply put a cutoff time, say if it's more than 10 seconds and today has the answer, then we say it's not available, then that would basically give that, right? Then we would just say, yeah, the adversary hasn't answered. That is not to say that- I mean, I guess you probably need to set that time setting to be very high if you're doing it like that because otherwise you have to start considering the setting of, well, sometimes they're not going to be able to respond within 10 seconds, but they will respond within 20. So what do we, yeah, it probably needs to be like a ridiculous overlap. Yeah, but yes, yes, but yes, as Mark said, that affects lifeness, yeah. So if you want to, if you want to also consider, can this actually work? Then yes. Maybe a short comment to what Francesco wrote in the chat. So that's like a question of how you model it. Like you could say that completeness says under good network conditions, under honest participants, under honest encoding, everything works well, in which case it would just be an additional assumption of what your completeness is. And then like the not being delivered would be part of the soundness, but you could also model it as saying, like you have completeness in an asynchronous network and then you would have an extra definition. Basically that's just a question of how you want to split it. There's like not one right answer. It's just a question of what is a preference for modeling it. Any other questions? Are possibly a less technical one, but would it be possible, how much longer are you planning to be doing your internship with us for? Oh, technically it ended. Ah, technically it's ended. Okay, okay. So if I was to turn around and be there like, this would make a great blog post then you're probably afterwards for that, but Mark. My internship, not it? Okay. Yeah. That's great. After the papers, no, yeah. Maybe one small comment regarding like that we abstracted away the network. So in the talk, Benedict was talking about the index sampler. So one thing that we did try to model at least a little bit is the fact, like it's not explicitly modeled, but like implicitly it is in there, is that if you do fully uniformly random queries, like in practice you would do the encoding and you would split it across a bunch of machines. Now, if I do the uniform sampling, so I'm one client and I collect 50 samples, I might potentially need to contact 50 different machines to get my samples. So that's why we have the subtraction of an index sampler where we say, maybe you don't want to have 50 uniformly random indices, but you want to have a random index and then 50 consecutive samples, in which case you would probably only need to contact much, much fewer servers. So we kind of like will provide benchmarks and comparisons of the different sampling strategies and how long they take to collect the desired number of samples that you need for the different constructions. Yeah, so as far as I understand, this is also what is done in the Ethereum construction. You group together symbols and you don't query one symbol alone, you always query, I don't recall, I think it was 16 symbols or something like that. So some continuous segmentation. Yeah, and to understand how this compares what is really better, we also simulated this and did some analytical work on this, but yeah, this is, I think for its own, this is a very interesting problem, this index sampling. And the other thing regarding what Gotti asked in the beginning regarding repairability, maybe it's also worth mentioning. So informally to describe it, like informally saying, oh, I want it to be repairable is very easy, but actually writing down any sensible definitions turns out to be extremely hard because like you have an adversary and you want to have some kind of repairability notion that says something is correct, but you didn't start with a correct object. So it's very tricky. So we have one in the paper, formally written down that kind of like does seem to model something that is, let's say seems sensible to us, but even that definition may be still a bit too weak. So for example, if you want to distinguish the Ethereum construction from a standard KCG, just like Reed-Solomon code was KCG construction, this definition doesn't necessarily highlight some practical differences that you would have, but making a stronger definition that does highlight them is like, because I mean, you want to make a definition that is expressive, but it's also understandable because it can be very expressive and useless to use as a definition or the other way around. So we try to find the middle ground and like there's still maybe some work to do to find like an even stronger, but still useful definition of repairability. That's a very tricky topic, actually. It seems very innocuous, but it's very hard to define it. Okay. Very nice work. Thank you so much, Benedict. Yeah, it was fun. This was a really good presentation as far. Thank you. Yep, thank you. Yeah, awesome. Great work. I'll have to run. So yeah, thanks. And yeah, bye bye. Great. Nice seeing you. Thanks a lot, Benedict. Bye.