 The next step, so our next speaker is Dankrad Feist. Dankrad is an Ethereum researcher with a background in theoretical physics and technology. Since joining the Ethereum Foundation in 2019, he has worked on topics involving applied cryptography, sharding, statelessness, the proof of custody and other related topics. His talk today is on data availability commitments with distributed reconstruction thanks to the ZKG commitments and how they were able to construct a unique data sharding solution that supports high data bandwidth while preserving security properties without requiring powerful actors beyond normal validators. Dankrad, I believe you're up. Okay, fantastic. So I'll hand it to you. Thank you. Try to share my slides here. Okay, we can see your screen. Okay, great. Cool, thanks for the introduction. Yep, so I'm going to talk about data availability commitments that allow distributed reconstruction and how we're able to do that using a two-day, two-dimensional KZG scheme. So as an outline, I will be talking about data availability sampling, how that works, the basic principle, the merkle tree-based construction that was first suggested in 2018. And then how we came across fraud-proofing constructions using KZG commitments. And then finally, how we improved them using a 2D scheme that also allows distributed reconstruction and gets rid of requirement dependencies on super nodes for everything except for liveness. Well, potentially you could even remove it for that. So let's talk about data availability something. The idea of data availability sampling is that we want to know that OFN data is available using only OFN work. So like this, well, what we basically want is we want to scale data availability. So we need to somehow do less than OFN work in order to ensure this. And the basic idea is, well, what if we distribute the data into N chunks and each node selects K random chunks and downloads these and uses that to check whether the data is available. And so the basic problem with that idea is that, well, what if one of the chunks that you aren't checking for is missing, right? So in some applications that might not be such a big problem, you might be okay if 99.9% of the data is available but in blockchains because we have these applications where, for example, one single transaction could print a billion ether. That's not sufficient. Like we really need to be sure that all the data is available. Otherwise we can't use it. And so in order to be able to use this technique, we need to make it more powerful using something that's called erasure coding. Erasure coding means that we take our original data and we extend it. For example, using a polynomial. So what we can do is if we have these four data points on the left, we can always put a polynomial of degree three through them and evaluate it at a number of additional places. For example, four more places in order to double the original data. And then with a property that any polynomial of degree three can be, is fully determined by four evaluations. We know that if you know any four of these eight samples, you can get the full data. You got, you know everything about this polynomial and can thus compute all of the eight samples. This is called the Reed-Solomon code. And in this example, the coding rate is 0.5 because we extended it by a factor of two. And if we do this, then the sampling that we described earlier becomes efficient because before we had no way of using sampling we had no way of using sampling to efficiently make sure that all the data is available. So we would have like in order to catch a single missing one with like a high probability, we would have to sample. Like for example, if you wanted 99% probability, we would have to sample 99% of the data. So that's not efficient. But once we do this, it becomes different. So if you wanted to hide any data here, you would have to provide less than 50% of all the samples as an attacker. And in that case, if you, for example, do 30 random samples, then the probability that all of these passes only to do the minus 30. So that's great. The radar coding can make data will be sampling efficient, but now we have another problem. Now we need to ensure that this coding is correct because if the coding is incorrect, basically what could happen is that an attacker just instead of providing here in this example, eight correctly encoded samples, they could just provide random samples. And then each four of these would not give you back the same polynomial, but each would give a different polynomial. And then clearly, yeah, like your sampling doesn't help because you still don't know what exactly the original data is unless you get the original data itself. So there are mainly three possible approaches to ensuring the correctness of an erasure code. You can use frog proofs. You can prove that an encoding is correct. They sort of like, well, naive way using a snark. And then variants C is actually similar to B is uses polynomial commitment, which is also a form of proving that is correct, but it uses more natively the cryptography and therefore becomes much more efficient as we're going to see. So using frog proofs, basically let's assume that we commit to this data availability route, which is basically Merkle route of our data samples. Then what would be required for a frog proof, like let's say these are not all on the same low degree polynomial, we need at least degree plus two pieces to construct a frog proof, right? In order to prove that they are not all on this polynomial, we need enough pieces to reconstruct the polynomial with just degree plus one, and then another one so that the node can see, oh, no, like it's not all on one polynomial. And so the frog proof would, in this case, in this naive way be the same size as the data block. So that's pretty terrible because now like the worst case actually stays the same as it originally was, you still have to send the whole data. So that's not practical for our application. We would only get an average case reduction in the efficiency, increasing the efficiency, but so here's an example of this. So you would need, for example, to give these five pieces in our example in order to prove that it's not on a low degree polynomial. And the solution to that is to instead use two decodes. So basically what you do is you encode the data into a two-dimensional polynomial. And what this means is that each row and each column in itself is a one-dimensional polynomial of low degree. And this is nice because that means if there's any incorrect encoding anywhere, then you can prove this by just giving either the row or the column where the mistake happened. And so the fraud proofs now become root, order of root n of the data size instead of order of n. And this was basically the first practical, practically efficient data availability scheme. And it's also nice because it still only uses hash functions. So all this can be constructed using Merkel trees. And this construction was proposed by Alba Samsonino in between in 2018. The coding efficiency of 2D schemes is a bit lower. And that is because if we extend by a factor of two, so we keep the same coding rate per row and column as we had before of 0.5, then the data actually comes extended by a factor of four. But we still require three quarters. Well, we actually, it's more. We now require three quarters of data to guarantee that all the data can be reconstructed. So basically, if you imagine if I am trying to hide data, then what I could do is I could, I would hide a little bit more than one of these squares, like for example, the lower right square. If I just hide a little bit more than that, then you wouldn't be able to reconstruct the data because each row and each column would be missing a little bit more than 50%. So it would not be reconstructable. So we need three quarters to be sure that it's all available. And so if we compare that to the 1D scheme, then that extends by a factor of two and requires one half for reconstruction. So it's more efficient in the amount of data that you have to put on the network. But the main downside of this ASB-18 scheme is that it actually requires fault proofs to verify the consensus correctness. So that means that we like, we don't, since we don't know that an encoding is correct by just looking at the root, it means that we need to wait for fraud proofs before we are able to be sure that a chain is correct. And that's a bit impractical for consensus nodes, i.e. the stakers who construct the chain because if they always had to wait for fraud proofs, well, you could construct a chain like that but you would be waiting at least several minutes or so each time because it could take a little bit longer for fraud proof to arrive and you really want to be sure there isn't one. So it would be a very, very slow chain. That's not really what we want. So instead, anyone who uses this design is probably going to make a different decision which is in practice, I think what Celestia is doing where relying on this design is that you will use super nodes for the consensus nodes. So you require all those who actually participate in the consensus by constructing the chain to be not just sampling the data but downloading full data so that they are sure that there is no fraud because if you don't load all the samples then you know that it's correct. This is not a practical solution for the Ethereum design because what we are really keen to have is that the consensus can remain distributed so that all that staking can run like say a Raspberry Pi at home that cannot process these amounts of data. For example, because it's also just relying on a much smaller internet connection than would be required for this. Okay, so the next idea for what you could do in order to get this sort of the fraud proof problem is that you could take a Merkur route of an encoding and instead of using a fraud proof to ensure it's correctness you construct a full snark that shows that this encoding is correct. But doing this the naive way is very expensive and one way to somewhat alleviate that is to use modern smart snark friendly arithmetic hashers but we aren't really confident in them yet. They're not really that well proven so currently we wouldn't really know what's exact function with what parameters to use so that we could be confident then in 10, 20 years that is still a safe function. If we instead just use well proven hash functions then it would be very expensive. In either way it would be like sort of pretty big data center like operation to compute these routes and so we don't really consider this practical at the moment. But what has now become practical is a third option which is directly using polynomial commitment schemes on which these snarks are built. So for example, a commitment scheme like KZT-10 allows us to directly commit to a polynomial which is in other words a read Solomon code and the correctness is actually enforced by the commitment scheme. And that's orders of magnitude faster than using a general purpose snark on a Merkle route. So the KZT commitment scheme takes a polynomial defined here by f of x and you can based on that you can compute a commitment to that polynomial. And for any evaluation y equals f of z of that polynomial you can compute a prover using the data of the polynomial itself for the coefficients can compute a proof pi that proves that f of z equals y. So using this commitment c of f and the pi as well as the values y and z a verifier can confirm that indeed this f of z equals y. And c of f and pi are just elliptic curves elements. So for example, using BLS 12381 which we are using that's 48 bytes. So they are very nice and compact. And the way I use them as data availability routes you basically take points on this polynomial you now set like these samples as f of zero, f of one, f of two and so on. And then you compute the KZG route of that polynomial. And yeah, you can basically think of this as something similar to a Merkle route but it's always guaranteed to be on the same polynomial. In terms of efficiency so someone needs to compute all these samples and in particular they need to compute their KZG proofs because the sample is only valid if you have the corresponding KZG proof. This is a lot more expensive than a Merkle proof and naively it would take O of n work to compute one such proof like you always need to touch unlike a Merkle proof where you only need to touch like log n elements if you naively compute a KZG proof you need to touch all the elements or all the coefficients of the polynomial. And so if you did that for all the proofs it would be O of n squared and that clearly wouldn't be practical but luckily we came across a technique that Dimitri Kovratovic together with me developed in 2020 which allows using FFTs in the group as to compute all these proofs in n log n time and that makes it practical to use this as a data availability scheme and compute all the proofs. Now if we use this in the way that we've described as a commitment scheme for one polynomial then we need to consider one further thing which is the reconstruction. So if less than all the samples are available we always need to reconstruct the polynomial so that we can get all the samples if it's possible. And I mean that's for two reasons one is like we need the original data so applications or like the whether it's executing the chain or rollups they will of course need the actual data so if some data is in the original data I'm missing we clearly need to reconstruct that in order to have the data but from a consensus point of view more important is the convergence property. So like we want clearly that all nodes should eventually agree whether a block is available or not because this is one of our additional validity conditions now. And so we need that all nodes will come to the same conclusion so either less than our thresholds like less than three quarters are available in which case we don't know enough samples so that all nodes will agree on that but it could be that an attacker say makes a higher proportion of those available and in that case it would be there would be some nodes who think that it is available and some nodes will not see the data as available and this brings us to the problem with the 1D KCG scheme. So the reconstruction being able to reconstruct all the samples would still require super nodes. So it would require nodes that are able to download all the data and compute the individual proofs for all of them in order to distribute them. So that means that the reconstruction in the scheme requires super nodes and that means that in the absence of honest super nodes a malicious actor could split the chain. So this is where we come to the two dimensional KCG scheme. So in a way this looks similar to what we had earlier with the original scheme that was proposed in 2018 using Merkle routes but instead now we commit to KCG routes for each row and those eight KCG commitments in this example themselves lie on a polynomial. So you can do a polynomial check on these and this now has the property that the rows and columns can be individually reconstructed. So like a node that wants to help in reconstructing all samples can download one row or one column and if more than 50% available it can get all the samples in that row or column and that way you can actually reconstruct the full square of data using only these nodes that process rows and columns. And so you can see the similarity to the Merkle route-based 2D scheme where our original motivation was to use the 2D scheme because we wanted to minimize the fraud proofs. And in this case we're not worried about the fraud proofs anymore because the KCG scheme guarantees correctness but we can now reconstruct in a distributed way which is another important property. We can actually even use this to construct the block in a distributed way. So we can also use this as a way to say there's actually nobody who needs to process all the data and compute all the proofs and all the extensions and all the KCG commitments for it because you can also do that by rows and columns because the nice property is even the KCG commitments can be extended. So you can just compute a polynomial extension on the data KCG commitments and get the extension KCG commitments below. And so that's also really cool property. So we can also support distributed block production which in my opinion is a little bit less important than distributed reconstruction but it's still a nice property to have. So the scheme was proposed by Vitalik Buterin in 2020. And so now basically what it achieves is that we require super nodes. Well, if we don't do distributed construction last point it only requires super nodes for liveness. If we do distributed construction then actually we can also theoretically construct a scheme that does not require super nodes for anything. And at the very least all the safety properties can be ensured by nodes that can only process rows and columns. So yeah, as in conclusion I've made an overview here to show what basically the downsides and upsides of the different schemes are. So on the left we have the ASB-18 Merkur fraud proof based scheme. And in the middle I have any scheme that's basically a proven Merkur route. So either a snark on a Merkur route or a 1D KCG scheme. They have very similar properties. And on the right I have the relatively new 2D KCG scheme that allows distributed reconstruction. Yeah, and so like the basically the big additional thing that we get is that we can get a convergence property for distributed reconstruction. So the original scheme clearly also had this property since it was also 2D scheme. So that's not a complete new property. But unfortunately it requires super nodes in the consensus because otherwise the consensus would be dependent on fault proofs which would be problematic. Cool, so that is the end of my talk. Thank you and I'm open for any questions if there are any. Okay, thank you so much for your talk, Dankrat. If you have questions, now is a good time to ask them. Again, you can please ask them in text or raise your hands if you wanna speak. Keep in mind that if you do speak, you will be on the recording. So if you would rather that I read your question, just note that in the text. So far I see one question in Slack and I will read it out because Marko is having some tech issues. And the question from Marko is what is the super nodes? I might have missed it. And could you explain the difference between white nodes and super nodes? Right, so super nodes are any nodes that need to process all the data. So a super node would be a node that is able to process the full block data without doing something. So in terminology, you're right, I didn't define this before. So a light node would be one that only does data availability something. And in between, we would have nodes that don't load rows and columns, which is still much lower than a super node, but a bit more than what a light node would do, which only does some random something of the data. Okay, next we have a question from Andrei. Andrei, would you like to read your question while? Okay, my question was, how do you compute the public parameters for the KDG commitment? You mean the trusted setup? Yeah. Yeah, so yeah, we're currently in the process of arranging that. So like we are gonna have a trusted setup ceremony later this year, so that is in the works. The good thing is that we only need a very small trusted setup of like, I think the current design scheme only needs to the 12, we'll go a bit further in order to have some room for a future expansion, but maybe two to the 15 or something like that. And so it's gonna be a super small one and gonna be a super quick ceremony, so we can have lots of participants. How could you elaborate and why you need only a small trusted setup? A trusted setup? So the reason is that in the 2D KDG scheme, you only need commitments for one row basically, and so the rows are relatively small. So it's only like, for example, in the current design, a bit more than 4,000 field elements, so that's a very small trusted setup that you need in order to commit to that. Thank you, and I have one more question with me. So the 2D KDG commitment, is it like a black box construction out of 1D KDG commitment or is it separate? Okay, so there are different ways. So you could, so basically the way we're proposing it is to simply give like the individual commitments, like so you can simply, in the second dimension, you basically give a list of commitments. And so it's not in a way, it's not a 2D KDG commitment, it's a list of 1D KDG commitments. You could also directly use 2D KDG commitments, which would have, yeah, it would make something slightly smaller, but it would provide much less flexibility in terms of how you can distribute transactions and so on, so like this small overhead seems worth it to us. So sorry, it's probably a stupid question. This is small, I mean, isn't it square root of N? Sure, yeah. Okay, sorry, thank you. That's fair, okay, yeah. Okay, I have another question from Marco. Do you expect this to go into production and if so, do you have concrete plans for when? Yeah, so I mean, it's the current plan for Ethereum Sharding. So a first version that will use the same commitment scheme but we won't actually do sampling yet based on this, like sort of a traditional scheme, I think we'll go into production probably next year. So that's EAP 4844, which we're planning like in the next hard fork after Sharding, sorry, after the merge. And then, yes, I mean, I would say within hopefully one or two years after that, that we'll go into production on Ethereum. And we seem to have another question from Alfonso Alfonso. Yeah, really quick, I mean, we have a bunch of light notes on these data availability proofs. We would be able to recover the original data from these proofs. If I'm a light note and I just, I mean, the samples are out there, I would be able to recover the data as well. And I mean, traditional read Solomon Cody. Sorry, what is the question? So if I would be able to recover the original data from the proofs from the samples that are distributed from the samples, yeah, yeah. Yes, if you don't know enough samples, you can reconstruct the data. Yeah. Okay, as long as I have enough samples, I would be able to reconstruct the original data used for the proofs. That's right, that's right. So like traditional read Solomon, we would have the same properties with these commitments. Okay. That's right, yeah. I'm not seeing any other text questions if there is a question that you would still like to ask. Maybe just a quick question. Yes, go ahead, Juliana. Yeah, some very busy question, still early morning, yeah. So that's pretty much it, but can you go again over the very basic motivation behind those features that you presented, not how they're realized, why do we want them? Which one in particular do you mean? Just really basic, why would I want light nodes that can only, if I did the availability, but maybe not reconstruct the data or things like that. Right, I mean, well, that's the, oh, why do you want those light nodes? I mean, that's the core paradigm of scaling, right? We ultimately, like, if we want, like, what we want is we want blockchains that have the same properties as now, like right now, like, you cannot, like minus cannot construct an Ethereum chain that's not available, that has missing blocks because full nodes would just not accept them, right? And that is one of the properties that we want to preserve as we scale the chain. So we want to keep the same security properties as we have now, but be able to scale Ethereum. And so what we need for that is to come up with constructions that scale. And why would you want a light node, I guess is another possible point of your question. Why would you be interested in a light node that can only tell you, yes, this is the head of the chain, but cannot tell you what the data is, like for example, would not be able to construct the actual state because you can get your state somewhere else. Like for example, now you say you have this light node that tells you here is correct chip of the Ethereum chain. And now you could go to like someone else and ask, hey, can you give me the balance of my account or like whether this transaction has passed through? And they could give you a proof based on your latest state rule, right? So they couldn't cheat on you. They could give you like the data and they could add like a witness that shows that this data is correct based on the latest tip of the chain that your light node has provided you. Does that answer your question? Yeah, it helps, thanks. Okay, I think we still have time for one last question, if there are any. Have a small question about the motivation. So you had a part was the talk where you talked about the ability proof, but I'm not sure I quite understand what's the purpose of having the ability proof when doing security that the nodes can refuse to serve when the data is actually needed, the node can refuse to give it. So it can have it, but it can refuse to help reconstruct it when it's in it. Right, so I mean, I think the best way to see this is it's a gadget that prevents someone from withholding data. So yes, I agree, like we do require actors, like if, like say I'm a light node and I don't actually don't know the data, I will need to get the data from someone when I actually want to like find out something about the chain, right? Of course, that's true, but my assumption is that Ethereum data is interesting enough that there will always be people around who will provide that and do that for you. Maybe you have to pay them, but like it's possible. What we need to make sure is that these actors, even if there's only one honest one left among them, that they will get guaranteed access to this data, like that they are guaranteed to be able to download this data if they want to. And this is basically like it's a little bit like a proof of data publishing, maybe also a good name for this. Like basically whoever is constructing the chain, even if like the majority of consensus nodes is malicious, they cannot force others to accept this chain if they are trying to withhold the data. Does it make sense? I'm not sure how the sampling helps because there may be a node who has all the data and you sample it and you see that it has the data, but then when you want to reconstruct it, it refuses. Okay, so if only one person does the sampling, then it doesn't work. What you rely on is that there are lots of these light nodes that all do sampling. And now if the node tries to cheat, it could target you. If it knows who you are and knows your IP address, then it could give you all the samples, but nobody else does samples, and then it could trick you. But I mean this still requires like for an average person that would not happen because it requires that you overtake the whole consensus to make it vote for that. So basically this mass of samplers, they protect you because like in aggregate, they would get enough of these samples so that the data could be reconstructed. So after the sample, do they store it locally? Yes, you need to keep the samples that you've done around and provide them later if someone is trying to reconstruct as well. Okay, no, that makes sense. Okay, fantastic, perfect timing. We are exactly at time. Thank you so much, Dankrat, for your talk and for your comprehensive explanations.