 My name is Brandon Ramirez. I'm a co-founder at the Graph, as well as one of its core developers, Edge and Node. And today I'm gonna be talking to you about the importance of decentralizing the blockchain data supply chain. And specifically, it's important to the Ethereum vision as well as the Ethereum roadmap moving forward. So before we get into that, let's first things first. Like what do we mean when we say the blockchain data supply chain? Simply put, it's everything that happens between you as a user sending a transaction to modify the state of the blockchain. And another user reading that modified state based on the transaction that you sent. And at a high level, we can break that up into three kind of stages. You know, on the left, we have writing state. So it's everything that happens with you sending a transaction, getting it included in a block on the canonical chain, which is through consensus. And then downstream of consensus, you have everything that's required for that state to be read for the various applications that you might use that state for. For example, on a decentralized application. And I think most of us are probably more familiar with the right side of the supply chain because it's gotten a lot of attention in the last few years with things like MEV, some of the proposals like Builder Proposer Separation and MEV Boost. So this is like an example of what the right side of the blockchain data supply chain will look like in Ethereum in the future. You know, where you have a transaction going to a private or a public mem pool. You have people bundling that into larger transaction bundles. You have block builders that might be these specialized actors trying to kind of extract as much MEV as possible. And then, you know, eventually it makes it to a proposal and validated as part of consensus. Downstream of consensus, then we have everything that's required to get that data into a useful place for you to do something with it as a user. Broadly speaking, we can break that up into extract, transform, and load. And you don't always need to go through all those steps. Like for example, you know, JSON RPC is a form of extraction. There's many use cases that just consume JSON RPC directly. Many people today consume blockchain data through subgraphs, in which case they're using subgraph mappings as a transform step and then loading into a database or a store that's optimized for GraphQL queries. Later in this talk, I'll be touching briefly upon two technologies, Firehose and Substreams, which are next generation versions of the Firehose and Extract logic in the supply chain. Got some autonomous lights over here. At a high level, we can sort of categorize the different approaches to decentralizing the blockchain data supply chain. The first approach, and I would say for a long time, this was kind of the conventional thinking, was what we call Web 2.5. And in Web 2.5, basically what you're saying is, we do all the things that we've been talking about the last few years to decentralize the right side, so protecting against MEV, keeping good stake decentralization and keeping good validator decentralization, but then downstream of that, we let the reads be through proprietary APIs, centralized servers, et cetera. And I think the reason that there's been so much more attention to the Web 2.5 vision is because failures on the right side of the supply chain tend to be a little bit more visible, or at least a little bit more intuitive. So if you're getting your transactions censored from being included in consensus, that's something that's gonna be very apparent and you can point to. Similarly, if there's actually a determinism issue upstream of consensus, like some kind of data or execution inconsistencies, that's actually gonna lead to a consensus failure, potentially like an unplanned chain split, and that's a very visible type of failure. And so I think rightfully so, there's been a lot of attention to everything upstream of consensus, but you can also have data inconsistencies, for example, downstream of consensus, that's some things that are indexers in the graph have come across and helped debug, but they just don't get as much attention because inconsistencies in the way that you read data downstream of consensus impacts you as a user, but it doesn't show up in, for example, a consensus failure. So the other approach is what we call the Web 3 approach. Web 3, simply put, is full stack decentralization in the context of the blockchain data supply chain. What that means is decentralized rights, like what we just talked about, but also decentralized reads. A lot of my thinking on this is obviously influenced by my work on the graph, but I want to note that there's other projects in this space, like portal network, pocket network, Trueblocks that are all taking varied approaches, and I definitely recommend you checking out sort of, this landscape of the decentralized reads. So why do decentralized reads matter to the Ethereum vision or to the Web 3 vision? I think we can start by just reading about it from first principles and what it looks like to access a decentralized system through a centralized intermediary. By a quick show of hands, does anyone see something wrong with the diagram on the right side of the screen? I'm glad I can educate you today. So what we're showing on the right is basically a decentralized system kind of on the top right, being accessed through a centralized intermediary. So like so much of the work that you guys have heard about in the Ethereum ecosystem from the Ethereum researchers is really about maintaining decentralization at the blockchain layer, at the consensus layer. And then what we've effectively done in many instances is taking this decentralized system with all its benefits, all its advantages, put it inside of a box, and then accessing that box through a centralized gatekeeper. And at that point, what's inside the box really becomes irrelevant, right? All the work that's been done to actually build this beautiful decentralized system is sort of opaque to the end user and to the applications that are built on top. Afri, I think, put it really concisely. He's one of the early prominent Ethereum core developers that just, if DApps are still accessing the blockchain through centralized hosting, then the Ethereum vision has effectively failed, is the way that he put it. I was working on an analogy to describe kind of the Web 2 dilemma, and hopefully this isn't too soon, but one analogy that kind of made this stick for me a little bit is that if Web 2 is someone sneezing on you during the pandemic, and Web 3 is a properly fitted N95 mask, then Web 2.5 is kind of like a shin diaper, right? And I know there's a lot of different opinions and ideas on the efficacy of mask mandates, et cetera, but I think we can all agree that the shin diaper really didn't make a lot of sense. And that's kind of Web 2.5, right? It just doesn't make sense. It's incongruous with the goals of blockchain and Web 3. Specifically, it undermines a lot of the value propositions that the Ethereum founders and a lot of the early builders in Web 3 sort of united around in the first place, so being able to build unstoppable applications with composability. When you start building on centralized infra run by single service providers, well then your app is no longer unstoppable, it stops as soon as that platform gets shut down. Once you stop building on open standards, the composability of your applications and the stuff that you've built goes down quite a bit. And even if there was a standardization on these centralized servers, there's a limit to how much composability you can get when you're building on centralized building blocks because the more centralized building blocks you can pose, the more brittle your system becomes because now you're basically, if one centralized building block and your giant compose system goes out, well now that ripples to breaking your entire system as opposed to unstoppable smart contracts running on Ethereum, the way those work, where a new DeFi protocol can come along, integrate with an existing unstoppable DeFi protocol, and that composition can keep playing out until you get this really rich tapestry of decentralized applications. Definitely hurts for usability. You're just doing one-offs all the time. And then I think on ownership, censorship, resistance, fairness, I'm gonna try and paint this a little bit in a few examples. The first is how many people saw the article that Moxie Marlin Spike from Signal wrote earlier this year? Okay, let's say 30%. So he wrote, I recommend you look it up, he wrote a very well-intentioned, well-thought-out critique of Web 3 from someone actually trying to make a good faith attempt at building some decentralized applications in the space. And so one of the things that he did was build an NFT app and to sort of kick the tires on the guarantees that we think we have when we're building on NFTs, he basically built a single NFT, but depending on where you viewed it, whether you viewed it on OpenSea, Rarible, or in your own wallet, you got one of the three images shown on the screen. So when you're actually buying the NFT or looking at it in a collection, you're seeing this cool piece of artwork and then by the time it makes it to your wallet, you get a poop emoji. And that's possible because the NFT was not hosted on decentralized data and using a decentralized data supply chain, it was using centralized servers to serve that NFT. And for a lot of us who are sort of inspired by the ownership vision of Web 3, that Web 3 is enabling digital ownership, how can we own things when the actual information artifacts that we're owning are controlled by centralized intermediaries? So this experiment also highlighted a way in which we have censorship in quote, unquote, Web 3 today, because once OpenSea realized that Moxie had done this, they actually delisted his NFT from OpenSea, which if OpenSea was doing that as an individual application and doing that for sort of their own purposes, and I think that's fine, like we believe in choice at the level of projects and individuals, but the problem was that so many other projects in the ecosystem were building on the OpenSea API that him getting delisted from the OpenSea application also made it so that the NFT stopped showing up in wallets and all these other places where you would expect to see an NFT that you had purchased. So this is a real example of censorship that's happened just this year in the context of the blockchain data supply chain for a use case that we all, I think, care a lot about. Another one, this is a little bit more speculative, but I think it's an interesting one, just as a thought experiment is index or extractable value. This was an idea that was put forth by someone in our community. You can think of this as being a little bit analogous to like MEV or also payment for order flow, which got a lot of attention this past year or two when people realized that Robinhood's business model for offering free stock trading to retail investors was that they sell the order flow to giant hedge funds like Citadel and then Citadel with its privileged access to this order flow can then like extract all this value at the expense of the retail investor. You could imagine a world where if we were all accessing the blockchain where we expect more of our financial lives to exist over time and we were all accessing it through a centralized intermediary that that centralized intermediary might gain a lot of alpha from being the one that's seeing the sort of Google Trends style data around accessing and querying our financial lives as it exists on the blockchain. Okay, so that's kind of where the web three vision is at today and sort of the importance of keeping the blockchain supply chain decentralized. Let's talk about where Ethereum is going and the ways in which a decentralized blockchain data supply chain supports that future vision. So for those that saw Vitalik's ETHCC talk earlier this year you might have come across these new, let's call them work streams that are all happening in parallel, so they're not sequential milestones but they're all parallelized work streams that are happening in the Ethereum ecosystem, the merge, the surge, the verge and the splurge. And the diagram on the left can look a little intimidating. We're just gonna focus on a few aspects of this roadmap, specifically parts of the verge, the purge and parts of sort of the light client vision of Ethereum. So a brief overview, the verge, one of its sub-goals is around supporting stateless clients. Stateless clients allow validators to sort of be small and light while having like heavier specialized builders in the right side of the blockchain data supply chain. There's different approaches to this, weak statelessness, strong statelessness. Weak statelessness is where builders sort of provide the witness of state that's being used by a transaction and strong statelessness, the end user actually when they submit a transaction they would also have to submit the state that that transaction is gonna access as a witness to the network. The purge is about history expiry as well as state expiry. So EIP44s covers the history expiry side of things, which is the idea that after a certain amount of time, in contrast to the way that Ethereum works today, full nodes and validators would be allowed to drop, they'd be allowed to prune historical blocks. Basically keeping the storage footprint of these nodes lighter. And then state expiry basically attacks the unbounded state growth of Ethereum where certain state, if it's not touched frequently enough, again would be allowed to expire. And then light clients, we're gonna touch on a little bit in the next screen, but I think they're interesting to bring up here because they share a lot of the same requirements as sort of those previous two milestones. And they're also part of the solution space, right? Because a lot of this stuff boils down to being able to access state and data. But the problem with light clients is that today at least light clients rely on altruism from the full nodes in the Ethereum network. So if those full nodes decide to support the light client protocol and serve data to these light clients, then it works out great. But the reality is that because of that lack of incentives, as Piper, one of the researchers working on this noted, basically the light client protocol in Ethereum is a vast desert of starved clients desperate for data. It's not working, right? It works nice in theory, but in practice you need incentives or some way to make that data available to support these light client protocols, which mind you is very important to the vision, Ethereum's founding vision of decentralization. So the common threads, without going too deep into the weeds of any of these sort of proposals in the roadmap, there's some common threads that we can kind of paint here in terms of requirements. And it's basically that all of these proposals are gonna need to depend on some way of reliably, efficiently and verifiably accessing either historical blocks or pruned blocks from the blockchain or expired or uncached state. So in the case of state expiry, it's expired state. In the case of stateless clients, it might be just state that the builder or the user needs that's not cached locally. And the solution space breakdown here, I think can be broken into two larger categories. One is financially incentivized approaches. We mentioned those light clients needing cooperation from the full nodes, which right now don't have an incentive. So things like the graph, things like pocket network could fit into that category. The other category here is using more like tit for tat incentives, which are not financial incentives, but this is kind of how the bit torrent network works for example, where if you're being a good citizen in that network, uploading data as much as you're downloading, then there's sort of this reciprocity that you get. And that's the approach being explored by projects like portal network, using things like gossip protocols and DHTs, which again are techniques from bit torrent and now IPFS as well. And so why are we doing all this? Like why are we getting rid of all this state? Why are we getting, are we doing it just for the fun of it? No, we're getting rid of this state and this history for a purpose and that purpose is scaling Ethereum in a decentralized way. So simply put, what that means is being able to fit more transactions into a block, being able to have bigger blocks while maintaining a small footprint for validating clients and light clients. And as an example of this, one of the EIPs that's out there right now is called EIP 4488. It proposes a approximately 5x cheaper, 5x reduction in gas costs for call data. This is about supporting the Ethereum roll up centric future. But the reason that they feel comfortable doing this, which is inevitably gonna lead to bigger blocks or more full blocks, is because it's intended to be paired with EIP 4.4s, which is one of the proposals in the purge which specifies dropping this requirement for full nodes and validator nodes to keep around all this historical blocks past a certain point. So that's what all this really boils down to is decentralized scalability of Ethereum. Okay, so this last one's kind of a bonus section. I think this is really interesting from both a couple of standpoints. One is that the fire hose, which we briefly noted earlier is a next generation extraction technology. I think it's interesting, A, because I think it should impact the way that Ethereum clients get built in the future. But I also think it's interesting because it's an example of this positive externality coming from an ecosystem that's tackling the blockchain data supply chain in a decentralized open source way. So this is firehoses technology originally created by Streaming Fast, which is another core developer in the Graph ecosystem. Before we get into how it works, it's worth calling out some of the problems with JSON RPC and the way that it works today. So most of you I think are show of hands people that are familiar with JSON RPC. Okay, most people in the room. So this is how you access data from a Gath or an Aragon client today. It basically depends on a running program to read data. So if you look at this diagram on the left, you kind of have this fan-in architecture where all the users are sort of hitting a running process that process consumes CPU. That process becomes the bottleneck for getting data out of the node. And then because you need to have heavier nodes to serve all the data that people want, like archive nodes, potentially running parity trace, it's also incredibly intensive on memory and solid state disks. Because in order to actually efficiently access that data, you need to provision a lot of both of the latter. It's difficult to get query intermediate states. So JSON RPC really only lets you query data as of a block. You can get a little bit more by using the parity trace API, but it's still incomplete, also very cumbersome. It can be difficult to debug. So if you're using a subscription to like ETHGet logs, for example, via Infura, like we've encountered this in our own stack that like sometimes, in the past, messages will just get dropped due to transient network events or partitions. And there's just no way to debug that. Like you just miss the message and like, it's very, very difficult to figure out that like, hey, a message you should have gotten never made it to you. And there's a pretty incomplete verifiability story. Some of the data you wanna get out, you can get a Merkle proof for. But other data, like if you're using like the ETHGet logs, interface for example, there's not a really easy way to get a compact proof that says you didn't miss any logs within some range of blocks. So what's the solution? We think the solution is the firehose approach. It's streaming first. So if you look at the diagram on the left, data is being on, like as soon as it's available, it is being broadcast out in a fan out approach rather than this fan in. It's being distributed as flat protobuf files. So these can be distributed across commodity hardware like and using things like Google Cloud Object Storage or Amazon S3, because now we have these distributed flat files, now we can parallelized workloads on those files, all without even touching the blockchain nodes. So this is much more akin to like what you would see in like a Hadoop big data architecture where you have all these flat files distributed across commodity hardware, and then you have these compute clusters that can be scaled independently and spin up concurrent and parallelized access for doing the sort of transform steps on that data. And so when you do that stuff, you get, compared to the approach using JSON RPC, today you get a one to two order of magnitude increase in read performance based on the use case. There's a couple of integration strategies that I'm gonna move through here real quick. The first is just integrating this as a drop-in replacement for JSON RPC. Basically have it be something that runs locally on the node, when you're running your node instead of accessing via JSON RPC, you would be using sort of like the firehose stack. And that's kind of like a really basic integration. But in the future, you also might wanna actually improve the verifiability guarantees that you get from firehose. So you could imagine sort of the firehose instrumentation logic running as a wasm process that is implemented as a read-oriented roll-up. So like an optimistic roll-up or even a ZK roll-up that actually gives you some guarantees on the validity of the sort of data that is extracted through that firehose instrumentation process. And then the final approach that, and again these are still very early in us thinking about this, but the final approach is you could even potentially integrate firehose files into the consensus process itself. Now you probably wouldn't wanna do that initially because initially there's a strong benefit to having what's called like data agility, being able to evolve the schema, evolve the instrumentation logic based on sort of a changing understanding of what's needed. But as those integrations stabilize, there might come a time and point where blockchain core developers decide to say, hey, we'll reference maybe the firehose files from the previous epoch at some future block. So that they're actually, you have some guarantees natively in consensus. I'm happy to announce that actually we have our first example recently of a blockchain core development team self-serving their firehose integration. And I think this is gonna be really important for every blockchain core development team, especially in including the Ethereum ecosystem to eventually own their firehose deployment so that they can sort of maintain that for their own use cases. And then that also has secondary benefits like basically automatic integration with protocols like the graph. There's another integration story here related to EIP44 called ShIM clients, but we're out of time. So if you're interested in that, feel free to come up to me and find me afterwards. Highly recommend checking out Alex B from Streaming Fast talk this week on substreams. And then also Vincent from Misari has a talk on standardized subgraphs, but they're heavily using firehose substreams as part of their dog fooding. So yeah, thank you guys very much. That's my talk. Do we have time for questions or? I guess we have time for a couple of questions. We have a mic in the back. Is there a public specification for the firehose? There's pretty detailed documentation. I wouldn't say it's at the level of like an IEEE spec or anything like that, but there's very detailed documentation on the architecture of how it works. The repos are open source. So if you just search like firehose, the graph docs, you'll find it real quickly. Dave in the front. So for, oh thank you. For the stateless clients in like EIP 4444, is there any way to estimate or predict like how, like basically what I'm trying to understand like can the graph network itself as a whole essentially like earn income from providing a service of like, you know, basically the current Ethereum nodes retiring the state but then graph nodes serving up that and then as a whole like the network getting some profitability from that. Yeah, it's a really good question. So this is either sort of emerging ideas in our ecosystem but like increasingly my view of the graph is that it is gonna be like a multi-service ecosystem. So like today the primary service is, you know, you do ETL into a sub graph inquiry via GraphQL but there might be a range of services. For example, accessing firehose data directly or JSON RPC data directly or data in a key value store or a SQL store. And I think all of those could potentially be supported side by side in different like namespaces if you will within the graph ecosystem. And in that context, then you could imagine, yes, graph indexers and service providers being the endpoint that like light clients use to get their data. It could also be what blockchain nodes themselves use to hydrate data and some of these other, you know, related to these other milestones. So yeah, I think it will be a source of query volume, you know, in the future on the graph. That great question. Cool, I think that's it. Thank you guys so much for your time and attention.