 Hi, my name is Alexander. I'm CTO at Streaming Fast, and I'm also a pianist, a data scientist, whatever that means I'm a father of eight beautiful children, two of whom are there. I love designing and crafting software, which I've done since I was 12, and I'm here today because one day in 2013 I read the Bitcoin whitepaper and that changed the trajectory of my life. And fast forward to today Streaming Fast, a company based in Montreal, Canada, is now one of the core developers of The Graph, and we joined The Graph a bit more than a year ago in a kind of bizarre M&A 2.0 fashion. Our lawyers didn't understand what happened. But anyway, we said thanks and goodbye to our VCs and shifted our focus to make The Graph the greatest data platform on earth. So today I'm here to introduce Substreams, which is a powerful new paralyzed engine to process blockchain data. And before I can do that, I just want to set a bit of context. Perhaps you can raise your hand if you know what subgraphs are. Raise your hand if you know. Oh, you're good. Okay, so subgraphs can be thought of an ETL process, right? Extract, transform, and load, and subgraphs add that little Q there, the graph QL layer to it. And Subgraphs today provide that sort of simple, approachable, end-to-end solution to blockchain indexing. And the graph node is responsible for all of these components, right? The extraction is done through hitting JSON RPC nodes, and then transformation, you provide some assembly script. You guys know that compathawasm, running in a distributed environment, and then you have the load aspect, which the graph node does, puts that into Postgres and offers you a rich, you know, and beautiful graph QL interface on top. And one of the reasons we were brought in was that we could push the graph to new height in terms of performances. So to do that, we brought, first thing, the fire hose, something at the extraction layer, to our take to boosting performance by one, two, three orders of magnitude, the first layer extraction. It's a method of extracting data from blockchain nodes. Imagine prying an egg open, where the data is exfiltrated as fast as possible, and all the juicy data gets then thrown in a GRPC stream, as well as into flat files. And you can think of that as sort of a, the binlog replication stream for blockchains, where you'd find in a master-slave replication engine like in databases. So we'll get back to firehose in a minute. Then sub-streams is sort of rethinking of this, rethinking of the second box, the transformation layer. Here, instead of the traditional sub-graph handlers and assembly script, you will write sub-streams modules in Rust, and those can be executed in real time, as well as in parallel with unprecedented performance. So let me give you first a primer on firehose, because there's a lot of benefits of sub-streams that come directly from the firehose. So, a streaming fast for many years, we've been thinking hard about all these indexing problems from first principles. And we needed, at first, a robust extraction layer. We wanted something that was extremely low latency. So I'm thinking I would push data out the moment the transaction was executed within a block, within a blockchain node. JSON RPC was not going to cut it. And we didn't want to have to deal with those large, bulky nodes, right? Hang on a thread, occupied with managing high-write throughput. Well, it kept everything in a key value store behind a JSON RPC request, and it was really heavy in RAM and CPU, and you needed super optimized SSDs. It's really annoying, and all these things are much costlier than what needed when our goal was to get to the data inside. So we also wanted proper decoupling between the processes producing the data, so the blockchain nodes and its intricacies and its request response model, and they're all different, and the data itself. We wanted the data to be the interface, and we wanted something also extremely reliable. In the sense that we could avoid hitting load-balanced nodes that had all sorts of different views of the world, and that we need to have client code, latency-inducing code to resolve what's happening there. If there's an error fork, you need to query nodes again for reorganization of heuristics, for example. But also, we wanted something better than even the WebSocket streams that pretend to be linear that the nodes have implemented, because when they would send you a signal that, let's say, this block was removed, it can leave you hanging. If you were happened to be disconnected for just half a second, you'd reconnect, you'd missed a signal, so the reliability was not built in, so we wanted something to address that. And above all, we wanted something that is able to process networks in 20 minutes. Well, okay, an hour or two, but never three weeks or things where we're waiting linearly, and that's still our goal today. And when we say network history, I mean executing GIF and extracting data executed into flat files, that's the extraction layer. But also, any sort of indexing after the fact, we wanted to be able to have massive parallelization. Like, there was no other way to have reliable and durable performance without parallelization, so our solution was the fire hose. And the fire hose solved all of these issues in a radical way. We took a radical approach because we wanted to solve those problems definitively, meaning that there would be no further optimization possible, except attempting to bend sort of space-time continuum itself, right? So with streaming, with even multiple nodes pushing out data, multiple nodes are actually racing to push the data. The first sort of consuming process gets the first to get out, like you can't really remove more latency there. And there can be nothing faster than immediately when the transaction has just executed from your node. And then regarding the staple processes and costs, flat files, flat files for the win, we have a hashtag for that, right? Flat files are the cheapest, much cheaper than processes, they're easier to work with, there's nothing simpler nor cheaper in terms of computing resources, these storage facilities have been optimized like crazy. And it's also where data science is headed these days. And there's one common thing to every blockchain protocol that it processes, data. Data is also the right abstraction for this technology, not an API that's common to all chain. Data, so firehose clearly delineates responsibilities and the contract between the extraction and transformation layers is again the data model. Firehose creates, and for every chain you can imagine the best data model, the most complete. And that's what we've done for firehose, for Ethereum for example. The data model for Ethereum within firehose is the richest there is. Like you have in there the full call tree, internal transactions. You have the inputs and outputs as raw bytes. You have the logs obviously, you have the state changes, like you see on EtherScan, down to the internal transaction level. You have balance changes the same way with the prior value and the next value. So when you're doing like navigation backwards or forward, you get, you have the data you need. You have also gas costs at different places, and there's that important notion of total ordering between things happening within the logs, state changes and calls. All of these things happening during execution are totally ordered. So you get in there, everything parity traces would give you, and more, and everything you would need to rebuild a full archive node from flat files. And everything there is scoped to the transaction level, not rounded at the block level, which is crucial if you want to index with precision. Rounding of blockchain information at the block level is sort of meant for helping in consensus. But it doesn't mean that what happens mid block is of less value than what happens at the boundaries. So, okay, so that's very interesting. And now regarding reliability, whoops, no, not so fast. Regarding reliability, the Firehose GRPC stream provides reorg messages like new block or undo this block, or this block is now final, accompanied by a precious cursor. I think that's really key here with each message. So if you get disconnected, and upon reconnection, you give back that cursor, you'll continue exactly where you left off, and you're potentially receiving that undo signal that you would not have seen where you disconnected, right? So you will get it. So with the guarantees or linearity of the stream. So no WebSocket implementation would do that because it doesn't make sense for a single node to track all the forks possible even two days after the fact. And undo messages come with full payloads. So you get all the delta, so you can just turn around to your database and apply the reverse, or you know, a block again in the full payload of what happened in the block, and just have it to do it. So it doesn't pose on the reader to store what happened, like at that previous block if the signal was just removed block 7000, right? Okay, and when you commit that cursor to your transfer database, well, you get sort of that, you know, finally some consistency guarantees within your, you know, your back end. So some of our users told us they could cut 90% 90 of their code reading the chain because they were relying on some that reliable stream. And okay, and it also lays lays down the foundation for massively parallelized operation files plus a stream. And so this is the future of the graphs unbeatable performance and it's core to our multi chain strategy because, you know, any blockchain can have that data model. Now let's dig into substreams. Substreams is a powerful clustered engine to process blockchain data to streaming first engine and is powered by the firehose underneath and its data models of the chain. So let's dig in. Here are a few quick facts. It's invoked as a single JG RPC call. And within the request, we provide all the transformation code like you'll have in there. Oh, it's too long. You'll have in there the code, some, some wasm modules, relationships between the modules and, you know, all the transaction, the transformations within the request. It's not a long running process, except if you run it for long, it's not a service you spin up, right? And the backing nodes are stateless, which provide nice scalability properties. Modules for transformation are written in rust. They compile to wasm and they're running a secure sandbox on the infrastructure there, similar to the subgraphs. And the ultimate data source being the blockchain data being deterministic, all the transformation outputs are also deterministic. And the request, if the request you send involves process prior history, even if it's 15 million blocks, well, the substreams runtime will then turn around and orchestrate the execution of a multitude of smaller jobs in parallel, fuse the results on the fly for you and aggregate the results to simulate a linear execution. So you would see a dime difference. And all results are streamed back to you as fast as possible with the same guarantees provided by the fire hose with a block per block cursor and a transparent handoff from batching and historical processing to the real-time low latency rewards aware stream of the head of the chain. So let me show you if you're interested how we create one of these things. Raise your hand if you're curious. Okay, you're good. Okay, so let's start. We start with a manifest like that. Do you see that now in here and move the podium? I can't. So there's package information, you know, some metadata there. You have pointers to the protobuf that you'll use. Again, contracts between modules are about data. So there are protobuf models similar to the protobuf models of the root chain of the chains, the layer ones. And you have pointers to the binary that you're working on your drive and all that. And you have imports and imports that is actually very interesting because you can import third party substreams packages. And these YAML can be packaged. And so you can import from someone else's package, you can write your own or combine both. That means substreams enables composition at transformation time, which I think is pretty unique and a pretty game changer. And then follow up and there you have the modules sections, which defines the relation between the different modules. And you see it defines a directed acyclic graph. You have modules that slowly refine the data. And so there's two types of modules. One, the mapper, the first up there, map pools. And this one takes inputs, does transformation, and outputs. It's parallelizable down to its core, block-wise, so massively parallelizable. And then there's the store input, I think it's awesome. This one takes any inputs and outputs a key value store that are sort of an accumulated in a stateful way. And stores can then be queried by downstream modules. And okay, so we'll see a bit more after. The name corresponds to the function in the wasm code. And the inputs can be of a few things, either the raw fire hose feed. So for example, the source here, that means the block with all transactions for that particular block. And it can be the output of another module, like you see down here, the input of map-map pools. So you'll get the data as bytes. And it can also be a store, which would be a reference, we'll see in the next slide there. And on the store pools here, you see there's an update policy, which sets constraints on what you can do at the store, and it defines a merged strategy for when you're running paralyzed operation. I'll get to that also a little later. And the value type field will help anyone decoding, understand what bytes there is in that store. So you can UI, you can JSONify them, and your code consuming can automatically decode them with proto-buffs, all languages supported. Otherwise, the key value is just keys, strings, bytes, values, very simple. And one thing to note here is that because it has deterministic input, it's possible to hash a module, like the kind and all of its inputs, and the pointer so it's parent, and including the initial block. So you have a fully determined and hashable, let's say, cache location for all of the history, similar to Git, all of the history of data produced by you hash also the WASM code. So it makes it for an extremely cacheable system and highly shareable and cross verifiable output of modules, which opens really interesting possibilities for collaboration within the graph ecosystem. And imagine that one has large disks, and another one has large CPUs or sleeping CPUs. They could pool resources together to build something bigger than themselves. OK, you see the relation there, so this gets piped to that. And if we add another module here, you see how the graph comes together. This one computes the price. This is Uniswap V3 thing. It computes the price, but you want to get them for certain pools, because maybe you want to use the decimal placements in the pool. We'll see a little bit more there. And let's say you're running that at block 15 million. Well, the runtime guarantees that the store you'll have to execute code at block 15 million will have been synced linearly or in parallel, but you wouldn't know. But it'll give you a full in-memory store, eventually backed by some disk, but whatever, and you can query the key value store at each block. It's guaranteed to be synced for you. That's exciting. No, OK. So you see the DAGs fully being built, the dependency. So now this leads us to composability. See, each color here means a different author. And modules written by different people, ideally the most competent for each, like we would hope, analyze what's on chain, and refine the data, and abstract it to new heights. And the contract between the handoffs is always data. It's a model of data. So you take a module, it's bytes in, bytes out. And so you see here, we can get the prices from Uniswap V2 and prices for Uniswap V3 and Sushi and Chainlink and whatever, and have someone in reality a module that takes these input at transformation time and then averages them out and whatnot. And then you'd have one beautiful universal price module that you can then hook on top and feed to some. Who knows, maybe if someone feeds that back onto the chain for some reason. And then soon enough, all of that, well, someone wants to build on top of it, something like that. If someone wants to compute the USD denominated volumes aggregation of NFT sales on OpenSea, you'll take some sales, you merge it with the price. And we see here that little trader, Inc. Maybe he wants to feed that into his trading bot because this is a streaming engine, we're not storing that into database yet, right? But this begs the question, where is all that beautiful data land? Where does it get piped? That's where syncs head up. Like, sub-streams being limited to the transformation stage of the ETL analogy, remember? It doesn't really care where you load it and that could be anywhere. These are just a few examples. You can load that in databases. Already have a sync for Postgres and Mongo. You hook to sub-streams and it just loads it into Postgres with a data model that we've agreed upon, right? If you write it in a certain way, it just syncs over there. Or message queues or whatever, data lakes or some bots or some trading algorithm. I don't know, some whale detector you want to hook directly on the stream. Or also something, I think, big for doing some ad hoc data science. Because now you have a really fast engine, allows you to process the whole history in like, it can take a few minutes to process the whole of it in a minute because they're going to pluck some new insight. So you can write your code, send it to the network and then stream out the results. Similar to, for those who know BigQuery, you know the cluster, the big cluster to service by Google, that's what they do, send the request. It just shards everything, they send you back the request. Well, all of a sudden, the sub-streams engine can allow you to do some things like that ad hoc. And it can run any program that supports GRPC and Protobuf, which are many. And the last one here, not the least, sub-graphs through graph node. So we're working to make sub-streams feed directly into graph node to then provide the same loading experience and then querying experience that you've come to know and love. And you'll be able to deploy a sub-graph, this time not containing assembly script, but a sub-streams package with an entry point and would process the history in parallel and load that at your database in crazy speeds. So stay tuned for that. That's not out yet, but soon. Okay, and so this is a simple example in Python. It's not really longer than that. You have one or two dependencies like GRPC so that you can use a query, so we're leveraging a lot there. And you can see that SPKG there. We can use that to code gen Python classes and helpers and all of that. Because it turns out that the manifest, sort of the SPKG there, is, for those who know Protobuf, is a file descriptor set. It contains all of the things, all the Protobuf definitions. So the SPKG also contains all the wasm code, the module graph information, you know, the dependencies, the inputs and all that, and even some documentation. Everything is needed, is in there, so you can pass it down to the modules, you take it from the disk and boom, you send the request to the server and it's running. So you can deploy packages also very easily and consume them very simply this way. There's a few imports we've omitted there, but it's simple, it's just a show. Okay, and let's look at the simplified data model for Uniswap v3 and I'll show some code making use of it, okay? So this here, the pool is a list of pool, this is actually what gets handed off from our mapper, which finds the pool that we're created down to the store pool, which we're gonna look also. And so it has a list of pools and the pools you can imagine and address and the two tokens that are concerned here and we have a reference to the token also, which is gonna be very useful to enrich the data downstream. So it has a decimals right at hand, like we won't need to do much loading, it's gonna be very, very close so we can enrich all these Uintz 250, 79,000 and put the comma where it belongs. So let's see what happens in the mapper. So this is a sample Rust code, raise your hand if you love Rust, oh ho ho ho, raise your hand if you know Rust, okay, it's very simple here, I'm gonna go through, you have the map pools function corresponding to the manifest there, has one input, the block, this is the firehose block all transactions, all logs, all state chains, you can craft your own triggers in there as you wish but we have a simple version, see that line there, blocks events, you have a thing that goes through transactions and it's going to trigger on pool created and that pool created object in Rust was actually code gend from the JSON ABI. So you can just give an instruction and we're gonna filter it for only the V3 swap factory and then that beautiful filter map will give us the log. And then we'll output, we're gonna collect some of these things into one list of pools and it's assigned to the pools object there and notice that little thing, this is the RPC create unit swap token thing, this actually hits an e-call on a node behind similar to what we have in subgraphs. That's actually very important, it means that once we've processed this layer once and we've done it for the history, it can be cached very efficiently. So anyone relying on that thing will never need to reprocess it again. You can give the package to someone and they can access the stores that's been cached by other people immediately. So you could go to block 15 million and you'll have the list of all pool created that you can query super fast. You can depend on it also. So I think that's pretty cool. And here that's the store modules. The store module is pretty simple here, it receives the pools from the output of the prior module and it does, it loops through the little pools there and calls output set. And see the key there is pool, colon, the address is gonna store the proto buffing coded stuff of the pool with the token decimals for both, right? Did I say that to be constrained? So the store here is constrained in two ways. In order to preserve parallelized ability, the stores are right only. You cannot read your rights, otherwise that would make them potentially cyclic. And they expose only the function defined by the update policy, in this case it's set. So let's see what happens if we run things in parallel. So here we have two jobs covering two segments of the chain, you know one million block each. And see those ugly arrows there, there they correspond to a pool created event. And so in our code you've seen we would write a key for each of them. And so in the first partial run, we would have what we call a partial store with four keys and the next one we have two keys. And so when we run the merge operation, we would apply the set merge policy, which says basically take one store, take the other store, cycle through keys and the last key wins. If you do that you can parallelize endlessly. So we'd have here a complete store with six keys and now at that place we have a snapshot, we can have periodic snapshots and so if you wanna go and explore the chain at any point in time, you have a snapshot plus a little partial, you can have the state synced at any block height. So this one has the last key win policy but you have a few others like min, max, add and another one like first key wins. So if you merge them, you would have set if not exist and then that allows us to build different aggregations. You'd like to see that running live? I have a few minutes now I need to bring that other window up. So you see that? Okay, that's good enough, huh? Okay, so let's imagine we want to see that pool, pool's created thing. Okay, do you see that? I want to see the output. I'm gonna run that, I hope everything is good. You know the demo guys are connecting. Okay, okay, whoa, not too fast. So this is going through, starting at the beginning and we have there a pool created event and see that we have everything decoded because it's a protobuf thing. We have the thing too decoded, we could feed that. It arrives on the wire as bytes and it'll properly serialize bytes and then we see that the token address is there. We have the decimal, we have the address and so what that means is pretty crazy already. Oh, do I see that? Okay, what that means is that you can inspect the chain with your code at any place in time, from mapper especially, can go there and I could run it again here and say, I want to run the mapper, let's say a block, I don't know. Something more recent, give me a recent block. What's the block yesterday? 15, 7 million, I don't know, something like that, okay? And see, is there anything recent? So there's some stuff, right? Some things are recent, can I see that? Someone still created a new pool and I can inspect my code to make sure it works. Where is that? Come on, right? This one was wrapped ether and infinity, they were just created that address as a new pool. So you can go and test your code everywhere and that's done, well you're set, right? You can go then to the next data dependency. So this really changes dynamics for debugging and this can also work for stores, they can, you can ask for a store and it's gonna process it in parallel and then you can inspect all the keys that exist there or see the deltas coming through. Okay, and let me show you something running in parallel, uh-huh, graph out. Now this is very interesting because, oh no, let's start it at 15 million. Let's say I want graph out at 15 million, I didn't run it again before and I want, so let's run it. So this starts a whole bunch of parallel processes and you see up there the number of blocks per second, yesterday I had about 8,000, on Solana blocks I had 16,000, it depends on the power you put behind there, but this all, like you said the pool counts is a dependency on the pools, the pools is further down so we're able to schedule things and all that just massively parallel and once that's ready, let's say everything was done, I would start streaming and get all the content and so let me show you the graph out, it's very interesting because, because graph out, graph out has refined the data up to entities that we're talking about database tables and fields and you get out of that, do you not see it? Okay, that, wait a second here. So let's imagine, see we have token and an update and you have the field derived ETH and you have the old value and the new value, I don't know if you've seen this thing in the data science world, this looks very much like change, change data capture CDCs that can power a lot of large scale systems and you have the prior and after so you can feed that to your, let's say Postgres, apply the changes and when you have an undo signal, it gives you back that payload, you then just flip everything and you have guaranteed linearities to your store with a cursor and it is flawless, it's just extremely simple to keep your things in sync. You also want to have a Slack body can have an undo message, remove the message if you have a thing coming through, right? So I think that's pretty cool. What do you think? No, wait, wait, wait, wait, okay, okay, okay. Okay, that's cool. Can I shut that down here? Do I have another window? I'm close to done. Prepare your question, please ask them succinctly. We have a half a minute, I just wanted to have a final note there. So as a final note, I wanted to share with you a little bit of my vision for the graph, okay? I don't know where the window is. So whatever, it just says fine, it's fine. I'm imagining the graph becoming that sort of huge worldwide cluster of processing and storage capabilities and something like Google's BigQuery but where people join because it's better together, right, instead of running it alone where you need to have all the resources alone and I see also a new era of composability which means more collaboration and in a tighter community working together more intimately with those data contracts and I see also a new mix of collaboration between indexers like exchanging data or sharing resources in terms of compute and storage and whatnot, therefore introducing new value flows and also I'm seeing new products, new services being offered directly on the network to satisfy some needs that perhaps couldn't be addressed before and I mean, there's a place for you in there as a developer, as an indexer, as someone who realizes the radical benefits of such a platform and who builds on it and promotes it, but my ask to you is you go to go try substreams, put pressure on your favorite layer ones so that they do integrate the fire hose natively, that's Pristin's aptose has done that recently, some others start where I think so that makes it everything you've seen today becomes immediately available to them, sell them on the goodies so also join our discord, I would love follow up questions and come see me afterward, I love feedbacks on these sort of things, all of this is open source so let's dig in and build something of like together the biggest blockchain platform owners. Thanks for your time today. We have time for two, three questions, we're the last ones, so if you have a question. Hey, so one question, modularity and composability of these substreams is super, super powerful, but still, if I look at this, compared to SQL and like dbt models, right, it's a lot more complex, so how can we enable people to really kind of learn this and like build these kind of hyper modular data streams? So it's a good question, but the transformation layer is not the SQL layer, like this is powering going through history, it's an ad hoc transform with stateful storage, but you would pipeline into SQL storage to do other things, right, you would have refinement, you have knowledge from the community as to how to analyze this and that protocol, ever, ever increasing refinements, but then you might store that in your store with off-chain data, and maybe that's best fit for you, maybe you feed it into a subgraph, that's what you need, you have a total decentralized solution and you don't need to host anything, so this is a naveler at a lower level, it's not a, it doesn't seek to replace SQL, but it puts itself at a place where we can feed all the systems on earth with enriched data, which you would need to do in SQL it's really not fun, so you leave that to the community, right? Gotcha, thank you. We have an old subgraph which is pretty slow and we would like to transform it to the new type of subgraph, should I only read some code on Rust and that's it or something else? So it is not the same paradigm, to enable parallelization, you need to distinguish the data dependencies and that infers a number of stage of parallelization that is needed, it's not easy at all actually, it's pretty crazy to try to parallelize the subgraphs, we try that, that's what yield us to design sub-streams by cutting unit swap stuff, so you will want to go and write in Rust modules and it's a different paradigm, so it's not just an easy switch, I admit, but it brings us to the next stage in evolution of blockchain indexing. Yeah, I see, thanks. Thank you so much, Alexander. My pleasure.