 This project has been going for a little bit less than a year and it started as a little curiosity but then it sort of meandered into all sorts of rabbit holes and currently it's been supported by three different organizations. It's the Führer Foundation in form of grants and also by and Führer, by doing contracts between Interchain Foundation as well. You probably know about Interchain Foundation via the Cosmos product. So let's start. So I'm going to skip all the slides that I had before because you've probably seen them. So we start right from the meat about what is the difference between the Gess and Turbogeth and so on the slide you could see on the left hand side you could see like the in Ethereum the the state of accounts and the contracts and their storage is modeled as the Patricia tree which is essentially a radix tree of the radix 16 with some refinements to be able to get rid of some some waste of waste of space and when you model it like that and you want to persist it in the database the persistence you looks like on the right hand side which is the key of the so it basically the whole structure every node in this Patricia tree is split and is one record in a database and for that record the key is the the hash of the serialization of this node and the value is the serialization itself. What it allows us to do is let's pretend that we do not have that pink node in the top like the diamond pink node and instead we simply have a hash which is a which you could see that is a little dark and dark rectangle. So what we could do is that we could take that hash and then use it as the key in the database and look up the serialization of that pink node and reconstruct it and then plug it into the tree. So that's how basically pretty much all the clients do that as far as I know. And so the the problem that so that when I first saw it actually I was a bit surprised because for me it wasn't it wouldn't be how I would do it and that's why I actually started on this process because whenever I see something which doesn't look like I expected this is where you so I didn't expect to see that but probably was very natural for other people. So the main problem here that I saw is that let's say that when you when you only have a root and nothing else in the memory and then you want to fetch let's say element V5 which is the purple diamond on the bottom right, you actually have to do one, two, three lookups to the database. And the reason for that is because in order to retrieve the that diamond you have to first retrieve the element before that because it contains its hash. So and then so in vice versa. So you have to do three sequential lookups to the database. You cannot parallelize them because you there's a data dependency there. So that gave me the idea of not doing that. So basically that you see in the red matters depth should not matter. It means that how can we do that without doing three lookups or five lookups or seven would lookups or whatever. I think at the moment on average will be eight or nine lookups if you don't go through from the root. So what I tried to do is that okay let's just put them in a database as I would expect it to be. Like you've got a key which is the which is the actual key of the value that you want to store and the value is the actual value that you want to store. And if it's a historical value then we're gonna append some kind of block number encoded in some way to the key just for simplicity. And the main problem here is that how are we going to compute the numerical hash of that. And I think initially when the first clients have been produced the assumption was that oh that's going to be super expensive to actually have this flat representation and then to build the whole tree. And then everybody decided to go okay let's do the just the Patricia tree in a database. But then I said well why not let's try. And then I tried that and actually turned out that you can do things much well in a lot of cases much faster if you simply rebuild the hashed tree on the fly. And you use some caching and stuff like this. So the way it looks like in imagine that we need to get to this pink diamond now with the pink diamond we just simply fetch it. But actually let's say that we want to compute the hash of that oval dash thing. So in this case we do a range query which will fetch us all the keys and values which starts with one which is the actually might be the picture might be wrong. But anyway so you probably get the idea so we get the range of keys and values and then we apply our miracleization to them and we get the hash. It only kind of works for the current state and if you want to compute the miracle hashes of the current state but for historical state it's it's a kind of a lot of work because you don't usually store the historical state in your cash but it works reasonably well for for the RPC queries that are tested. So let's go so so as I mentioned that infuer has been supported in this project for for some time and one of the things we did is that instead of me only running the Torpegettes on my machines or some cloud machines they tried to run it out of the source code and this is what happened and I was kind of pleasantly surprised because I must have been running it on the really bad machines before so now I usually got the two weeks to sync it but this was actually synced in about six and a half days yeah six and a half days and the this to the pretty recent block and the it's archived node actually so it has a full unrolled history and it only was what 200 250 something gigabytes which was cool because it probably six times less than you would normally get and that just as an interesting thing that the memory profile so the what you can see here is that the yellow oscillating bit is that it's obviously go it has a garbage collection and that's why it's oscillating so much at it's a basically allocated heap which goes into about 13 gig maximum and you can see the spike about the spam attacks where the it was a big load and the green bit is there is a number of nodes so what to forget does actually one of the things it does differently as well is that you can limit number of nodes that you could of the tree you can store in a cache and that might allow you so my hope was that you can use that to calibrate the heap space but you know I think it might still be constantly increasing so now what I so now look at the these graph so what you can see it this is the recent well quite recent maybe one month old graph and this is where the actually the cost of storage comes from most of it is now block bodies which is the essentially this is all the blocks combined for like six million blocks and then the second biggest structure is the called the block this is like a gray one block number to changed keys this is the structure which is a specific to through the geth and it allows it to to know which accounts and which storage indices changed at which block and why do we need this is because it allows us to do to rewind the state quickly so in a situation where let's say that we've been like sinking to the main chain and then we suddenly see rework of like four blocks so what the to forget does it actually looks at this mapping and figures out which keys changed in this four blocks then picks up the values which were before these four blocks and then unwinds the tree in a hash and it computes a new hash and then reapplies the state on a fork on the other fork and so what you do not see here is the receipts you can have them there they would take about 70 to 80 gigabytes of all receipts from the beginning and I will show you later why would what you want to do that but the interesting bit to note is that lots of people are confused about the size of the state in the theorem and and the reason why they're confused is because different clients stored the state differently like like for example go with theorem current state representation ago if you're probably like 80 gigabytes 90 gigabytes in turbo gas is probably gonna be about 12 gigabytes I don't know how it's gonna be in priority but it's different clients stored differently and sometimes people confuse the current state with the like total historical state with the pruned historical state that's why it's but what you can see from this picture on the very top you can see accounts which is 3.89 gigabytes this is the account without any storage items is just simply balances nonces hashes of the of the of the storage roots and the code hashes and the whole you know it's not the whole history it's the current accounts so the history of accounts you can see on a left on a yellow bit so 44 gigabytes then the storage of the contracts the current the current state is what 10 gigabytes you can see it somewhere down the middle the purple violet bit and then you can see the history of it's about 26 gigabytes so you can see that the current state is still kind of about 12 gigabytes probably now so this is the this is where I'm gonna say about receipts because I haven't tested all ex old entire entire set of RPCs and again this numbers are very rough I didn't do scientific studies on this but this is what I just saw very after very quick running I run the both archive gas and turbo gas on the same machine and obviously there would be lots of noise but I could see through this noise that are generally the things that go in faster except for the receipts and the reason that for that is because I chose to prune the receipts from the database and instead I just recompute them so when the somebody asked me for receipt I just go to the state where that was at that block and after that transaction just re-executive transaction producer receipts return them it turns out to be slower actually this might be not 10 times maybe a bit more slower but somebody told me that maybe it just needs to be a flag because some people the only thing they do is actually querying receipts so for those people they might be able might want to pay 70 gigs for for speed of that RPC so now the other thing I did and quite recently in turbo gas is that I have prepared it for the pruning but I haven't implemented pruning yet so but I made sure that the pruning can be done pretty frively and so the idea here is that instead of so what you could see in this diagram so every bubble every little circle is actually a record in a database and so the the the arrows like the thick arrows they represent that each of these circles is basically a representation of the state for that period of time so let's say that the green circle on the right is the current state it means that the current state has been for that like if you look at the top one for the last let's say seven blocks or eight blocks I don't see it exactly this was the current state and it hasn't been changed for seven or eight blocks right and before that there was this yellow bit which was since we know and and so on so the red bit is where something gets deleted for example the account gets deleted then I mark it as a red we put it as a record as well and also that you could see that the blank blue one is where something just first been created and there was nothing in a bit before so now what we can do is that what I call it reverse diffs instead of forward diffs and what it allows you to do is this so if you basically chop off the left bit then you can just keep running your your note because you still have everything everything is still consistent so you can still query the state at any point that you have a history for if you if you try to do it with the forward diffs that would be difficult because you would have to when you chop the chop this up you would have to rebuild the current state at the at the point of the chop because everything is is relative is relative to the to the preview some some sort of previous snapshot so as I said I haven't implemented yet but it's trivial now light clients I mean here I'm actually mean light server and at the moment it's not implemented and the reason for that is because I simply don't store any hashes in a database and I kind of do but not much only in the corner for the current state so what the light clients actually need you can see on the right hand side all the ticks that the last two protocol needs all this support of all these things and the ones that I can't currently cannot provide is these the ones which I pointed with the red arrows which is no data and proofs right so no data will be very difficult in any way because it's specific to it turns out to be specific to the way that current clients store the state and will be very difficult for me to implement that because it's actually you cannot come up with any hash in the history or in the current state which I have no idea where which block it is from and you supposed to give me I supposed to give you the note for that and I don't have idea where to look for it in my database because my database is structured by the blocks the get proof is more is easier for me because it at least it tells me at which block and in which account to look for so I can actually find it so it's easier to implement that and maybe I know that the light clients can ask for proofs not for the current state but before for some delayed states but it could be done but and then there's another problem which I encountered recently is I call it create to revival problem and it's also actually this is also to inform people about something which is coming up in the Constantinople is it is not in the main yet by the way don't get yet scared it's the so create to up code essentially is introduced to allow efficient counterfactual instantiation and in the way it does it is a if you look at the formula of the address computation it includes address which is basically the creator of this contract salt which could be chosen at will and then init code which something which will be executed to generate the actual deployed code so that means that you you you can in theory and in practice recreate the contract after they've been self-destructed so in this diagram the big circles is the contract in the small circles is the storage items of the contract and you can see that how contract was created and storage has been modified blah blah blah and then you go to the point where you got self-destructed and you got all this red things and then later on in the like three blocks later it's been recreated using create to with the same code or with a different code but with the same init code and then after that you supposed to so it's assumed that now at this point the the the storage is cleared and balance is cleared so this is completely empty contract and now you can start again so the problem for turbo geth is that because it doesn't story the the state is a tree but rather in this relational way it has to either insert all these red bubbles into into the database which could be in millions potentially at the at the point of self-destruction or it has to apply much more more in nuanced way when you actually fetch the contract items so the alternative would be to like when whenever anybody asked me for the contract storage I would have to also check whether the contract has been self-destructed in the past and when there was a lost time so it's more searching I can implement it but it's kind of a bit hustle now so this is this is highly experimental this is this is something that I came up with when I was thinking about all these problems that with the light clients and with the with the create to and at the same time I was I was working with the either meant team and they basically we did some tests with the loading a theorem transaction at either meant and it turns out to be a bit slow so I said we need to be we need to get faster with that and so the idea is to actually create a specialized database that would allow these few desirable things and I will explain why they are they might be good it's still very experimental so it's it could also it all turned out to be in different ways so although I did some proof of concept on this I will tell you roughly what the results were so let's get into them the details of that so as you know again as I told you before is about the the theorem currently uses patrici tree for miracleization but also it's it's it's before inserting everything anything in the patrici tree it also applies the hash function to the key so the keys are not just inserted as they are and this is because in back in 2000 whenever 15 I think when when there was a security audit of a theorem something pointed out by Andrew Miller that you could actually create a very very long branches in that tree but attacker could do that and I'm basically introduced some some bad things so as the mitigation it has been suggested to hash all the keys before inserting them into the patrici tree and the idea is that it would kind of balance them out there is a it still doesn't completely solve the problem but we're not gonna touch it right now so for the Ethereum 2.0 for example there's also suggestion of sparse miracle tree which is essentially also radix tree but with the different radix instead of radix 16 they use radix 2 and but the idea is very similar and for the same exact reason the keys will have to be hashed and the reason I don't like the hash keys is firstly because I believe that the the the problem pointed out in security audit has not been completely resolved and it will keep persisting if we stick with the sparse miracle trees and also you also have to keep pre-images so to be able to iterate through the state and the pre-images are not really heavy at the moment but in I just chopped the bit out of the diagram it's 15 gigs so what I've been experimenting with and of course it's been it's been it's been inspired by the by the fact that that that either mint uses the AVO trees looking at the balancing trees and the main objection to balancing trees was that the order of insertion and deletion actually matters and I've been looking at not only AVO trees but also the well weighed balance tree is actually really a little known about them but these are the structures which are really good for functional languages for some reason so actually what the idea is that okay let's try to encode the structure of the industry is a string of bits so this is actually you're probably gonna look at the slides later but so this is to demonstrate to you that you can encode the any any binary tree in fact into the string of bits with the cost of at most two bits per item and then decode them with a very simple state machine as I provide here so the decoding allows you to either rebuild the whole tree or actually compute the merkle hash efficiently and then what you do next is that you have this huge tree which is balanced and you would merkelize it according to this huge tree and then in order to efficiently store it in database you split into pages of the fixed size and this is kind of the schematics of how you would do it then after you do that you encode them in the pages and this is the structure of the page I'm using in proof of concept it's called like different elements in that and what is interesting here is the page pointer so there's two elements stored in in the pages that the values you can see that fp there and this is circle with the arrows are the pointers and so then after that you can actually add the history so so the main so let's let's talk about this so first of all in order to explain more in a short sentence is that the main idea is that you use the same structure for the database index as you there is the structure that you use for merkelization so it's the same so that means that whenever you actually commit something into database it doesn't move anymore because it's an exactly the same plot in exactly the place where it's supposed to be according to the to the index of the database which means that right amplification equals the basically right amplification does not happen and so the database just grows it doesn't keep rewriting things on it if it just simply adds thing which is has some nice properties so at the moment with my proof of concept which I've been running I was comparing with to be guessed which is also quite already high bar so the right so the the right efficiency has been about seven times better than in to be guessed so it means that it actually has a set seven times less IO unfortunately the space efficiency has decreased compared to to be guessed it's about four times less space efficient in the current numbers but I have a some some I know some ways how you can improve that access efficiency I only superficially tested that so and so what this to put it in a perspective is that people usually talk about the trade-off of three things in databases of data efficiency space efficiency and access efficiency and you can see that you really want to be inside this triangle not plain but you don't you never really know whether you're above that triangle which means that you're you don't actually have an efficient trade-off optimal trade-off or you're in a triangle so what I'm trying to do is that I posit that non-optimized systems are already above actually above the triangle so they're not really exercising trade-offs at all so they're inefficient in any way in any possible way but what I'm trying to do now is I'm trying to hopefully I am inside the triangle already and try navigating between them so I'm gonna if it's any time I'm gonna take any questions any time for questions or yes yeah there's some time for questions yeah I sorry I rushed to you if you have any questions yeah so just that I understand it this 200 gigabytes or 200 plus 70 for the receipts that's actually an archive note right yes archive note nice yeah and it can it can fetch you any data for the entire history through any RPC queries can you talk more about this range request oh range request it's it's it's it's basically so range request is quite simple because essentially you are in most of these key value stores what you can do you can open the cursor at the certain you can you can seek to the certain key for example in this case will be one with the sum 0 0 0 0 whatever the first key is to start with one and then just use that iterator to iterate through them until you see two in the first position that's a range query it's not like a squirrel but it's still possible efficiently Alexi can you talk a little bit more about your experiment with aetherman and why oh yes I forgot about that sorry so the either meant experiment is interesting for me because at the moment when I only try to optimize the tuber geth and there is a Patricia tree which I already I said that I don't like it very much but and it's encoded in the yellow paper but what I want to kind of look behind beyond the yellow paper maybe because because it's cool because because I want to search for different structures maybe because they gonna be used in 8.2 2.0 or because a you know we might be able to be able to create something more efficient in either meant and then we could just transplant this into the Ethereum it's probably good for experimentation I also quite like working with the either meant team so there so yeah it's that's the reason but was there any question is there any like performance difference when you're doing either meant versus if I mean there's on a very early stages and the whole reason why I started to do this Morris thing is because the first performance tests were not satisfactory to me so I wanted that to be faster we probably have time for one more question yeah first of all I just wanted to ask you realize how amazing you are well thank you very and then secondly can you just go in a little bit more detail of how you come from 1.3 terabytes to 250 gigabytes with the same it feels like the same performance but a very significant decrease in the size how exactly well there it comes from two things first of all I don't store any hashes and the hashes are probably one of the biggest contributor to the to the space in in existing clients and secondly the the tree structure when you are model the the history is a tree structure it repeats a lot of the elements on closer to the root when you create a new version so that repetition also that it also contributes to the might to the Morris being less space efficient because it's also tree based so and that's basically two reasons and if you remove these two things that becomes super space efficient all right well thank you very much like thank you for coming great