 and introduce, that's right, applause. Our next two speakers, we have a tag team coming up, that's right, things are getting more and more exciting momentum is building. Jamshid Shorish and Andrew Clark, Dr. Shorish is CEO and founder of Shorish Research and a senior advisor to the Uberling Corporation, a technology firm which developed Voson, the virtual observatory for the study of online networks. He also works with the Australia National University's Voson Lab where Voson is used for research, research tool development, teaching and training. Andrew Clark is a data economist at BlockScience with a background in financial and technology auditing with a focus on building machine learning auditing systems. And together they will present an empirical study of Filecoin gas and some ongoing research. I'll turn the mic over to Jamshid and Andrew. Great, thank you very much for the introduction and welcome. Thank you for the invitation also to ZX and Alex. It's a pleasure to be able to present some of the extent work that BlockScience has been cooperating with and through the CryptoEcon Lab. My colleague Andrew should be online momentarily, but in the meantime, I will go ahead and just kind of give an overview of what it is that we're doing from the BlockScience side of things and how it applies to two extent research directions that we're working on, one of them having to do with the gas dynamics of the Filecoin network and the other one through a pass through of a simulation framework that we have into an assessment of a particular type of protocol update around the batch fee and the batch balancer. So what we'll be looking at very quickly in our time available today is to start with a section on gas dynamics where we'll be looking at the way in which data was surfaced, how that gas usage was decomposed into its pieces and then extrapolate using that particular decomposition to understand and inform how it is that we can extrapolate forward gas usage given a particular time series of the network. That in itself is extremely useful for building out a simulation framework that actually goes on a macro level step-by-step through the Filecoin network and acts as a digital twin. Digital twin which is built in the CAD-CAD simulation framework at BlockScience is one of the workhorses by which we can understand how for example, different types of activities that may have occurred or different types of scenarios that we would like to know what might be happening can pass through the network and allow us to have actionable insights. Utilizing both the digital twin and the gas dynamics analysis will then go into a very quick application area which has to do specifically with the batch balancing system where we'll look at kind of what batch balancing is in a very quick snapshot and then kind of wrap up with extent research on utilizing the framework of the digital twin and the gas dynamic systems to move forward into actually selecting among a very large class of possible batch balancing frameworks one which may meet the criteria that we would like to be most interested in for being able to assess when it is optimal to aggregate different messages and when it is optimal to keep them as separate messages. So that's kind of the overview of what we've got for moving forward. I'm checking to see now whether or not Andrew is available. I'll do a quick check here. Otherwise I will jump in and take over a section. He's coming online now, very good. Okay, Andrew, I'm gonna give the floor over to you. I'll continue to drive just in the interest of time to make sure that we have ample time available for the rest of the talk. Okay, that sounds good. Thank you. So what we started with was, it looks like it's between slides. Oh, perfect. We started with an exploratory data analysis around the gas mechanism to understand kind of what was driving gas usage and gas limits and other things and understand a lot of the key components to drive it. We did a lot of different methods and different aggregations, do it on blocks or do we did it by seconds, by days and tried to see how the trends were happening, all of that kind of information to really understand what the key drivers were. So the types of analysis we did, if you go to the next slide, please, was also checking the change of actor methods. We created our own data dictionary based off of what the storage minor five means, for instance. So we went through the Filecoin code and figured out what those things were. So we could see the percentage change in message counts, for instance, between different times. Like this right here is a week percentage change between a week of September 2nd, which is an example of this storage minor seven, had a massive increase in the number of messages over this course. And we did these types of analysis to really see what we're driving during, in combined with macroeconomic variables. We also use things such as Fourier transformation, which is on the right-hand side here, to disaggregate the data into a frequency domain and be able to see what the components were and see the trends so we could start and then we did phase shift analyses and things like this all to try and understand what were the key drivers. So we could understand the intuition for building predictive models off of the system. So as we're gonna get into the digital twin and how that whole system works, we wanted to make sure that we could understand what the key signals were. So we could drive specific models to be more predictive as we get into the system identification, which we'll get into in a little bit for how we're forecasting states forward as the different actors are interacting and we'll get into that in a little bit and getting very accurate gas usage from the bottom up approach. One of the main techniques we use if you go to the next slide please was something called Granger Causality, which is a very interesting method that we used to create our vector auto regression models. By what Granger Causality does, this allows us to understand if certain variables from a var model, which is, it's just an auto regressive model with you determine the number of lags and you can have many different factors. What this heat map shows here on the right is we can determine, like at the top here is base fee burn. Does that cause one of the actor methods to have a higher count? And what Granger Causality does essentially is it's a way of trying to infer the causality of does this variable cause this other variable based on the lag from a var model. So what we can see here is based off the P values and the statistical relationships, the green are the variables. If you intersect from the column and row, you can see which variables have that factor. So what we were trying to do is based on the time period we had, which is one of the key considerations we'll get into in a moment, based on we were using the Sentinel Filecoin Database, that has truncated at specific periods of time. So we didn't have a massive back history, especially if we're doing daily data for how these different methods classes work, as we're most likely as we're looking at it like, an operational digital twin to drive business insights, we kind of want daily values for these things. We had a dimensionality problem of having way too many different actors and their different gas usage. It was way too wide of data for the amount of rows we had. So that's a dimensionality issue where it's had as a very hard time training models. Traditionally we would use something called a var max model, where it's the vector auto regression is the endogenous variables, what we're trying to predict would be the gas usage of these different actors methods. And then we pull in macroeconomic variables as the exogenous. In this case, the macroeconomics would be base fee burn and some of these higher things, if we're trying to predict and all the different signals from the Filecoin network, minor penalty, pledges, that kind of thing, anything that we think is relevant that we've can derive here, to be able to predict what this gas usage would be. So because we ran into this difficulty, and you can actually see on the left-hand side here, some of our plots of the end model we ended up with, we're not gonna go into all the specifics of how we have it, but we have a little bit more information in the appendix and we can definitely follow offline. We use something called a, if you go to the next slide, please, a var x model that Dr. Shorish came up with, which is essentially, by using this Granger causality, we can understand what the key macroeconomic variables were and then we ran them as a normal var model instead of this normal var max model. And then this allowed us for, based on this analysis as well, we found what were the key drivers because there's so many different actor method classes. We only need some of them to be doing a Pareto principle for doing an accurate digital twin that's performant. And we can see here what the different gas usage per day from these different accounts. So we use this method that was faster that got around the computational limitations allowed us to build this integrated digital twin that allows us to get the previous state and predict the next state from our operational DT. As we were building this out, you can go to the next slide. We started running, oh, yeah, that works. So we started running into issues as we're building into the digital twin. Issues with data, longitudinal data for back testing and things like that nature. And be able to really fully build a model. We needed a longer period of time of the data. And we were with Sentinel was truncating that data as the operational digital twin, we wanna perform health metrics, we wanna inform decisions, we wanna send a check behavior, expect behavior. We need a long period of high fidelity data in order to do this. And this was the goal of all of that analysis we were building towards was being able to make this gas dynamics for extrapolation digital twin that helps make decisions for the PowerPoint network and helps doing parameterizations and things. And when we started really taking that initial analysis we did, which we knew we had limitations based on how far we could go back in the data. And then based on what the goals were here. If you go to the next slide, please. We created this operational digital twin infrastructure ourselves where basically what we did here was we went back and got all Lilly data fields. We got all these different data sources and Sentinel because we lacked the fidelity and the longitudinal data that we needed for our analysis. So there is a file point did also have an internal research database but didn't quite have the right fields of things is the different aggregation layers we're using and all of the different signals we needed based on our EDA analysis of the different macroeconomic signals for the, from our Varex analysis with that heat map that we showed. We had to build our own internal infrastructure which is now up and running that allows us to basically every night batch data from Sentinel after we back filled with Lilly we then can use Sentinel and then aggregate into our digital twin data fields and aggregations we need. So then when we have this operational digital twin that's used by file point stakeholders they're pulling directly the refined data fields from the block science RDS here that's a lot more efficient than trying to ETL and do all of these analyses each time because the code ETL from the raw Sentinel production parsed messages for instance into the daily level or epoch level gas usage data that we needed is a long process and takes a long time to run. So we've now ETL all of that here so we can have a build this operational digital twin with the health metrics and the ability to do what if analysis and counterfactuals and parametrizations and have that into a solid state. So I'll hand over here back to Dr. Shorich for going forward on talking about how some of the gas dynamics then moved into the batch balancer. All right, thanks very much Andrew. So what we're gonna look at now is a very quick application of this digital twin infrastructure with the gas dynamics extrapolation exercise built into it in order to understand one particular feature of the Filecoin protocol that has the potential for being updated. And so this is an active area of research we're currently working on actually building out the functional form representations that we're testing in the digital twin. Just as a quick overview of what the batch balancer is doing, what's happening in this particular environment as many of you know that there is the opportunity for messages to be aggregated or batched in order to be able to save gas. And in particular for the largest gas usage messages of pre-commit and proof commit messages for the proof of replication. The intuition is that as the network use becomes high then batching is incentivized in order to improve efficiency. And the mechanism by which this is being done which is raised initially in FIP 13 and then updated once in FIP 24 is to have a batch fee surcharge with two different degrees of freedom. One of which is a batch balancer which enables individuals to make a trade-off between what the base fee would be for a single message submission and what would be happening if you actually aggregated things together. And on the other side of batch discount which is a way of incentivizing a little bit of a shaving or a little bit of a haircut on that surcharge depending upon how the system is performing. And in these two particular degrees of freedom what we're gonna see is that currently these are being set at once and then updated through governance. And what is being investigated is whether or not we can actually endogenize what are both of these to respond to network conditions. But just to give a very quick idea of what the Filecoin Batch Balancer would be set up to implement in this particular context we would have a pool of messages that are coming in from storage onboarding as you want to have these sectors being sealed. They would come into a batch decision the question of whether or not to aggregate then depends upon the state of the system at that time, the base fee that is available and then what the batch balancing surcharge would be then it would be a decision on batching or not batching if you do batch then you get less gas per message but then you have the batch surcharge which is applied if you do not batch then you send messages singly you have more gas than per message but then no surcharge and they lead to different types of outcomes on the stress let's say of the network. If you aggregate and batch things at once then you're able to use less gas and utilize less of the network and there's a lower chance of network congestion whereas if you have individual messages that are singly being submitted if there is already a high level of network utilization then this will add to the burden to congest. And so the idea from the batch balancer is that if you have for example, high network use then what would be chosen in an environment where the parameters of that batch balancer are already being specified is to say yes I would prefer to batch in that situation collect everything together sort of use less gas and then lower the network pressure. So this is from the point of view of the network what would we like to see? And of course we want to ensure that there's an incentive underlying this for the each of the storage providers to do so. By contrast, if it was in the low network utilization environment we would say well it's not so interesting then to batch let's not incentivize batching as much let the individuals use the network as in single messages and then perhaps raise the network usage. And so we ended up with kind of an equilibrating regulator system that's put into place but it's predicated upon those two degrees of freedom that I mentioned the batch balancer value and the batch discount. Now how those are updated presently well those are updated by governance and by introducing a fifth. So as we move from fifth 13 to fifth 24 what's being introduced as a change to one or both of those parameters on the basis of what has been seen in between the two points which these have been last updated. So you look over the interval and say well maybe it makes sense now that we increase the surcharge by a slight amount or perhaps we decrease the discount by a certain amount. This is a way in which you can use expert knowledge to drive the update that occurs but of course it means that you cannot respond in the immediacy of a change in network conditions because you have to wait until you have an appropriate amount of time for review in the fifth to be accepted. So by contrast the idea here is that the active research we're looking at is to see whether or not we can put into place a functional form representation that explicitly depends upon the with state of the network and then dynamically adjusts one or both of those parameters. And so if there is actually say high network utilization that this dynamic batch balancer would then take that into consideration and then adjust in total that batch surcharge and influence the incentivization of individual storage providers accordingly. So by introducing a dynamic batch balancer framework you let's say not necessarily eliminate the need for having governance because you may want to change the form of that particular mechanism depending upon its performance but it does mean that you don't have to change those parameters every single time through a governance process. The regulator system now has been indoctrinated. So the challenge of course is to create such a dynamic batch balancer that responds to network behavior autonomously while continuing to incentivize batching for the storage provider when network usage is high and our current research agenda is to assess a parameterized dynamic batch balancer functional form within the digital twin, the operational digital twin of the Filecoin network that Andrew introduced. The goals in that case that are to select using the digital twin and understanding different scenarios of parameterization that is informed by simulations that we engage an employee for various storage onboarding rates or various types of demand activities for Filecoin services. And that assists then in the recommendations for a FIP to implement the dynamic batch balancer upon conclusion of those simulations. These are simulations that are ongoing at the moment. And finally, the idea here for the batch balancing is to incorporate all of the information that we can about the existing network that is to ensure that we understand not just the laws of motion about the system as a whole but actually to understand what would the system like to have. So we know, for example, from a storage provider's point of view they want to make a batching decision predicated on cost but from a network as a whole, optimal batching is predicated upon network efficiency. There's a trade-off between the network becoming too congested on the one side and not having enough of the protocol revenue of the gas being burned on the other. So the digital twin implementation allows us to model the entire network as a macro system that selects message batching in the simulated framework based upon network use and the gas usage from message traffic according to this batch balancer functional form. So we dovetail in the gas dynamics on the one side into the digital twin on the other conditional upon this functional form examine the scenario simulations that occur and by using different metrics assess which one or another parameter constellations are optimal for that particular implementation of the functional form of the batch balancer that we would be suggesting. So by the simulations we actually look at various storage onboarding scenarios high storage onboarding, low storage onboarding, high network congestion, low network congestion to understand that impact of the different message traffic rates and we combine those scenarios with the gas dynamics laws of motion that Andrew Hitch uncovered in the first part of the talk to be able to run those simulations conditional upon as close to an understanding of how those gas dynamics are influencing things like the burn rates for different messages as possible. In order to be able to combine those two together we then perform what is part of the engineering design work process, a parameter selection under uncertainty to be able to figure out a range usually of parameter values for which the dynamic batch balancer can be then recommended. Okay, I think that's everything that we have the time available. I know we went through fast and furious for different research opportunities but please thank you for your attention and naturally if you want to have any more information you can definitely contact Fox Science and either of us individually. So thank you very much. Great, thanks to Gemshid and Andrew for a great illustration of the power of the digital twin model for modeling and simulation and for a nice illustration of the relationship between empirical observation and mechanism design in the Filecoin ecosystem. We have time for a couple of questions to the speakers. I'll run up to you like Oprah. Don't be shy. Okay, I'll get it started then. I noticed in one slide you made reference to an intuition that you had about the system you were modeling when you were describing when you're designing the batch mechanism, batching mechanism. Where do these intuitions come from? Are there any real world or digital systems that you find are fruitful sources of intuition when you're building these kinds of models and when you're doing these kinds of studies? That's a great question. So some of the kinds of environments that we look at are our control systems. Whether or not we're looking at open loop or closed loop feedback systems, a feedback system which says that we would like to respond to a particular state of the system by taking an action that reinforces a criteria such as stability or a minimal set of volatility. These become ways in which we can kind of understand the trade-offs that must exist for an individual of their own volition, completely of their own choice to make decisions that affect the network as a whole in a positive fashion. And so it is, of course, a longstanding and open question about whether or not individuals when doing their own thing will do things such that they don't end up working against the community as a whole. So there is a lot of tension when you're trying to aggregate up from a micro-decision to a macro-decision. The idea within the context of the batching system is to say, we want to make sure that people understand that they can batch when it suits them at any time, but that we would see on the margin that they would decide to batch more often under conditions that would benefit the network, for example, under conditions of high network congestion. There is actually an empirical fact that we have individuals who like to batch even when there's absolutely no network congestion. They batch a certain amount all the time, maybe thinking it's just simply an efficiency gain, even though they aren't necessarily earning any kind of savings from this because the network level is so low, the base fee is so low. This may call into question strategic issues of whether or not they are looking forward and saying, oh, maybe I'm actually going to change, perhaps my impact today changes the base fee in the future. And therefore, when I actually have my time tomorrow to actually engage in mess of traffic, I actually save money on that. And part of what we're investigating in this macro model is to actually model those types of strategic decisions, which are motivated from game theory but are built into a macro model of the trade-off between the current benefit and the future benefit. So this is one of the ways in which we utilize that intuition behind driving the micro level to the macro level to build such systems. Thank you for that. We have another question. What kind of changes to the system would make it way easier to measure a lot of the things that you want to measure and make it easier to either run experiments or simulations and so on? I imagine a lot of this is sort of rate limited by the ability to do experiments or design different kinds of potential systems and so on. Are there any kind of changes that could come in to the tech stack itself to make it easier? Good question. I'll defer also to Andrew if he has an answer to this. From my side, it is the more information that you can get exposed about the distributions of things that occur in the system, the better. A lot of the challenge from the data side is to be able to build, let's say, a model of the distribution of the things that you're not totally under your control. So exogenous effects, for example, and the tighter bounds that you can place on that so the greater the data fidelity is to be able to run some sort of a parametric or non-parametric estimation, the better off you are because then you can close the error bounds a little bit over a wide range of different simulations that you might have. But let me also defer to Andrew if he wants to add something to that. It's a great question. Yes, completely agree with what you just said. The other thing we are working on to make it so it is more accessible and easier to use for the analysis on the tech stack is moving it to Docker images because setting this type of thing up is a lot of different dependencies and things. So one of the next steps that we have in the operational digital twin is like we have the data already set up in a specific way. We can keep making it so we can do the different distributions as many pre-knowledge we have but also the ability to then run this from a Docker container without the full setup allows us to crowdsource a little bit more some of the analysis versus like if we're like, here's a repo with all of this code it's gonna take a long time to set that up versus like if you have Docker we can package all this so then you can go in and change prams and do more experimentation. So from a tech stack perspective specifically moving to the full digital twin to Docker will definitely aid the crowdsourcing of analysis. Great, thanks. And do we have any other questions to the speakers? I think in that architecture you showed like a lot of the data originated in like Sentinel database and Redshift or something like what would be the trade-offs if you were sourcing that data from Filecoin or IPFS directly or would that help at all or hurt? And that's the key thing is so we also used a lily which is our, it was S3 dumps of data from actual Filecoin chain we used a lot of that as the basis of our backfilling as well. For the digital twin a lot of times we're looking at a little bit more of an aggregated view than specifically from the chain and internally at Block Science we are looking at creating a system where we can basically ETL directly from different blockchains. So at that point we could actually go from the Ethereum IPFS Filecoin system but if we're doing operational DT for making decisions about how to structure the economics going to that level of fidelity versus relying on like an aggregation that's even epoch level or daily or even second level that kind of data is more what we need for operational DT. So at the moment it's not gonna create for the amount of additional work to hit directly from the Ethereum chain is not gonna outweigh the costs because of where it's not gonna create the benefit based on what the aggregated data we need but definitely is something we're long-term looking towards the closer we can get to the actual data and then have an aggregation layer the better and that's a long-term research project we're working on across all Block Science clients. Great thanks again to Andrew and Jamshid for that presentation.