 Hey everyone figured I'll start a minute's longer. I have one more minute. Yes so Every now and again somebody sends me a message on messenger or wherever and the message kind of goes along how do I debug get and I always I have a follow-up question What do you mean debug guests just submit an issue and it turns out that they actually don't have an issue Rather, they are they want to know how they can know if gas is actually working or if gas is working Properly or how healthy it is and Then usually what we refer people to is well just look at the logs. How hard can that be? Okay, I'm being a bit ironic here, but the problem with looking at the logs is That we either arrive in this situation where get is just silent and there are no logs And you're just waiting and waiting or maybe there's a log every 15 seconds But you still have no idea whether that's good or bad or we have the more ingenious people who raise the log level and They still have no idea what's going on So the problem is that logs are really really useful if you know specifically something that you're looking for But if you just want to get the general overview of what your node is doing then logs are perfectly Not useful So what can we do instead? Well first things first People mostly are interested in whether they know this working or not and this is not such a simple question to answer because it's never binary Of course if they know is not working we see that it is not working But if it's working then that's an entire spectrum. It might be working very well It might be working just barely, but it's still working now The issue is that if it's working very well then everything is fine if it's just Working barely then you might have an issue at 3 a.m. In the morning and you want to be able to tell how healthy is your node Now if let's suppose that we start with Latest release it kind of works, but a lot of influence a lot of things can actually influence Your node in some one direction or the other for example every time we do a bug fix release Hopefully it improves the situation and of course if you run it on more powerful hardware that also papers over any issues that we might have with flaky algorithms or suboptimal things Network connectivity and other externalities like workload again influence your Up your operationality or however that's pronounced. Sorry Essentially, of course if you put if you run a girly node on your notebook Then it will perform really really well Whereas if you run mainnet then you might run into troubles. These are kind of natural Now the thing is when you start running a production infrastructure Then just knowing that your node works is not really enough You want to know how well it works because it's kind of a game of trade-offs If you are yes, you can put the most powerful virtual machine available on Amazon underneath it And it will probably run perfectly, but it will cost you a lot or you can put a raspberry pi underneath it And it won't run so that's how do we found them find the middle ground and the answer is actually metrics and monitoring Now from the very very abstract perspective metrics and monitoring means measure everything that you can and visualize it and This is not something new in the theorem ecosystem I mean the stats page was available since forever and everybody knows its usefulness However, when you want to measure something more, it's really hard to figure out. Okay. What exactly do you measure? I mean a software system. They are gazillions of things that it does So what is important and what is not important and to answer that you actually need to have specific questions What kind of questions would you like the answer to and for example us in the geth team? We usually have three questions One of them is how do the nodes behave across the globe the second is that if some node Looks weird, then what is it doing actually and last but not least if you found an anomaly and we fixed it Then it is really important for us to see that okay the old version did something weird. Does the new version fix it? Yes, and does the new version really not ruin anything else? Yes or no so if we dive into actual Things actual examples The geth team is kind of running eight boot nodes globally across various continents now First question that we want to answer is Are these boot nodes healthy and how what is the cost now whether the boot nodes are healthy or not for us? It's really simple are there are they in sync with the network or not So we simply just visualize how many blocks behind there are from the chain head and immediately if it some boot node falls behind That we know that something is wrong, but this just gives us the binary thing We know that it's healthy or not, but we don't know what's the cost or how healthy it is now if we If in order to find out actually how healthy it is We actually need to also chart out the resource consumption of it and as a computing system You have four major resources CPU memory disk and network and of course you can split disk up a bit and network up a bit in input output Etc, but all you know if we chart all of our eight boot nodes across these four metrics types Then immediately we can see anomalies for example on the memory chart we can see that the yellow boot node is actually using 2 gigabyte more memory than the rest or CP wise a similar thing and immediately you have something to take care of and the other interesting aspect of it is that if you know Your nodes have 13 gigabyte. Oh, sorry 32 gigabytes of memory and you're at 8 then you don't care about it You're really far away from crashing. However, once you start reaching your limits That's when things will start going wrong. So you can just looking at this dashboard. You can immediately see how close you are to failure Cool. Now just to dive into one of these examples DevOps team a couple weeks ago told us that hey, you're kind of using too much bandwidth I mean it costs too much. They care about the money not the bandwidth and So we looked at our charts and yes, indeed the boot nodes are actually Pushing out five six megabytes per second of data and we are wondering. Okay. What does the boot node do? I mean this chart tells us how much it costs Resource-wise, but we don't know what it's actually doing with those resources. So those are actually our intermediate level monitoring questions let's try to figure out what a Ethereum node is doing with its disk resources or other resources now in case of networking. What can we do? Well first step We have previously we've measured the total bandwidth a lot Usage, but we would like to also measure the individual map bandwidth usage of individual protocols since the boot nodes are running Ethereum and light client protocols those two are the ones we want to run, but we can go even further down and actually measure the Network usage of each individual network packet I mean packet type within those protocols and if you chart these we get these really nice fancy spikes and Immediately we have three things that are really really obviously Strange for example in the top two charts Those are the input and upload and download speeds of the theorem protocol and you can immediately see that the light blue Thingy is causing quite a lot of traffic and I'm not sure whether it's visible or not But I will tell you the white blue has the label of transaction propagation So just by looking at these charts you can immediately see that there's something wrong with transaction propagation It's taking about 1.1 megabyte per second traffic and Maybe this is enough for you. Maybe it is not enough for you to decide what's the next step But in our case yes We are kind of aware of how the transaction propagation works And this means that we actually need to roll out a new Ethereum sub protocol I mean a new version of the Ethereum protocol with an alternative Way to propagate transactions because otherwise there's no way to get this down But we immediately know what the fault is and we have an idea of how to fix it Now the other anomalies from these charts are that the boot nodes are uploading a significant amount of data And our guess again is that since these are the boot nodes Everybody's trying to synchronize from them. So if you want to stop people from synchronizing them Well, of course, we can nuke the boot nodes Halfway offline so that they refuse to give you the data, but that's not nice Alternatively, we need maybe a better discovery protocol so that you can find peers faster Better peers, etc. Again, just we just looked at What the node is consuming its network bandwidth on and we can immediately make really nice guesses Now there are four resource families. Why we can do a similar exercise for CPU usage now CPU usage if Somebody were to ask you what does an Ethereum node use CPU for the no-brainer answer is block processing and this was our false assumption for many years to and About maybe half a year ago somewhere on January we decided that okay some things not right So let's try to split that block processing up So we saw that okay block processing takes a hundred millisecond But what's inside that and actually figure that well when you run the transaction you there is execution components We also load the data from disk and we also write it to disk So let's and we also do some hashing So let's try to actually break it up and meter all these components individually And then the resulting chart was a bit I don't know a bit Surprising it turns out that if you run a full node Transaction execution. I mean actually CPU running and computing stuff. That's 25% max of the block processing So 75% of block processing is shuffling data around and this is an extremely important thing to know Because all of a sudden you realize that optimizing the EVM is not that important So optimizing the database makes a lot more Sends than that even caring about how much one on opcode costs or the other and in the case of the boot nodes This gets even worse Since everybody is constantly hammering us with requests in the case of boot nodes the EVM actual execution is ten percent of The block processing time. So that's a that's kind of scary But again, we have a brilliant idea of where we can optimize So just a single chart just some information we can immediately see how to proceed now other interesting facts that these charts allowed us to see is that Transaction validation or propagation. I mean the Ethereum network throughput is I don't know something like 20 transactions So you would expect that shuffling 20 transactions per second around is no big deal Except when you chart it out and turns out that the boot node is receiving 11,000 transactions per second Most of them are duplicates Half of them are invalid half of them are or some portion big portion of them are underpriced that are get rejected But it's really Strange and when you look at that number you all of a sudden realize that wait So the transaction pool needs to be really optimized for this throughput So it's not the throughput of the blockchain rather the throughput of the junk data that's coming in that's Defining how much processing we're going to do And of course, we have two more other big consumers for the CPU one of them are RPC calls and Which we didn't chart yet So that's something that we want to do in the future and network handshake requests Which might seem like why does that even take CPU? Well, if you run it on Raspberry Pi it will murder one core from your Raspberry Pi I'm just doing the cryptographic handshakes in the network So again, we need a new discovery protocol to fix it and the big Third one third category is disk now in the case of this it is extremely hard to measure because Kind of when you run a program you have all these layers on top of one another operating systems containers libraries And they all like to be smart and all like to cash And this kind of means that there's really hard to measure the thing that you actually want to measure We could measure how much data we're pushing to disk But that depends on how much RAM the operating system has and how aggressively it caches We could check how much it could ask the operating system to tell it to us But that also depends on if you run it in Docker Docker also starts screwing around with all kinds of caches now if you After we figure that that we cannot reliably measure we decided okay screw it We're going to measure how much data we are pushing into the database Well, it turns out that that's completely useless too because level DB has all these background processes around Compactions and fancy storage models So in the end we actually needed to patch level DB and ship all our metrics into level DB So that we can get an accurate Measurement and we're actually really happy for the author of go level DB for allowing him allowing us to keep patching his database So This one I will probably not go into but the idea is that after you found an actual issue So after you found that specific metric that is really out of place Then you need to find out why it is out of space out of space here I gave you some ideas. So those are good ideas are and are enough But if you don't know if you can if you don't have any ideas Then the only thing you can do is actually to just try to look into the details of the algorithms internally and try to map out what the algorithms are doing and And try to have better guesses and for example This is maybe one-third of the light servers charts that Gary put together and there are lots and lots of these just to figure out individual tiny bugs But I won't go into these because that's completely out of scope here However, let's suppose we did manage to find what or we did manage We have an idea of what What the anomaly what was the cause of the anomaly and we have a poor request to fix it So the next thing is we just open a poor request merge it in and everything is fixed. Yeah, no wrong The issue is that there are simply so many moving components in your theorem node that the fact that we fixed one thing Might actually break others So the lifetime of these performance or these anomaly fixing poor requests kind of look like People open a PR and then we actually run a one-week long benchmark to see what does it do now? This is the perfect case. We see something like this maybe once a year. These are the miracles of Development when this was actually a poor request created by Gary disabling some internal data shuffling within level DB and what it the here actually the green lines were the master and the yellow lines were Gary's experimental PR Essentially by just swapping a few variables He actually managed to cut down the disk IO by an order of magnitude. So that was I Mean, you don't get these kinds of charts at all. These are just you don't simply don't believe them when everything is perfectly better than previously What you get a lot more often however is these kinds of charts This was from a previous monitoring system that we had where somebody opened a poor request that on Windows Windows really chokes on folders that have many files in them. So let's change level DB to use larger files I mean, it seems like a pretty good idea Why would you use two megabyte files if you can if you can use 100 megabyte files? So he implemented the PR submitted it and actually also submitted benchmark results that yes The number of files go down the number of used file descriptors go down the PR got to reviews Everybody approved it almost got merged in and then we said that yeah, okay Do you know what let's benchmark this because it's touching scary stuff and then You go you start to look at the middle benchmarks and you realize that the disk writes simply exploded so it blew up by two orders of magnitude and Again, it was a perfectly good PR Just it didn't take into consideration some weird internal thing of level DB And these are the PRs that can really really bite your performance where everything that you do is perfectly logical except the result and of course these are still a good case because it is PR you can at least close and Then you get the really nasty ones and these are generally this is the case that we see every time When so the charts that you see here that peak that blue peak That's actually the Shanghai does attack so it I just closed in on when we are processing doing a full sync and processing the Shanghai denial of service Attacks and that's the memory usage so previously. That's the one eight release family of geth every time you reach the Shanghai attacks We had this huge peak huge memory peak simply we had various caches in they use a lot of memory You essentially your machine needed 13 gigs of RAM to be able to process those blocks and Of course, that's a lot. So we decided to replace that 13 gigs of fur I mean that caching algorithm with something completely different and yeah It was complete success it completely wiped out the the peak and also even the general data usage Sorry general memory usage of geth went down Except the disk usage went up by 30% and now you have this huge problem and question What do you do? I mean 30% extra this guy always horrible Then extra gigabytes of memory usage is equally horrible And then you need some need to make trade-offs you need to make really hard decisions that which is better And then in the end we went with the extra memory usage is better simply because If you use a lot of money if you want to use more memory than available you crash I mean, it's a hard failure. Whereas if you use more disk than you would like then things get slower But they still function So in the end this is why the actually the one nine zero release uses more disk than the one eight Whatever the last one, but all in all we think it was worth it. However, since Gary fixed it now. We're much better. So So this thing was fixed already cool now That is kind of how the gas team pushes out these How we monitor stuff benchmark PR and push them out Now if you would like to repeat a similar thing in your infrastructure You kind of need to decide how you want to monitor things you either pay for monitoring as a service via data doc We did that for two years, but data dog essentially had limitations It didn't allow us to push all the metrics that we had in so in the end we switched to our own node We just ran Grafana and if you actually decide to run your own on-premise Instances then you need to decide also on a database to push your data in either you go with Prometheus or in flux DB These are kind of whatever you prefer the reason I explicitly mentioned them is that Monitoring infrastructure boils down to these four things. So you don't care about anything else in the world and from our part Geth can actually push data either it can we can export data into the data doc format We can push data into Grafana and Prometheus to so whatever rocks your boat we can do it and Just just for the sake of completeness. I also linked in our own dashboards So we exported them we use Grafana in flux DB But you are welcome to use them integrate them do whatever you want with them So all our configurations are published there. They will probably also land in repo eventually Cool. Now what are kind of the lessons that we learned while creating these charts? They were that you must measure as low as you can Meaning that every abstraction is transforms your data Everybody tries to be smart and those usually ruin your metrics now You always must measure your worst-case numbers I get it that the operating system is really smart and makes things more optimal for you But if you assume that you run out of memory then your worst-case numbers will do will be the ones actually hitting you so we always be aware of what's the worst that can happen and Measure everything that can you can afford as I shown you previously You can gain a lot of information with if you have a lots of tiny detailed metrics, but Eventually it will become too expensive. For example, Martin did a really awesome Measurements where he actually measured the cost of each individual opcode. It's really insightful But obviously you cannot run it in live production environment still if you have the numbers you can debug an issue If you don't have the numbers you will make gather the numbers the next time you reproduce that issue So there's no way to fix it without numbers. Yeah, and that's about it. Thank you very much