 Really just our story and how we've developed loadstar over the past three years. So about us My name is Cayman. I work at chain safe. It is a great company We're hiring. We basically build blockchains We have a whole bunch of layer one projects loadstar is one of them But if you are interested in getting into core development, we're really great place great culture So loadstar is it's one of these projects at chain safe. We're we're We're basically an eth2 or aetherium consensus ecosystem written entirely in typescript We really take the open source ethos to heart. We have all of our meetings They're public everything we do is public and we really encourage anyone who wants to contribute We we try to help people out and get get people involved a lot of our team members actually started Started their work on the project through their open source contributions And it was kind of like a open source to contract to hire kind of a situation So we really we really like that that kind of ethos What we're talking about today is specifically our consensus client and how we got from like prototyping Typical kind of code to something that's like production ready something that you can really rely on Something that doesn't just blow up all the time So before I kind of step into our story. I just want to kind of do a very high level of what metrics are This is I'm not gonna go too deep if you really want to go no more You can like read the docs about these things so Metrics in our case we're using Prometheus Prometheus basically gives you the the tools to Build time series data for specific elements that you want to track in your code The kind of three main things that you're tracking is your counter a gauge or a histogram We can go we'll go into kind of some examples of all of these things But basically what you want to know about metrics or you have some value something in your code that you're wanting to track and you want and Prometheus will kind of Query the query your software over time and you can get a sense over time of of how things are changing how things are moving Then you use Grafana to visualize these metrics so the picture does is like a thousand words, so it's Basically the difference between like looking at a bunch of logs and then looking at like pretty graphs that can show you Give you a little more insight give you something at a higher level than just squinting so our story Basically, we we started with really bad With not very much infrastructure and not very much a discipline and how we tackled Metrics are tackled like the data that we we track and eventually we kind of Like over time and like through a lot of pain and and suffering We kind of realized that metrics are the way to go metrics give you all insights that you will never get just by looking at Logs so really metrics aggregate a bunch of information and give you The tools you need to make better decisions about how you release your product What things to focus on like what are priorities and they're also just really pretty so they're cool So this is a picture from the early days of load star from the interop lock-in in Muscova in 2019 Here's our message. Yeah, we're stuck sinking in some p2p issues We didn't really know what what was going on like why things are breaking We were looking at logs actually on the other side of the screen here, you know, we look happy, but it's like oh things are breaking Here's some Some funny issues that we had you can see priority low adding Prometheus monitoring. We didn't even know what we were doing Yeah, out of memory o.m. Out of memory. It was just like plaguing a lot of our development throughout 2019 into 2020 Load star was like very prototype very much like a prototype and we ran into the issues time and time again Like we would have the same thing happen multiple times and we weren't like like regressions that were happening and we weren't really like learning our lesson kind of the Turning point was you know starting to take some time to really Take the time to like build out these metrics and build out these dashboards. So we had a contributor I think I can save a new world maybe with peace And so throughout 2021 we started going really hardcore into adding a bunch of just add a bunch of metrics and Taking time to think through what questions that we want to answer What pieces of code are dark? where is Our you know, where are we spending time in the in our software? Where are we? Where is a bunch of memory being spent we really kind of uncovering all of the complexity of this running software? I mean blockchains are really complicated software. There's a whole bunch going on and Unless you're really tracking tracking key pieces you're you're kind of be going blind and now we're finally kind of at a point where We For any new feature that we're adding we asked for metrics. So It'd be very important to track the retry behavior meaning add metrics for So that brings me to metrics driven development Really the kind of the key the key thing that we've learned is that every large feature should be documented with metrics So if you are adding in case of a retry mechanism, you're adding some new feature You need to make sure that it's actually working and the only way that you can Or not the only way but a really the really great way of doing that is being able to show that visually Being able to show. Okay, my cash is now bounded. Let's see. Let's see that in the metrics Let's see how it's how that cat cash is growing over time or let's see how many times it's the retry mechanism is actually being activated Those are the sorts of things that you really only look confined by through through these graphical methods through these metrics You're gonna have a really hard time looking at the logs for that The other great the other thing about metrics driven development is that any deployed software that you use? You need to be monitoring it very carefully for regressions. So look here, this is the process heap increasing over time This is a comment from lion for the last two days. The leak is clearly visible an impact of 75 megabytes a day Yeah, so that's gonna be a problem if your blockchain is if it's just increasing at 75 megabytes a day Continuously you're gonna run out of memory And these are the this we were able to catch this this issue before we actually Before we actually cut a release. So It makes your release process a lot read least release processes a lot better Here's another here's another example of a regression We deployed deployed a version of our software and we saw the cache size grow. It's like, oh, is that is that a bug? Is that a is it okay? Well, we only actually caught it because we were tracking these things You probably aren't even gonna have a log that's gonna show you your cache sizes So unless you you actually measure this you're gonna be blind the other great thing about the about Dashboards and about metrics is that it lets you correlate different different pieces of the code to find out where the problem is So here we have different you can you can see a bunch of different graphs kind of all at the same time and see Okay, something is affecting the event the event loop lag I Can see at the same time the active number of handles is growing the number of requests is growing so you can kind of You know kind of solve the mystery and in a in a Way that you're not going to be able to do in the in the logs another great strategy is Is using different versions if you're running software so running Like a staging version of your software against a production version or against several recent versions and being able to overlay data on top of On top of one another and being able to see okay. Well, it looks like our beta version is Is using a whole bunch more memory than our unstable version. So, you know being able to kind of compare and contrast It gives you the tools to do this and so just just kind of like drilling down to a few a few a few tips These are very very practical kind of tips don't abuse the histograms. So I said there's three different types of metrics You just wouldn't doubt you use the simplest tool possible histograms are really only used if you're wanting to see the kind of the distribution of of Something so like maybe like looking at like request timing Maybe you'll want to see like all right certain requests are happening very very quickly, but then others are happening kind of longer so the classic mistake with histograms is is Using a label that's unbounded So if you had a metric that was tracking something per peer ID on the network Oh, well, you might have thousands of peers with and if you each new each new peers going to be using a new PRID it's gonna blow the metrics up and it's gonna affect the performance of of Prometheus and you're not gonna be able to Run the queries that you want to run. Oh, so how do you know which metrics to add? So I showed some of these pretty graphs and everything But you're like in your in your software. How do you know what you want? What do you want to track? so really what you want to do is Think about questions that you couldn't ask otherwise So a really good example is the size of your internal caches Another thing you can think about is like if this feature or if this part of the code Would degrade or explode in the in the kind of bad case that that's a that's a prime candidate for for adding metrics around it And then really just like at keep asking questions And then it's like if you have the data to be able to answer the question you're good But otherwise keep adding metrics until you can until you can answer those questions So some examples that we we've kind of come across how often are our streams being reset? How many peers do we have these are the kinds of questions that that that the metrics can help and so I'm gonna show you just a live demo or a live Example of of our dashboards and we can kind of run through some things And if anyone has any questions about it while I'm going through it we can we can do that too, okay? So this is this is our Dashboard for our fleet of of beacon nodes and validators This is kind of the high-level dashboard Right now. We're looking at an unstable node. We can go to like stable a stable version of load star So this is running version 1.1 or on the girly network. We're synced. It looks like it got restarted about a day ago Some interesting things I guess you know peer peer count is looking good. All right You know, there's nothing everything looks stable there. We track different types of peers inbound outbound and For us we really care about the number of peers that we're gonna be receiving messages from kind of continuously. Those are our gossip peers So the number of mesh peers on our core topics staying about six or seven it looks like we're connected to a lot of lighthouse nodes and a lot of prism nodes and And there's another load star out there. Let's drill into something a little a little more detailed here So one thing one one area that we really struggled with that when we added like a lot of metrics is in our gossip sub implementation we found through adding a few metrics that We weren't getting blocks on time and it turned out that we we didn't have enough peers who are sending us these Who are sending us blocks and then we started asking questions. All right. Well, why are they not sending us blocks? It turned out it was because our peer scoring was too low. All right. Well, why why was the score too low? Well, it was it turned out it was because there was a specific part of the scoring that was that was too low And so we kind of you know, you can start asking questions and you can build up You build you start building the answers to that Okay, so what you're seeing here is kind of some of what I was talking about We have breaking out the score into different components the different parameters that are Going into your kind of aggregated score for peers What we were seeing before is you know, you'd see a graph where the values are like going down and down and down over time Or if they're like negative very negative numbers another really useful metric that we were that we check a lot is Things related to the VM so memory CPU and Disk storage so looks like our memory is stable. That's great GC is only taking Roughly seven and a half percent. This is things are looking good. Yeah, I think that's enough of a enough of a demo here so At this point open to questions and we're also hiring so if any of this looks cool or interesting and you know type scripts Hello, and thank you for the talk So I see that you have a ton of metrics and I believe that the most of them are useful, but Do you have some kind of an alerting system? for you to be able to know when a metric has gone wrong because I Think it's just unfeasible to go through each of them like daily or something like that I think you need some automated way to the other to that something is going wrong Yeah, absolutely. So that was something I didn't touch on but it definitely feeds in so we we use Pager duty you can you can basically set up Thresholds for certain metrics that trigger Either a slack notifications or discord notifications or in the case of a critical kind of issue It'll call you or text you Yeah, we do have that set up. I would definitely recommend that if you are in for running something in production Thank you Again, there are lots of metrics you did you guys some develop over time a way to or you know Kind of a mental model or framework to decide how to or to put them whenever you have new You know graph or metric instead of or do you just like this side to put it somewhere and then go back like You know to make it easy to to read or to browse the dashboard And if that makes sense, it's a question about organizing the yeah, exactly Yeah, so I think probably have to grow it organically what we did is we we started with one Dashboard that grew and grew and grew and eventually it became kind of Unsustainable to have everything just piled together. So we we started breaking it apart The other thing that the other kind of issue that we had initially is we would create a bunch of metrics But we wouldn't add them to the dashboard So there was no actual like visual cue for for it for a lot of it. And so it kind of just went un Unnoticed and so one thing we would we would definitely recommend would be like if you add a metric also add it to the dashboard at the same time Thank you The question about storage and so again too much metrics Prometheus or what does it affect this the storage? Well, so some of it kind of it depends on your architecture of how you're deploying this it so What we do is we actually have one Prometheus and Grafana server servicing all our entire fleet of of Beacon nodes so, you know, we have a bunch of different machines and then we have once one machine for these Right to do things so it doesn't cut into the to the Storage we retain for a month quick question is are the metrics relevant for like other EVM chains besides the theory of mainnet Or is it just like specific to a theory of mainnet? So yeah a good question So I'd say like I don't know it's a half and half But like half the metrics are like more like for like monitoring the chain something like a like a block explorer kind of kind of thing But I would say like a lot of the metrics are Specific to the implementation. So and those are the ones that are I think probably more helpful Just the internal caches and like timing Timings of things that we're really really wanting to get right. Hi I wanted to know if you think these metrics are useful for other clients and not just a type script one And if so would you maybe consider like, you know putting them in the public domain? So like anybody can see them and apply that to new clients that might be considering building, right? I know that other clients do have metrics and I think like the ones that are more user focused Those are the ones that look more like a block explorer and would maybe with some like validator Like showing your balance ticking up over time And I think those are those are public and I and there is an effort to standardize metrics across the Consensus clients. So I think some of that's happening But again to the to the previous question like a lot of the metrics will always be implementation specific in terms of Roles and you know in team members who generally looks at which metrics and which graphs Just kind of to get an idea of You know, which metrics and graphs are useful to which team members or depending on like the role or the Type of work that they do you're asking if if only certain members contribute to these are more like the Looking at the the graphs and the bronzing the metrics themselves like, you know, for example, there are lots of metrics So not everybody looks at everything. So, you know, it's more What you know what team members look at look at what kinds of graphs or right? So so for any of these larger feature pull requests, we have the the person who is authoring or kind of owning that that feature they're responsible for Making the case that that that it's correct and then it works as intended And so they're responsible for looking at the metrics and building building that case and then as far as like our release process we have a Checklist of basic things that we look at Before we before we actually release we have a testing period. So we cut a cut a beta release. We deploy it to our fleet and we watch it for a few days and then after a few days we gather, you know check check do the checklist and Any other kind of ad hoc things that were were we're noticing that Okay, so in web 2 there is a standard of how to consume logs metrics and traces and it's done by the cloud native Foundation to make the same pipeline that you have here more standard in a way that's consumed by other tools as well Did you take a look at that or recommend it if not? the the standardization effort is to to build out some kind of baseline of functionality like you can have one One Prometheus and Grafana running and then whichever client you want plugged in here You kind of have some guarantees like it's not breaking and like it's it's kind of working. So yeah That's exactly what bought open telemetry there. So take a look at it