 Hey, everybody. Thanks for coming to the ETH2 client summit. I know there's been a lot of ETH2 talks, so maybe you're sick of it. But it's really special to be here. This is my first DevCon. Didn't go last year. I really wanted to. All the rest of the Prismatic team went. So it's truly an amazing opportunity to be here. So what I want to talk about is our testnet. We've been running one for some time now. And we've actually learned a lot. So let's dive into it. First of all, who are Prismatic Labs? Or who is Prismatic Labs? We're a self-organized group of just dedicated blockchain engineers, people who really wanted to be a part of scaling Ethereum and making this reach global adoption capacity. So there are six of us today. Most of us are here. So come say hi if you see us. If you don't know us already. We'd love to chat. What's interesting about this group is that we all formed really organically online. Nobody knew anybody before any of this. And we were just inspired by Vitalik's Wiki article about sharding. And wanted to start making moves on that because we felt that at the peak of 2017, when things were really maxing out, that we needed something soon to keep this momentum going. So we were one of the first teams to come together to start building sharding and then proof of stake at Casper. All these things merged together. So it's been a pretty remarkable journey so far. One of our biggest achievements is to run an actual test net. This has been, we started doing this back in April of this year. So we've been running it for, I think, six months now. What this is comprised of is several beacon chain nodes running in different environments, different areas of the world. It's open to anybody. So anyone with a computer can join. It typically runs for several weeks or several months, depending on if we run into some fatal bug or if there's a new spec update where we have to rewrite everything, which just happened many times. It's also the full Intune experience. There's nothing really mocked out here when we're talking about this test net. There are real deposits coming from a real ETH1 chain, the POA-Gorley network. We've actually had hundreds of people join. We, at one point, accidentally merged with the IPFS network. So we learned not to use default libp2p parameters because that's what libp2p does. And we were wondering, why do we have 500 nodes connected to us all of a sudden? So that was a kind of cool thing to learn. But most of all, it's just been really, really fun. And it's actually really easy to get involved. So if you wanted to be a validator to see what it's like to stake with Prism Client, we have this test net website, which is super, super straightforward. So when we're talking about test nets, I wanted to ask Vitalik, what is the best test net of all time? And I didn't get a chance to ask him yet. So this is what I think he's going to say, is he's going to say, Prismatic Labs, best test network, and the entire planet, and the entire universe. So I challenged someone here to find him and ask him and see if he actually says this. But Vitalik has run it as far as we know, and he's really enjoyed it, as have many other people. So the first lesson that we learned is that participation has to be very easy. You have users coming from all types of technical backgrounds or non-technical background. People interested, and they want to check it out. But they're really intimidated by adding a lot of tooling to the system just to build the project and running into these weird errors where you, you know, it really depends on your local environment. So what we've done is built this test net website that within six easy steps, you can be validating on a real beacon chain and actually seeing rewards, balances, change, and even penalties if you go offline. So the approach we took was to use Docker images. And this is typically pretty easy for users to get on board. It still requires some installation on your machine and some, you know, command-line inputs. But overall, you don't have to spend 10 to 15 minutes building the project, as some of you may have seen from the interop event if you tried to build Prism for the first time. It takes a really long time. So downloading these images, which are 15, 20 megabytes each, just seems to be a pretty straightforward approach. Haven't had too many problems. We also made it so that within the website, you can get some tests, ether, from the Goalie network. We have a faucet built in here, an interface so that you can just paste in your deposit data key. And then with MetaMask or Portis, you can send that transaction to the ETH1 chain and wait for that to be included into a block and then go through, you know, go through the voting period and all these other things. And eventually, you do get activated as a validator and you start receiving assignments from your recon chain. It actually has a really cool progress bar so you can sort of see something's happening when nothing's happening. So it's been really cool. And this has been our biggest value add for users. And one of the biggest things that we started off with learning, because before we got here, it was just running it on your own computer, running one client and then running two clients on the same computer, then on different computers. And all of this was, you know, crawled before you can walk and then now we're starting to run. So this has been my favorite lesson so far just to make onboarding easier. Probably the biggest lesson though is that when we're running, we're running multiple instances of the beacon chain in the cloud or wherever and those are really hard to be monitoring. As you've seen from screenshots from the in-app event when you see seven terminals on one screen, it can be really hard to see that there's an error or even to catch that one of those nodes was not even talking to any of the other six which did happen. So having good monitoring in place is absolutely critical to understanding when errors are happening in your system and how to find the root cause of those errors. So let's take a look at a few strategies we do for monitoring things. From the very beginning, we have built out metrics collection and monitoring through these metrics. So here you can see a test network that's been running for 229,000 slots. It's starting to look a little weird because these graphs should be pretty, you know, flat and straight line, especially the, you know, the CPU churn and on these kind of things. So here we might say like, oh, if there hasn't been a new block in 30 seconds then we'll fire an alarm and we'll alert to that, make your phone ring or whatever. And then when you are woken up in the middle of the night and you see something like this, you have some clues of where to start looking. So to digest this for you guys, you can see on the bottom right that there's chain reorg-ing happening like every single block just about. So that ended up being a bug in the Fortress rule and we had like deviation and balances. So nobody was agreeing on anything. We had, you know, hundreds of thousands of go routines happening at the same time. So these were concurrent processes and that's leading to the CPU maxing out and then this nice like up into the right line is starting to fall apart. And this was like a nightmare scenario. So having these resources is super, super helpful. But now you know like what's kind of going wrong. Like how do you dive into some very specific details? Like maybe block production is taking longer than expected and you wanna identify the root cause of these things. So what we do is we do a practice called code instrumentation. So we'll annotate pieces of the code that we think are expensive or really interesting to track. So here you can see the entire workflow for a block proposal. So starting at the top, we have the validator client which is a separate process from the beacon chain sending the block to the beacon chain. And then something along here is taking 236 milliseconds and we kind of expected that to maybe be 50 milliseconds typically so we can dive into each part and see why it took so long. This one took a long time because it was processing a lot of slots. So we might say was there a long period of skip slots or can we cache this somehow so that we don't have to do this work redundantly? Where can we find ways to improve and make things faster? And some of these things like you know from your intuition and you can write tests for and kind of write benchmarks to see like, okay, I know this is a place that could be bad and I can make it better, but other times you won't know it until it happens in the real world. So this has been super helpful for us. Speaking of benchmarking, we do profiling. So when we see that there are 100,000 go routines or something's taking a long time using a lot of CPU, we'll take maybe like a 30 second snapshot of the CPU profile and then look at this. This is called a flame graph because it looks kind of like a flame and you can dive into the most expensive parts of things that took the longest and go and then reproduce this in a local test and be able to iterate on that and make it a little bit better. So it's been super helpful for us. Another key thing that we've done is this is kind of recent is to do log ingestion. So when we're running multiple processes and trying to understand errors and we have evidence we've users are filing bugs that the signatures are not verifying, we can query the central repository of logs and see on what frequencies are happening and did we really solve the problem. So here we can see that this should be zero because this is an error that we don't want to have but we can see it at like specific times and maybe correlate that with some other interesting data that will help us identify the root cause of this because typically it's easy to make a problem go away but to really understand what was the actual problem is how you're gonna get the best solution. Lastly on monitoring, we do something called canary analysis. So when the testnet is running which is almost all the time, we're pushing code out as frequently as possible and without catastrophically breaking it. So what we'll do is we'll run the new binary that we've just cut, the new image that we've just created we'll run it alongside the existing binary and start them at the same time and then see if there are any major regressions. So in this case we see that memory usage is up by 56% and that is kind of concerning. So this canary report got a score of 77% which based on your tolerance you may want to fail immediately or not let this go through maybe you want to have some manual approval before pushing it out. Our success or auto-approved criteria is 90% so anything that's almost perfect will go through. Anything above 50% we can look at and anything that's below 50% will just stop immediately and we'll take a look at it later. But that's been really great from keeping us from tearing down our testnet on accident. Okay and this is something that we really learned pretty recently. So our team is very distributed across the world when nobody lives in the same city. It's uncommon that we're all in the same time zone. So actually being all here in a DevCon we're all in the same time zone. So this monster's coming out at night is even more scary because what's actually happening is like when all of us in North America go to sleep the testnet starts to fall apart and then the people over here in Asia are stuck with it and like kind of on their own by themselves where maybe they didn't have the same knowledge that we had learned that day and we go to sleep, we try to pass off things. But it's always like when everything sounds really good and really nice and when you least expect it things start to fall apart. So that's been super interesting. When we landed here for DevCon day one we were right all in the same time zone and when we woke up in the morning the testnet had been stuck for two hours and no one was around to see it and we couldn't fix it in time before. It was just unrecoverable amount of time since finality. So we ended up just restarting it. But that was something that we've learned is that if you can put somebody so you're always awake 24 seven or the actual solution we did for this was to have the alerting that goes to your phone you know, pass through your do not disturb, wake you up in the middle of night and make sure that like when no finalities happened in 100 epochs that someone is there looking at that because that's a really, really bad thing to happen. So lastly like why did we do a testnet so early? Like I said, we've been doing one since April and this was before we even had a networking spec so we kind of made one up to get our clients to talk to each other and we knew it wasn't gonna be final we knew we were gonna have to rewrite everything in that regards but why did we do this? We actually learned a lot of things besides those few lessons. So we're able to find race conditions which you can check for with race detection but it's not always perfect. One of the biggest things is long, long lasting memory leaks so things that grow over many days will have a binary that runs for maybe a week and we'll see that the memory uses just slowly, slowly, slowly going up. When you're running this on your laptop you don't ever see it or you don't never see it in a unit test so these kind of things are super helpful. UX problems, users come to us and say hey I don't understand why you won't let me validate like what's going on and you know that's been great to just make the client better for everyone for usability so we can help understand what users are frustrated with or what they want and then we even actually run into aggregation issues so some recent thing that we learned is that you can't overlap aggregated attestations that are already aggregated so we were seeing we were trying to aggregate those and they were failing but silently failing for some reason and then we were getting this not very consistent graph of finality so that was something that we only realized by running the test network for a few weeks. Oh and then lastly testing gaps, things we can do to make it better so like the canary analysis and other like long-lasting tooling to make the code safer and to be able to have releases more frequently and just to make it a lot more secure. So that's it for me, thanks everybody.