 So, hey, I'm going to be talking about the quest for the best tests today We're gonna look at all the different testing tools that we used and I'm gonna be talking about the bugs that we didn't find on mainnet If you were here for two talks before mine Maria spoke about the bugs that we did find so I'm gonna do the contra The first part let's talk about what we're actually doing with testing. Why is the merge complicated to begin with? Well, we have roughly 20 different client combinations and regression sneak in very easily You might see a regression in one of the client combinations, but none of the other ones and that's very tricky to pin point The specs for an active development. There were quite a few of the early test nets where we kind of didn't pin the spec version So someone was implementing something on a different commit. So of course the test net broke So we had to actually figure out how to treat Modifying specs along with modifying test nets along with 20 different combinations all changing at the same time Communicating and debugging it's great that we have a decentralized environment It's horrible if I have to wait for the Australians to wake up for anything It does take quite a bit of effort and quite a bit of planning on our side We had to kind of schedule around people picking up their kids from school We had to schedule around people waking up in Australia. We had to schedule around Americans a lot of different things and Figuring out how to do all of this in a reliable manner and on a timeline was crazy The last one was debug knowledge We were really surprised the type of debugging you need to do for CLs and years are totally different we had to see how to bring all of that competence in one place and How to actually figure out when something goes wrong and they did go wrong a lot in the beginning And the nice part is if you figure this out once we forget it out for future test nets So happy that that worked out What could possibly go wrong? So the merge has two parts the consensus layer and the execution layer and they communicate via this thing called the engine API so if we mess up the consensus layer you're gonna have a Network that can't agree on anything if you mess up the execution layer you have a network that can't do anything So those are the two high-level problems that we're in trying to ensure that never happened Just to enumerate a bit from the regular testing world. What sort of tests even exist? so we have unit tests and at least in our Decentralized world with all the client teams the client teams take care of this themselves We don't have to do anything no coordination work from our side This makes sure that there's no regression But these are localized regression something that they might have seen on their client based on how they've built their client We have integration tests part of them are done by client teams So for example the Nimbus team spins up a local Nimbus network to make sure that their stuff can always talk to each other Then on a high level we do some interrupt tests. So these might be dev nets or whatever they are Then we have system tests and this is where External coordinators come in so I'm gonna say the EF was part of the external coordinator for the testing efforts So these tests end to end functionality. We spin up a test net. We run transactions. We Withdraw we set up faucets. We look at it explorers work and so on and then we have production tests This is something you guys might have seen with shadow fox So production tests sort of work on a prod like environment and I'll go into what shadow fox are later on But on a high level we inherit all the complexity of main net and this also includes public test nets So if you guys remember kiln or kinsuki that falls under this bucket So you have everyone all over the place testing their things layer tools are deploying over there We have random defy protocols They're deploying over there and these fine issues that only happen on real world workloads. You probably can't simulate them any other way So the second part I'm gonna talk about what different tools we had and just give you a brief overview and just to Tell you how this gonna go once you have an overview of what sort of testing tools we have I'm gonna talk about what we actually did with them and what we didn't find So the first one starting at a high level we have spec tests. It's a great thing about the consensus layer They have an executable spec. They have specifications with a ton of tests Klein times can then import these specs and make sure that they can test them in their local CI This means whenever they're making a release at least, you know that they're coherent to the spec This is largely a sanity check. It's not meant to find any massive bug, but it ensures that there's no regressions happening We currently have the spec test setting running every night on on a new CI machine The second one is hive tests. You might have seen this being referenced a couple of times So hive tests run with a simulator and they essentially start up the clients and then run the test against a predefined interface These are a couple hundred tests these take anywhere between I think a day or two days to run everything And a brief example of how this is is it starts up a tiny instance of a nethermine node It sends it to terminal blocks and asserts how the transition happens This is a lot of awesome work by Mario. He should be somewhere in the crowd Shout out to him and follow him on Twitter We found a lot of edge cases in this and once we do find an edge case It's always in this so that we make sure in in future updates, whatever it doesn't happen if people check the website, of course Then we have this thing called kurtosis It's an external tool that we're working with and kurtosis Obfuscates all the complexity with setting up a test net. You don't have to worry about how Genesis works You don't have to worry what format that nethermine needs its Genesis file in nothing You just define it in a YAML file. You say I want a five node network with this docker image and that's what happens for you We done this actually nightly and I think we've gone through the merge at least a couple hundred times this year a Lot of them with issues, but a lot of them without which is great to see We use this also to rapidly iterate ideas So once there's a new spec version or if you want to try out a new testing tool We throw it in there first We also had mebbost integrated into kurtosis so we could test it out and do a lot of cool things there General view of kurtosis is it it checks the happy case if you can't figure out the happy case There's no point checking it the rest of it So it just starts up a test net make sure everything is fine if it's fine it shows you green on your CI The next one is sync tests There's no point on a network if you have nodes that can't sync up to the network So what we do is we spin up nodes. I think at a week's notice right now they sync up to the head of the chain and Then they assert whatever we specify over there So if you notice over here, you can say is execution as healthy is consensus healthy are both synced Are they both reaching head? So you can kind of define what site type of syncing you want to do here And as you can see on the right there are a ton of different options You can do and a ton of networks and the cool thing is you can also assert bad cases You can say start your EL stop once it reaches head then start your CL and then build weird scenarios there Shout out to Sam for building this. He was also running it and sending a summary I think he even presented it once on all codebs So we could make sure that at least when we're making releases and when we merged that we can sync the network and This is the meaty one we had test nets and shadow forks So a shadow fork and test net help us coordinate all client teams in one place and we Check compatibility largely we take whatever assumptions we have in the spec and we assert if those assumptions are true On a large level what we're doing is we're taking the genesis configuration of any one network And then we modify a couple of values here and there And what happens when those modified values are hit is we split away from the main network But we continue staying sync connected on the gossip network So we're importing all the transactions and we have the old load But it's on a side. We it's just a side fork It runs parallel to the main network and no one cares about it Except for us because we can find dozens and dozens of bugs there This allows us to stress all of our assumptions And we've done a lot of them. I have a summary on how many we actually did in the end I sort of look at this as a release test So it's kind of one of the last things we would do on Any of the future forks that we have it's kind of when we're Almost figuring out are we ready to go ahead with this? Are we ready to move forward with with committing to this fork? Is there any unknowns that we don't know yet? And then we have fuzzers and an external organization called antithesis So antithesis is a deterministic hypervisor that allows us to perform network splits Packet loss all sorts of really weird edge cases We don't necessarily expect the network to be put in these edge cases But if they are we know that the clients can handle it Um on the right side, you you can't even actually see it But each of those is a 256 thread machine. We have three of them all running fuzzers It was insane. We had way more. I've never seen that much compute in my life. We actually had The IBM data center go out and buy more cpus because we bought up all of them Yeah, various fuzzers various teams running out all of them Super cool. We found a ton of bugs refer to Marius's talk if you want to know what we actually found there And we hope that some of those bugs change the spec to make it a lot more stable or sometimes an implementation issue So to give you a brief idea on the testing lifecycle, um, let me tell you it's never that clean But that's what we want to hope to achieve So you have the client releases happening and once the client releases happen, they go into the integration test So we make sure that test nets can be run that integration tests work fine And once that's done, we move on to uh test nets and then we move on to stress tests And we push whatever we find on these stress tests on to specifications We do regression tests and then we do fuzzing and hopefully whatever we find in the last couple of stages go back into client releases So that's a kind of a nice way to look at a full lifecycle of how testing would work We're hoping to adopt the same lifecycle for future test nets. Maybe we move around a tool here or there, but that's the general idea And the end game. What did we actually do? Um, it was really hard to find a graph that actually fit all the test nets we had. That's how many um, so we had we started off in April of 2021 with Raynism and since then we've had Four public test nets that anyone could permissionlessly join into we had six dev nets meant for all the client teams. We had Five gully shadow fox and in the end 13 main net shadow fox and after all of this We had three test net fox and only 10 we did we hit main net So the whole Number of just testing hours in the merge is insane I'm quite sure if we added it all up. It would be at least a couple 10 000 hours put into this So What didn't we find? This is a really interesting part because even though we have all of these cool tools There's still always going to be something we didn't find this 99 participation 98 participation or great blocks being produced Awesome, but I want to know what we still didn't find So first one We had in memory databases that were too low to process main net blocks It just so happened that we did too good a job of deciding which machines run our test nets Which means we didn't have any resource constraints And we kind of missed the people running nodes on 8 gig ram machines or people running on 16 gig ram machines We didn't account for that or we didn't account for older versions of ram being used whatever it is Um, another one that we missed is non optimal block production Um, we were super focused on making sure that we didn't see any zero transaction blocks on the network We didn't compare that to what the optimal blocks in the network could be So it is an optimization problem, of course, and it is an awful that on main net We have some reduced load, but it is something we missed and something we probably should look into in the future um The next one There was a really specific way in which the terminal blocks could arrive that broke nethermind And it broke nethermind by causing missing missing receipts And we only noticed that issue when there are deposits being made So we completely missed this and for what we still yet to figure out why But this is mainly an issue for load star nethermind and no other combination At least we didn't get the Memo from other breaking combinations. So load star nethermind figured that out on day one of the merge and I think it was patched like a day later Um, another one that we didn't think of all of the shadow fox Didn't have any nodes syncing up to the network after the shadow fork was done Which means they weren't serving sync data and all they were doing were keeping up with the chain, which is great But constantly syncing on main net also adds load on the machine So we weren't accurately simulating that so potentially on future shadow fox We're going to have to add a bunch of syncing nodes that start up later on to see what could happen there And the last one that we didn't find was a failover beacon node scenario A lot of people are really obsessed with making sure they don't miss a single attestation Which means they have multiple beacon nodes set up And that's something we just never did so a lot of the requests were just being sent to the primary beacon node And not the backup. So when the failover happens the backup wasn't ready Um, I think this has also been patched right now, but it was a really tricky thing and we should try More failover backup scenarios in the future, I guess Um, one cool thing about why all of this wasn't found Most of them are optimization bugs. Um, as you can see much still went fine But we could there's still some room for improvement And it's a really hard trade-off to make between do we want to spend more dev time in fixing these optimizations? Or did we want an earlier merge? And that's kind of the tricky thing. Do we want earlier fox or do we want completely buckless free? free free fox No idea But we're definitely going to be adding resource constrained make machines to the mix next time We're thinking of just getting a bunch of raspberry pies and doing some shadow fox on there. See what happens there And another reason some of the bugs appeared But they didn't appear on the shadow fox is last minute commits The last shadow fox was I think a week before the merge and the last releases also happened a few days before the merge So, yeah, it's a hard trade-off Um, if you want to join the testing efforts, please join, um, mario vega Um, he's email addresses here send in your information there And I think I supposed to okay. I was supposed to have one more slide, but I will just talk about that other slide over here Um So we have a bunch of testing tools that we completely open sourced if you're someone who wants to run Test nets and kubernetes if you're someone who wants to just run kubernetes nodes If you want to set up your own networks We have easy ways to set up genesis if you want to Um, yeah use the same stuff we use for shadow fox. All of it is completely open source Quite a few organizations that are actually reusing our, um, our stuff. So shout out to them Um, contribute back whenever you can you can find most of the tools on Either on the ethereum github repo or on the eth panda ops github repo one of the two And yeah, that's about it. That's most of what we did and did not find on the merge. Um, um, thank you