 All right, so second part of the master is shipable strategy was eradicating, I'm never able to say this word, getting rid of the flaky test. What do I mean by flaky test? A test is flaky if it passes sometimes and doesn't pass some other times. Without any noticeable change to a code or the environment of the test, and you run the test again and it passes. The whole purpose of having an automation suite in your CI is to make sure that you get a consistent reliability, consistent quality signal. And it detects bugs early in your cycle. If you have flaky test, that corrupts the entire test because now you can't trust that signal. Because you don't know whether there is a genuine failure or the failure in the test. So this was the other thing that we wanted to deal with, that not only that the test take less time to run, there are unit tests, that when we run them and we see the results, we believe in the results. And if you believe in the results, we take actions. We take actions in terms of detecting the bugs early on. But it also gives developers confidence to make changes in a rapid fashion because they know that the system will catch problems and you will catch it reliably. So that was the basic idea. So remember the old reliability system we had where we'd run the test in a loop, in 500 times on a good build. We repurposed that system and we started running that for the L2 test. These are the new tests that we are writing. And we want to make sure that the quality of the new L2 test is really good from the ground up. And we don't end up in the same situation we were with the Stratest. So the way the system worked, it would pick a green build, run the reliability run on that L2 test. And if a test fails in any of the runs, we'll say that test is flaky. It needs to pass every single time. And if it's flaky, you file a bug and you expect the team to fix the bug. It shows up in the team scorecard. So that was the system. And with a significant push, we improved the reliability of that L2 run to well over 90%. Now remember, these are new tests. We are talking about new tests. And even those, even with the system, we couldn't get it to pass all 100% of the time because the test would just fail every now and then. And the effort to this flag, bad test, file bug, you know, ask teams to fix those bugs, it worked to an extent but didn't quite, it didn't really get us to 200% or close to 100%. So the old system had few flaws. The flaws were the way that the reliability signal coming out of that reliability run was not used immediately. It was sort of used to file bug and then at later point an engineer would go and fix the bug. So there is no natural incentive to just fix the system as it goes. So what we tried to do, we fixed that in the next crank of the solution. This is kind of the details of the algorithm but fundamentally what we did is the following. We said that flaky in a signal from the reliability run needs to immediately flow back into the official run. So we take that signal, then we know that this test is flaky and then we immediately mark the result of that test as flaky in the official run. We don't remove the test from the run, we just mark the result as bad. And so what the CI run does is it just ignores the result of that test and gives you a more cleaner signal. And the system would immediately file a bug and the bug gets surfaced to the engineer and then the engineer is expected to fix the bug. But now because the CI signal is very clean and you're getting these bugs from the reliability run, the engineer started fixing those bugs in much more an expedient manner. So with this new system, well the new approach worked because as you can see we were trying to stabilize the old L2 test. We were never getting close to 100% pass runs on all L2 tests. With this new system, a CI signal became very green because as soon as we find a test that is flaky, it's just taken out of the results, yes. I'm sorry, I think you asked me somewhere. What is this reliability run and official run? I mean, is that a run that you are... So reliability run is done on the side. So official run is our CI, batch CI runs that are happening in the master. In parallel, what we would do is we'll take one of the builds that we had a good run before. You take that build and run our L2 test on top of that build in a loop on the side. It doesn't change what's happening in the master. People are working in the master. On the side, we would just run our L2 test in a loop on a good build. And if a test, anytime a run fails, we mark that test as a flaky test. And so that's what I mean by reliability run. That's just happening on the side. Just as a way to identify which tests are flaky. Because when you run them in a loop, 50 times, 100 times, and if they fail, then you know that in an official run in the master, they might fail at some point in future. What could be the possible reasons that the test was flaky? Like it could be because of that data? It could have weird sleep patterns that sometimes is sufficient for a particular call to complete. Sometimes it's not sufficient. It could be just like how your product could be susceptible to resiliency issue. Your test could have a resiliency issue. It could be any number of things. It has made assumptions about the environments that it's not properly isolated. Could be any number of those things why the test could be flaky. But the key is that we identify those flaky tests on the side just by running them continuously in a loop instead of waiting for test to fail in an official run. Because when you wait for a test to fail in an official run, first of all, you may have to wait longer. Second is it's too late by then because it's already failed that particular build. You don't want that to happen. So we try to detect flakiness as early as possible by just running them in a tight loop. Okay, thank you. So this is what our pipeline looks like in action. So as you can see, a CI build comes out. We are running a N number of test runs. Most of them pass. Most of them pass 100%. This is what it looks like. I think the build showed you a diagram where they were all green, which happens most of the times, but sometimes it doesn't happen. Typically it looks like this, where it's green, most of the time there'll be couple where like here you can see it's 99.84. So maybe there are a handful of tests that failed. But again, they get caught in the reliability run, gets marked flaky, and then they get removed from the official run or the results, official test results. So the key takeaway here is that you have a CI signal that is extremely fast and reliable. You can trust your CI signal now. And you can see some of the times it takes to do this. So PR to merge, today it takes about 30 minutes. You ask the question, do we run L0 L1 test? Yes, we run about 60,000 tests in the PR. We do about 600 PR builds a day. You can see some of the other stats, like merge to CI is about 22 minutes, about 3,000 projects. Merge to self-test, so that's the first test suite that runs all our P0 runs. That takes about 58 minutes, so within an hour you can get an answer on your commit whether it passed the P0 runs or not. And within two hours, you get the answer whether it passed all our test or not. So within two hours, so whereas in the past you had to wait for two days, and even then you couldn't get a reliable signal out of the system because remember there were all these failures. Now within two hours, you know that the change you made is good. And that gives you tremendous confidence to just keep pushing more and more changes through the system. And that's really the key takeaway here. Yes. So these happen in a sequence? Yeah, no, these runs are all being run in parallel, yeah. Talking about the scorecard, so this is what our team scorecard looks like. We, you know, so the type of things we track here are life side health and the debt. We track both quality as well as what debt you're carrying, engineering health and debt and the velocity. So things like, you know, bug capper engineer. I think Aaron might have talked about this. So that shows up in a team scorecard. In case of life side, we are interested in what is our time to detect? What is our time to mitigate? You know, how many repair items a team is carrying? A repair item is when you have a life side incident, you do a retrospective and you identify a set of work that you want to do to prevent that incident from happening again. We track that and we track that to see that the teams are closing those repair items within reasonable amount of time. So this scorecard, notice there are no names here, just the team names and it just shows up as, something shows up as red if you have not, you know, if you're outside of the allowed threshold for the, for that metric. Yeah. So Nelindas, are there any performance tests as well as a part of the 60,000 tests that you're doing? Yeah, there are performance tests. But you know, sort of our overall philosophy and performance has evolved over time, where if you know that we are changing a part of the system that has been to have significant performance implications, you know, we'll write test, we'll, you know, run test in the lab, but we do a lot of performance data collection in production. So we collect a lot of performance metrics. And Tom will talk about some of that tomorrow in his talk about the telemetry. But because there is no place like production, we get a lot of data just from based on customer running real workload. So we collect a lot of metrics there too. Can you show us any of these tests in action, like a demo or something? Test in action. Sorry, I don't, maybe we can do that offline. I don't have that part of my. Yes, thank you. Security testing came in. I'm sorry? Our security tests. All tests are here. I mean, there is no other, it's not here, meaning when I talked about the other chart, this is just a team scorecard. It's just tracking different metrics of how well we are doing. But when I talk about those 60,000 L0, L1 tests and another 4,000 L2 tests, it includes all the tests. Whether there are performance tests, security tests, you know, localization tests, any, there is, there are no other tests. Like I said, the core principle we have is that if you write a test, it must run in RCI, in RCI run. There is, there is, there are no tests run after the fact.