 All right, so that brings me to this last point. So this is the third part of our big shift on the core strategy, and that is, there is no place like production. So I talked about how sort of we de-emphasized running functional tests in the lab environment. At the same time, we understood that it has a place. It has a place in production in particular, because the production environment is very unique. You know, there is, first of all, the full breadth and sort of diversity of the environment that you have in production, you simply cannot replicate inside the lab, and you get real workload in production. That is also very hard to replicate in the lab. We used to do pretty significant effort in doing, for example, performance testing in the lab, and we found that most of the time, we were just dealing with results that were very noisy and it didn't really give us a good indicator of what the actual performance was going to be. We were sort of, we'll take some metrics and measures in the lab and then we'll extrapolate what that would look like in production. So I mean, it was not even a good system. So we rely in production. Now when we talk about production testing, you know, people sort of like, whoa, what do you mean? Like, literally you run test in production? And the answer is yes, we do run test in production, but testing in production is really two things. The first is a set of practices that we have that safeguards the production, you know, sort of these are the safe deployment practice and the exposure control, the feature flag, and some of the stuff that you heard in Buckstock and I will go into more detail. Those are just practices that allow us to deploy change in a progressively, you know, in progressive manner in production. We, that's part of production testing. Then you have telemetry. Telemetry is test results. You know, failures, exception, performance data, security, Buck talked about how we detect certain security issues in production. These things are, in a sense, you know, test running in production. So that's what I also include as part of the, what happens in production. Simulating failures. So the production environment is interesting. First of all, it's changing all the time and then you have failures happening all the time. So you take a service like VSTS, we got so many dependencies. We have SQL Azure and we have DNS. We have AD and we have storage. There are so many dependencies. Each of those dependencies give you three nines of availability. So if your system was designed such that if SQL Azure failed, you fail, well, you're never gonna get to three nines because, you know, when you combine the availability of all the other dependent system, you simply cannot achieve that. So that's why you build all the resiliency mechanisms in the product that Buck talked about. But how do you know those mechanisms would actually work? So you have to test the production environment. You need to see that the fallback mechanism you've designed actually works. You need to see that a failure that starts in one subsystem doesn't cascade and become a major catastrophe for the entire product. So these type of things that we do in production with what we call fault injection testing or, you know, sort of chaos monkey testing. I'll talk about that in a second. And yeah, like I said, yes, we do run test in production. These are L3 tests. And we run that to kind of help us with the service-compired testing, yeah. So what data do you use for L3? Is it like the engineering data or like fake data in production? We use some, I'll show you when I talk to the L3 that we use the test account to do that. So I'll show you in a second. So let's walk through an example of this fault injection testing that I talked about. Buck talked about circuit breakers. They have big roles to play in production. They're very important. But how do we know whether the circuit breakers they're put in place? They actually work. And there are two really questions we are going after. One is that does the fallback behavior works? So when the circuit breaker opens, you're supposed to go back to your fallback. Does it actually work? Does it fall back and does the fallback work? That's number one question. The second question is, does the circuit breaker open? Does it have the right sensitivity to open when it needs to open? So these are the two things we try to test. I'm gonna show you, you ask about a test in a demo. This is a PowerPoint demo, but I guess it gets a little bit close to what you're asking, but not for the L0 test, but something else. So let's do a case study of testing a Redis circuit breaker. Now, Redis is a non-critical dependency in the product. It's a distributed cast, which means if it's down, the system should continue to just work. It should just go back to the source. And we have a circuit breaker that's wrapping the Redis call. And so we wanna make sure that the circuit breaker works. So the hypothesis is that if the Redis goes down, the circuit breaker should just kick in and switch to the fallback and the fallback should take over. So that's the thing that we are trying to test. So here is how it would work. So you have the 380s, you have a Redis breaker wrapped, wrapping the calls to Redis, and the sequel is the fallback. So through, we'll do a config change and open the circuit breaker. Buck talked about how we use a config change to open or close circuit breaker. So that's what we do. Through a config change, force the circuit breaker to open, see that the call goes to sequel, the fallback, and once the, through another config change, reset the circuit breaker, and the call returns back to Redis, okay? So that test, that the fallback worked, but doesn't answer the other question, which is did the circuit breaker open? Because remember, we forced the change. We forced it to open. You actually want it to open with real failure. So this is where the fault injection comes into play, where through a fault agent, we can introduce a fault for that call going to Redis. So fault injector, Redis requests are blocked at this point, and then see that the fallback works. The circuit breaker opens, the fallback works, and then you take, remove the fault. Now the circuit breaker will send a test request to Redis, and if the test request passes, it, the call reverts back to Redis. So this is an example of the type of testing that we can only do with fault injection, because we can simulate the failures that would happen in real life, and be prepared for it when it actually happens. So, you know, through this, we've kind of found all kinds of problems. I actually think we can do lot more in this realm. We are sort of scratching the surface, but to the extent that we have done this, we have found all kinds of interesting issues with this. So, you know, things like the fallback doesn't work. You know, you introduce a fault and see that the service doesn't fall back to its fallback mechanism, or the circuit breaker doesn't open. It's not sensitive enough, meaning you thought that in real world failure scenario, circuit breaker will open, but it's not sensitive. Maybe the threshold is too high, so it waits for the failure much longer than what you anticipated. That's the kind of things you'll find, or there will be some other system timeouts that would interfere with the circuit breaker behavior. You know, we have found those kind of issues too. The point is that, and these are all real issues that we have found, the examples that I'm giving here, otherwise have become a real life-side issue. It could have become a one issue or self-zero issue, in some cases, that we were able to prevent by doing this kind of a testing in production. Again, it's very hard to do this type of testing in the lab, so kind of going back to what is the role of some of this integration end-to-end test in the lab. This is where it kind of has a diminished role in sort of, and this kind of testing in production is more favored. Yeah, I mean, there's some lessons here. Yes, do chaos engineering, but do it in your ring zero. I think ring zero concept was introduced. This is the ring that we run our own service, our own team's engineering system on. So if something fails and it goes haywire, then we are only hurting ourselves. We're not hurting customers to your kind of questions. Are you not doing this in production? Well, we do it, but we do it carefully. You want to automate these experiments. These are expensive tests, doing this kind of testing that I just talked about, where you inject a fault and you see the failure overtakes place. I mean, these things are not easy to do the first time you do it, but I think you can, over time, by building the right tooling, the right automation, you can institutionalize this kind of testing throughout your organization. By the way, this is a whole, what should I say, practices around this, you can find more information at that link. All right, the other type of failover testing or failure testing we do in production is this thing called BCDR. I'm sure you all are familiar with, this is a business continuity and disaster recovery program that most companies have, and we have one too. It's very formal, where we track for each service, what is the importance of the business. If it's that service goes down, rank our services, we have formal disaster recovery plans, but one of the things that I've learned over the years is that just documenting a disaster recovery plan is never good enough. I'll give you a little bit of story. I was in my previous team where we had a major outage going on and this is advertising system for Bing, so you're thinking millions of dollars flowing through that and we were two hours into the outage and we have no solution in sight, so I looked to my ops team and I asked them, hey, let's just do the failover. And he looks back in my eyes and goes, no, I don't want to do the failover. I'm like, what do you mean you don't want to do the failover? We have the whole failover procedure. We've kind of built it and you just didn't want to do it. And it happened again, in the second outage happened, I would say let's do the failover and you wouldn't do it again. After that I learned that the reason you don't want to do it is because it was just formally documented. It was not exercised in recent times, so you just lack the confidence to go do it. So he said, no, it's better to fix the problem. Let's not do a failover. The key point here is that you want to test your failover procedures in real world and you want to, it's not just a documented plan but you want to do this in a regular drills.