 Now, second part of the big shift that happened, this concept of this mindset of master is always shippable, which is two parts, that is, the shift left notion. So, you know, kind of going back to take, I want to take you back to the September 2014. We are three years into our cloud cadence, and we're still kind of running our test the way we were running before. But remember that initial approach, you're trying to do things faster. We are trying to optimize our automation, but we are struggling. We are struggling big time. We are like living in scholar at this point. Our test took too long. You know, we had a test suite called NAR, which stands for Nightly Automation Run. That took 22 hours, and it's interesting. We had a test suite called NAR, Nightly Automation Run. So by definition, we are telling engineers to put tests that take all night long. So we had another test suite called FAR, another interesting choice of name, that stood for Full Automation Run. That took two days to run. So just by, you know, retrospect looks absurd that you name these things like that. But that's what we had. Test failed frequently. We try to look at the test reliability, and you will never find a test pass that passes 100%. In fact, somebody gave me this factoid that we never shipped a product in TFS that had 100% pass rate. We always had a test run with certain percentage failures. But the sadness of the situation was that we tried to first just try to improve the reliability of just the P0 run. You know, there are about dozens, a couple dozen tests that we thought they were super valuable that we were like to run all the time. These are P0 runs. So let's try to have them, these tests pass 100% all the time. So what we did, we'll take a good build and run this P0 test in a loop. In a loop for about, say, 500 times. And if a test fails, then we'll call that unreliable, or a test pass unreliable. And the idea was to just get that P0 test suite to run, you know, consistently with 100% pass rate, with a very focused effort. And so anytime that test fail, it would file a bug, and there'll be a P0 bug, and the management will just try really hard to get people to fix those bugs. So we tried to run with that. And with significant effort, concerted effort, we could only get the reliability to about 60% to 70%, even with that kind of a Herculean effort. Because the test would fail in a variety of states. You know, they would either fail because the deployment fails, or just publishing the results would fail, or the test themselves would fail, or the product would have bugs. So it was very painful. We couldn't trust the quality signal coming out of master. So you get to run, you do a commit, you get to run, you know, first of all, 12 hours, 22 hours later. And then you've got failures, a bunch of failures. And it takes a long time to sift through those failures, which the team didn't have time to do it. So people just ignore those failures until you get to the end of the sprint. And then the whole process of sifting through those failures and, you know, certifying the test run would begin, and that would take days. And I mentioned sometimes it took us three more weeks just to get to that point. So I guess it was a huge cliff to get to the release quality. So it was just a very bad situation. And we realized that this was not going to work. So in February 2015, we launched, or we published this quality vision. So like there is one slide you want to take away and show it to customers. In terms of what we did, I think this is it. And this is literally taken from, you know, what we came up with at the time. We drew this diagram. And by the way, this is a completely conceptual diagram. It is not drawn to scale. It was drawn to illustrate the point. And the point was that we are going to turn our entire test portfolio on its head. So whereas in the past, so first of all, just a nomenclature here, the L0, L1s are the unit test and the L2, L3 are the functional test. I'll get into that in a minute. But for a minute, understand that that's what they are. In the old world, just about almost all of our testing was done through those functional end-to-end tests. They're broad in nature. That's what our portfolio looked like. And we said, we want it to look like that. In conjunction with that, we came out with a set of principles that the team would adhere to. And these are all kind of fairly self-explanatory principles. But the first one is, tests should be written at the lowest level possible. That speaks to this push-to-left, write-more unit test. Write once, run anywhere, including production system. You don't write one set of tests for your local box, another set of tests for your integration environment, a third set of tests for your production. No, no, no. You write one test and it should run in any environment. Product is designed for testability. This speaks to the point that in our old world, you know, test team had a road test because they didn't have the appropriate test hooks, where they would call like the server OM directly from the client and, you know, any very unnatural way that exercise the product in a very unnatural way is not meant to be tested that way. So we decided we'll invest in the appropriate test hooks. The engineering team will do that, and so that we can write the test the right way. Test code is product code. Only reliable tests survive. This speaks to the mindset that we're going to treat tests just the same way we treat our product code. And it will have the same rigor, same reviews, and all of that. Test infrastructure is a shared service. So this speaks to the point that anything that we build, first of all, we try to build it in a way that we improve the product versus just kind of building some extraneous tools. But it is a shared service for the entire team. It's very well supported. It will have the same reliability as the product code. And the last point is that the test ownership follows product ownership. So it speaks to the point that, you know, the test sits right next to the product code that if you have a component that it's tested at that component boundary, you're not relying on somebody else to test your component. Again, you know, pushing the accountability to the person who's writing the code. You know, that was the basic concept. And I'll show you in a minute in few slides what actually happened. This is what we came out with. And it's interesting to see what happened. Yeah, question. So it seems like your decision was made to have more unit tests versus, you know, functional tests. But we all know that there is actually value in all levels of tests, right? Especially like when you talk about, you know, you're just changing some code and you want to make sure that you didn't introduce anything that, you know, broke something else. So functional tests would be probably the one that would actually touch this kind of issues. So are you saying that, you know, from your experience, those functional tests are not that valuable and you made a decision that... No, I'm not saying that. What I'm saying is that there is a relative... There is a balance shift. In the past, we relied heavily on functional tests, those broad end-to-end tests. And we knew that was wrong. We are not saying that there is no place for functional tests. We are saying it has a very diminished role. And I'll show you particularly when I talk about the testing in production because that's the other half of the story here. The first of the story is that we pushed more unit testing, but we are not saying that no functional tests or no end-to-end tests. Looks like Brian wants to chime in. Yeah, I just wanted to comment on that. And I think what we would say about functional tests is they're more expensive to write. They're more expensive to run. They're more fragile to changes. And therefore, it's not that they don't find issues, they do. The ROI on them is just lower. And so shifting left in that world, you move to a higher ROI. We still have functional tests. You know, the L3 bucket isn't empty. We just have a lot fewer of them and we rely on more efficient testing. So are you continue to write those functional tests? That's all we're saying. Yes, all those functional tests are new. So we threw away all our tests and had to rewrite new functional tests because the way we were writing functional tests before sucked, too. Are you going to talk about specific frameworks that you guys are using? Yeah, I will. All right, so this is just the same quality vision depicted in our pipeline. So the basic idea was that we would run a very large number of L0, L1, which are the unit tests, even before a commit is pushed to the master. That was the basic idea. And then we'll have L2 and L3 tests. These are the functional tests to your point. And we'll run them in CI as well as in our deployment environment. But there are just fewer of them. That was the basic concept. I got a question about what Brian and you just mentioned. Maybe this is something, maybe a question for Brian, but was there any kind of feasibility or cost analysis between throw everything away and redo it from scratch? Because it's just fantastic that this was done at that scale, right? But I don't see many companies doing it. But if I am to ever even mention that to them, how would I approach it? Did you guys do some kind of feasibility? Do you have some kind of general best practice for doing that kind of analysis? Was it a net of despair? I would say it was a couple of things. Act of despair is a little bit there. But we had seen this movie before. I can say personally, from my experience, I went through very similar exercise in the team. I was in before. We also observed what some of the other companies, like I said, born in the cloud era, how they work. And we saw that it is possible to have a portfolio balance that looks more like this. Ours was skewed because of the setup that we had, the old setup that we had. So did we do a formal analysis of the cost versus benefit? No. But we had a pretty good intuition that this was the right path. And you will see in a moment that we, remember, we survived with those 27,000 old functional tests for three years. And we saw that how painful that was to maintain that set of collateral. And you will see in a minute that when we hit the lead button on that, we did a very careful analysis of those 27,000. I'll talk about that in a second. Yeah. Just a follow-up on that. So now that you're writing more unit tests, that were not there before, right? So did you go back to the code that was not covered and then you start writing unit tests for those? Did you just make a decision that from this point on, whatever we're going to write, we're going to have to write unit tests for? And the other aspect of this as well is how there is a cultural change to that as well. So I'm going to speak to all these points in the next three to four slides. Ask me the question again if I don't answer it, okay? Please remind me. All right. So your first question, how did you do this? Like how did you get the flywheel going? How did you get team buy into this? These are all valid questions. And we grappled with all these questions. As soon as we rolled out this vision to the org, we started getting those questions back. I mean, questions like this, like, are you crazy? How is this going to work? We have never been successful writing unit tests before. Why do you think this is different this time? Or can you really test the product with this kind of a portfolio turn on its head? All these questions we got. Like I said, I like to say the unit test war broke out. You can see the tug-of-war here where there are people who are extremely skeptical, not only about the approach, but sort of management commitment behind it. And then there were people who were passionate about this new direction. And we had a lot of philosophical discussion about the types of unit test, you know, sort of the classical versus mockish, sort of people who take a very purist approach in having unit tests that are, you know, sort of completely isolated versus unit tests that take some amount of dependency. And each of these questions, we took a middle ground. We took a very pragmatic stand. So, for example, in this one we would say, well, you know, if you have the ability to refactor your code, or if you are in a greenfield environment, take a more of a purist, you know, position on your unit test. But if you are authoring unit tests for an existing code that's been around, it's okay to have unit tests that take a dependency. For example, you'll see that our unit test do take dependency on SQL resource provider because a significant portion of our product uses SQL and we just didn't want to, you know, mock that layer, you know. So we didn't take a very dogmatic position on this. We took a more pragmatic stand on this. And I'll show you in a second sort of how the progression of our conversion is just coming up in a couple of slides. I'm going to tell you a little bit about the other things we did just to kind of redefine the way we talk about these things, right? So first and foremost, we needed a new test taxonomy. Remember our old taxonomy was based on the duration, nightly runs, far runs, you know, things like that. We got it of all that. And we said we're going to define test based on the number of dependencies they take or the measure of dependencies they have. So you have unit tests that are, you know, L0 and L1 unit tests, they take minimum amount of dependency. In fact, L0s take no dependencies at all. They can be run in memory. You don't even need to deploy the product. Your L1 tests, they can take dependency on SQL, but again, they don't require the deployment of the full product. L2 and L3, these are functional tests, but these are tests that are running the testable service deployment so that you can run them in a proper isolated way. So we defined that we, these are some other names that we use. Self-test is a run comprising of L2 and L3 tests that we run in our CI. These are our P0 run. Remember that P0 test suite that I tried to run before. We created a new one called self-test. And self-host is all the runs. One of the philosophies here was that if we are going to write a test, it needs to be run all the time. So it needs to be in one of these runs. And the trates are just the legacy test, just for your reference. Yeah, question. So Manil, this is good. This actually helps to kind of classify the various tests and as you're showing over here that you have L0 and L1. Now, are there any specific best practices or any guidelines to ensure that you identify the test because I don't want to land up with a laundry list of unit tests that just keeps on increasing. Because I have a client who has, I think, more than about 80,000 unit tests. It takes about 16 hours or something, which is just terrible and we are trying to take them down. So good question. We have pretty strict guidelines for L0, L1 that we rolled out. And again, I'm not going to go through all the details here, but the basic idea was that this unit test needs to be super fast and super reliable. So there are performance requirements on the unit test. They must run, each test must run on average in less than 60 seconds per assembly. So that's kind of the guardrail for the team to design the unit test. And there are other requirements on the unit test. What you're allowed to do, what you're not allowed to do, that we rolled out to the team. And I'll show you in a second how long it takes for us to run 60,000 unit tests. It's actually less than five or six minutes. We like it to be more like a minute. Right now it's six minutes. And one of the ways we did that is by not only coming up with the requirements like this, but tracking all our unit tests to see that they don't violate any of our principles. So when you're executing this unit test, do you execute them in parallel? Yes. These are guidelines on the L1 test. Again, very similar. There are performance requirements where the code should reside. What dependencies can take things like that. Requirements for the L2. The key concept for the L2 was test isolation. A properly isolated test can be run in any sequence. You don't need to run test A after test B and so forth. That causes problems. So a properly isolated test, you can run it in any sequence because it has a complete control over the environment it is being run on. So you have a test that dirties the data in the database and leaves, you have a next test that you try to run and expect something else in the database. You can't run that test. It corrupts that test and you get flakiness. So the idea behind L2, the key design principle for L2 was test isolation that needs to run in any order. So one of the things we did is this dynamic identity work. So our old functional test, clearly in order to run test you need an identity and would call our external authentication providers to get an identity. So there are a few problems. First you have an external dependency that could be flaky or could have problems which can break the test. But it violates that test isolation principle because you can have an identity, the states change, you've done some previous tests, you've changed the permission on that identity, next test comes and now it runs into a problem. So there are a few examples like this where we built fake identities, support for fake identities. So we did it in a product-aligned way meaning we built test extensions, we took the advantage of the extensibility of product, built test extensions and this was part of the framework, there was a question on the framework, so this was part of the framework that we added so that the test can take advantage of these identities and they can use new identity for each test and can run that test in complete isolation. But there are only like a few examples like this. You are very careful in picking where we use this kind of fakes in the product. All right, so this is what, yeah. Can you please go to A0 tests? Yeah. So, okay, oh, so this is 60 milliseconds, okay, I thought that's 60 seconds. Oh, that's a second, sorry if I did, if I misspoke. So this is what happened and I think this, we'll try to probably answer a bunch of questions earlier. So what you're seeing here is that the orange bar is the old test, the old trates. So Sprint 78, at the start of this graph, it ends in Sprint 120, so this is over 42 Sprints or 126 weeks, about two and a half years worth of effort. When we started, we started with 27,000 trates, the old test, so this is what we are looking at. We are looking at those 27,000 tests, you know, sort of very difficult to manage those tests and run them and deal with all the failures. The first thing we did is, hey, let's not worry about those 27,000 tests, there is just no way we're going to be able to deal with that problem right now. What we want to do is just get the team to buy into this idea of writing unit test. So let's start with writing unit test. Let's start with writing unit test for the new features that are being developed. Forget about the existing code, the existing test, start with the new features, that's what we did. We wanted to get the flywheel going, so we made it super easy to author the L0, L1 test and Bill kind of walked you through that yesterday, that, you know, it kind of shows up right next to your code. So we wanted to train that muscle first, we wanted to build that muscle first, rather. And you will see that we didn't bother with the trates, sorry, we didn't bother writing any L2 test all the way up to sprint 101. You can see the L2 test, that's blank, you know, somebody asked, hey, do you just not believe in functional test? No, that's not the case. We do write L2 test. We just, we left the trates alone, we built, you know, we started with the unit test, that's the main point. And you can see the unit test starting to build up, now the team start to see the benefit, because not only they see that, oh, these are easier to write, they're easier to maintain, they're faster to run, they have less problems, they have less failures, that means I'm spending less time shifting through failures. Moreover, this unit test are run in PR, and they are wired in the PR, which means even before my commit makes it to the master, I already get a result back. So the moment teams saw the benefit of those unit tests, it became sort of a nice cycle, where a team would just start writing more and more unit tests. In the meantime, notice how the trates, from sprint 78 to sprint 101, they went down from 27,000 to 14,000. So all we did, we just started deleting the trates, because we would go look at this old test, and notice that either they were not good, or they were not providing the right coverage, or the unit tests that we wrote were good replacement for the trates we had. So we didn't have to redo all the 27,000 tests, a bunch of that transition or moving to this new world was just deleting the old test that we had. And then we focused on writing the L2 test, which is starting in sprint 101, and notice at the end of the sprint 120, the trates is finally at zero. So we got it all our test. We got it all our old test at this point. I showed this graph to one group, and somebody is paying very close attention to the numbers. He said, wait a minute, what happened in sprint 110? I said, what happened? He said, look, your trates jumped from 2100 to 3,800. So does that mean that you're suddenly writing old legacy test? Any guesses what might have happened there? We actually found more old tests. They were just hiding somewhere in the source tree. We did not write new old test. They were just hiding in the source tree. And by the way, that happened multiple times. It's just that the graph, because we were constantly deleting more than we were finding, therefore it didn't show up, and it showed up in sprint 110. So that was very interesting. I think I spoke to this, but this is sort of the process of the journey we went through. It probably makes sense in terms of if you were to take on an effort like this, how you might progress through this, is that you want to start with creating new behavior first. You want to create new muscle first, and you want to train the new muscle before you try to worry about what you have. Otherwise, you get a lot of inertia. You try to tell teams to go fix all these old tests. Nobody's going to sign up for that. I hope this isn't tedious for anyone, but would you clarify where the user interface automation tests now occur, where store procedure testing occurs, JavaScript library testing, other types of assets that aren't .NET based, or C++ based for that matter? Yeah, yeah. So let's talk about the first, the UI test. We had a lot of functional UI tests before, and in the new world, we discuss writing UI tests. We don't like to test at the UI layer. We do have some L2 tests that are UI tests, but a lot of it is functionality is tested at the L0, L1 level. We use the Casper JS framework, and we run it in a headless browser, the Phantom browser, and we have a unit test framework for the L0 test also. So that's how we deal with the unit test for UI, sorry, overall testing for UI. Sorry, what was the other question? Store procedures and JavaScript libraries. I mean, you answered the JavaScript libraries in a general sense, but there's a big fear or focus with clients who are into automation already that they need to do the UI automation testing, and I think they would look at this curve with a measure of some skepticism, perhaps, to say, did you really cover everything? Because how can you know unless you're pretending to be the user and you're covering the integration from end to end? And so that's the gap. Yeah, but remember, we are not saying there are no functional tests. There are no L2 tests. There are quite a few. In fact, we have 4,000 of them. So it's not a small number. It's just in the scale, because we are counting test, and you know the problem with counting test is not all tests are the same. So in a graph like this, it almost looks like all we are doing is unit test. That's not the case. I mean, you have 51,000 unit tests, sorry, L0 tests, another 5,000 L1 tests, and we want to join in another 4,000 L2 tests. So all we did is it shifted the balance of the portfolio. We are not taking a religious point of view that you just don't write any functional test. In fact, you'll see that right now there are no L3 tests here. We are starting to now write L3 tests, which are functional tests we run against production. We think they are very valuable for the reasons I'll explain later, but they are very valuable. We are not shunning them. Bill, you want to add something? Yeah, hi. So a couple of details. One, we use Hutzpah, which is an add-in for the test framework to do unit testing of the JavaScript stuff, and CasperJS and Phantom are up in L2. That's where we do the headless browser testing. So it's in that smaller class. And then the L1 tests are where we do the SQL validation. We've got tens of thousands of lines of code in sprox and a lot of logic down at that layer, which was all being tested through functional paths before. And that really is what L1 is about. It's really focused on the SQL side of things. We've got a very lightweight, created DB, set up my schema. We call it long-chain integration testing because it's going all the way from the units that you'd think about as being pure all the way down into the DB, exercising the sprox, and then coming back out. But you didn't have to do a full-on deployment for that. And then I think another important point to give maybe some degree, in some cases it's going to give comfort to your customers that are thinking about UI testing. We strongly discourage UI tests. They're flaky. I mean, just they're terrible. They didn't show up. I mean, I waited long enough and my button isn't there. I can't tell you how many times I've seen that. We hate them. I hate them anyway. But you have to have some degree of that kind of coverage. But the other important thing to keep in mind here is one on the other side of the cloud pivot, right? And so the way that we're thinking about testing is shifting a little bit to, in the box product days, one of those bugs that ships would have a huge cost. So we would polish to the end of the degree and coverage at that level of the customer scenarios. Very, very important. In the cloud world, the concept here, we take a little bit more, we get the way out, I try and explain it sometimes. It's like in the old world, we're trying to get the 99.9% perfection with our tests. And they're an approximation of reality. In the new world, we drop that last 0.9% and we use progressive exposure. So we deliver to ourselves first and then to a set of friends. And by the time we get out to where it's actually going to a place that matters, we're back up to that 99.9% and actually probably further than we would have been. So it's a little bit of a shift in thinking that goes along with the things that Manil was talking about. All right, let me keep rolling. We've got about 35 minutes and some more material to go through. This is sort of no test left behind. So in going through this journey, we were constantly tracking different things. So I mentioned how we found those L2 tests. Same way as you would find L0 L1 tests that people have authored, but they just haven't tagged them correctly and so they don't show up in a run. And so this is like, search team moved 25,000 unit tests to L0 L1, applause. So things like that, we are particularly concerned about the growth of the L2 test because they just take longer to run. So we're constantly asking the questions, is the portfolio balance right for this feature team? How come they are writing more L2 tests than L0 L1 tests? There is really no scientific answer to this or at least we didn't have a scientific answer other than kind of, does it pass the SNF test? One team looked very different than the other team. We kind of asked the question, why is it this way? That sort of things. The other thing we were concerned about was the speed. You mentioned this customer who has a lot of unit tests. It takes a long time. We constantly track the time it takes to run our unit test and anything that takes more than the time that we allow. For example, 60 milliseconds for the L0 test in an assembly, 400 milliseconds for the L1 test in an assembly. If tests are outside of that, we flag that and we expect teams to go fix that. Total execution time for L0 is 6 minutes, L1, 3 minutes. Combined, there's about 60,000 tests that we run. All right, so the now... Many questions here. So are these unit tests executed every time a check-in happens? Yeah, so they are part of the PR pipeline. So with every PR, the 60,000 tests are done. Okay.