 All right. My name is Ed Glass. I am a group engineering manager on shared services, on Visual Studio team services. One of the things my team does is look after our deployments and the way we deploy across our services. So I am going to talk about that. So my agenda is show you what it looked like in the early days of team services and it wasn't pretty. Then some key transitions we made along the way. Then I'll show you where we've come to today and some of the future investments we're making and lessons learned. So in the early days, we came up with some principles and these have been good principles. These have stood the test of time. So we still abide by these principles. One of them was the tools we used to deploy to the service are the tools we use every day in our Dev and test environments. This is pretty critical. I've heard of services where there's some special sauce that the ops team uses to go deploy the service and nobody knows how to do it and well, sure enough, you get to production and bad things happen. This way, we're exercising these bits all the time. If a developer makes a mistake on authoring deployment, they find out on their DevBox, not once we go to production. A second principle is the quality signals we're going to look at to greenlight a deployment. We're looking at those all the time every day. You've heard our test story. I'm going to talk some more about that. We had this principle just that the quality signal wasn't always very high fidelity. So I'm going to talk about that. The other principle is there's no downtime deployments. One thing this enables is us to deploy during working hours, which is pretty key for work-life balance for our engineers. So we don't want to be deploying at midnight on Saturday night, and have to call people in when things go bad. So what do these early deployments look like? Well, we had, this is a one-note, actually screenshot it a one-note that we had back then. For each deployment, we'd fill out these questionnaires and it actually is pretty long. So here's another page from that. And you can see from the scroll bar, this actually goes on and on. But these are things we did to prepare for deployment. And then the actual deployment script had all these steps to run. And then we actually, we deployed from command line. And the operations team had special access to these deployment machines and our production secrets. And they'd go fire up this command line. So you'll notice the background of this command line is red. And we did that so the operator knew they were operating against the production environment. If it had a red background, had a black background, it was a test environment to help them avoid making mistakes. So our deployments really look like a Skype call with about 60 people on it, with an ops person sharing their screen and going through all these steps and then everybody kind of crossing their fingers. That it would work well. We stood up the deployment in April of 2011. We waited about four months before doing our first deployment. And with that became a ton of changes. And the deployments did not go very well. And we were gearing up for the build conference when we were gonna announce our service and took everything to get things stitched back together. The problem with doing big deployments with lots of changes is there's lots of problems now intertwined and pulling those apart and kind of figuring out what went wrong is harder and harder. So we learned that. You can see within a year, we went to our Sprintly deployments. And it was really a big impetus on how poorly these deployments went to motivate us to deploy more frequently. So summarizing some of the problems we had in the early transformations we made were these big payloads. So we went to Sprintly three week deployments. Another problem that Buck alluded to was when we deployed, we had one single service. And so when we hit an issue, it impacted everybody. We were in early talks with the Windows team. He wanted to start using team services for work item tracking. And clearly that wasn't gonna fly with the Windows team having our service be flaky like that. So we introduced the ability to have multiple instances of TFS, which was our main service at that time. And then as we introduced new features, we had introduced the ability to do that via microservices. Another clear problem with all those manual steps is people made mistakes. And some of those mistakes were really costly. I remember one in particular operator ran a step out of order and it took us a couple of days to recover from that problem. So we adopted tooling. We adopted if you've been around, you know about RMV1. There were some problems with RMV1, but actually it delivered a lot of value to us. One thing we said is, okay, from now on we're getting rid of the one note. If you want to make a change in production, you need to author it into our automation and author it into the deployment. And we had the capability before, but we kind of had a culture of, gosh it's easier just to paste that in the one note than to have to actually code it into the deployment. But that got out of hand, so we just said, hey, we're not gonna do that anymore. Another change was enabling engineers to run config changes against production. And a config change is basically running a PowerShell script. So it's pretty powerful, you can do a lot with it. We have to be careful about it, but rather than calling an operator and say, hey, can you run this change for me in production, we enabled that through our release management. And then that enabled us to have an audit trail of changes that have been made. The other problem was serializing everything through the operations team, with only ops doing deployments. RM gave us the capability to have access control around who could queue deployments and who could approve deployments. So this was really enabling for the engineering team to be able to go ahead and deploy ourselves, especially as we spun up more microservices. We wanted the team that owned the microservice to have control over when they deployed and not have to rely on a central team or some higher up to give approval for their deployments. So it enabled those workflows, which was very good. And the other problem we had was these are, sometimes we'd have bad deployments and they were immediately would manifest themselves, but we would just proceed on and we have two phases to our deployments. One is update binaries, the other is to update our databases. Once we upgraded those databases, there's no rollback. We couldn't go back. We've talked about the binary compatibility. New binaries know how to talk to old database, but old binaries don't know how to talk to the new database. So once we service the databases, there was no rollback. And so yeah, we introduced this 15 minute health check period. And so we do this health check and then if we detected an issue, we can then VIP swap back and roll back the deployment. We still had a fair number of issues. So we had a poor quality signal. You've heard all about that from Manil and even Bill talked about it yesterday. It really manifested itself when we went to deploy. So the problem was, we finished the sprints and have what we thought was a pretty good pass rate. Maybe it's 98% and then come in on Monday morning and there's like 30 test failures. And so we got to go get the owners of those tests to come in and see, hey, is this a real problem or not? Should we block the production and well, they start doing new things. We uncover issues or it just takes time and it would take us, typically we weren't ready to deploy till Thursday of the sprint week. We still have this Singleton service, shared platform services where it holds our identity, account information and group information, which is a single point of failure, which my team contributed to that service. It's really a very, very bad place to be because there's no room for error. Some of my best engineers introduced bugs into that service that took all of VSTS down. And you're doing your team a big disservice by making a little mistake that's hard to catch outside of production, be able to take your service down. So having an approach where you don't have no single points of failure is definitely something to design in from the start. The other thing is it's no surprise but a lot of issues take longer than 15 minutes to surface. So we'd move on with our deployments, bugs would surface and now they started impacting customers and we have no means to roll back. So today we keep evolving, we're not done yet, still have, we still have a lot of work to do but we keep getting better. And I think that's another key message is we are always improving the way we're doing our deployments, always investing and making things better. We've moved on from RMV1 to Visual Studio Team Services, Release Management, that's been great. We work very closely with the Release Management team on new features and making things better and better and that keeps evolving. We've grown up our service so we have a worldwide service now. We talked about the 31 microservices, some of them up to 15 instances. So we are doing a lot of deployments. So if you think about that, that doesn't scale without being fully automated. We've also democratized that by enabling each team that owns their service to be able to deploy at their own cadence. We now have this high fidelity quality signal. So the end of, we're running our tests constantly throughout the sprints. If things turn red, we're jumping on it. We've got the class of unit tests running as part of the build, so builds won't get through without those passing. And then we have our two tests running constantly. We've got the two buckets that Manil talked about of self-test and self-host. But they're staying green and Bill showed you one screenshot that wasn't quite realistic but they really do stay green for the most part. So we come in on Monday morning to a green dashboard and that's the signal we're using to deploy. So most sprints we're able to deploy on Monday morning. Ryan talked yesterday about our cadence. So we are bringing a new sprint payload every three weeks. And then every day, let's say that customer reports a bug or sometimes we want to tweak features in production for usability, whatnot. As engineers push changes to the release branch, they're gonna get deployed the next day. And then there's a class of issue where, hey, we got an active LSI. We need to address it with a code fix. And so we call that a hot fix deployment and that's done as needed. Here's my screenshot, so a little more realistic. You can see the one red cell is really a clear indicator of a test reliability issue because it turned green immediately after that. But yeah, largely we stay green on our tests and have worked really, really hard to drive out reliability issues. All right, any questions on that so far? All right. Question. Yeah. And I know it's maybe a detail here, but we covered so many things. You said the environments are changed by power shell scripts. We've talked about databases. Is there a tool or a part of release manager that runs the environment update scripts once and only once? Or what's the method for that? Is it part of using Azure or is it? I mean, how do you run that config change? Okay, it just looks just the same as our prod deployments and that I'm gonna talk a bit about our deployment rings and it works the same way, but basically we have a release definition where you can plug a power shell script into and the power shell script we check into source control. It's gonna get repo and then you configure through a parameter that power shell script into the release and then kick off the release. Okay, so it's a human decision to include that script in that particular release? Yes, it's a separate release definition to do a config change. Got it, okay, thank you. Brian? Yeah, okay, so we do have a rule that, so we have, we got the binary updates, that's really kind of all or nothing. We get the new binaries or not. And then we have servicing steps that plug into a deployment, which are different actually than a config change. So servicing steps run as part of the deployment and the rule about that is your servicing step has to be rerunnable. And we have tests that run that will run and kill servicing at random points and then rerun to make sure it is in fact rerunnable. But the scenario there is you got a step that's trying to update database and it can, it can't get the lock so it fails and we need to be able to rerun it. And so that's a key principle. Any other questions? Good question. Yeah. And you may be covering this later, but are you still on the side? Are you still doing change management processes? What's that? Are you still doing change management processes for deployments? Change management processes. We do have some change management processes in terms of what changes go into the release branch. So the policy around that is that it changes going to the release branch require M1 approval. And that's pretty much it. And then for the actual deployment, pretty much anybody can queue a deployment and then M1s or leads can approve the deployment. It used to be require M2. We kind of keep getting less and less process is how I describe it as we go. All right, so I'm going to talk about safe deployment and safe deployment, the term was born out of the Azure group who looked at the way that we deploy Azure and came away with some principles and rules and guidelines that helped to protect our customers as we deploy. So that's really what it's about. Why we do it is to protect our customers from LSIs and bugs. So most of the LSIs that happen on our service are introduced by us, by deployments, by changes we make in our code, then deploys out and doesn't work right. So what we do is deploy changes first to very small set of customers who have signed up for the risk and say, yeah, give me the changes first and then we'll progressively roll out to larger and larger sets of customers. And then incorporated in that is also having automated health checks and rollback. And bake time is another pretty important principle. So we segmented our services into deployment rings. And this is the definition we have for kind of our biggest service is TFS, but all of our services define a set of deployment rings. And the ring zero is customers with a high tolerance for risk and bugs. That's where MSNG is. That's where we live. Our MSNG account is on our scale unit zero. When we deploy binaries on Monday morning, they're going to scale unit zero. An interesting thing about scale unit zero is it's in US West Central. It used to be that we'd be cruising along and then all of a sudden we'd get an incident in West Europe and say, oh, what's going on in West Europe? There's some hiccup with SQL. We haven't deployed any changes recently. We don't know of anything going on. Well, that just happened to be the first region that SQL Azure deployed new bits to, right? But we didn't have visibility into that. So we've all aligned with Azure to deploy to US West Central first. So now if the SQL Azure team introduces a regression, it's going to impact us on MSNG, right? You're a really good chance that it will. And so this has been really good for us to catch platform level issues. Next, we use a lot of VSTS, but we don't use all of it. So for the areas we don't use, we want to go to a small scale unit so we can get enough of customers using the breadth of the product. So if we break something in a part of the product we're not using, we can uncover it there before going out to a large scale unit. And so the other principle here is we want that to be relatively in a US time zone. And the reason is some bugs only surface during peak hours. So if we have it during the US, we'll hit the issue during our daytime can work any incidents that come up. Yeah, quick question. I was under the assumption that the rings were maps to data centers. But how do you know, I mean, how do you make sure that in ring zero there are accounts that use TFEC, that use hosted builds, and do you move them together or something and make sure that they have a certain characteristics so you can run certain tests or something? So ring zero, we don't. We don't really do anything special to get accounts in there that use the breadth of the product. So some issues will slip to ring one. And then ring one today happens to be Brazil. So it's not a huge scale unit, but big enough that if we broke something, we'll find out. And then it's Brazil today, we could change that. It's not a fixed thing. Okay, then we deploy to a medium to a large US data center and there we can catch a class of scale issues that and we tend to find any other bugs that maybe didn't get uncovered in ring one. Any functional bugs, but mostly to catch scale issues with our servicing. There's like a million and a half accounts in this data center. So it's good to catch that class of issue. Then we deploy to a scale unit that has, it's an internal, we have three internal scale units. So we pick one of those to deploy to. There's some scale and load characteristics we see on our internal accounts. We don't see on public accounts. And so this gets us a chance to catch those classes of issues. So for example, MS Azure is in this scale unit. They really hammer us in a lot of ways. And so then, if we've regressed something, we can catch it there before we go to the next scale unit which has the Windows team on it. So we'd rather catch it with the Azure team than the Windows team. Yeah. And then we go to everyone else. Okay, so I mentioned bake time. This is an important principle. So we wanna allow bake time between phases of the deployments and between the rings. I had, and then yeah, we'll allow for our sprint deployments at least a day in between. I've got a graphic that I'll show a little bit more clearly how we manage our sprint deployments. And then for our daily deployments and config changes, typically we do those over two days with delays in between the rings. And yeah, so there's latent bugs that they have time to surface. We had one incident where we broke a particular kind of a build task. And before we were doing this, we didn't have the delays so we were deploying faster. So maybe in the course of four or five hours we deployed to the entire service. And it wasn't a hugely widely used build task. And by the time the customers noticed, hey, my build's failing, called support and got to us and we figured out, well, crap, all of these, anybody using this task is broken. We'd gone out to everybody. So then the, you know, definitely he got the fix and put the fix in and we deployed it and said, oh man, this is killing me. It's gonna take five hours to get this fixed out to everybody. We need to make our deployments faster. And, you know, the really the converse was true if we hadn't gone so fast, the bug would have surfaced. It wouldn't have gotten out to everybody. It wouldn't have created the huge, you know, issue that it had become. And we had other classes of incidents that really only surfaced during peak times. So that's another important principle. So this is a typical Sprintly deployment schedule. So along the top we have, oh, question. Sorry, just on that note with the delays, is that, are those delays kind of hard-coded in or are there approvals or do you do any telemetry monitoring? Like how are those delays built in? So I'll talk more about that. But yeah, they are hard-coded in right now. Yep. So we have the days of the Sprint across the top and so 15 days in a Sprint, three weeks. Those are business days. And then the rings down the side. So on day zero, Monday, we do binaries to ring zero. So it's Monday, binaries to ring zero. And then we'll do a full deployment on Tuesday. If we didn't hit any bugs, if we hit bugs, sometimes we roll back on Monday. It's not uncommon. And then we'll do it again on Tuesday. But we want to get clean on our binary only deployments before moving on. And then typically we have some big time. And then Thursday, we're going to ring two, big time over the weekend. And on Monday, go to ring three. It's not wholly uncommon to uncover new bugs on ring three. Sometimes we even have to do a hot fix. Actually, most common hot fix is on these two days, actually. But that's what it would look like. If we did a hot fix, we'd go through ring two because that's where the new bits were deployed to so far. And then you can see on. And then each day, we're doing daily deployments from the previous sprints for any bug fixes that went in. So that's what that looks like. And then the hot fix process. So yeah, we can do on-demand question. Oh, sorry. Can you go back one slide? Yeah. So you do the hot fix to ring two because that's where you have the deployment. Yes. But the hot fix is done on the release branch? No, it is, always. So on day seven, the hot fix is on the release branch and then you deploy the release branch further to three and four. On day seven, we're just doing a daily deployment. On day eight, we, usually. But that already includes the hot fix because you're only doing the release branch. Because we're always releasing the latest green bill. Yeah. And what if, I mean, I don't know if it happens, but probably it does. What if the deployment error is not in the binaries but in the database update? Yeah, that's fine. Each of these deployments will do database servicing too. Yeah, but I mean, you mentioned earlier that when you do a binary update, it's easy to roll back. You just picked the previous version of your binaries. But you can't roll back the database. Correct. So the error is in the database update? Then yeah, then you got to put in a fix and then deploy the fix. Yeah, okay. Yeah, fair question. Yeah. But yeah, we're always, so in other words, when somebody prepared this hot fix, it's going to include other changes too. Any other changes that were introduced since the time we queued this build, it's going to include all those. So we don't kind of one off cherry pick a hot fix and deploy just that. We just pick the latest. Whenever you put something in the release branch, you got to know it could get deployed at any time. And then the hot fix, we have a SEV zero. Typically that's going to be on one of our Singleton services. So we don't have ring definitions for those by Fiat. But yeah, if it's a SEV one, it's not uncommon for us to hit data shape issues in a particular account like the Windows account or some accounts might have some weird data shape that becomes a SEV one for them because it's very impactful. And we do have a capability of deploying, for that class of issue, we want to go to ring zero first, validate. Things are looking good in production and then go to the impacted scale unit. And then most of our incidents are SEV two, SEV three. So in that case, you know, you put the fix in, we still have, we want to go through all the rings and introduce a few delays, but it's expedited to do it within that day. All right, so we have some future work slated. So Manil talked about the preflight scale units. So we'll have a preflight for every service and then configure accounts to go, you know, preflight of that service against the fraud of all the other services and we get compact testing that way. And then longer term, you know, to your question about, you know, the signal. So when we are in these delays here, right now, you know, we have the 15 minute check after binaries and then we're sitting there and it's a manual thing right now to roll back. So somebody's got to say, hey, wait, we got some bad problem. And then we'll go manually do the rollback, just a second. And then we, so what we're working on is getting our L3 test signal outside in as well as, you know, signals coming from our telemetry. And when we can detect things are red, then we'll be able to roll back. So that's something we're working with the RM team on. Your question? Could you describe the report mechanism when users say, oh, I found something and you're in the middle of your deployment? Right. Are they contacting you? Are they reporting it through a web service or are you just doing telemetry? How are you aware that there is the problem? Yes. So it could be internal users, external users on Twitter, external users emailing Brian. It could be through telemetry. We notice we have all kinds of alerts configured against our telemetry. So we detect, Tom's going to talk more about it, but if we drop under our SLA, for example, we can detect that alert on it and then get in contact with the right folks to stop the deployment. OK, I'm going to go in to talk a little bit about our tooling. So we've moved to release management and VSTS, which has been very good. I'm going to talk about tooling futures, still work we need to do. But this is what our deployments look like. So currently, you can see, so rank two, ring zero, one, and two have only one scale unit in it. Ring three has three scale units. Ring four has about 11. But right now, this is how it's modeled in RM. So this is the actual screenshot. So what the heck is 4A? It's an interesting thing. So 4A contains the Windows scale units, West US 2-1. And the reason we have that in a separate scale unit, I talked about, hey, if we have a particular issue that's affecting a scale unit, we have the capability to expedite that fix. And so it's not uncommon for us to hit issues in ring 4A in the Windows account. So what we can do is queue a fix to go to ring zero and then straight to 4A without having to go impact all the other customers in ring four. So that's the idea there. Now, we're working toward breaking all of our scale units into environments. And we have a short-term thing that's coming. It's going to enable us to do that. You can imagine with 11, we want to cut out manual work there so that long-term, there's going to be a notion of rings in RM. And then short-term, we're going to do it by naming convention, but have that facility in there, which will be good. Because currently, if an appointment fails in ring four, what shows up in the tool is ring four failed. It's not very helpful. So this way, we'll be able to see exactly what scale unit failed and get the logs for that more cleanly. So that'll be good. OK, here's what the AN environment looks like. We have this update binaries step and the update database step. So these guys do the heavy lifting. And it's running the same command line script you saw in that very first screenshot with the red command line, as well as what we run in our dev and test environments. So it's going to run that command line. And then these are how we have our delays configured. So right now, we configured the wait times with variables. And you can see it says manual intervention task. What the heck? But it's not manual at all. So the RM team put in a special feature for us to say, OK, you can put a timeout in there. And if nobody does anything in the timeout, it'll just continue. So we've configured it to continue. So really, these are just delays built into the deployment. Just last print, the RM team delivered a delay task. So we'll be replacing these with the delay task. This is a deeper look at what the health check question. Yeah, ring one had an additional phase and task. What would that be? Is it? Which one? Between ring zero and ring one, there was a difference in the number of phases and tasks. Oh, yeah, yeah. Good question. OK, so this does a manual pause between rings. We have that on ring one. We have implemented that as waiting before the next ring. So you saw we do ring zero, wait 60 minutes, ring one. This is the wait 60 minutes. And there isn't one of those in ring zero. That's what that was. That was a good pick up. Any other questions? Yeah? What are the typical values in the pause? Like from ring zero to all the way through ring four, what's a typical time duration? I'll go back to, sorry, maybe I shouldn't have done that. That's the delays we have. OK, OK, I'm sorry. No, it's fine. Put it in context, right? So an hour between binaries. And we're waiting longer earlier. And then day two, this takes about a day. So we put in a longer delay here because we have the time. So might as well. Question? Was that the build definition or release definition? Release definition. OK, OK. And one thing, I'm sorry, maybe I didn't get that fully. But here it says that basically in two days, you update all the four rings, right? Yes. But in the next slide, without the chart, it seems that you're doing this for a period of many days. So this is showing one of these days, OK, where we're going through all the rings with a daily deployment. So like on this day, let's pick this day. We're going to ring zero and one with a new deployment. We're going to do a daily deployment to two, three, and four. What we'll do is basically we're starting right here. So we do ring two binaries, wait an hour, ring two servicing. Then the next day, we kick that same release off for rings three and four. Does that make sense? So probably then what I'm missing is what's the difference between full deployment and daily deployment? A full deployment. OK, so typically when we talk about a full deployment, it's the deployment to a ring that includes binaries and servicing. So there's really all the deployments are full deployments from that standpoint. OK, I'll ask later. OK. Got a question? Other questions? Only just a quick question about this. When you just give just an hour, do you have some way of saying, hey, guys, new version is out, ring zero I'm talking about. New version is out, go ahead and take a look or go kick off the tires and let us know what exactly is. It's just one hour, right? That's what I'm saying. One hour, how can you get a statistical sample that shows that nothing's going on? Yeah, so I think that's a good question of, well, so let's say you have a fix going into that build and then you want to verify it. We use a Teams channel. So we have a Teams chat to say, hey, we're doing the build that or we're doing the deployment. So anybody can come in there and look. Also, anybody can come on VSTS and look to see where the deployment's at. They can reach out. They say, hey, we want to stop after ring zero to validate. But normally we say, now we're going through and we're not stopping. And that was an interesting thing about, I said teams have a way to file deployment blocking bugs. A lot of times what a team thinks is a deployment blocking bug isn't really deployment blocking because it's a big team, a lot of changes going in. So we got to keep the train moving kind of thing. So for the smaller teams then what we have is more like an up and people will say, hey, there is a deployment going on if you're really invested on this and then you have the casual user during that hour that might run into the issue. Yes, exactly. And considering that this is something as big as VSTS team, I suppose, what is the sample? What's the number of people in that ring zero that might be affected? Like 600. You can't share that. 600. 600. Depend on what time of day? 600. Yeah, I guess it depends on time. They have 400. Well, yeah, just like that. That's a lot. It's a big number. It's really, really bad when we take a miss inch down. It really sucks for everybody. So I was just thinking that 600, well, it's still a very good number to hit. Statistically then you might have some people that even know about the deployment that might hit that particular functionality. And then we'll help test. Right, right. Yeah, I was just concerned more when you go through other rings. I mean, because you had one hour, two hours, that just didn't seem too much for me for you to have to engage. I see, yeah, I mean, it is a good question of getting through an appointment, how long is long enough. And the longer term I'm going to talk about, but splitting up binaries and servicing so we can roll back any time. That's the long term. Right, now what it looks like is, OK, I'll hit the floor and if nobody else fires, I'll keep going. That's right, basically, yeah. We do have the basic Perf counter health check too. And we've got alerts firing. That's another signal we get. So we've got a 99.9 SLA. And if we drop below that and we drop below through errors and slow requests, it'll trigger an alert, which would then trigger us to stop the deployment. So it's a combination of user action and telemetry. Yes, yep. So I hope I'm not reading much into the slide. But between this slide and the previous slide, it's kind of a little confusing. So the dailies over here have a different color. And after the full deployment is done, dailies again in a different color over here on this slide. And if you go to the previous slide, so here we are saying that on day one, we are deploying in ring zero. And then we are deploying in ring one as well as in ring two. Yes. And then ring three gets on day two and ring four also gets on day two. But if you go to the next slide, you are not. Yeah, good point. I had a slide with actual build numbers in here, which was just a lot of information. But there's actually on here, the build here is a 12 build. The build going here is a 11 build. So actually, we take this build and deploy it like that. I'm sorry? The build, OK, so on this day, the 11th, we'll take a green build. And we deploy it through ring two, OK? Now that deployment sits overnight. And then the next day continues on three and four. So then what happens in that? So in the very same day on day 11, you have two other deployments on day for three and four. Is that the previous build? Yes. Yeah. OK. And is there any particular reason about color coding these before the full build and after the full build? The dailies are different colored? Like this is? Oh, it's different sprint bits. Correct. So that's the previous build that is there? Yeah. This is like 123 build. This is the 124 builds. Yeah. Understood. OK. OK. Yeah. So just so I understand the slide, are you saying for that hot fix between six and seven, that means on one of those days you did two deployments? Exactly, yeah. OK. Just curious why you wouldn't put that as part of your daily build? Why what? Why you wouldn't do the hot fix as part of the daily build? Because there's an LSI that we need to mitigate. OK. And it can't wait till the next day. That was at SEV2, 0, 1, or 2. If you guys even at SEV2, we want to get a fix out. OK, fair enough. All right, yep. And if this is two in the weeds, push me off, feel free. You've got two day deployments, multiple environments. Is your configuration for release manager somehow in version control, or do you use some other method to mitigate someone changing a task, changing a setting, changing a parameter that would make the ring four deployment different from the ring zero deployment for the same build version? Yeah, great question. We do not have protection for that. So would you move scale units between the rings? You just got to do it carefully. Yeah. Yep. Could you then share also the slides with the numbers here because you see that people have just put it in the presentation. What's that? Yeah. This one with the numbers and the builds and the sprints. That would actually help a lot because I had to kind of think about it for a minute. OK, that's fair. And one more question. Is it necessary to have these five rings, or can I reduce the number of rings? Yeah, that's a good question. I think definitely you can reduce it. Some of our micro services have four rings. And so it just really depends on. But I think if you think about this with terms of your deployment and how you deploy your services, it's a good thing to think about. Because initially we just thought about setting up new TFS instances when we felt like one was full. And now we think about it more in terms of blast radius and safe deployments. So, but yeah, I think four is. Having three or four rings should be just fine. Yeah. OK. Yeah. OK, those are good questions. Thank you. I think we talked about this and this. And I've talked about most of these things. I'll reiterate. So we're going to model each scale unit as an environment and then let RM control the parallelization. We get a bunch of wins out of that for our tooling. And then we have this new delay task. We're going to be replacing our manual intervention tasks with it's confusing, the manual intervention task for people who aren't used to seeing that. They come in and say, wait a minute, what's that? The other thing, if you set the delay to zero on the manual intervention test, it means wait forever. So that's really confusing. And then, yeah, incorporating this outside end and L3 test signals into our decision to roll back. And so anytime during that wait period between binaries and servicing, we'll be able to roll back. OK, so I know Buck talked this morning some about controlling exposure of features to customers. I wanted to talk about in the context of these deployments and rings and how to think about that. We have these four mechanisms for controlling exposure to customers. So we have the safe deployment where bits move through the rings. We have our feature flag tools, customer opt-in and opt-out, and then we have AB testing. So as a team, we really don't use AB testing. And I'm not sure why, but it's just not something we've done a whole lot of. So I'm not going to talk anymore about that. Just skip that. We can do it, but we don't. So feature flags. So we have this notion of stages. And stage zero are internal accounts. And just recently, we had people who said, hey, when that stuff's on MSN, just as soon as it's there, I want to enable these feature flags. But when I do, I get errors, because not all the rings, stage zero accounts are on SU zero. So we kind of fixed that. So we made it so that we moved all those accounts onto SU zero. So stage zero is really, it's the MSN's account, plus I've got some test accounts and other people on the team have test accounts. So Brian's site that he runs his farm is on that. And stage zero. And then stage one, some of you might have your accounts in stage one. And the idea is that we have a new feature. We want to enable it for customers. So we do stage zero first and then stage one. Stage one's got maybe 175 accounts in it. Well, what does that look like with our deployment? So let's say you've got a new feature and sprint 124. And then we go and we deploy sprint 124 to ring zero. Well, now we're able to enable the feature flag for stage zero accounts, because we know they're all in ring zero. But for the other accounts, they could be anywhere. And so then you really got to wait till the deployment proceeds through all the rings. And then you're able to light up the feature flag for customers in stage one. So that's how those two things are composed together. Question? So is that again automatic via PowerShell or is it manual? The feature flags? Yeah. You can do it either way. So you can enable and disable feature flags as part of the deployment. Now these stage zero and stage one, we have a feature flag tool that has configured with the list of stage zero and stage one accounts. So it'll go and block showed a screenshot of that. So kind of this crappy internal web UI that does that. Yeah, if we're not already a stage one account, how do we go about getting ourselves included in that? That's a good question. Let's see, you could shoot me an email. Yeah. Okay, any other questions? So is it open for anybody to kind of sign up for a stage zero or a stage one account? Can we sign up? Anybody, no, no. Are you partners? Yes. Okay, if you would like to get added to that. Okay. Thank you. Yeah, shoot me an email. All right. Okay, and then we have customer opt-in and opt-out. So this is a relatively new thing. We rolled out the new nav this way and it's been really good. But if you go to your profile, there's a preview features menu item there. And then, so this is me on MS Sange and I can hit that drop down and if you're an account admin, turn features on and off for your account. And then you can see some of these are on and some are off. And so the engineer can decide when it's usually engineer working with the PM to decide on a rollout plan for their feature. So a lot of times initially we'll roll things out, default off and then go into some test accounts and turn it on and test it out in production. Then we might go to the MS Sange account, turn it on MS Sange, start getting feedback and say, oh crap, that didn't work very good, turn it off. And then iterate and then we're able to iterate that way. And then we can control that for external accounts as well and say, okay, it's gonna be default off. We'll give it a couple sprints of iteration in MS Sange. And then we feel like, okay, we're ready to turn this on by default for customers. And then they can always opt out. And then when a customer opts out, we pop up and say, hey, why are you opting out? And then we collect this data from our customers and can iterate like that. And then eventually these go away and they just become that's the way the product works. But it's been a good way to introduce new features. So is this driven by feature flags as well or this is a different way? There's feature flags on the back end and we call this a contributed feature. So with extension management, there's these contribution points where you can plug into menus and different points in the UI and the way we discover those contributions, that's how these are actually surfaced here. So it's loosely coupled for the developer, the way we discover these is through this generic contribution and we look for contributions for the feature service. So when you surface that to customers, do you give them warning when this is going to go away at some point or at some date it's just going to wake up and it's going to be gone? I guess it depends. Most things, no, we just make it so. Yeah, like the nav was very disruptive change just because people's motor memory got interrupted. So changes like that were a little more careful about. It's kind of hard to communicate those things also in a way that doesn't become annoying and noisy. So that's something where, yeah, it's kind of tricky. Any other questions? Okay, so I also want to talk about compatibility in the context of deployments and this has come up in Buck's talk and Manille's talk so I'm just going to kind of summarize it here. There's actually a lot of different compatibility considerations that we have and I'm going to talk about three of them and there's more really the takeaway for you is to think through for the services you're developing what are the types of compatibility that you have to be concerned about and then develop coding patterns that enable you to deal with them. So this, we deploy new binaries before we deploy a new database. So this is what this code might look like to say, hey, there's a new serialized process in the 124 and you can defensively code to see if that's null. And this happens to be a serializer and it's going to go ahead and put something in there so that we don't get null refs if people are expecting that to be filled. And then here's just, I did a quick unit test where we're forcing this condition on definition versus actually having different instances of the services. This is a way to do it via unit test, it's kind of clever. The other level of compatibility is we have services, service compact so we can have 124 TFS calling 123 SPS and similar 123 TFS calling 124 SPS and when a call is made and we get a JSON artifacts and a call it includes a version on it and so similarly, you can write code to say, hey, if this is an older resource and then it's going to be specific to whatever that thing is and what you got to do to protect yourself against it. It ain't pretty, but that's kind of what the code's going to look like. Somebody had asked about, hey, do you come back around and clean stuff up and yes, we do, we don't enforce it though. You know, we don't have any metrics around to say how much in 120 code do we have sitting around still? And then the last kind of thing to think about is actually persisted data. So here we store, oh, sorry. Like build pros, this is a build definition binder. So we store build definitions as JSON in the database. So we have a 124 database, 124 binaries but we might be reading a 120 build definition. There's different ways to handle that. Sometimes as part of upgrade, we'll choose to upgrade the assets as well. Sometimes we'll just deal with different versions of the assets. So there's different strategies you can employ to handle the compatibility of your resources. All right, any questions about that? All right, and that is my last part of content. So to wrap up, some key takeaways. You might think deployment's painful. You're gonna do it less often. The truth is the opposite. The more you do it, the easier it's gonna get. So deploy often. Get a clear and reliable test signal so you can stay green throughout the sprint. That'll make it so when you're ready to deploy, you can actually go deploy, you're not hit by surprises. If you can have consistent tooling throughout DevTest and your production deployments, there's no special sauce happening when you go to deploy because when you have special things happening, at deploy time they break, they don't work. So it's a good technique to harden your deployment tooling. It doesn't have to be RM, but use a tool to automate and orchestrate your deployments. We've gotten a lot of benefits from that that I've talked about. And our key one is enabling your engineering team to do deployments. My last takeaway is follow those safe deployment practices. So think about if you're implementing services, how can you have these isolated instances of your service and then incrementally deploy to them so that as you introduce bugs as part of your deployment, you're able to shield most of your customers from them. All right, that's the conclusion of my talk. If you have any more questions for me? Yep. Quick question here about your usage of RM. There is no hidden feature using the same thing that everybody else has access to, right? Correct, yes, same thing. Oh, I was gonna say a little bit more about that, but sometimes we'll get those features enabled. First to you. But yeah, yeah, but no secret sauce there. In fact, I've got a pretty long list of requirements I've sent to Gopi. And I think one of them was this notion of ring and deployment group or whatever, right? And they're like, oh, that's too complicated. That's just another thing our users, I don't know how useful. And then Brian said something to them about it and now they're planning to do it, so that's good. Yeah, nice. Thank you. It's a great case for RM, actually. Yeah, yeah, it's been really good. I mean, we were really driving improvements in our, before you guys got, I want to share this before, I forget again, but in a book mentioned, so we deploy from MSNGE, and so we do builds on MSNGE, our build artifacts are stored there, we do our deployments from there. So what happens when we break MSNGE? So we deploy to ring zero and our deployments break. Well, we go back to that red command line, actually. So we go hop on one of the deployment machines and we'll deploy to, you know, then we got to go get the code fix and you're kind of put it on the deployment machine and then we can deploy. We practice that every other sprint or so just to make sure it still works. But yeah, that's kind of a funny thing. So there's some questions, yeah. So if we can go back to the slide where we are showing all the rings or one of the tasks. In RM? Yeah. Okay. I'm clicking. Prometary to his question is, could you also show that slide that it's in the back with all the build numbers? Yeah, I have it hidden. So I got to edit my deck, so see you. Did I go by it already? No. Okay. Dang, this is slow. All right. Yeah, we can stay on that. This one. Yeah, okay. So we run the, we check in the code. We have a build, we have a single build definition, I believe in this case where. Yes, we have one repo. Shut up things. And one, actually we have several build definitions that build the product in different flavors. So, and one of those we used to deploy. So specifically we sign binaries that we deploy to the service. And then we have other builds that don't sign that we can pick up more quickly for our test environments. We have other builds that package up TFS on-prem. So same code, same repo, different build to go build the MSI. Okay, so once you've got the build artifacts, then automatically it triggers the release definition to pick up those artifacts and start deploying on them. No, actually that's manually a deployment driver would decide what build to pick up and go, create a new release and pick the build that they want to deploy. So we go to that, the screen with all the green, find one that's green and then we also have this way, I mentioned the blocking bugs facility. So we'll go find a green build, check there's no blocking bugs and then queue the deployment. And are there any functional tests being run as well on ring zero? No, that's the L3 test that we're working on. That's the future. Yeah, yeah, we're coming soon. I mean, we're actively working on it. There's some more questions. So one of the things about DevOps mindset is to have engineers not only writing the code but actually running it. So can you talk to how is that handled within your team? So who's in charge of creating those builds, creating those releases? I'm talking about not triggering them, I'm talking about creating actually the release definition, the build definition. Is that created by the engineering team on their own or there's another team? And then whether you, so when you are in that deployment phase and you are planning for how you're going to run your application, you need to be on kind of like a common platform or common framework. You don't want people to just decide on, I want to run my service this way while another team is doing it another way. How do you actually standardize or align on those things? And whether it is like a team that's doing that or this is, because we talk about autonomy, how autonomous our teams decide on those kind of things. Yeah, okay, good question. So what does it look like? I'm on a team, I want to stand up a new service. What does that look like? So we have this framework that you had talked about and maybe a sample service that runs on the framework. So typically if you're going to start a new service, you start with a sample, make a copy of it and then you start going because it's pretty kind of bare bones. But then you can have that now. We have one build. So you're going to define that each independent service is rooted off the source tree. So you're going to create your service within that. Then it's going to be part of the build. So it's going to be part of the CI and all of that. Now to get test running, the team that owns that has to set up the test runs for their service. And so we've got facilities for, you know, we've got machine pools and test machine pools. And so you can get your test runs up and running on your service. Then, so now you say, okay, I think I'm ready to go to production. So typically you're going to be working on this two or three sprints. Once it goes into production, it's like this airplane flying. You know, you got to do all your servicing, you know, on the flying airplane from that point on. So I tell people, we'll be careful about when you do that because you've got a new commitment you're making once you do. But now you want to deploy the service. There's different levels of support and we've got the different geos. So we've got, you know, stuff in the U.S. and Europe and kind of all over. Well typically we're going to stand it up on SU-0 first. And then, you know, try it out in production and then, you know, on other services, let's say integrates back into TFS. We do that through extensions, like release management runs as an extension. And we'll configure that on the MSNJ account to do release management and then so other accounts won't get it. So that's kind of how we control the servicing, the new UI. So just to follow up on that, right? So let's say, you know, I'm a team and I'm part of the team and part of what I'm doing, I need a new, like probably stand up, new cache or something, right? I mean, probably it's not, it doesn't exist currently. Okay. Right? So do I have to go through another, like so I'm part of that engineering team, right? Do I have to go to another team to get that done? Do I do it myself? Do I actually, the code infrastructure is called to actually do that and deploy it and I'm able to do that or? Yes. So. Yes, yeah, that would be you. Now, so we kind of, you know, vet new architectural pattern. So if you're going to take on a new architectural pattern, there's a bunch of work around that. One thing is, you know, if your bits are eventually going to run on TFS, on-prem, you know, and you take a dependency on some service in Azure, you got to figure out, you know, how's that going to work? So we got some services running on table storage. Well, there's not a one-to-one of table storage to SQL, you could kind of think there might be, but there's work to do to make that happen. So you got to do the work to make that happen. We took on, you know, Redis on my framework teams in a way that if Redis wasn't there, it's just going to keep working on-prem. But if you're a microservice, you want to take on some new architectural pattern, there's a ton of security work, you know, you got to figure out the on-prem, you got to figure out the deployment, the telemetry, you know, so if you just follow the architectural patterns that have been established, you get all that for free. Are you going to do something new? That's a lot of work you got to do to get that going. And you're free to do that. Like, if you want to take that work, you can just do it with any power to do it. We got architectural, I guess, V team, and so we want to take new patterns to that V team. You know, I think you got to get, probably going to want to get by off from your director, go through this architectural V team so we can vet it. But yeah, ultimately, that would be up to you. Yep. Question? Are you able to show this actually in VSTS? You know, seeing them on the slides is good, but can we see it kind of in VSTS, the build and kind of reduce definitions? Okay, I'll show you. Here's the team channel I talked about. So people can come in here and chat, you know, if they got issues going on or whatever. And we did, we had to do a hot fix on Friday night, this past Friday night. And so the DRI put a note in the team channel to the deployment driver saying, hey, we had to do a hot fix. So here's our TFS prod deployments. This one, so you can see this is 122. This is a little confusing. I don't know what happened here, honestly. So this is our 123 deployment happening right now. So we are going to ring two today. You can see that's going on right now. And then we're taking the 122 bits through rings three and four and four A. All right, now this is kind of weird. This doesn't usually happen. I'm not sure what happened. So that must have happened when I was preparing for my talk today. And then let's see what else is interesting here. So you can see here we went to ring zero with the 123 build and you know, the rest of the rings with the 122 build. So if I go, let's see, if I go look at this one, look at the history. So you can see 918, we succeeded. So we 916, which was Friday night. Oh, that was the, we did that Friday, that was the Friday night, it's 103 AM. That was not a fun night, but we did that. And then you can see, then Monday, we carried that through to the rest of the rings. All right, and then here's my kind of, I do, this is our Runs dashboard. So this is master. You can see these are almost certain flaky tests. You can see a little bit of flakiness going on here, but mostly green. So when we go to deploy, so this morning, this is our 123 run. So if I hover here, you can see, sorry, where'd my clicker go? This thing showing in 123, it's in 123 build. All right, and then this panel is 122. Okay, so if we're going to go to deploy at 122, we'll come over here and find a good 122 build. So we would deploy the 15 build. And so yeah, so that's how it goes. Does that make sense? Can you showcase the build and the release definitions please? The release, you want to see the release definition? Sure. So it's my screenshot, I promise you, let's see. So I'll go here, go here. So here's that screenshot I showed you, and then if we go. It was just so good, we wanted to confirm. Trust but verify. And here's the tasks. So I can show you, let's see. The run on agent, so let me see here, our variables. So if I go here, see, I am used to looking at this to approve a deployment. Okay, when we have these variables, this just looks different to me, hang on. What's that? Tasks. Okay, yeah, okay, thank you. So these are our manual intervention tasks here, which are configured with a variable for pausing. Where is that? Yeah, so it's this thing. And then I can't figure out, so I'm used to approving and deployment on the variables I can switch. Gopi, can you help me out here? Maybe just open the variables tab and click on it. Process, oh, click on the grid? Oh, okay, yeah, okay. I'm not used to this view, I'm sorry. But here you can see the pause time configuration by default for each of our pause between each deployments. What's that? That's a 123 feature. It looks different when you're approving deployments versus the editor. All right, does all that make sense? Definitely. I'll show you this, this is zero. This is our ring three. We have set to zero to stop after ring two. That's what that zero means, so. Yeah, any other questions? I think, like here's the deploy command. It just runs that PowerShell script. Anything else interesting you wanna see? I would like to see the relation with the release branch because you told about many teams that were in control to do their own release. Right. But since you're all, I assume there is only one release branch per sprint, so they do their bug fixing, I think. Yeah, so here's how it relates. Releases 123, we're getting lots of check-ins in throughout the day, and these are going to different services, so. Then when you want to do a service, we've got the build. So we have, this is the release build. It's the vso.release.ci, so this is the one doing the signing. So you can see here's a .36 build. So if I come over here, I can see the .36 build. Here's the tests that are running for that build. The tests are configured as release definitions as well. It's a little confusing, but each of these, like the, like the self-host is a release definition that then goes and runs tests. That's how that's working. And to add to that, because the release branch is created at the start of the sprint and to the sprint, so. In the sprint, yeah. To deploy, and all the check-ins that are done on the release branch are basically bug fixes then? Yes, it's up to the lead. We'll do some feature tweaks in there too, or we've got the build conference coming up and we promised some big feature and there's a lot of feature work to do. Then we can do feature work in the release branch too. It just depends on the circumstances, but typically it's gonna be bug fixes or polishing for features. All right, let's see if there's anything else I wanna show you about this. Here's my blocking bugs query. I'll just show you that. So I have these different tags people can put on a bug to mark it as a blocking bug. All right, cool. Who asked me for the demo? That was good. Okay, any other questions for me? You're still gonna show that slide at the end of the... Oh man, okay. I hope it's what I think it is, so let me see. Yeah, there it is. See? What happened? What the heck? I didn't do that. He had to shut you down. That's weird. You jumped on a presenter moment. You gotta like duplicate your screen or something. I didn't. Okay, yeah, yeah, so here you can see. I guess I should have showed that slide. I dumbed it down, but here's the 8.2 build going through 7.1 and it hops over to the next day. Now it makes sense, perfect sense. Thank you. Lesson learned, there you go. All right. Okay, I guess, okay, I'll talk a little bit more about this. So this maybe there weren't any bug fixes or changes. It's pretty late, by this time it's week four. We've been out there for three weeks and maybe there's a day when we don't have a deployment. Just depends if there's fixes to go out. But yeah, the 112, 112, this is showing the progression of 112 builds. This is 111, these are 111 builds. Also, you know, it's showing different number.1.2, whatever.12 and you could see, you know, from our builds could be.1.2.12 just depends on how many commits we had that day. The number over there at the bottom is in blue, it's not very clear, but I hope that is, yeah, that is at 6.1. 112.6.1. So you're basically deploying 6.1 and then ring three and four on day six, is that correct? It's a color bug, it should be this light green shade. Correct, yeah, but it's the same build that's going over there, right? Yeah, yeah, this build is going here, yes. Okay, and then you have a new set of builds starting in zero to two, which eventually kind of keep on sliding. Exactly, exactly. So somewhere down, let's say, is that the hot fix after day nine? Yeah, yeah. So that hot fix has been deployed in all the environments at the same time? Well, it depends on the severity. That's because- But typically it's going to be on the same day. Same day, yes. Because the same build would be, whatever, I think 9.2 or something. Yeah, 9.12, yes. That's going to be rolled in all the environments. Exactly. Okay. Yep, because they're all experienced in this bug. Good, good. No, this makes sense. Thank you. Okay, yep. All right, I think that's a wrap. You got any more questions for me? Nope, and we got 30 minutes. That's cool. I thought we'd have more, but I'm glad you asked all the questions. That was good. So thank you.