 So good morning. Thanks everybody for coming out. I know folks have traveled quite a way and we appreciate it So, you know this this whole event has been kind of looking back at at our journey and Sharon You know what we've learned and as I was preparing for this presentation I you know thought back to when I joined the group and that was in 2012 and at that point our scale was was much smaller than it is today. We had a single scale unit of TFS you know deployed in the US and a bunch of build machines kind of hung off that and About I think it was seven months into You know my tenure with the service we had a sprint 45 deploy and You know it was it was a deployment that really kind of destabilized the service it introduced a lot of performance issues and errors and and there was a kind of complex set of stuff we had to troubleshoot and Reflecting back it was it was quite a challenging time for us We didn't have a lot of the visibility that we needed to really understand what was going on in the production system And that really slowed down, you know how we isolated these issues and then you know eliminated them from from production and I Think I spent you know it was a good ten days, you know just sitting at my desk had come in in the morning and You know we'd be troubleshooting collecting a lot of telemetry trying to understand what's going on and You know going home at you know 10 11 at night and and cycling over again And the whole team was was kind of going through this we were struggling to understand these issues and You know one of the one of the issues when I reflect back You know and I kind of contrast with what it'd be like today if we had the same kind of instability You know it's it's it just kind of kills me how long it took for us to identify this It was a basically a disk latency issue, you know we we deploy in Azure and we're on VMs and back then the underlying disks as you would write data to them They were thin provisioned and they would expand and the disk latency You know when the disk was doing the expansion would get to be quite extreme and If you you know you think about performance engineering and read any white papers, you know It varies but you know 50 milliseconds is kind of the standard latency where you should be concerned and If you put that in context, you know Floppy drive, you know the old floppy floppy drives. They're about 300 milliseconds of latency and What we were seeing was one thousand two thousand milliseconds of latency and it was causing, you know issues in net and obviously affecting our code and impacting our customers and Disk latency is something that you know It's pretty easy to spot if you have the right telemetry and you're looking at it But back then we had some you know some gaps in our in our monitoring and telemetry And so, you know, we're jumping on the machines in production, you know already peeing into them and setting up batch files to click perf counters and you know setting up Ping matrices, you know that dump out to a text file and Bringing that all back and loading it up into our SQL system and trying to you know correlate all these trends you know when we're having impact and you know what's causing this issue and You know if we just would have had that disk latency and been able to see it we could have Cut off days, you know from this specific issue and there was a bunch of issues across the the environment, but um You know that really brought home to me, you know, just how important it was, you know to have a really rich set of telemetry and and really curated visibility into it and alerts that let you know in automated fashion when when things aren't healthy and You know another thing I experienced I was still kind of new to the group and You know historically I kind of come from an IT ops background where you're on-prem and you know devs devs off in their tower Right in code and they pitch it over the fence to you and you go try and run it and The devs aren't really involved in in that the production life site and in this group It was it was quite different, you know This is a group of people that are building this product and very personally committed to it It's online you have that direct feedback with the customers and there's no kind of walls between you and the experience they're having in the system and People really take pride in the service and care about their customers and so when we had this issue It wasn't me just you know work in these long hours. It was the whole group and it was developers and Everybody even leadership all rallied around this live site issue and you know, we eventually work through it and so in retrospect, you know it was We we should have had better telemetry that would have really helped we could have compressed the the time to mitigate this and understand it We also you know when we first kind of realized this issue it spun up and there's a lot of energy and people working on it But we didn't quite know how to to coordinate together You know and you really in a situation like that You really want to be super efficient how you're delegating out tasks and you know coordinating the investigations and and planning the Mitigations and slowly we figured out how to coordinate as a group and whatnot But there was some inefficiencies in how we responded. You know our process. I guess you could say But the thing that I alluded to that was really Let me know I was in the right group was just this extreme kind of focus on life site You know, there's the term life site first Which you know a lot of online services, you know adopt as a mantra and I really saw it in this group And it was it was a good thing to see So I'm Tom Moore. I'm the group manager for the site reliability engineering team for VSTS and I'm going to go through today Life site and monitoring and telemetry so I Kind of you know, one of the things to have a really healthy life site is to you know Have your organization kind of all march in the same direction, you know and You want to have some some beliefs and kind of you know ways you think about supporting live site in production that that are common and For us, you know, I alluded to that kind of life site first culture. So you'll see that term up on the board That's one of the kind of key tenants that you know permeates the group You know and that really comes from Brian. He's very involved in life site. It's not just features It's the quality of the service, you know in in production You know another thing that's very important and influences our organization is the term feel the pain, you know and with dev ops It's it's different than that IT ops kind of thing where the devs throw the code over the fence You know the devs that build that code they support it in production and If there's resiliency issues or gaps in telemetry, you know There's an incident happens the devs are the ones that are going to get called And they're going to get woke it up in the middle of the night And that's a very direct loop to get the feedback that you need to you know make your app more resilient Or if it's a if it's a false alert, you know, they own their alerts and they're choosing to call themselves in the middle of the night they're going to fix that quite fast and you know in fact I Supported a bunch of different services within Microsoft Excuse me and One of them, you know, I we had kind of a tiered model where ops still abstracted the dev team And I had these these three alerts that just just drove me up the wall. They kept paging my team waking us up and There was no real action to do and I would beg the devs to fix it and you know try everything to get them to Fix this alerting and Finally, we'd gotten to where we built this health model all this correlation And we could do direct routing to the devs and that meant that I didn't need my team to be in the middle And as soon as we you know routed those devs Sorry routed the alerts to the devs Those monitors all three of them got fixed. I think it was in two weeks. Maybe one I mean it was just boom and that really you know to me It's you know, I knew about dev ops and the power of having people support their code and production But it just really drove home the importance of having the people that own it Be accountable for it and they'll get that direct feedback loop and naturally improve it So so feel the pain is kind of one of the the principles that they got our group and how we structure Another one, you know drive with data, you know, I talked about this this this sprint 45 Experience, you know, even when I hear the word sprint 45 it still makes me kind of cringe and since then, you know, there was a lot of darkness in our system and After that we really built up our telemetry and we built up our alerts and you know kind of turn the lights on in production And what I'll talk about in part is when you start to bring back too much telemetry and fire up too many alerts It actually can start to overwhelm me. So there's kind of this pendulum you go through But having this data offline, you know in a central location that you can query at and ask any question and get it Answered fast for life sites just absolutely essential and we'll talk to that quite a bit Another thing for us and this is another one that comes down from from Brian is you know root cause is key and You know when you have an impacting incident your natural inclination is you know, let's get this fixed as fast as we can't go reboot everything and Sometimes that'll fix the issue But you won't have captured the state to really understand technically what happened and that thing's going to come back Most likely and it's going to keep causing impact and so as a principle in our org. We all know that You know getting to root cause is key and we'll actually take a little bit of extra time to mitigate something Because we want to capture that state. We want to reduce that time But it's very important to eliminate these issues from production. We've got to get to root cause Yeah, another one I'll kind of skip down to is Detecting incidents before customers It's really embarrassing when you're running a service and you're not aware. There's an issue and your customers escalate to you You know especially for my org, you know site reliability engineering. We're really focused on life site that's that's central to us and One of the the measures of our success is is knowing when things go wrong And so really ensuring you invest and have you know very precise alerts They'll let you know when the customer experience is degrading is a key thing and that helps to build trust with the customers You know if they have to tell you that the service is degraded. They're wondering if yet you have your eyes on the ball You know and I can talk to all these we could probably spend all day talking about them, but one thing that's very important to us is Automation and You know I talked about when we first went live. We had one scale unit I knew all the names of the build agents, you know, there was a handful of them And it's a scale you kind of manage, you know with with having some manual debt in the system You know passwords and secrets and expiring objects That's one that I've seen a lot of services they go live They want to get out there and you know get their their features launched and and they don't automate secrets or other aspects of the service and You know with secrets and expiring objects you have to rotate them all the time And if you have a small scale you can do that But as that scale grows as your business grows if you are being successful That success and scale growth is actually gonna gonna kill you and and we've gone through that where you know We now we have I think it's 31 services, you know, not we went live with two hosted build and TFS now We have 31 and I think there's a hundred and twenty five hundred twenty seven scale units around the world And if you just even think about our SSL search, there's thousands of them all over the place And so because we didn't invest in automating secrets, you know We would spend sometimes weeks of my team rotating these things and so any little flaw You've got in terms of operational debt manual tasks. You have to really watch that because you grow to scale It's it's it can overwhelm you and the final thing I'll talk about to this. That's you know, one of the key You know tenants for our org is Really being open and and learning from mistakes, you know, we're all Engineers and we know you're gonna have bugs from time to time. You're gonna have Incidents happen. You want to be ready and you want to avoid them But when they do happen once they're over, you know, really taking the time to Reflect on what went wrong, you know, there's the technical aspects, but there's also how did you respond? Did you detect it? Did you find the issue quickly? you know really trying to pull all the learnings out of that and You know opening up work items and and improving your service improving your process It's key to grow and get better over time And it ties into customer trust, you know customers need to trust that you're Committed to this service that they're taking a bet on and you know being able to kind of share your understanding of this root cause And how you're getting better is is is something that's very important to us So, you know, I think I just heard Brian speech. He mentioned people Process and technology, you know, so I kind of talked about the people and how You know, we've got these beliefs that kind of align us to live site But then there's also our incident process framework, you know, that ties together all the people and the tools we have to Make us really efficient when there are issues, you know efficient at detecting them routing them getting them to the right teams and You know mitigating those issues and then, you know learning from them and like I mentioned when we first went live in Strength 45 you had a bunch of folks that are dedicated. They got a lot of skills and energy And they got the live site, you know focus, but we weren't super structured and Delegating out the task and organizing how we were dealing with this issue Excuse me and so over time we've built up this this kind of process flow that has really helped us You know be super consistent how we respond to incidents and We'll go through through aspects of this later in the presentation So, you know, why is live site important? You know these headlines illustrated and it's somewhat self-apparent, you know When you had the old on-prem Infrastructure, right? You're kind of running your own thing and if it goes down. It's you know, not super exposed Out on the internet with online services customers have a ton of choices They choose to move from on-prem up to cloud and they can move across different service providers, you know fairly easily and So being able to trust that service provider and know they've got quality and resiliency and They're not having you know performance issues all the time to errors. That's a big thing for customers And when things happen, it's it's very visible and that can tarnish your reputation and can impact, you know the the customers trust in your service and impact your business and So for us, you know, there's You all week We've been meeting and talking about all the features and the great things with ALM And that's a big part of it But ensuring that this online service is up and is high quality and and resilient is just as important as the features If folks can't access them. They're not working It's not a good service for them and you're not going to win customers and any questions at this point Let me deal with the clicker here So yeah trust is a big thing for us, you know And this is another thing that comes from Brian, you know, he's very You know transparent as blog talking about when there's issues and Making sure our customers kind of understand, you know where we're at with live site and whatnot and that permeates down throughout the org and so there's two kind of Flavors of life site communication we do that that are in part built, you know aimed at building trust with customers there's incident awareness and you know, we know issues will happen we try and avoid them and when they do happen as As a customer, you know, you're wondering, you know, is this me is you know, is it is at the service? Is somebody aware of this? You know, I like a lot of folks I watch Netflix and I've got a wireless networks that sometimes is not the most stable and Friday night, you know, I'll be trying to fire up a movie and at times It's it's kind of airing out and I'm wondering is it my network or is it the service and when they pop up a message And I know it's the service. I'm like, okay, cool. They're aware of it. You know, they'll fix it It's good. So, you know letting customers know as soon as you can when there's an issue it builds trust and it's a better experience for that customer Then there's the postmortems after you get out of the the incident we really work to pull out all the learnings I talked about and you know following up on that and showing customers that that you're really dedicated to understanding these issues and improving over times another thing that we found customers really respond to and and value and So, you know, don't try and read this, please but these are these are recent postmortems that we've written for different kind of you know major outages and You know, these originally started on Brian Harry's blog. We'd have a you know an incident and Brian would write a very thorough You know postmortem on it and explain what happened and Over time as we matured, you know, we couldn't rely on just Brian to do this and so kind of the level two managers You know, we all decided hey When there's an outage, you know, let's let's do this postmortem and post it out to the customers and we set a goal of doing this within three business days and Sometimes it's hard to get all the data to have a meaningful postmortem but that's what we we try to strive for and for everyone we're committed and we post these out on our Lifesite blog and again, it's really meant to show, you know, we understood this issue Here's what went wrong technically here's what wrong with our response and how we're going to reduce the you know the time to respond in the future and The things we commit to to make the service better And if you look up it, you know on on Brian's blog for his postmortems there's a lot of comments from customers and the overall theme that you see is that you know Customers really appreciate this they appreciate the transparency You know, nobody wants incidents to happen, but no one that you know We're committed to learning and and being open about it is something customers value And another thing that's interesting is You know in these postmortems, we're kind of dissecting what happened and there's there's lessons in there that you know During customer visits and you can see up on the blog that you know Customers also value because they can learn from from from our lessons and hopefully avoid the same issues themselves so you know communications is important and Time to is important you know communicating quickly and I mentioned automation is important And so if I kind of look back through our journey, we've always tried to communicate with customers and Back in 2013 You know we weren't as mature as we are now we had a spreadsheet that was encrypted and had all of our You know the the login for our blog and the login to set service status and all this stuff And so an incident would happen and you're firing up this incident bridge and you're trying to You know get these comms out as fast as you can But it was very manual and clunky and frustrating to be honest So it would take us sometimes 45 minutes to get these communications out in the service data set And a lot of times the incident would be mitigated, you know, so we weren't doing very well back then though the intent was good and In 2016 we developed we've got a mission control tool and this is written by our SRE team And it's outside the service intentionally so that we can kind of automate our operational processes and One of the things it does for us has helped to streamline our comms This is an NBC website just got a little workflow that we go through and What we did was we took all those passwords and connection strings out of this spreadsheet and built these web pages that we could enter all the You know the blog data and whatnot and post that out through a tool and that made us faster You know our time to notify got down to you know 30 minutes on average But it was still slow. You're duplicating a lot of data and and it's not super efficient And so most recently we we did what we call one-touch comms, you know, which is really trying to automate this workflow and We built this communication tool that blasts out all these channels and it sucks in all the relevant data from our incident system We call this ICM and so, you know ICM's this ticketing system that we Create a ticket for each incident and put in data like when it started what feature was impacted what scale unit? what's the impact statement and So we're able to press a button now and blast out You know our service status our public blog if it's impacting public scale units And then we've got a lot of internal customers windows, you know office, etc And they all want to be communicated to in different ways and so it'll send a bunch of emails internally for that and You know, we're looking at ways to evolve this but We feel better at this point, you know that we're able to get our time to notify Down to 15 minutes sometimes faster and it may not be a rich, you know communication that goes out It's something like hey, we're aware of this. Here's how it might match the symptoms You're seeing, you know, we'll follow up soon and we're able to do that quickly And so this is somewhat in the spirit of you know automating so You know going back to that sprint 45 experience, you know, we had some monitoring out there We had we had scum which is you know system center It's kind of an on-prem monitoring solution that we'd wired up to the to the service and it gave us some level of visibility We had a lot of stuff that was You know kind of under the desk type monitoring, you know running on the box under the under the desk and You know strips and whatnot that wasn't super mature, but it was an attempt to plug some gaps We had some outside-in solutions that would ping the service and let us know if the front door was open and Since then we've really, you know ramped up our telemetry and we realized, you know that you got to bring back all data all the time and so For a live site and for our business. We're collecting all kinds of data types, you know from across the service and flowing it all back into a central solution that we'll talk about. We call it Cousteau and We've had now have seven terabytes of data a day and if you look back in 2015 I think we were 60 gigs a day so the you know the scale of our services growing and the the scope of the telemetry is growing and We've got a lot of data and that's good, but it also creates challenges. Do you have a question? So what kind of skills that your team have to build to actually make sense out of this data? Well, it's going to vary, you know for like live sites, you know, it's So I think one way to think about it, you know, it's a lot of data and before you get to the skills You have to actually have a mental framework on what are you trying to do with all this data from a live site perspective? It's really about customer experience, you know is the customer happy with the service. That's the basic question and Once you know that some folks are unhappy you want to isolate the source of that unhappiness You know kind of traverse through this data and figure out where this the slowdowns are occurring or the errors are occurring and Then once you've found that within the system, you want to drill deep and understand root cause as best you can to mitigate and ultimately fix So one skill I think that's needed is is having this kind of end-to-end You know kind of conceptual model you build up in the architecture in the stack and understanding how the telemetry maps to that and then Tactically when we have an incident we've got a tool we call Cousteau that I'll show it's it's log analytics is the public offering And it's it's powerful and it's a sequel like tool that you can stream everything into it and You write these sequel queries that you can join all this data and It's like magic, you know the answers will come back out very quickly So so being able to you know write those queries is important and we do that Bill Esri calls it. It's Cousteau's our language internally. We answer all of our questions through Cousteau business questions life-side questions So that's that's kind of the most, you know tactical way to Think about this and the skills you need the ability to query this stuff But then when you're in a life-side incident, do you really want to be crafting up a bunch of queries? You really want to have alerts over it that proactively analyze the data and run these these queries and tell you what's wrong or Kind of some curated views some health models and whatnot. So there's some development skills you need to create, you know Dashboards and whatnot over the top Does that answer your question on a high level? Yeah, and do you have a slide that's actually showed the kind of tooling that you have in place? That's actually helps you to do all these kind of things that you're talking about In the appendix, I've got a architectural slide But to be honest the key thing that we're going to kind of focus on today is more how you think about it because there's you can Okay. Yeah. Yeah, that's a good call Yeah Yeah, that's great. Yeah, we'll we'll show and tell Okay Thank you So yeah, so we'll kind of we're not going to drain all these telemetry types I'll talk to them, you know very quickly within the context of how do we understand customer impact? That is that's central to live site and central to how we think about our telemetry for live site And that's what we call these these activity logs and activity logs are really Once Customer requests or any request gets up into our stack, you know any any REST API or web page we capture it every single request and You know decorate it with a bunch of other data And it lets us know each individual command. What's it successful from a user perspective? Was it fast enough? Did it fail? And so that really gives us this this tracer that traces user requests through the system And it's like a drag net then we start to harvest all this other data like traces For us traces are you know as a developer there might be something significant and you raise a trace and Attach that into the context of the activity ID, which is the user request And we've got a bunch of counters that we collect from the OS and IS and ASP net We write our own counters and we'll flood those back into the central system we've got customer intelligence data and You know I was dry running this presentation with with my family and one of my kids said you know How do you know how smart your customers are and I was like, you know We can't really measure that but you know customer intelligence data is you know measuring the experience on the client side is JavaScript We've got synthetics. That's that's outside in you know, and that's where a lot of folks start with their monitoring you know pinging it from a Gomez or or app insights gsm and then there's there's platform and network telemetry You know Coming from on-prem everything's kind of packed into one box or in one rack and you don't have the Amount of dependencies and surface area you've gotten cloud and cloud We've got these very distributed systems with a lot of dependencies the network connects everything There's load balancers everywhere and so pulling out all that telemetry from the platform You know in an Azure all the different You know platform services have really rich management API's and telemetry you can pull out So we harvest all that and again throw it back into this log analytics solution So we can join it into the overall customer flow and and understand where things are having issues and so a game changer from us is something we call Cousteau and The public offering of this if you go up to application insights or Azure monitoring. There's log analytics solutions. They've got it's all the same technology on the back end and Arthur C. Clarke's got this this this quote that I really like that says any technology that's sufficiently advanced will seem like magic and Cousteau is magic it's You know this big log analytics system you bring back all this data and I can write these in same queries and Answers them in seconds and you know I can join all kinds of data and stuff that The same question I used to ask when we had kind of our old on-prem Infrastructure, you know it's based on sequel and whatnot. I'd run a query trying to ask the same question I'd have to come back an hour later and it's still running Now with Cousteau it's really like magic. We can answer these questions And I didn't encourage folks to check out this you know log analytics. It's it's really powerful I've been late to more meetings than ever and I blame I blame Cousteau so You know, it's you know that that conceptual model, you know How do you understand what's going on in in live site and it's really pivoted around the customer experience? That's what we're trying to manage and so these activity logs are the key for us and You know the user request comes in the front door and starts coming up the stack We run on Windows it'll come in through HTTP.Sys and up into our W3WP and once it's in IIS All of a sudden we've got context on that that user request and you know very similar to what the IIS logs are You know, we've got effectively a row for every request that flows through the system But then we decorate it with a with a lot more data, you know, we've got the user context We know where it's at within our deployment the scale unit even the feature they're calling and you know where it's at in code and So that request comes in and we've got a very distributed architecture And so a lot of times the request will flow across service tiers So we've got this activity ID. That's that's how we you know track this and when it passes across service tiers We'll take that activity ID and pass it in the header and it'll come up the stack on the downstream service and we log another activity ID and because we've got a correlation across this and all this data's in Cousteau we can now start tracing every request across the system and That's that's powerful because joining this and tracing it throughout the system is is is key and then We've got dependencies, you know, and for us, you know, it could be anything it's a storage You got some storage subsystems. You're going to call out to queuing systems we use a lot of sequel Azure and And we have challenges at times of sequel Azure will drive too much load or there's something wrong in the query plans And we need to know when it's slowing down So within that activity ID that's transiting through the system when we call out from our code into sequel We log the duration for that call And that lets us know if the command's slow we can then find the budget for sequel and realize that's the source of the issue and Once you've isolated where this impact's coming, you know, slowness or or errors You're going to want to dive deep and really understand specifically what's what's causing this and so as I mentioned with with with Azure or AWS all these different platforms have a bunch of telemetry that you can harvest and So within our framework We're constantly pulling back in all this data and stuffing it back into Cousteau So that we can dive deep and we have ways to correlate this request or the calls from the web tier down to sequel and dive deep and then You know as a dev you're putting in trace statements all over the place and Those are very meaningful if we have an exception. We roll it up into a trace statement We capture the stack We know the thread ID the process it was running in a bunch of rich data that'll help us Investigate and we attach that into the context of the activity ID the activity ID is that user request and now we can Attach some very meaningful information to it with our traces So this is This is powerful I've been on systems where you don't have this type of you know I call it really user monitoring and You don't have all the the telemetry for the platform, you know, and I didn't think back to sprint 45 We didn't have these types of views and ability to join this stuff Having this it doesn't make it easy to understand what's going on But it makes it up, you know easier than when you're running in the dark So if we kind of take You know take an example of this You know, we've got TFS on our front end and that's the service You know that a lot of interactive users are coming in and interacting with and so they might be opening a work item or Queuing a build or whatever it is and they'll come in the front end of TFS and we start an activity idea and We can measure the end-to-end response time for that request, you know And it's going to transit maybe down to sequel Azure back to SPS or back in off an identity service SPS may have its own dependencies calling out the storage or sequel Azure and And So by passing these keys and tracking the overall time we can really decompose in this case, you know the the user request was you know 12.x seconds and If we query Cousteau we can see it wasn't TFS calling down into sequel Azure It was actually the remote call to SPS and we passed that key so we can correlate it Then we can traverse in our Cousteau queries and look at SPS and see that it's actually the database that was the issue this is a very common pattern for us in in live site and Then you know like I said, there's very rich platform telemetry and we harvest that up and you know specifically for sequel There's database layer metrics you can look at you know dynamic management views that show you the overall State of the database over time You know are we using too much CPU too much memory etc. And so that kind of gives you an aggregate view to see if the database is healthy And because we know what database we're calling into for this activity ID We know how to join down into it Then there's QDS. It's a query data store and that's more at the object layer. We've got a bunch of sequel procedures They've got query plans things can go wrong and affect performance and user experience And so we're collecting all that data and and can look at our you know sprocks and aggregate and see How healthy they are and then at the lowest layer This is somewhat new for us. There's x events and you know if you've worked on sequel You know about sequel profiler and you can fire up traces and you know see every single statement that goes through The sequel stack and we actually when we connect to sequel in the context object We'll take that that activity ID that user experience, you know key and pass it down into x events So that when we eventually extract that data and stuff it back into log analytics We can join it back into the activity ID and see the actual commands that were run and So I kind of mock this up. This isn't a real thing, but it was a real query But you know in this case you could see, you know Q build in sequel was The procedure call that contributed most of the time to this, you know slow command And so if you're investigating it, you know, that's a sprocket you're interested in and the other ones You know look look look pretty pretty good So yeah, this is a really powerful, you know pattern for us that You know we built up over time and again, it's not it's the pattern that really matters You know, we've got dependencies on storage read us a bunch of different stuff and we you know We don't have all the data, but we've been working to accumulate this so we can do the same type of traversal queries any questions kind of before I move off Activity logs and whatnot Yeah, so capturing all this kind of data. Did you experience any kind of impact on performance at all? Can you repeat that? I think you need to turn your mic on to so the folks can hear it. Did you experience any impact on performance? Oh from collecting this this data I would say in general know, you know, we this is built into the framework we a buck gave a talk about the framework It's a wonderful thing Yeah, overall impact Brian's right There's impact the agents are doing a lot of buffering locally there though, too You know so it's doing a smart job of pushing the telemetry It's it's flowing off the box through a local cash and into blob storage and that's so that helps But we can certainly if we're not careful put the service on the floor by turning every trace statement on yeah That's true, you know, but over time though with the bar telemetry channel, you know, it's pretty wide right now We don't really have a serious problem with Perf, but we we do have to think about it So you can do the image so I was one thing Tom Go ahead. I guess I saw the guy saying repeat the comment the comment was hey does collecting all this telemetry impact performance One thing we do is we'll automatically turn off traces after a certain period of time so Because we found before we did that these traces would accumulate and accumulate and nobody's really paying attention and then yeah And we have had it where Somebody turns on too much tracing And yes, yes, so so how long are we talking about here like after like a couple of weeks or something when something is pushed then you just turn off those Like low-level telemetry or how often I think it's Yeah, we got it. It's a week. I'm not sure exactly I think it's worth clarifying that tradesman trace statements are On demand on when you're trying to investigate a specific issue you turn them on some of the other telemetry is constant You know, we are constantly collecting CI customer intelligence data We're constantly collecting activity IDs. So I think it depends on what you're talking about trace statements end up collecting a lot of data so That you do it on demand and then as I said then we turn them off Yeah, yeah, there's one. Oh, I'm sorry Yeah, one more important point which is that of the three sources that Tom was showing for sequel The the QDS and the DMV are actually out of bay and they're not in the context of that of that specific Request, so there's a job that's that's collecting that data It's still very valuable from the macro level to get you know insight into the health But not everything is tied to the the actual, you know request itself We do go as deep as we can but that's generally where we've got dials on on the depth of information another anecdote there we Do rate limiting with the resource limiting that buck talked about yesterday to do that We got to collect a lot of data about what you're doing, which you can imagine has some overhead And we've had to be really careful about that. So when we initially try to do you know high-fidelity sequel tracing it had two big impacts So we're you know really careful about measuring that and then recently went to Ex-events to get this data and that's had like a less than five percent overhead To collect the data, but it's been super powerful because we can tell you know for Up to the individual user what exactly you're doing to the database. So it's it's a powerful thing Go ahead. Yep Have you hit any limitations with Cousteau in terms of you know sending metrics across to it? You know are you pushing in directly like are you talking about the ability to scale? I guess so yeah, you know, it's we haven't had any issues and I'll get issues with things getting jammed up on the on the collection side on the well That's that's not you're talking to agents on the computer. It's related to the pipeline There are teams that are yes, thank you. So yeah, the we as team put it. Yeah, that's what I was gonna say Look, we don't have problems with scale of Cousteau. There are teams 10x bigger than us using Cousteau and not having problems Yes, there's some limit You could have enough data to overwhelm it but Cousteau also would allow you to partition that data into different databases So Cousteau scales incredibly well, there's cost to it, but it scales incredibly well Yeah, I guess one thing to think of you know in terms of impact there's performance overhead You know when you're collecting all this stuff, but the way I think about it is You know you're building that into your performance model and you're investing in this telemetry and so it's there's value to it You know we do have issues if we flood too many traces and whatnot, but um, you know It's something we really factor in to you know our scale Okay, there was a question over there. Sorry. I missed it. Oh go ahead Is activity ID the only correlation type of ID that you flow through the system or are there other? Correlation ID type of that's the the primary you know ID we use and we got different kind of flavors of it But yeah, it's activity ID, you know bill kind of mentioned You know we can loosely join other stuff like when we call in to the DMV views that kind of database view or the QDS we know the database and we know the time range. It's not a direct correlation for that activity But we can at least associate the data, you know through those those those parameters. Thanks Yes, I wanted to know I saw in I think it was bucks talk yesterday had some App insights he was using the custom telemetry to actually send up the telemetry So I'm kind of trying to figure out what's what custom telemetry. Do you guys typically send? In order to build all of us up because I mean you can right click and add App insights. Yeah gives you some of the stuff if you get some correlation IDs and so on but then there's also the custom Telemetry yeah, I mean so with with application insights, which which we don't use it's very similar capabilities But we have with application insights you can take a reference, you know on their library and you get a bunch of stuff for free But then you can go ahead and add your own kind of you know custom events and metrics and whatnot You know I guess analogous for us like if you think about the metrics we collect will harvest all the perf counters off OS and IIS Then we write our own metrics up in code, you know for things we want to understand exception rates and you know function calls But then you know with activity IDs and the tracing and all that type stuff that stuff We're rolling on our own. We're writing in code and we come up their own framework for But it's very similar to application insights So Oh, sorry go ahead I'm sorry if you answered that but I don't think I fully understood What was the what was the reason why you didn't use application insights? Um Bill you want to field that one? Yeah, well one one is that it wasn't around we first started doing the instrumentation You know, we have looked at the agent and it's just one was the timing and two is You know, we're We're running at large scales and the tool we basically we've grown beyond the capabilities that were present when we needed to invest So we basically just took two different paths on the path That's right We would use app insights again. It's just more of a timing thing than anything else Okay So, you know get live site. It's about the customers We want to make sure they're happy with with with our features and live sites running Well, and we've got this very rich really user monitoring, you know that we implement through these activity IDs and We harvest that data and use it in many different ways to ensure we're focused on live site And you know, it's very common for a service to have an availability model and you know, that's for us Availability is the overall Kind of key KPI for service health, you know, we look at it monthly. We look at it weekly We look at it daily. It lets us know if we're having any big regressions and customer experience But that's an aggregate view and then we've got The same data that we can look at per customer and if you aggregate, you know any customers Availability using all these activity IDs. We can look at them monthly weekly, you know down to the hour and At times we'll peek into those individual customers to make sure that the ones that are missing goal We understand why and we're not losing that in the in the aggregate and then there's You know live sites real-time it's happening right now for us and You know, you want to understand when The service isn't healthy and fire up this whole incident process We'll talk about and there's many ways to detect, you know if customers are unhappy in real-time But the most direct way is to to use their direct experience, which is our you know activity IDs And so this is somewhat of a new thing for us But we're starting to alert on you know the availability within a five-minute grain, you know And that lets us you know precisely detect when there's issues in life site and fire up our incident process to mitigate the problem and So You know getting the availability model right it didn't just pop out day one It was a you know a journey for us And I think a lot of services go through a very similar kind of evolution You know we started off with Gomez, you know as our outside in it's got synthetic users kind of placed around the world and They're hitting the login page and maybe opening a work item for us But very soon that that you just realize that doesn't scale as you know the number of web pages and scenarios and flows in the system Expand you just can't keep up with with re-recording these these web tests. It's a lot of work and So there's some value outside in to make sure the front door is open But it's not what we wanted to use to measure customer experience in an accurate way and so then you know We've got all these Activity IDs for like okay cool. Let's aggregate these up, you know take all the commands in the system and look at the ones that are failed and slow and look at the total commands and come up with a percent and That you know was was it worked at first, but then you don't really have that per customer Experience you really want to think about the customer. So you might have an automation system. That's you know generating a bunch of light commands and As that command volume grows it you can wash out issues. It doesn't really represent customer impact and So we switch to the model that we're using right now Which is really looking at each individual user or account experience and over time and seeing you know in Different quantums, you know, are they having failed commands or slow commands and then we look at the number of Accounts that had you know failed or slow commands and total active accounts and We can have kind of a empathetic user model And if you look back to you know kind of 2013 when we were playing with the command model and the real user model you can see kind of the phase two line where we had a pretty severe outage and From a command perspective all the Most of the commands are succeeding and we don't see the impact. It's a flat line But then using the real user experience model you can see it really drops out and that really matched with the experience our customers were having and so that was That's been a big evolution for us and I haven't seen a lot of services kind of adopt this. It's really a Model I think that merits looking at So yeah, so we learned kind of through this experience and we're Pretty happy with the real user experience monitoring so Brian's got a Blog on our customers a bag of sand and this term's floating around for a long time to be honest I didn't know what it meant for a while and But then I went through and read the blog and really it's it's talking about if you're looking at things in aggregate You know the overall service availability You can lose track of individual customers that may get kind of fall through the cracks And so what we've done in our live site reviews and when our you know, our DRI our Engineers are on call. They're doing proactive investigations They'll kind of peek down into that bag of sand and look for the users that are impacted You know individually and they're not affecting typically the overall service availability But we want to understand those individual users, you know a we want them to have a good experience and be sometimes those investigations show us something, you know in the service that You know is something we need to fix and and there was one example where We had some accounts that had a bunch of team projects, you know over a hundred team projects And when those users and those accounts would come in we would do so many kind of off calls and identity calls that It started overwhelming the database. I think you know, there was like 1400 proc calls We'd make for every time a user, you know visited visited the site and You know, we looked at some of these users that you know, not many customers had a hundred projects But when we looked at those accounts, you know We saw this and and refactored the code and got it down to a single call and you know Reduced the latency from it was like 400 milliseconds down to 50 milliseconds for the for these customers And you know that helped those individual customers But then that also helps you know as our service grows and people adopt and the data shapes become, you know bigger and bigger That probably saved us some incidents that would have occurred for other users of the future So this is you know really looking at customers and what buckets do they fall in individually each customer most of them When you look at their individual availability, they're above goal three nines for us, you know Then we got the second bucket, which is you know, they miss goal But it's it's not a horrible miss, but it's concerning and then really folks that are falling well below goal And so kind of by bucketizing the user experiences it lets us find those users that are falling through the cracks And then I talked about kind of the third way we use this this real user monitoring our activity logs and this is You know alerting in real time so right now, you know, we've got monitors running in the system that are you know looking at For these alerts we actually do it at a per user level We look at all the active users in the system there and they're running some type of command And we consider that an active user and then we look at You know all the commands they run and we say we're any of these commands too slow. That's frustrating for users We consider that a bad outcome for that user or did it fail? That's obviously a bad outcome and then we take all the active users in the denominator and The users that had bad experience in the numerator and that gives us percent of impact in real time And we're looking back in five minute quantums and that lets us actually, you know We can do counts or percent and that lets us actually set up alerts, you know that when we go out of goal And you can see it in this this graph This is I think from about a week ago We had an incident where we had you know the impacted users climb up above 50 in one of our scale units and There's always a little bit of noise, you know, it's a pretty aggressive Availability model, but at this point we know there's something wrong in the system And we're actually we can fire an alert to the SRE team. It's a very generalized thing and fire up a bridge and try to figure out what's what's wrong and We'll talk about alerting, you know, this has been a very precise alert for us a lot of times You create a bunch of alerts and not all of them are actionable and you may not know the impact in the system So this is something we're really investing in heavily and Yeah, so this this gets into a so, you know, let's go back to Sprint 45. Oh, yeah For the SLAs what are the kind of SLAs that are being monitored? Well, so I wouldn't call these SLAs if they you know, SLAs typically public-facing so I apologize if I said that These are our internal goals this this whole model is it's a very aggressive model It's not something you would use for a public, you know kind of SLA It's really, you know, holding us accountable and giving us a signal into, you know user experience and and it's you know Single failed command or a single slow command. So it's it's hyper sensitive So we call that a service level objective and internally right now where we're at is on any scale unit You know for TFS or least manager, whatever it is If we get above 50 users that are that are impacted, you know with this very aggressive model We'll fire up an alert if it's you know sustaining Go ahead. Yeah. Yeah. So are you gonna talk about a scenario where when you know, you saw something like that It was alarming and like, you know, what have you done from like, you know from that point on to actually address that? Yeah, would you have to you know involve other teams? Well, so what is how that you know, yeah, we'll talk through in a couple slides kind of our incident process Which is this is the remember I talked to early on their signals That let us know. Hey, there's something potentially going on a life site We routed somebody and then they validate that this is you know customer impact and then we fire up this whole incident process That will go through But so yeah having these you know in the old model a lot of times you'd have people kind of watching dashboards You have a knock or you know a service desk or somebody that watches these trends, right? And they're watching for it for impact You know, that's better than not watching stuff, but ideally you want to have alerts and we even we call them pages You know, there's there's alerts that might go to mail or something But when you page somebody, you know, you actually wake them up in the middle of the night that's that's the highest level of alert and If you're you know a dev ops shop you want to wake people up and there's an issue and have them get on it But when you wake people up in the middle of the night and it's a false alert that makes folks unhappy And so it's really important that you have precise monitoring that it's actionable and that when I get called it's meaningful I need to get a line and do something and So that's kind of one thing you're trying to balance this You know, don't wake up your devs unless you really need to on the other side, you know I talked about one of our key kind of you know beliefs in in our org is that we always want to detect stuff before customers and And so for that, you know You naturally want to write a bunch of alerts. We go through all these incidents. We realize we had a monitoring gap We're like, hey next time this gear gets hot. Let's create an alert and fire it off and And so you'll want to start firing up all these alerts and you create a lot of noise You know with it's a good intent. You're trying to capture customer issues But you have to be very thoughtful about how you balance, you know, your volumes of alerts and making sure they're actionable And so if you look at kind of our journey, you know in 2014 2013's when we had the sprint 45 issue and I don't have the data for it But coming in in 2014, these are our alert volumes and you can see, you know just in the volumes that that we're investing in alerting we're creating all kinds of alerts and You know really trying to light up production and That's good to a degree. That's good intent, you know, if you look down below These are the the incident Are we detecting customer impacting incident? What's the rate per month? And our goal is, you know, 80 90 percent. We want to detect via automation And so you'll see that we were detecting a lot of incidents But, you know, there's there's a hidden thing in that in that it's kind of a needle in the haystack up top We're having, you know up to 5,000 Alerts a month that's so many alerts that you may have a good meaningful alert in there But there's a lot of noise in the system Plus you're you're waking folks up and generating You know kind of DevOps noise that's that's not good. It's frustrating your DevOps team And so then you naturally kind of react to that and you start to want to drive down those alert volumes You know and so we start turning off alerts that we think aren't actionable You know a lot of job agent failures we crank those down you loosen up your threshold so they don't fire as aggressively and That can have an effect that you stop detecting Customer impacting events and we actually saw that you know you see our alerts We invest heavily and kind of cranking these things down in part We created new tooling that is stateful and instead of the monitor waking up every time sending an email We'd open up a state object for that for that alert and that helped reduce volumes But you see where we kind of plateau out, you know one of the the biggest things We did is is DevOps feel the pain You know we got to the point with our tooling where we could route these alerts to the the devs that really owned that you know That authored that alert and all of a sudden it's that story I told you know when you start routing the alerts to the dev that owns that that code if it's noisy They're naturally going to fix it and so even though our scale was growing tremendously throughout this time We're deploying scale units out all around the world our alert volumes have have sustained and But we really got into this Percent detection issue. We were just keep missing customer impacting issues our customers keep telling us there's issues or you know internal and external and You know it just it's frustrating and we've tried different things, you know, we've tried a bunch of different ways to plug this gap and And so this is kind of the balance that you know you're This is how we were thinking about it before you know you're balancing DevOps health with customer satisfaction, you know and we realize this is really not the way to think about it you know, there's this concept of precise alerting and That's really where you're only detecting actionable stuff, you know And that's either you've got a customer impacting event coming you're proactively alerting or There's a customer impacting event happening and you're alerting those things are very actionable And it's very valid to send a page or an alert to DevOps So that's not the way to think about it So really what we've Zeroed in on or I wouldn't say zeroed in on I'd say what we've started experimenting with For kind of our web scenario and there's a bunch of different scenarios in the product But for the web scenario we have these activity IDs We have this, you know user experience impact model And we've started alerting on that because it's directly what you're trying to to detect is customer impact So on the subject of customer impact, how do you? Determine what slow is yeah So within our framework In the command model the activity You know concept Every command is well typed and as part of that command you've got a performance goal And our default I believe is ten seconds and then you can override that make it more aggressive or Loosen it up and for some commands like you know if you're streaming a file That's going to take time. It's a big file So we won't look at performance on that so we kind of got a default of ten seconds And then we adjust it based on kind of you know per command Understanding what the user expects Yes These are customer impact alers is the customer any time alerted about So your question is do we internal alerts? That's what I understood Let me play back your question make sure I understood you're asking you know, we've got this customer based alerting Do we share this back with our customers either external or internal? Yeah, we've started so again This is something we talk about a lot And let me let me answer this as Precisely as I can so internally we started to share this with Windows and other groups It's a very aggressive model. So we're a bit hesitant to share it out and we're always like hey, you know, this is very aggressive It's not like a public SLA, but we started sharing this with with internal customers And we really like that, you know going back to Transparency and you know trust with customers We really would like to share the stuff externally and let folks you know time to communicate issues What's the best way to do it? Have a robot do it, you know give them these metrics so they can see their actual performance and issues A it's it's it's it's little too aggressive We think you know it and customers may not understand it So we'd have to figure out how to frame it up properly and B We don't have right now like a customer portal where we can stream metrics and expose this But it's something we talk about constantly and would like to approach at some point. Thank you So yeah, these these customer impact alerts, you know really looking at this end-to-end scenario and Using that to alert on it. It's it's really Helped us be precise, you know and We're still you know evaluating it, but it's it's promising There's other scenario kind of in the system where we don't have this the same kind of end-to-end model You know some of our orchestration, you know for releases and builds and stuff We've got alerting over it, but we don't have the same kind of model So I think we'll start looking at expanding the framework to Expand to those areas Any any questions about kind of customer impact telemetry? So once we get the alert What is the corrective measures taken? Can you just give us some examples on that? What how do we respond to it? Yeah, thank you. That's yes. That's where we're headed so Let me answer it in a couple slides Yeah, first I want to kind of go back to you know the old model and As a product group, you know, we've been writing TFS for a long time and You know selling it to customers and they install it and and run it But internally, you know Microsoft's a software company and we would deploy TFS on-prem. We still have it and you know Windows office, you know different folks would would use it for their ALM and The team that supported this was a traditional IT ops team We have a we used to have a group called MS IT. It's our internal IT group They're kind of separate from the product groups and provide services so the TFS team This is kind of simplified, but you know would write the code and pitch it over the fence to the MS IT team they then do this, you know what a lot of IT ops shops would do They take the code they provision a bunch of infrastructure, you know they know how to set up the load balancers and get the certs and the storage and the compute and Then they go ahead and craft up the config and you know enter all the endpoints They set up and put in the cert thumb prints and you know start twiddling the config file web.config or whatever it is and That's outside of source Then they're going to Deploy that out to the infrastructure they set up and and you know, they'll write scripts and manually do it and Big word docs. I've seen like 20 page word docs on how to deploy stuff And they deploy that out and stuff's going to be kind of wrong configs drifting but you know get it up and going and Then you bolt on the monitoring you set up system center or some kind of Monitoring that's outside the product and you start writing rules to pull in perf counters from the OS or IS or you know TFS got some counters You look at the you have issues and you look at the event log and you figure out the red X's You know the bad events you set up alerts for that because they correlate to incidents and it's ops Trying to you know take care of the system and understand it and monitor it and the dev teams, you know Focused on features. That's really how we used to organize and there's kind of a you know Gap between the the live site and the devs that write that code. They're not getting the feedback You know when things go wrong on how the codes behaving in production and they're not incented You know, they're not getting woken up in the middle of night. It's it ops and So they're not learning, you know how to make it more resilient and not prioritizing that they're not Learning how to create great monitoring that's super precise and measures customer experience and let you drill down on issues quickly So it just wasn't a model that was going to scale for us up to cloud, you know with cloud You know, it's Azure for us. It could be AWS You know, the devs are unleashed, you know all the setting up infrastructure on-prem You have to know a bunch of process and people it's it's not easy In cloud you just deploy it out and it all magically appears And so the devs can deploy out and you know set up the network and set up storage and all that, you know through their their release and That lets them get closer to production. So this model is not really needed anymore and so You know, we've over the last couple sessions, you know people have talked about how we've progressed to combine engineering, you know Historically we had kind of the that you know the feature team devs We had testers and we had it ops kind of you know separate org if you're thinking about TFS and and You know the devs are starting to move really fast and test can't keep up and you start, you know deploying this stuff up to Azure now and you know ops can't keep up and For a bunch of different reasons we want to combine engineering and that's you know, we combine the test and developers into a feature team and Folded in kind of the concept of the operations team and And so you still got kind of two flavors within this combined engineering, you know, you got the The feature team and then the life site engineers and these aren't necessarily different folks It's you're wearing a different hat a different point and The life site engineers, you know, you're gonna have feature devs that are on call I think we have around 25 ish, you know for all of our different services And then you give them point in time that are focused on life site They're not writing features. They'll cycle in and out and In part they're on call if there's any of these alerts or issues in production We're going to bring them into a bridge and we'll talk about how we do that They're also doing those proactive Investigations and doing the repair items to improve the monitoring and resiliency and whatnot. So that's what we call life site engineers and If you subdivide the life site engineers, there's two basic flavors. We've got the feature teams on call folks the LSEs And then we've got site reliability engineers and that's my group the sREs and site reliability Engineering's kind of a newer concept for cloud That you know, you're starting to see across the industry and for us the sREs are really They're focused on the platform with with Azure All our different services, you know, very consistently adopt Azure. We got a consistent framework And so the sRE team really understands the networking Understands how to pull out all that platform telemetry from sequel Azure from storage Understands how to troubleshoot and traverse down into the platform issues understand our whole monitoring kind of flow and Is really focused on the platform when issues occur with the platform or the network You don't want to call a dev in the middle of night and wake them up. At least we don't we've got a team That's small. I think we have around 22 FTEs for BSTS in my group and You know, they're staffed around the world and they can respond to these platform alerts And they got the expertise and ownership to fix that Then the feature team Life site engineers they own their code. They're writing features That's being deployed out in production at times. They'll have bugs at times. They introduce performance issues and Those alerts get routed to them and they're going to get woken up and fix that now You have the whole feedback cycle that they learn to write precise monitoring and more resilient code and These two we feel really complement to each other Any questions so who actually writes that infrastructure as code and Configuration as code and all those things. Is that the feature team or the light well I mean, so that's something that's so the your question is who writes the kind of the platform infrastructure code That's what's a little you know a lot of SREs when you go talk in the industry and go to like a conference They're writing the storage subsystems. They're writing the network load balancers and supporting that in production For us we have Azure and so the folks that are writing that's the Azure development group And they do a good job, you know, the platforms matured a lot over time Application would need like you know the infrastructure, right? I mean you would have that infrastructure I mean, so you don't have like infrastructure is called you just go to the Azure team to do that or I'm talking about like, you know when you're running applications gonna need something like VMs and you know, that's gonna be that's part of our Deployment, you know, so we deploy out through Azure through through arm, you know as a resource manager And it's part of your deployment. It's provisioning the feature the feature teams, you know, we'll write their You know own their deployments and they will provision the resources they need as part of their code deployment And yeah, I mean Azure it makes it pretty easy from a from a deployment perspective and provisioning perspective Go back to that thing. So you talk about feature team engineers, which is the deaf and test deep Combined. Yes, and life side engineers just a role It's a role. Yeah, it's a role if you're on the feature team You will go through a life site rotation Yeah, and that when you put on the the the life site rotation hat You're putting on your life site engineering hat and you'll then be you know on call or focus on life site And that's a feature team life site engineer. Yeah, but but the IT ops guys So the traditional IT guys are they now part of the feature team or did they appear? Well, what happened to them? Well They you know, it's the roles evolving, you know we still have the on-prem infrastructure and we've got a team supporting that that's part of our group and That's evolving. We're even trying to move a lot of that infrastructure up to Ultimately, we want to get all the TFS on-premise users up into our VSTS product. Tom, let me jump in this model. I've got some of these questions before the talk also I think I know where this question is coming from. So I think he's showing you the evolution of our IT ops So as he said it used to be there ops team was completely separate You know, it was not even part of our work. It was as he said it was in the MS IT org We give the code over to them and then they deploy Then came the first change where we said no, no, no Let's just bring that IT ops team and make it part of Brian's work. So that was the first change There's still IT ops. They're doing the work that he described in the previous slide And then we kind of looked at that the work they were doing is what do you do? Okay, well we do deployments and we monitor all these alerts and then we do all this Change management blah blah and so we said, okay, let's see if you can simplify your life So the first thing we did is we automated the deployment when he said we the feature teams wrote the deployment it used to be that the IT ops team would write the deployment scripts and push the bits out to production and That response we moved to the engineering team when that responsive move to engineering team The IT ops is no longer in the business of doing deployments So they had to kind of up level the skills to do something other than doing deployment Which was 80% of their time. So now they are focused on Looking after the alerts and kind of responding to the alerts But even that in the next phase. So and you'll notice we change the names. So this is Tom's team It came in as an IT ops team But we are evolving his team and with the evolution. There is the names changing too It used to be ops team Then it became service delivery team. Then it becomes service engineering team Now he's he's calling it site reliability engineering team And that's just a way of describing that they are moving up the stack in terms of their skill set He's he's hiring developers at this point in the site reliability team But remember that's not where we started. So your customers may be way down in the IT ops realm You just need to explain that there is a there is a You know phased approach to kind of moving moving up the stack And so now where they are at is Even those monitors or the alerts, they don't go to his team. They're particularly the application alert They flow straight into the engineering team. So they are even out of the business of Responding directly to the alerts. So they are now focused on Hey, what do we see globally in terms of the reliability and performance of the system? You know, so they are they are debugging. They are kind of understanding the code and at that level of detail So it's it's an evolution of his team that you are seeing here Make sense. Yeah, okay. Yeah one one addition But maybe that's uh for for after this session What I what I also see a lot is Within the developer teams that I'm working with They have a lot of knowledge about coding But all the all the stuff around networking vnets and all the stuff that's you're absolutely right and that's where So, you know, the always I asked Tom like if what is it that your team is going to add value In the org because I don't need you to go and fix the feature code We have a feature team for that But your team can be really really good at understanding the ecosystem of the production the networking and the platform he talks about sequel azures and Load balancers and the dns and all of those things When the service is running in production The engineers know how the code works, but they they they're not intimately familiar with the entire ecosystem So that's kind of where he brings in this expertise. So if you have a platform issue His team can take point on kind of going deep in that whereas if it's when once we realize that that issue Is with the specific piece of code Um or feature that was wrote that somebody wrote It's the best thing to ask the feature team engineer to go figure out what's going on So that's another to give one tangible example for that And you know, just one one example is a networking, you know, most feature devs are not really steep within how to troubleshoot networking You know, they don't know You know how to run netmon interpret it and whatnot And so the sre team's got kind of deep expertise in that and when we think it might be a networking issue You know this this sre team that small has that skill that can apply across any of those feature teams in a life site situation You know same with storage and and whatnot So, uh, yeah one more question to that. Uh, yes, where are you? Yes Oh, hey, how do you actually get that? Knowledge back into the feature teams so they can write better code going forward Um, so that's really our post mortem process, you know If there's let's say like there's a Networking issue and we realize our code's not resilient to it It might be my team that troubleshoots it and isolates the network issue and You know works with azure networking or whomever to fix it But then as part of our post mortem process, we're understanding Why isn't our code more resilient? Why did this impact customers? How do we improve to, uh, you know Basically make our code more resilient. And so the devs are always part of The rca the root cause analysis whether it be an infrastructure issue or the app issue Because either of them has impacted the customer and there's things we can do in code to improve that And a lot of those improvements we build back into our framework You know, so we've got this buck gave a talk about the framework and within the framework I really think of it as kind of in part the glue that sticks us to azure And it's got libraries on how we connect to storage And with those it's it's got retries built in and it's got all the telemetry that we collect to understand health of that remote dependency And so anytime we have let's say a storage issue. We're going to analyze, you know Was the service resilient to that failure? Did we have the telemetry we needed? And if we see any gaps we log those out as repair items And build them back into the framework and because the framework shared across all the services that goodness is is shared out across the whole ecosystem So that's the basic model Hey So since we are capturing and analyzing all this data Did you guys notice any patterns? Because now you have a lot of information. I believe you can use machine learning, etc To kind of even dive into this information. Yeah Well, I mean so I'll answer that with a specific example You know, there's a concept of problem management that I'm showing here And problem management is you've got incident management, which is dealing with this tactical issue. You want to detect it Isolate it resolve it, you know understand root cause that's incident management. It's one instance But then problem management one of the flavors of it is looking for repeat issues or repeat patterns, you know volumes And so we went through and we've got an incident ticketing system Where we track out all of our our incidents and all the details including the root causes And we spent, uh, you know a fair amount of time looking back over the last nine I think it was months of incidents and we went specifically to sequel azure because we could see that's One of the big drivers of uh, you know the incidents we're having And one of the things we looked at is how do we mitigate this, you know, we saw that um For a lot of these incidents regardless of root cause The mitigation was scaling up the database It might have been a temporary thing or it might have been a permanent thing But it mitigated that impact and or would have avoided that incident And Ed, do you remember the percentage of uh incidents that were kind of the the scale Scaling was how we mitigated I think it was around 35 40 percent of them. Let's say 30 percent And so that knowing that, you know, we we've got a fair amount of sequel azure incidents And the mitigation is manually Scaling up the database We looked at that and we'd been toying with the idea of automating that, you know Upgrade but that data really helped us, you know Raise the priority on that and I don't I don't know if we have a a plan for it yet Tom let me let me just jump in a little bit and give you a slightly higher level answer to your question see if it helps So there are two ways we look for patterns. Uh, one is, you know He'll talk about the post-modern process where once a week we get together the leadership team as well as Sort of the engineers from Tom's team We are we're looking at the issues that occurred in the previous week And when we do the root cause analysis, you're not just asking the question What happened with that specific incident what we can do about it? We are asking a broader question like Is this a class of issues that we we have in the in the system? So that's one way that happens in that and you've got a collective brain power of the Leadership from all the areas sitting in the room plus Tom team Kind of following that data and asking those kind of questions. So that's that's one way it happens The second way it happens is once a month, you know We have a monthly service review where we look at again Sort of the issues that occurred for the entire month and we're asking a very similar question So like we look at our time to detect metrics and we go wow So we had 30 incidents and only 10 of them were detected by automation clearly. That's not healthy Why is our detection so weak? And you know, then we we find that oh we have pretty weak Detection when it comes to this class of issues, let's say, you know job agent issues We don't have a really good detection for that. So that turns into an action or a repair item So that's kind of how we we we work on and he refers to that as a problem management, but that's kind of Thank you. You're welcome. Yes What's the difference in dri and sre? Um, so dri is a term we use internally. It's called designated responsible individual And basically what that means is um in part you're on call You're focused on live site or you're out of the feature rotation and you're doing proactive life site work So that's dri And so uh, this simple way just think of it as being synonymous with on call And so sre is site reliability engineers. That's how we've staffed the expertise for the platform and the incident process That's my group And we go on call. So we'll have sre's that are dri's that means they're on call Thank you. Yep. Yes Is there a reason you use a separate ticketing system Instead of for your incidents instead of throwing them directly into vsts as work items as a separate work item um Yes, uh, I mean You're it's a leading question. I think I think you you touched a sore point You touched a sore point for for for this group. I think if you ask, you know, bill He'll say, yeah, why we should be using that. We in fact, we were using we were using tfs work items vsts work items for live site ticketing Since then, uh, you know, we part of we're part of the broader cloud and enterprise division where as your team developed a ticket and system and You know, they've invested quite a bit in that it's got the on call rotation thing and you know, Managing the bridge and so he's got a lot more functionality than just managing tickets So we we embraced that system. However, we have had ongoing conversations about like, hey, could we just make the actual tickets be The work items and maybe someday we will do that but that's kind of the How things evolve Okay, so yeah, sre does a bunch of stuff, you know in this in the context of this Presentation we're really talking to live site monitoring and to degree problem management So I'm coming back to your question. I believe it was, um, you know, what's the process? How do we respond to these issues and Do it in a structured and efficient way? So this is kind of our world map of of our process That brings together all of our folks and our tools and moves us towards mitigating issues And it starts over here on the left. We've got inputs and You know the strongest input that we desire is alerts, you know that are actionable Where we proactively detect any issues before customers do there's also more fuzzy signals, you know, we're starting to watch Social we're trying to set up alerts over, you know social monitoring which that's you know It's customer impact we're trying to measure and if you listen to the customer, that's a good source But it's a little bit fuzzy You know, there's a bunch of Other ways we'll have stuff come come in through email, you know alim champs tls all kinds of stuff Our customers escalating. So those are kind of the the inputs that'll come into the system And ideally we put them through our icm Incident system that has auto routing rules. We can parse different properties and then set up Auto dialer or notification and that can either depending on the severity call Or email the the dri the on-call We do have an internal kind of you can think of it as a tier one support team that if it If these inputs are not flowing into our automated system for routing The this this our vs online live site team They can kind of interpret the requests from users, you know We get a lot of internal folks that'll escalate and get it routed properly And if it's you know something that's coming in like let's say from css It's a product question or whatnot. We get a dts process. That's kind of outside a live site. That's more of an individual case but then the actual incident flow so Let's just say an alert. Let's say I've got a customer impacting alert or a feature team alert And that's going to come into the icm system. We set up rules to route it properly to the team Every engineering team and my sre team maintain a call list in in the incident tool with a primary to backup And you got a whole escalation path set up in there So icm will start calling you, you know, you'll get a call from sachet. He's got his reported voicing We need you on a live site incident And you pick up and acknowledge it and it stops calling through the call tree And uh, and then depending on how we route it. It's going to go to the sre team for uh platform alerts Or if it's a generalized customer impacting alert those those kind of uh Uh availability alerts we talked about where we don't know exactly what code or The causes so that comes into the sre team And i'll just say we're staff 24 seven, you know, we're always watching live site We don't write features. We're writing some non-functional code But we're watching live site and we've got a follow the sun model We've got folks out of india And they pass off to ireland which is a newer team we're building out And then ireland passes off to redmond and so we're continually, you know, staffed up with eight hour shifts Where folks are ready to respond to incidents Um, if it's an app issue, uh, you know, the dev authored that alert it gets routed directly to them Whoever gets the alert the first thing you want to do is Understand, you know, our customers being impacted or is their material risk that they're going to be impacted And either of those conditions, you know in the impact assessment We'll trigger a live site bridge and this is where we spin up this this heavy but very valuable process The sre team Within our icm tool spins up a bridge from that ticket And we have everybody join it in real time And uh, we've gone through iterations, you know, where folks will try and do stuff like over You know chat and whatnot getting on a bridge and sharing screens and talking real time is is the best way to triage these issues So we'll fire up this bridge and it might initially just be, uh, you know the on call engineer and They then are going to start working to, you know, isolate the issue Traverse that telemetry figure out what's going on dive deep and then determine the mitigation and the mitigation is Mitigating customer impact, you know, a lot of times there's there's a longer term fix you're going to have to do but you're really focused on You know stopping the customer impact. So there's a bunch of different ways. We'll kind of mitigate things And then once that incident's mitigated you've also pulled root cause the state out of production so you can understand that deep technical Reason we go into the improve phase, which is really the rca process that i'll talk to in the learnings that you pull out And how you improve over time I'll just mention that Some alerts, you know or tickets that come in They're not, you know, high severity and we don't spin up this whole bridge form. It might be, uh, An investigation, you know, we've got a Database that's filling up, but it's got plenty of time and we need to kind of investigate it Or a customer might escalate something but they're not asking for it to be resolved in real time And so that goes into kind of an eight by five business hours And depending on if it's my team or or the feature teams, you know, we have different kind of business hours You know for me business hours are 24 7 because i'm always staffed for the feature teams We have feature teams in india north carolina redmond, you know kind of sprinkled throughout And so their eight hour shift or you know, whatever you consider business hours is is kind of variable around the world So this is our world map And the value of this is if you go back to that sprint 45 kind of scenario i shared Um, you know, we didn't quite have this rigor and we didn't have all these rules as well defined And so you got a lot of people kind of coordinating activities and whatnot, but it wasn't uh in a prescriptive way And so i'll kind of drill into the The incident bridge And so this is you know, we've determined there's impact and we say let's let's fire up the incident bridge And we'll bring in a bunch of roles and initially this may be somebody wearing multiple hats But if it's a big enough incident, you know, you'll have individuals for all this And uh, yeah, sometimes i'll jump on a bridge And there might be 30 40 people on it, you know, not all of them may be actively investigating them A lot of them might be but it's also a place where you can learn a lot about life site So we've got the the incident manager and the incident manager is a very special and important role You know, I think of them as kind of the the quarterback for the overall incident response And they're not going deep in in investigating stuff And if they do that's you know, you can get kind of lost they're they're looking at the overall flow of the incident Do we have the right people on the bridge? Are we blocked? Do we need to pull in anybody else? Um, they're trying to they're technical and they're trying to understand do we have a you know systematic way We're trying to isolate the impact Have I assigned tasks to ensure we're doing this in parallel as fast as we can? Does anybody need help? um, so those are the type of things they're doing they Use we got a tool called the outage hub We used to use a whiteboard and in skype But the outage hub, you know, we'll update with the current impact a list of all the tasks. Who's this assigned to? What's the due date and status? um What's the current focus of the incident and by putting this in in this outage hub you'll get Buck buck comes in all the time on these incidents And he's going to ask a bunch of questions, you know, and then maybe brian's going to join He's going to ask a bunch of questions. I jump in I'm going to ask a bunch of questions So having all this stuff in outage hub lets us kind of organize all this information Answer those questions as folks come in and out it helps us analyze the incident, you know after the fact And again, it's also got the tasks that that are delegated out So incident manager is just critical and it's a skill that that that you learn Then we've got the um site reliability engineering, you know, kind of technical experts that are digging in on the Into in flows as well as the platform telemetry and issues Or you got the feature team life side engineer Digging in on an app issue or how to mitigate, you know an app issue that might even be caused by a platform problem So they're the deep technical ones that are, you know, investigating and deploying the the mitigations For us if if we have an azure issue We've got a escalation path over into azure and all the different different teams and we know them quite well at this point And we can bring them into our bridge and they can help confirm is is it us or them and and you know, what's going on with with with the platform and Do some mitigations we can't do on our own We've got the concept of a communications manager and you know, this gets back to you know, we're saying Transparency and openness and building trust with our customers is really important to us And when there is an incident we want to get those comms out And so that one touch tooling the comms manager is kind of driving that and they're continually providing updates If you go look at our live site blog and kind of see the updates over time And I think it was around six months ago. We started staffing it with you know, full-time employees that are you know Managers or pms and then really kind of have that a lot of context on the service and You know strong communication skills and they're not technical resources Before we were asking the sre's to do this stuff They're trying to go technical and you're trying to have them communicate and that's that's challenging for for anybody So we've split up that role And then uh, you know, we we got this thing we call the sen. It's kind of our uh, it's it's called classify escalate notify It's our internal agreement on how we trigger live site and how we think about impact And that also tells us you know, hey if an incident goes on this longer If it's this severe pull-in leadership because they'll want to be aware and reach out to top customers and make sure everything's on Track themselves. So for major incidents, we'll start pulling in buck and manila and other folks So that's that's kind of the incident bridge on a on a very high level And and and again the general phases they go through, you know, if you're the incident manager You're making sure you got all the right folks engaged You're trying to isolate that issue traverse all the Ideally customer impact telemetry to find out where the the slowdowns or errors are being introduced Assigning tasks to make sure folks go deep to understand how to mitigate it always capturing state before we mitigate And then making sure that mitigation gets out fast and you know, the comms are happening Any any question on our incident process? Yes What do you use to manage these? So that's our again our icm What what's that the autage so you said that the Incident manager creates an autage something Yeah, so the um, so we have this icm tool that you know these alerts kind of come into and it creates an incident ticket in our incident tooling And then we fire up the bridge from within that incident, you know, so you come in and find that And then we create something called out each hub Which is associated with that incident ticket And it's this I'll say a form That you know, you can enter all the states on the incident, you know, what's the impact? What are all the tasks I've delegated? What's their status? When are they do? So all those things that an incident manager does to understand the issue and drive kind of the isolation and mitigation Is encoded in this out each hub kind of workflow And it's saved into the question about the tool. So it's part of the icm this icm tool that he keeps referring to the That's that's what it's a It's a what do you call it's a module inside of the icm. It's a module. Yeah It's nice. It kind of encodes that incident workflow It helps us, you know, kind of structure that incident manager role Previously we're using a whiteboard, you know, we've tried one note, you know, it's in general you want to have some kind of Tooling that you're you know capturing all the state and defining the tasks and kind of logging out what's happening in that incident so you can You know communicate with others and then kind of plummet after the incident and you know, look at your time twos and Yeah, unfortunately, we are showing you some things here that are Not externally available. So yeah, yeah, I mean ideally this would be built into the product So I have a couple of questions. So with this incident manager, is that like a dedicated role or this is just somebody wearing that hat? It's It's For lower it's both. Okay When an incident initially starts and we got like sub two that's kind of the lower end of impact The sre will play the incident manager role. Okay, they'll be kind of investigating They'll be coordinating with the feature teams with azure and also, you know doing the incident manager role It's a bit challenging. You tend to want to go deep and you know, but you've got to structure the tasks And then if it's a bigger incident or it goes on for too long, then we'll actually pull in an incident manager It's like a role a role. Yeah, and that's going to be like senior managers around the org You know will come in and kind of offload that incident management role These are not dedicated people doing incident management For a large service like azure may have people who just do this for living Tom's it's not a job. It's a role. Yeah. Yeah, it's a it's a role So it's not it's not so that's what I meant. You put on the incident hat We've actually got a little fire hat the incident manager that we what about the communications manager. Is that the same thing? Same thing So another question that I had in the actually the previous slide So what is that system actually that you have like? Yeah, that's one Is that something that like how much effort did you guy? It took you to actually put something together like this and what kind of tools you use I don't know if you can share that or not, but Again, this is you know, it's it's right at this point. It's all based on this ICM incident manager tool That's an internal tool. We use it's got auto dialer the on-call list which kind of maps to You know, there's some public tool like pager duty You know, it's got the incident the outage hub module and everything And so it's all integrated into this tool our division runs, you know as you're kind of created it and There's analogs out in the industry, but we use an internal flavor of it. Okay. How are we doing on time? So we've got 15 minutes left. So I'm going to kind of hopefully fly through this stuff So, um, I like people But uh, you know when it comes to online systems You know, they make mistakes. They need to sleep They forget things So for us, you know, we talked about automation. It's it's it's critical. You know when you're at scale Anything you have to do that's that's manuals gonna gonna overwhelm you And then, you know, people aren't as reliable as automation You know, if you code it up and automate it, you can test it and it just keeps doing what it's supposed to be doing Um, so we're really focused on on automation at this point and you know invest a lot I talked about secrets. It was very manual when it got to the point literally where my team every time we needed to rotate them Most of the team was spending a week and we've automated the vast majority of those So we're learning our lesson and getting all the manual debt out of our service But it's also from a life site perspective Um, there's things that can help you reduce the time twos and and you can execute more efficiently So, you know one I think this is a common practice and I've seen this, you know As long as I've done life site in different forms, you'll have what you call a troubleshooting guide or a knowledge base article or Something that you tied to an alert And when that alert comes in it's this is telling you all the prescriptive knowledge You've kind of captured on how to respond how to troubleshoot it how to mitigate it and um And our group really believes that you know, this is a valuable thing That you know most of the time for this class of issues There's a best way to you know Mitigated or understand it You may need to go into a deeper investigation, but we want to capture this kind of you know, 80 percent rule And but is it fast, you know, we got this big wiki article with a bunch of custo queries in it and whatnot and You know, it's hard and slow when an alert comes in to kind of read through this and cut and paste these queries into, uh, custo So we've really started to to invest in automating this we like the knowledge And we want to capture that but we've got this is robo remmy. This is our automated robot and um He Or it is you know running a lot of these custo queries that we've got for a given alert We can take all the custo queries out of the wiki article And put them into this runbook for the alert and in this calypso tool we have And so when that alert fires or we pump an alert from another path We kick off all these queries and then we decorate that into the Incident the incident ticket you can actually go into icm And you know you get woken up in the middle of the night you're groggy You've got to go run all these queries It'll just show up there right in the incident ticket and you know Hopefully very quickly you can understand what's going on and what you need to do And so that makes us faster to understand, you know, the issue isolate it and ultimately mitigate it faster So that's one type of automation I don't know why his hands turn purple So we've got we've got other things that You know In live site we we've got very common Mitigations we perform over and over again and they may be for different root causes And you know those are you know rolling back if we do a deployment we want to roll it back You know scaling stuff up scaling it out the capacities, you know overwhelming the system Um, uh, you know draining something out of a load balancer to pull out a bad node Rebooting stuff recycling stuff and so We always collect state before you know before we do these mitigations but We used to do these very manually, you know like when we wanted to pull something out of a load balancer I'd go rdp onto a box and there was this little kind of a service management command you can run and You know a bad front end Node you pull it out of a load balancer and try and figure out what's going on And that's not fast. It's not secure to jump on that box and it's you know, it's inconsistent You know other people might do this differently So what we've done for all these kind of common tasks is you know Most of this is script them And we built it into the framework where remotely we can implement all these different kind of mitigations But it's still some human operator, you know making the call to do this and that adds time And so it's good. We call that mechanization, but it's not fully automated and so again, we got our robo remi and you know, this is this is We've taken kind of our first step into You know robo remi is out there. It's part of our framework. It's a health agent Is the module and it's watching different counters and we got one counter right now That's asp.net q-length and when that goes up it can be different root causes But we know it's causing slowness for users And we know we want to dump out the memory and upload it so we can analyze it offline And we also know that by recycling that role, you know, we can mitigate it most of the time And so robo remi now he's constantly watching that counter on compute And when it goes above a certain threshold for a certain amount of time He'll do all those steps. He takes a lock on on a blob so that only one instance to pull itself out We want to be careful Then he clones the process And then he pulls the the note out of the load balancer at that point the customer impacts mitigated, you know And a human didn't have to do anything. They don't even have to wake up And then he'll you know dump the clone of the process Upload it to storage send an alert through the icm system. So in business hours, we can go look at it and investigate And then he cycles the app pool He runs a health check. We got a health model that makes sure the node's healthy And then adds it back into the load balancer and then customer traffic comes back in and we've got our capacity back And so that's that's powerful for us, you know, we don't have to wake up humans We compress time to it's reliable, you know, robo remi does it very consistently He doesn't sleep. He doesn't doesn't do anything that uh requires overhead any scales So that's something that that you know, it's a little scary, you know to have this stuff happen, you know Or it's you know, it's like skynet. It's it, you know, the computer start making all the decisions But it's it's the way we want to go but we're edging towards that and making sure we really kind of learn before we go too far into that Sorry, is it becoming scary? Are you paying to use artificial intelligence? So we've you know, we want to I mean everybody I think you know looking at like machine learning and for alerting is one thing we've been looking at and We're trying and we've been investigating and kusto gives you a lot of these these these functions You know kusto got all our data there And we're trying to use machine learning and and to be honest We haven't been super successful yet coming up with practical solutions Our partners over in sequel azure actually really are getting some strong signals that are very precise that are detecting issues Pretty well, you know using ml But for us it's we're going after it, but we're still figuring it out I have a question Go ahead. Who was that here? Yes Just wondering at what point of time do you take out the server from the load balancer? Because if you took it earlier earlier, there won't be any requests if you took it later then Yes, so specifically on that that mitigation, you know We don't take it out of the load balancer. We clone the process under load So that we're capturing the state, you know With the load and then we pull it out of the load balancer so that you know, it's a good. It's a good repro when we dump it so another type of mitigation is Uh And I think we talked about this earlier you can run all these q-stop queries and kind of join them And we can diverse all these systems and go deep and whatnot But we've also invested in a lot of different visualizations that take these uh, you know, uh Expose this data in a way that really helps us at a glance hopefully understand what's going on with live site Then we can drill through and so we've got a health model That kind of shows our our overall deployment And we can decorate in all the alerts and their different severities And uh at a glance you can kind of see all the active alerts everybody's writing these alerts and they're valuable We don't necessarily want to wake everybody up for every one of them But showing them in a health model when you have an issue and you can kind of see everything quickly That's that's a type of automation. We've invested in um, we got these dev ops reports that take all the uh metrics that we send back and the uh, queries and kind of you know show them at a high level what's customer impact What's a compute looking like what's sequel looking like and we can drill through those It makes us more efficient and quick And then we've got change whenever we deploy through rm We log out a change marker to this fcm tool This fcm tool is another azure solution that we use internally that's federated change management And we can show that change in these dev ops reports and change often causes issues. So Um, uh, it's it's a nice way to kind of uh compress time to So i'm going to go through root cause now this is our last slide and we got seven minutes. So we're doing good. Um You know, we talked about learning and you know, once you get out of the incident Uh, a you've kind of dumped the the the the state so you can understand deep technical root cause And then also how you respond. That's another thing you want to really analyze and in that outage hub We kind of capture all the decisions we made and tasks and the time twos And we go back and every week, you know, we will look at the previous week and do these these post mortems and it's dev it's my team collaborating to really Pull everything we can out of this incident in terms of learning and You know talking about the culture of the org it's a very Safe culture, you know, it's it's open everybody is, you know, self critical and questioned stuff and there's no blame We really sincerely want to understand what could we have done better We talk about it in our life site review And we document it in these these formal post mortems so we can have kind of a trailing history of it And you know, we'll put in the impact the root cause technically The detection and mitigation. How do we respond? But then the most important thing I'd say is We pull a bunch of repair items out of that and these are tfs work items that we log in VSTs and we can then link those into the incident And set kind of delivery goals on them And you know, it could be for monitoring gaps. It could be for resiliency improvements we need to make And we then scorecard those we've got our own internal scorecard We developed that kind of looks at these repair items And you know scorecards can be helpful, you know, you've got a lot of things you need to prioritize and manage And so all the different feature teams and my team You know has got some items to understand if we're falling behind on our commitments to close out these life site repair items And so that really kind of closes that whole loop, you know, that helps us improve over time You know respond better respond faster Improved resiliency in our product, you know drive out false alerts and you know get meaningful precise alerts Any any questions on that? Yes so After since you guys been doing post mortems and collecting all this information And all of that is are there any best practices That have been posted somewhere or any white papers that have been published for example A common question that comes from the client is hey, what is what are the best practices to implement elementary? Or any kind of monitoring? Yeah, we we have not I would say from a life site perspective Aggregated up those best practices. It's something we've talked about we do post these post mortems publicly for the for the big outages And there's a lot of lessons that you know customers read through that and pull out a lot of lessons We learned and apply those as general patterns, but we haven't centralized them in one place yet. That's an interesting idea Yep I mean adding adding to that question In the previous slide we saw devops reports Yeah, I would like to see like the detail of those charts and can we leverage that to happen sites Yeah app insights has got dashboarding and it's it's a lot of the same stuff even the health model I showed They've got app map which for free kind of shows you the whole flow of your code and Gives you a logical deployment view with all the telemetry and state So a lot of these analogs are built into App insights and you know, they've got composable dashboards that you can find metrics and kind of put together So it's very similar to what we've done internally. We just happened to do it through some internal internal tooling Bit top My understanding is is that even though you're not using the app insights dashboard and stuff App insight analytics kusto Everything underlying all your fancy dashboards all your fancy. Yep. That's all coming from kusto. Yep. We can all use kusto Yep, it's build their own dashboards on top of it, but kusto drives all this Yeah, kusto is is the heart and it's magical. It's the heart of our um life site telemetry And yeah, and it is something that you know folks can use externally It's you know, azure log analytics and azure monitoring log analytics and app insights log analytics It's exact same engines exact same capabilities And ai's got the same kind of views you can build, you know over the top and alerts We actually will schedule up queries over kusto and generate alerts from that And uh app insights has got the same exact feature. You can set up a scheduled kusto query and generate alerts dashboards and stuff So in summary, um You know, we're really uh Focused on you know quality and and uh of our service. We view that as a competitive Feature, you know if our we have if we have great features and they're up and performance You know, that's that's the formula to build a good business and win and retain customers And also that transparency, you know things do happen And we really try and hold ourselves accountable pull out all those learnings and share that with the customer So they understand we're very committed to improving over time um From a telemetry perspective It's easy to collect a lot of telemetry But it's it's very important to have kind of a conceptual model how you organize it and for us We've really zeroed in on this, you know focusing on the customer impact telemetry or the availability model And that then lets us transcend across the system and find those issues And and that's really helped us with our alert precision and and troubleshooting more effectively You know we've Got kind of common beliefs that help to align our group, you know On how we think about live site and approach it that make sure we're marching in step And we split out roles, you know, we we're a dev-op shop We've got the feature teams on call and they're learning from from their code and production But we see value in having a deep Team that understands the platform and the networking and kind of owns that live site process in the steps 24 7 And then, you know automation is it's a killer, you know, if you have any Sorry manual debt is the killer and as you grow to scale It will crush you, you know, and so we've learned that we just can't have any manual debt in the system And we're very aggressive about, you know, automating all that And then we're also automating our live site process, you know Trying to reduce those time twos and how we do comms quickly and everything So kind of more of the workflows which is outside the service. We also want to automate And then You know Getting to root cause always dumping the state before we mitigate its key So we can really understand these technical issues and make sure they don't repeat and and we resolve them But then also, you know, looking back at our incident response and understanding Could we be any more efficient and and driving those improvements back into our our cycle? So that is The end of the talk. It's noon So again, I'm I'm tom more and you got any questions my emails up there and thank you for your time