 Hey everyone, welcome to OpsDocs, Alt-Things hybrid in this session Ops116. We're talking with Eric Nemvet on how to respond to events on-prem using Cloud services. So stay tuned. All right, let's go. Hey everyone, welcome back. Hey Eric, how are you doing? Pretty good. How are you? I'm good. So what exactly are we looking at in this session? So I'm going to give you guys an overview of Microsoft Retail's Cloud monitoring framework that really helps enable on-prem and hybrid solutions like Microsoft Retail stores to monitor and react to events that happen on-prem as well as in our hybrid IaaS environments. So we're using Cloud services to monitor and alert stuff that's happening on-prem without there's no secret sauce in there. It's all commercially available. It's 100 percent public offerings from Azure, from Microsoft, no internal secret squirrel stuff. Okay. All right. So take us through it. All right, sounds good. So what we do is we start with, we're going to run over the history, where we were when we started this adventure in Microsoft Retail. Some design considerations, our implementation, a high-level overview of the entire framework, and the integration points. We'll run through a demo and then cover a few references to get you started. Okay. Perfect. Okay. Microsoft Retail stores had a roughly 10-year legacy of on-prem infrastructure challenges. We had a large scum farm. We had hundreds of duplicate alerts. So as new IT teams rolled in, we never really deprecated old alerts. We just added new ones. We had huge reliance on vendors. So I remember when I first joined the team, I was like, hey, this machine's low on dry space and they said, oh, just create a ticket. There was a question. Did the ticket magically fix the dry space or it turns out there was an army of vendors to go do a lot of things on-prem. Then we also had a lot of tough challenges around, hey, this is such a high visibility marketing thing for us that we were scared to make any changes that would potentially cause customer impacts. Then we also had due to the infrastructure and the way things were, when things went bad, you had these single weekend warrior, heroics that happened where somebody stayed up 48 hours to rebuild the SQL table or whatever. So we weren't in a great place. We're in a place that was treading water. So we wanted to find a solution that helped us bridge the gap between 100 percent on-prem to how do we do a hybrid and eventually, what does it look like if we moved our entire operations to the Cloud? Well, this sounds very familiar. Maybe it's all my years in IT where getting so many alerts that it just becomes white noise and eventually you get tired of those alerts and you just start ignoring them or there are so many for the same things that you say, okay, well, I've already fixed that. So I'm not going to bother with the second one when in fact it's a new one. So is that kind of like the issues you were trying to address with that? Yeah, exactly. And you know, it's the traditional you can't see the forest through trees. We get, you know, hundreds of drive space alerts for the same drive, same machine or multiple machines and it would all be, you know, one common issue across all of them. And so we really needed to look at how do we first reduce the noise because you couldn't have a page or you couldn't be on call and it would just be 24 seven. And then the other issue that happens when you get so many alerts you lose focus on the critical alert. So we would have a significant amount of major incidents which represented something that would be visible or impactful to the customer. So imagine, you know, going through and being signed up for a distribution list and trying to find the one email that said, hey, I really need you to call me and you've got thousands of emails every day that you, you know, open up to Outlook or whatever your email program is. It just becomes unmanageable and it leads eventually to really bad things. Yeah. And I think that's, a lot of our audience are in the same boat. So that's great. Yeah. So some of the design things, obviously our first goal is to prevent customer impact events. We call major incidents anything that would prevent the customer from being able to come into the store, get a product, purchase the product and leave with the product. Other things like video all down any kind of brand damaging events, like power was out. We couldn't even open the store, video all it's huge marketing, awesome advertising platform. If that obviously had issues, it'd be a little bit of egg on our face. And the other thing that we really wanted to do was make this available eventually as a showcase to other customers, retail, et cetera. And base it 100% on Azure, Azure public offerings. Okay, which is, I've talked to a lot of other folks and often there's a, like a vendor component somewhere in the middle where you get the alerts from SCOM. There's a vendor or a product that you've had to buy that kind of massages things in the middle and then sends it out. In this case, this is completely out of the box commercially available to anybody who's listening right now. They could just go out and set that up as is right now. Yep, yep. There's no secret sauce in this solution. All right, so Eric, why don't you take us through what this solution looks like and like an overview of what we're talking about here. Yeah, definitely. So here we have the entire framework and it's made up of multiple components. So we have the monitored resources. So these are both on-prem and cloud-based resources. So you have your local machines, your SQL servers, any custom apps, application logs, et cetera. And so we leverage the Azure log analytics slash SCOM agent to go and monitor these on-prem solutions. And then for the cloud-based solutions, you generally have app insights or some log logic app that goes and interrogate something and then can send that data into a log analytics repository. Okay, next we have your Azure monitoring. So this is your standard Azure portal, Azure monitoring solution. So you've got your log analytics source, your Azure alerting and your Azure action groups housed. And then we have the cloud monitoring framework. And again, this is just a framework. It's designed to give you a almost plug and play solution where you focus on the alerting and the data that you care about to generate an alert and then you work on what I need to do to solve that alert through a remediation runbook. It's basically an Azure automated PowerShell runbook and or a ticket. So obviously if things can't be solved, you wanna notify your users, you can send email, generate a ticket in your own ticketing repository, et cetera. And then finally, everything in this framework is sent out to application insights. So whether it was successful or it failed or what it did under the hood, it's all in application insights. So you can go back through and review where and what went wrong and help yourself resolve those errors as well as whether your remediation was successful or not. Okay. And when we call this a framework, we're not really, it's not like a .NET framework where people have to download and install something. It's just how we've organized and built this solution to address the issue that we have. Correct, yes, yes. And so some of the things that we wanted to do to address some of those is we wanted a framework that was simple enough that we could with very little effort migrate, you know, our infrastructure from like a dedicated on-prem solution like SCOM to something that's usable in the cloud and kind of bridged that hybrid gap. To that point, we've provided an API or PowerShell commandlets to help you migrate your alerts off of the traditional SCOM into this infrastructure. So this has APIs to go and rebuild or build Azure alerts that will essentially replace over time your legacy SCOM alerts. Okay. So it doesn't matter whether or not the alert is generated by SCOM or generated by Log Analytics or Azure Monitor alerts. They're all being addressed the same way through this framework. Correct, yep, yep. Ultimately they all end up in a Log Analytics source and from there they are queried through the Azure Monitoring Azure alerting system. Okay, all right. All right, so let's go through and look at the implementation. So obviously we needed something that brought us from SCOM to the cloud. So we looked at the time it was OMS. So we had the SCOM slash OMS agent. We're still using that. Azure alerting and I mentioned action groups. We need Log Analytics, Log Search, Functions and ultimately we can leverage Blob Storage. So one of the challenges that we have is there's no maintenance modes once you go to a solution like this. And so what we looked at doing is storing our maintenance modes in Azure Blob Storage as a JSON file. So through some log searches we can pull that Blob Storage in and that exposes our maintenance modes at the alert time. So we can say, hey, this server's down. We go check the Blob Storage. Hey, that server's currently in maintenance mode. We're good to go. We can ignore the alert. The other thing we looked at is Azure Functions. So with Azure Functions we can put load balancers in front of it. So you can have multiple regions host this self, this cloud monitoring framework. Yep. And so that gets you out of kind of the, I put all my eggs in one basket, region potentially goes down. We can spread out across multiple Azure regions leveraging functions. And then we also have PowerShell behind the hood. So there's very little, if any compiled code in this solution it's all based on Azure Automation. So we don't have durable functions. It's all Azure jobs, Azure Automation jobs. And then the Azure Automation section we have PowerShell runbooks and hybrid workers to reach your on-prem resources. Okay. So there's no code that's there's no executables in there. It's all PowerShell code that's all of our users or our audience can basically write a remediation package to fit what their problem would be. But that framework could deliver it and resolve their issue as if it was written by somebody else. Correct. Yep. The entire framework's designed to enable you as the end user to basically remediate and code your own remediation and your own alert condition. So it really focuses on putting the emphasis on you who should know your infrastructure best to go address, not only alert on but also address the things that go bump in the night in your ecosystem. Okay. And then finally we wanted telemetry. So we really looked at what are pretty good telemetry systems out there available out of the box. And we landed on application insights. Traditionally your Azure environments, your Azure services, et cetera, write to application insights. So this being hosted in Azure, okay, let's use application insights. You still with me? Yeah, yeah, yeah. All right. You ready for me to switch to the next one? Yep. Okay. So I'm gonna go over a quick demo. Unfortunately it's not an interactive demo. We recently closed our final stores but what I'm gonna do is I'm gonna go over kind of what it takes, what we've done to help facilitate setting up the cloud monitoring framework and then looking at some of the solutions we had in place for things like a video wall outage. If you hadn't been in stores, we literally had wall-to-wall monitors that had wall-to-wall advertising. It was pretty awesome. And so what we're gonna do is we're gonna look at one of the challenges that we faced with how we identify and react and recover from a visible outage on a screen like the video wall system. Yeah, cause I assume in those situation if for example, one of the video walls monitor actually showed a blue screen, it would be damaging to the brand and damaging to the store itself. Which is something very important and for our audience it could be, this could be anything but like a locked print server in a branch office or a SQL server that's down or anything of that sort, right? Absolutely, yep, absolutely. And this framework isn't just, you know, obviously it's not just for on-prem but it really shines in the on-prem to cloud hybrid solutions. We can do the same thing with a website obviously hosted in Azure as a service, et cetera. We can scale up, scale down but where this really helps us is that kind of fulfilling the gap between, hey, I've got an on-prem environment and I really want something that is a little bit more robust and this kind of helps as a stepping stone from the sole on-prem hybrid and Azure solutions. It really helps you kind of, single solution that bridges all those gaps. Okay, perfect. So we, as mentioned before, we have an API that we provide through Azure, or sorry, PowerShell command list. And one of the things that we find that was real challenging in our migration from SCOM to Azure is really automating and bringing a select few of those hundreds alerts over that were meaningful and actionable. And so we discovered real quickly that we needed to provide an easier way to bring that stuff over. So we engineered a few command lists here and we're gonna just run through some of those to kind of give you a feel of for what and how we've done this journey from a legacy SCOM environment over to this new cloud monitoring framework. Okay. So are those command lists that you created or are those available right now in the PowerShell? So right now they are part of our cloud monitoring framework solution. We are going through currently working on removing a couple of the internal retail specific code paths and we will be publishing this out on GitHub soon. It will include these PowerShell modules and command lists that you see. Okay. So for us, everybody that's listening because myself included, keep an eye on itopstalk.com. Whenever this goes public on GitHub we'll make sure to make an announcement for it. Yep. So basically what we're doing is this leverages the Azure public APIs to go and generate an action group inside the Azure alerting system. And we look at things like your subscription, et cetera because initially what we wanted to do is we wanted to make sure that the cloud monitoring framework worked outside of an existing environment or ecosystem or even subscription. So you could isolate this to provide common functionality across all of your resource groups, your different subscriptions, et cetera. So we really focused on how we can make this a solution that would be kind of work across even multiple Azure subscriptions. And so what we wanna do is basically create an action group in a specific subscription to handle our specific video wall ecosystem. And so what we have here is we leverage the subscription ID that houses the video wall solution. We've got a resource group name that we call cloud monitoring in there. And then we're creating an action group per service. So in this case, we have a video wall as a service. So we have a single action group for that service. So all alerts involving video wall will ultimately call this single action group. Okay. So if you were for our audience perspective, if let's say I'm an independent contractor and I look after two or three customers which I have onboarded using something like Lighthouse. So I have access to their subscription. I could set that up in each of those subscriptions. So they've monitored their own stuff in each of those subscriptions. Correct. Perfect. Then we also have a remediation. So this is the runbook that would go run as a result of the alert occurring and hopefully resolve the issue. And so what we do here is we say, hey, again for isolation, this remediation runbook could be in the video wall subscription or in another subscription. Like we could have a subscription just for on-prem resources or we could have a subscription for different areas of responsibility within the operational team. And so again, what we do is we say, hey, we've got this subscription. We've got this resource group. And then we have these Azure Automation account. And then finally the runbook that we want. And then what we do is we go through and create the action group. And now we're identifying the remediation runbook that we want to run whenever that alert occurs. And keep in mind, these are comalant. So we are ultimately generating an assignment to this remediation settings and then down in a few more slides, you'll see where we build all this together. Okay. And then we have ticket details. So should anything go wrong? Obviously we wanna produce a ticket, something in a troubleshooting ticket that says, hey, operations, email, whatever, go out and address this alert because we couldn't solve it through automation and remediation didn't succeed. In this case, we define the environment. So where this alert would it be occurring? So in this case, it's production, our ticket details. So we have a component ID. So this is kind of, what's the resource name or what's the service name that this alert is generated for? For those of you familiar with the Azure alerting, you'll recognize the pound alert rule name. So that's a wild card inside Azure. So the Azure alerting system replaces the pound alert rule name with the alert name that you enter inside the Azure alerting system. And then our framework also has the option to do the results from your Azure alert in a HTML table. So normally you'd say, pound search results would be a list of all your results, but we've added some extra logic inside of our framework. So if you do pound search results table, it puts that in a nice HTML table that makes, dry space and other things look somewhat friendly inside of a ticket or an email. And then you can see where we have our raising location as the ticket location. And then we also include the search results. So this is really the key to help us, you know, inside the ticket, we wanna make sure that we include the raw data return from our Azure alert as well for debugging and other purposes. So again, the whole purpose of this is to really help you either solve the issue through automation or give you enough root cause so you aren't guessing when you get the email. We try and include as much information as we can to help you go in and quickly identify and address the issue. Okay, so like the error code or whatever would have been generated by the alert itself would be included in the ticket. And when we're talking about ticket here, I know what we use internally, but this could be tied into any ticketing system? Pretty much, I mean, it's a PowerShell, essentially it's a PowerShell runbook that gets called through Azure Automation. So currently we have it obviously geared for our internal ticketing system inside Microsoft, but, you know, you can easily extend it to leverage any of the other major players out there. A lot of people use, you know, yes. Essentially, if you have access to their APIs or your own system, it's PowerShell, you can modify it and make it work. Okay. Then finally, we're looking at the alert creation. So this is where we generate the alert in Azure for you. So we go through and we have, I'm sure you got your familiar, if you aren't, we have an option called custom JSON where you add a huge payload to the end of the alert. Normally you send it off to a web hook. In this case, you're gonna call our cloud monitoring framework. So we include the remediation settings. Again, that identifies which runbook and where that runbook is to run. We include the ticket details that were previously assigned and created. And then we ultimately create the Azure alert. And again, we have a generic name here, VideoWall issue. It points to the specific action group that you've created for the VideoWall system. And we say, hey, we're gonna look at the old VideoWall OMS location for our alert sources or our alert data. And then we specify the workspace type. So right now we support not only the OMS log analytics, but also the app insight sources as well. And then we have the alert query. So again, this is classic Azure alerting where we need to go and run a query. And when that query comes back with a result, then we go and generate an alert. So in this case, we're looking at anything that has a render issue of true and we give you the computer name on which that issue occurred. And then we include the custom JSON in that entire alert package. Okay, so just so I'm clear, because NOAA in Azure Monitor, it'll have access to a log analytics workspace. And then you create alerts by saying, if this query is true, then do this or if this performance metric or this counter of like whatever your trigger is, is true based on the information that's in the log analytics, then raise the alert and call the action group to do something. So what you're doing here is basically replacing that or creating that. So we are, so if it exists, we overwrite, otherwise we create. And essentially we're, so normally you go in the portal and you set all the stuff up in a wizard, next, next, next, done type fashion. What we've done is we've incorporated all of that Azure API calls into this command lit. So again, the focus is to how do we go and automate or onboard to this system and some kind of automated and air quotes easier solution. And so we can go through and dump all of our scum legacy alerts and quickly put them in a loop and you leverage in PowerShell, rebuild those alerts inside Azure and that, and those alerts would then call this cloud monitoring framework. Okay. But if somebody has already got a log analytics or they've been collecting data from some server for a bit and they've already got some alerts, this is not going to stamp them down this is just going to append to that alerts list. Correct. Correct. Yes. It doesn't destroy what's there. It's a, it recreates. So for instance, if you already had an alert name out there called the video wall issue, this would recreate that alert. It would essentially instructive overwrite. But the goal here is to anything you leverage with this would be calling into the cloud monitoring framework. So obviously if you've got some other alerts that send emails, et cetera, you don't want to leverage the same alert name. That will obviously be a little bit more destructive. We unfortunately overwrite your previous alert. Okay. Just that. And then the- Good thing to know. Oh, I'm sorry. I said it's a good thing to know. So you don't want to make sure that you have like a proper nomenclature or naming convention for your alerts and for your queries is always a good idea because when something like that happens, you don't want to overwrite something that you've already got. Yep. Yep. And then you also can specify in here your time span window. So 35, et cetera. And then you can also specify how frequent you call it and then what your result count is. So for instance, if you're familiar with the Azure alert definitions where you enter your query and it says, hey, go run this for the time span, it'll last 30 minutes, but run it every five minutes. And if I get over one result, then generate an error. We, there's additional functionality in the add cloud monitor alert command let to provide those additional details. By default, this looks back five minutes every five minutes. And if it gets any results, it considers that a alert condition. Okay. All right. And then what we're gonna do again, unfortunately retail stores are closed, but I wanted to walk you guys through what we did to address video wall screen issues or rendering issues that we had occasionally throughout some of our older stores. So a real quick overview of the video wall service, we had a wall to wall monitors and every monitor had a video sourcing box that took two inputs, one from a static demo and the other one was from our dynamic rendering farm. And so essentially think of it like an HDMI switch with a little bit of intelligence. So go, hey, I want to watch DVD today or I want to watch something, you know, a computer source. So it would dynamically or sorry, we could switch it from one to the other. Okay. And all these machines, you know, there's for every four screens essentially, we had one physical machine behind that those screens driving the image. And then ultimately those were all onboarded log analytics. And, you know, obviously being, you know, at 60 frames a second, we had some really near time requirements. If any screen was out, it was extremely visible by the customers. And so what we had to do is we had to come up with a custom way to look at and review these screens. And so we wrote a service that sits on the box that draws the image to the monitors. And what it does is it looks at the frame buffer going out and says, hey, do I have content or do I have dead pixels or am I showing a windows blue screen? And so what we wanted to do is say, hey, can we go out there and identify when one of these machines falls over? And so what the service does that we wrote is it goes out and looks at from each machine's perspective, it looks at the entire video wall and it breaks it down into sections that are rendered by the machine. So the graphic icon you see across the top is an entire video wall for one store. And then the subsection below is a, there's the three or four screens that are rendered by a single box. And then when we go through and we look at these at the machine level, it becomes easier to go through and say, hey, is there a gap or do we see a render failure inside this entire display? And so in the case here where we go through and we clearly see a gap, what that does is that that service will then take that box that says, hey, I'm on box X, I see a giant gap in my display and it publishes that through into Log Analytics. And that starts our word generation process. So basically like in this particular scenario, we've written a service to do this. However, if it was SQL or IIS, you could just interrogate the logs or any other telemetry that the software package that you want to monitor is providing out of the box. Yep, absolutely. Yeah, you can even, if you really wanted to, you can go have a service that interrogates the internal infrastructure components of the server. So a lot of manufacturers out there have their own little management software built into BIOS, et cetera, you can go interrogate that for things like drive heat, failures, et cetera, and report that out to Log Analytics and do something at that level too. Yeah, I remember years ago, I did something similar where the intelligent printers were just starting to emerge. So we'd have a little program that would just call the admin portal of the printer and do nothing but refresh once a minute. And if it got like a 404 or any other error on the HTTP stack of that printer would just alert us that the printer was down. Yep, absolutely. So our audience can adapt anything that they have in terms of monitoring and figure out ways was to identify whether or not our service is up or down. In your case, you're just inspecting every frame that's going to that wall. Correct, yep. Yeah, it's a rather unique problem. How do we detect, you know, across 100 plus monitors and 25 plus servers, how do we go through and say which one of these screens is not working, right? And it's a tough problem, you know, and it required a little bit of work with our software and hardware vendors to figure it out, but the results are pretty cool. And so real quick solution. So what we have is we have the video wall issue here and what we're gonna show here is how it flows through our framework. So what happens is we have the video wall issue, it gets picked up by the box, gets sent through Log Analytics, Alert Searches, you know, the Azure Monitoring Search grabs it, says, hey, there's an alert, it sends it into our framework. Our framework says, hey, you defined remediation, we go run that remediation, that remediation jumps to on-prem, runs our runbook, which then goes back in and flips the video wall switch. So again, we had a solution where we could say, either take your display feed from the servers or switch to alternate content. And so what this would do is the runbook would switch to that alternate content, go in, restart the software on that box or reboot the box. And when the box comes back up and we detect its success re-rendering, we could then switch the content back to that. So, you know, the solution was very specific for video wall, but it represents a really good use of on-prem detection, you know, logging it, getting it through Azure Automation to then come back and do something on-prem to resolve or recover from the event. Okay, so it really comes down to, like we're using something that's visible as our example here. And but you're really kind of switching, if you're like me doing a presentation in front of an audience of a thousand and you just, it's like hitting the logo button on the switching machine. So you do display a nice logo on the screen while you're rebooting your box. So people don't actually see that your machine has just crashed. Yep, exactly. And again, we want to avoid any kind of visible brand tarnishment. Obviously, if we couldn't keep the lights on, the video wall crashes all the time, that just looks bad. And there are some things we just can't solve for. But again, we want to make sure that we get in front of the stuff sooner than later. Obviously, we want to avoid customer impacting events. The video wall is more of a branding impact, branding, marketing impact. But, you know, prior to this, video wall would go down. It'd be down halfway, you know, for half the day before we'd get a store employee to call and say, hey, FYI, your video wall has been down all day. But, you know, obviously they have a job to do, which was to go and engage with the customer. They didn't, they weren't and should never be your ops on the ground. So this solution gets ahead of that. We know within minutes if we have an issue and obviously we can recover that and address it through automation faster than having an army of people, you know, monitor this physically in the source. Yeah. So we really, we want to avoid the airport experience where you walk into the departure gate and basically one of the screen is a blue screen. Yes, yes. Not only because we're Microsoft, but obviously, yes, we want to make sure that it's as seamless to the customer as possible. I mean, we understand that stuff goes off the rails all the time. How do we address that, get ahead of it? And then this also helps bias time to put out the small fires and then go focus on fire prevention going forward. Yeah. Which is really, really important to understand. This whole thing, automation is to provide the audience with more time so that they could do their real job and just not fight fires. Absolutely. Yeah, again, the goal of this is to help you, help enable you to do more by removing a lot of the noise out of your ecosystem through automation and other things so that you can, you know, focus on things like redundancy and really all of this is designed to buy you time so that you can focus on potentially fixing the root cause, not just recovering from it. And this helps you with the kind of recovery which a byproduct is, hey, if I don't have to go free of drive space every five minutes, now I've got an hour to go focus on fixing it the right way. Yeah, okay. And then the remediation where we keep talking about, well, like we're gonna like issue remediation, this is just, again, a PowerShell script that is sitting in Azure automation that you're running against a target that's on-prem. Correct, yep, yep. And so I'm gonna quickly switch gears here and show you a very high level example of a remediation script. Obviously, we do a lot more inside the remediation script for video wall, some of it's proprietary, but I wanna show you guys that essentially what you get with this framework is stubs to go help enable you to, again, address your issue, get time back in your data, focus on the real hard problems. So what we have here, and I'll zoom in here, is this is essentially a PowerShell Azure automation run book that gets called by the framework so it's got additional properties that it passes to this run book. And then again, we've got commandlets here that go and generate what we call a return object. Essentially it's a JSON payload, but there's specific properties we care about. And some of these are the ability to write out output back to the framework which ultimately gets sent back into our telemetry, our ticket, you know, anything you want. We have the ability to determine whether this remediation was successful or not, whether we go ahead and generate a ticket, we can go and if your ticket supports that your ticketing system supports that we can manipulate the summary, we can go through and add some other attributes like was this event customer impacting, was it a security risk? We can even go through and pick up and attach log files. So here we generate a simple sample here that says this is a sample file, but we can attach that to the payload that if we need to generate a ticket and your ticketing system supports it, we would attach this test.txt file or any other log file that you collect in the remediation back to your ticketing system. So you have all the root cause logs, et cetera. So let's say the only way to get out of this situation is to reboot the box. Well, it'd be nice to grab some log files or memory dumps or something before you flatten the box and start it up again. So this script allows you to go and take that, grab, generate those and then attach them, return them back to the framework and ultimately put them back in your ticket or hopefully you resolve the issue and you can just say, hey, it was successful and the framework goes, okay, nothing more to do. But we've got these stubs and then what happens is you actually insert your remediation code in here and do the video wall switch or free up the drive space, et cetera. And then based on those results, you run these other command lists to say, hey, I was successful or I wasn't. And then ultimately that generates a JSON result that gets sent back into the framework when this runbook finishes and the framework looks at that and says, what do I need to do going forward? In this case here, we're successful. So the framework goes, oh, great, nothing left to do. That's fantastic. So because we know by experience and I think everybody who's listening right now has had these problems before where you get a call and somebody says, oh, my server's been hanging and then you just say, okay, can you go and check the log? Oh, I already rebooted it. Well, okay, so now I can't see what was going on on that server before you rebooted it. So if it happens again, call me. Yeah, yep. And so kind of put it on your problem management hat. This helps you kind of as root cause analysis, the more data you can get before you reset the service or reboot the box, the more chances are or the greater that value is in helping you identify true root cause and then obviously getting a handle that and solving it. But in the meantime, you've got this framework to help you kind of recover quickly but also grab that content as well. That's great. Great, so and then to move on. So what we have here now, we flip the switch. You see video content is now good. And so what happens is once we flip that content, we say, hey, everything's good. The runbook result comes back to the framework. The framework says, hey, you said success was good. I don't have anything else to do. So I write all this out to app insights or sorry, yeah, app insights. And then at that point, this framework steps away, everything's done. Should the runbook fail, so success equals false, there'd be another path here where we would go through and actually go back and generate a ticket with all that information available in or at the ticket system. Well, this is really interesting in terms of using, and we keep talking about the framework, but it's really leveraging all of the pieces from Azure, in terms of Azure Log Analytics, Monitor, Alerts, Runbooks, Automation, Hybrid Worker to make sure that we are on-prem environment, stays healthy by leveraging all of these cloud services. The PowerShell framework set aside, this could basically be replicated from by any of our audiences, even without the framework that you were going to be published. They would just have to write their own scripts to do it all. Yep, yep, and again, it's not rocket science, for people new to Azure and stuff that may feel like it, but in reality, the framework is there to help get you through some of these challenges. It can be challenging to connect all these dots. We're getting better, but fundamentally, this is just leveraging Azure monitoring, Azure alerting, and hybrid connections for those of us who still have on-prem resources. Which is probably about 80% of our audience at this point. Yeah, exactly, and the framework just gives you a few commandments here and there to help you connect those dots in a more seamless and automated way. Great. Great. In the case where we do fail, here's an example, we have internal insights, so again, everything that the framework does, it writes out telemetry to application insights. In this case here, we've got an example where one of our solutions failed, and so what it ultimately did was write out a piece of telemetry here that actually generated a ticket in our ticketing system for the video while service. And again, internally Microsoft uses its own ticketing system, but this is an example of the type of telemetry that you can expect from the framework. Yeah, and that could be fed into anything from service now to Remedy, hopefully, or any of those commercially available ticketing systems. Yeah, absolutely. Yeah, we've got some experience going to service now here internally. Okay. Great. So have you thought about, actually, perfect timing. I was just going to ask, have you thought about applying this type of framework to other scenarios? Yes, actually, so we've got additional uses and POCs that we've done. So POC we set up early on was using a smart outlet that you get from a lot of the vendors that basically, you know, responds to internal network commands. So we've got, or we had, IP-based cash drawers and receipt printers, and occasionally they'd stop responding. And so we leveraged this framework as a POC to go in and have the remediation runbook run on premise go and tell that smart outlet turn off and then turn back on and then wait 30 seconds and try and ping the cash drawer and receipt printers. So again, you know, instead of having a store employee go, I can't eject the cash drawer to give this customer change or I can't print out a receipt. I have to call helpdesk. The framework would detect that and automatically restart physically, power off, power on these devices. So it reduced the chances and the time it took, both with customer impact and potential customer impact. So ultimately, we didn't go with the solution because there's a phase towards using tap to pay, et cetera. So everybody wants receipts or wants email receipts and they use their credit card. So the cash drawers and receipt printers over time just kind of phased out of use. It's extremely rare now that we are even more now, but it was extremely rare towards the end that we'd have a cash drawer or receipt printer usage. We leverage it for what's left of our on-prem SQL cleanup, so temp, DB, et cetera. Still using for Drive Space, you know, we like to generate logs. So Drive Space cleanup is a huge thing. In the cases where, again, you can't solve the issue, you can leverage it to basically essentially automate collection root cause data and ultimately restart the service or reboot the box or whatever you need to. But again, you know, the root cause data to help find the issue really is important and that's one of the big wins here with this framework. And then obviously we looked at infrastructure protection and failover. So we had, well, we've had multiple incidents where you can't pick the environment that your servers exist in. You end up in like a broom closet, water closet, et cetera. And there's ultimately multiple things that can fail out of your control. So one of the things we had was we had a water leak in a server room. So one of the things that we could leverage with this framework is putting water sensors and humidity sensors in that room. So if we detect there's an inch or more water on the floor, we can tell the framework, hey, if you, if this Azure IoT device with a water sensor detects humidity or water, send a signal into Log Analytics. Log Analytics gets, that alert gets picked up, goes into the self-healing framework. The framework comes back on-prem and does two things. It tells all the machines in the rack to power down and then it can tell the UPS shutdown. So we kill all the machines before we have an electrical issue and we also tell the UPS shut off. So if our mains go, we've got a separation there between our main power and the rack power and all the machines are safely shut down before there is a significant either heat or water or what have you damage occurring in that infrastructure. And then the other thing we do is as you migrate over and get, merge more towards the Azure side of things, you can also spin up additional resources, leveraging AZR, et cetera, to clone your infrastructure should you have a failure at one point. So basically automating the failover from on-prem to a cloud instance while you figure out what happened to that broom closet. Exactly, yep. Oh, that's great. I definitely see a lot of potential scenarios where our audience can use that whole architecture but also the framework once we release it. But just the way you explain and go through all of these pieces and how they talk to each other is very important for our audience to know because this will take care of a lot of headaches that you have on a day-to-day basis like the Monday morning going through your inbox and seeing the 100 alerts that you got over the weekend that now you have to address because somebody who was on call didn't get to them. Yep, absolutely, yep. All right, perfect. So where can we get more information about that? So here's a collection of resources that we've essentially leveraged for the framework. It's not a exhaustive list, but it covers all the high levels. Whether you use the framework or you just want to get started leveraging this for yourself using something homegrown, these are essentially the technologies we've leveraged pretty much exclusively. Obviously, Azure alerting, Action Groups, Azure Automation, which includes the PowerShell runbooks, obviously Hybrid Workers, which allow you to connect to your on-prem resources and leverage that runbook internally. And then if you're looking into some kind of load balancing or diversification across Azure regions, I highly recommend looking at Azure Functions. You can put a friendly name in front of it, load balancer, and then you can literally redirect traffic within Azure Function across multiple Azure Automation endpoints. Okay, so the difference between Azure Automation and Azure Function in your scenario here is Azure Automation actually runs the runbook against the resources and Azure Function does what again? So the Azure Functions, more like a, it handles our decision trees. So the Azure Function says, hey, is the payload that I've been given? So if you remember back in the commandlet examples, we defined a remediation location. We also defined tickets, you know, ticket properties. So the Azure Function in our case says, hey, the Azure alert that is sent to me with that JSON payload that contains our remediation location and ticket information, it decides, hey, I'm gonna go if remediation is specified, I'm gonna go leverage Azure Automation to go run that remediation. And if that fails or isn't specified, then I use the ticket properties to go call your ticketing solution. Okay, I just wanted to be clear as to, because on both of them you can run PowerShell script, but I wanted to be clear that in the architecture that we have there, they have two very different functions. Yeah, and the Azure Function also is written in PowerShell and the commandlets talk directly to the Azure Function. So again, in the case where you have one or many Azure resources or sorry, implementations of the framework across Azure regions, the framework can run in both of those regions and interact with the same common logs, Azure Automation locations, et cetera. So we can go through and get complete duplication and resiliency across the board and then the Azure Functions can also live in those resources as well. So you've got a traffic manager between them that says, this region went down, we're gonna direct you all over to this other region and you'll have duplication and when the other region comes back up, it can do around Robin. That's great. That's really interesting and it opens up a lot of possibilities for anybody really who's got to make sure that the lights are on in an hybrid environment. Yep, absolutely. Okay. I think we're at the end of our session. I want to make sure that the audience, anybody who is watching this right now, if you have any questions or you would like more information, go in the link below. There you will find a Discord chat group where we're gonna discuss this session and if you have any questions, put them there and I'll make sure that either myself or Eric or someone will answer that question for you. So Eric, thank you very much for spending the time with us. This was very, very informative and it kind of opened up a lot of potential solution in my head in term where things like that can be applied. So thank you very much for your time. Thank you for having me. All right. And again, be part of the conversation at the link below and check out all of the other content that we have for IT ops talks, all things hybrid. Thank you very much. Bye.