 Hey, everyone. Welcome to another episode of Azure on blog. Today, we are talking with Aditya Balaji. He's the PM for Azure Backup, and he's going to talk to us about some improvements in alerting when you're doing backup. If something fails, something goes wrong, how do you get your team to know about it so that you can act on it? Stay tuned. Hey, Aditya, how are you? Hi, Pierre. I'm doing well. How are you? I'm great. I'm great. We've got some news or some improvements with Azure Backup. Can you tell us a bit more about it? Yes, definitely. Azure Backup has a bunch of monitoring features. As you might know, we have things like Backup Center, we have backup jobs, we have reports and things like that. One thing that we have always had is alerting, where we used to send email notifications when backups failed or when there were certain security scenarios that we felt users might want to be made aware of. But what we have also started doing is actually integrating more natively with Azure Monitor so that we can provide all the goodness that Azure Monitor offers to our end customers. Right now, with these Azure Monitor-based integrations, customers have much more flexibility and they have a wide range of choices in terms of notifications and being able to take more proactive action on their alerts so that they can achieve much greater operational efficiencies. So if we're looking at integration with Azure Monitor, does that mean that I could write my own crystal queries in Azure Monitor to query, let's say, if the backup of my production environment fails, send me in the dirt, but if my backup of my development environment fails based on resource groups or tags, then alert a different group? Is that something like that that we can do? Yes, that's a great question. So what you're saying is very much possible and just to take a step back, so there are many different ways in which we integrate with Azure Monitor today because there are different kinds of requirements that might come up in the context of alerting. So one of the ways we integrate with is is called built-in Azure Monitor alerts for backup where for certain default scenarios that all users believe that a backup product should fundamentally have, we provide a seamless way for customers to get alerts without needing to do too much additional configuration. So probably what I can do is I can show you my screen and to give a better idea of what I'm talking about. Okay, let me, okay, let's bring it in. Great. So if you can see, we have the backup center here and what you'll also see now is a tab, is a tile for active alerts. And this actually refers to the alerts which are coming from the Azure Monitor-based systems. So we have two things here. One is the Azure Virtual Machines and one is called the Global Alerts. So just to put it in simple terms, Global Alerts refers to those alerts which don't have an association with a particular backup item. Like let's say it's at a vault level. So for example, let's say somebody had disabled soft delete functionality for that entire vault because of which any backup item under that vault which gets deleted either accidentally or maliciously will no longer go through that 14-day soft delete period. So Global Alerts are things which are associated to the vault as a whole and not an individual data source. But on the other hand, there could also be alerts on an individual data source. Like let's say somebody has explicitly gone and deleted backup data for the virtual machine or let's say like you mentioned, a backup or a restore for the virtual machine has actually failed. So those come up as alerts under Azure Virtual Machine Backup. And as you can see here, there are two types. One is security and one is configured alerts. So security alerts are alerts that Azure backup sends by default and surfaces them via Azure Monitor and Backup Center. And these are alerts that all users receive irrespective of their configuration. And why this is important is that there could be certain scenarios where due to some unfortunate incident in an organization, a malicious admin kind of wants to disable any security features so that the right people are not made aware. The benefit of actually now surfacing these security alerts by default irrespective of the user's configuration is that these alerts can't be subverted by a rogue admin. And you might ask the question, like if the alerts are there by default for every user, won't there be a lot of noise with respect to the large number of emails generated? Sorry. I was just gonna ask, yes. Okay. That's great. Yes. So in terms of that, so one important distinction to make is alert versus a notification. So an alert is something that we generate and it is something that is shown on the portal without needing to do any additional configuration. The notification is something that is still in the hands of the user. So let's say for certain kinds of subscriptions, let's say they are test subscriptions and users need not actually get notified if something is deleted. They can choose not to enable notifications for wards in that subscription. But let's say there has been, there are certain production subscriptions where they do want notifications for every failure. What the customer can do is that they can actually create a notification which in Azure Monitor terminology, it's called an alert processing rule. So let me probably also walk through the process of creating an alert processing rule. So if I switch my tab in the browser, so essentially this is a vault which has a couple of backup items. And the way the organization wants to get alerted is that the way these are set up today is that we have one VM which belongs to one application called Web App One and the second VM belongs to a second application which is called Web App Two. And the application associated to both of these VMs is actually identified with the VM tag. And so what is the application that we have and what the scenario we're looking for here is that let's say whenever an alert is generated on any of these backup items, it's very possible that different applications will are actually owned by different people. And let's say a critical operation on virtual machine in Web App One needs to be resolved by one set of people whereas a critical alert on virtual machine two needs to be resolved by a different set of people. So what the organization would like to do is, so this is an example of a sample logic app which the organization has generated to actually be able to decide where to route these alerts. So I'll probably walk through this logic app and then I'll talk about how we can hook the alert to this logic app so that you can get the notification to the right channel. So essentially the logic app is configured to receive any HTTP request with the schema of the alerts. And since we are using Azure Monitor based alerts, we leverage the standard alerts schema which Azure Monitor supports for all resources. So from that point of view, the customers don't need to learn any new schema just to be able to parse and manage backup alerts. And so the logic app gets the details of the alert, it identifies things like what was the affected resource, what are the subscription resource group, it gets all those things from the alert body. And then what it does here is that it checks what is the type of the item? Is it a virtual machine or is it a database? Because the user has a scenario where alerts on virtual machines need to actually be routed to the infrastructure team whereas alerts on databases need to be routed to the database team. And even within infrastructure alerts, the VMs could be belonging to different applications. So what this part of the branch does is that it actually checks what was the tag of the affected VM and based on the tag of the affected VM, it chooses to route it either to the first team's channel which is for the application one or for the second team's channel which is used by members of application two. So this is essentially the logic that the organization wants to achieve. And now we look at how you want to, you can actually route your alerts to this logic app so that the alert is routed to the right notification channel. So if I go back to the tab where I had the recovery services vault, what I'll do is I'll go to the alerts tab and this shows me the list of all the alerts that were generated. And what I'll do is I'll go create an alert processing rule. So in very simple terms, there are different concepts which you might see here. One is the alert rule, one is the action group and one is the alert processing rule. So alert rule is essentially a condition which defines, is a rule which defines under what conditions an alert should be fired. In the case of these built-in Azure Monitor alerts, since we are generating these alerts by default, customers don't need to create an explicit alert rule for the creation of these alerts. These would probably apply in other kinds of scenarios like metric-based alerts or log-based alerts, which I will, which we'll probably come to a little later. But for these default Azure Monitor alerts, we don't need to create an alert rule which makes it much simpler. And so the only thing that they need to do is actually create a notification. And for the notification, there are two concepts. So one is called the action group, which is the actual notification channel. It could either be an email address, it could be an SMS number, it could be an ITSM channel. And in this case, it'll be a logic app. So because what we're doing is we are first sending the alert to the logic app to do some business logic and then send it to the end channel, which is the team's notification channel. So that is essentially what an action group is and an alert processing rule basically specifies which alerts should be routed to which notification channels under which condition. So for example, I can say, create an alert processing rule such that all alerts with severity zero are sent to this particular email address or this particular logic app. So that is essentially what I will first create. So if I go to alert processing rules, I can click on create, then I can select a scope. In this case, just to keep it simple, I will scope it to the vault that I'm looking to monitor, but in reality, you can actually scope an alert processing rule to span all alerts within a subscription. But for now just to keep it simple, I'll just do it for a single vault. Then comes the rule setting. So alert processing rules have many different capabilities. So in our particular scenario, what we want to do is actually make sure that the alert gets routed to a notification channel or an action group. So this is what I will select here. And now I can either choose to add an existing action group which I might have already created or create an action group in line right from this experience. So what I'll need to do here is essentially define the action group to point to that logic app which we just saw. Yeah, nos accent groups are reusable. So if you've already created one for application A, you can basically just reuse it for application B. Exactly, that's a good point. So if we already have an action group existing, we can use that same action group in the alert processing rule without needing to create a separate one for each application or resource. Yeah. So what I'll do here is that I'll select action as logic app and it shows me the list of all logic apps already in the scope and this is the one I want. And then, yeah, so in our case, we do use the common alert schema. So you can optionally mention this as yes, though it doesn't make a difference in this particular scenario. And I click on OK and I give it the name. And yeah, so my organization requires me to add tags for every resource, otherwise the resource fails. So like a tag at every other resource, I can also assign tags to the logic app to the action group at the app order. And then I go to review and create. I can choose to even test the action group before I actually go ahead and create it. That's another cool feature which Azure Monitor has added recently. And then I create the action group. So while this is creating, this is pretty much exactly the same as if you go directly into Azure Monitor to create the alert in there and the notifications in there. Exactly. So what we are looking to do is essentially leverage standard Azure-based services for different management at scale requirements, whether it be governing, it could be monitoring or it could be things like reporting. And since Azure Monitor is the standard solution for managing all Azure services, which includes backup, using Azure Monitor-based alerts helps users have a consistent experience for managing all their alerts. Yeah, so it doesn't matter if you start the process from the Azure backup management pane or the backup center management pane and when you click on alert and keep going, you're still getting the same experience as if we were in Azure Monitor creating that alert in that processing room. Exactly. Yes, that's a good point. And one thing which we also do just to help the customers of backup is in the backup center, we kind of try to provide a more contextual experience so that customers can alert not just by the top level properties but also by some of the backup specific properties like what was the workload type being backed up, what was the vault, and things like that. So that customers have a more contextual experience in managing the backup alerts. But having said that, what you mentioned about this being on par with actually going to Azure Monitor and configuring the same alert is spot on. Okay, perfect. So once I create this action group, I go ahead to the next steps of actually creating an alert processing rules. So yeah, in this case, I want to be, I want the alert processing rule to apply all the time whenever a new alert is generated. And here I basically create the name and a description of the alert processing rule. And again, I will need to tag these resources so that the policy doesn't deny the creation of this resource. So I will go ahead and do that. Tagging is very important for operations purposes. So I'm glad that you're doing it. Even if it's in a demo environment, I'm a big fan of that. Yes, I agree. All right, so we create the alert. So now it has everything and it's created, okay. Yes. And so yeah, so the other process, so the other processing rule has now been created and essentially this vault has been hooked to this logic app. So now what I'll do is I will try to simulate an alert and show how that actually gets routed to the required notification channel. So for simplicity, what I'll probably do is trigger a security alert, which in this case will be the deletion of backup data for one of the VMs. The deletion in the real world could either be accidental or malicious, but it is a critical scenario that needs to be alerted on. Having said that these notifications will also apply for any backup or restore failure that is generated for this wall because we also do surface backup and restore failure alerts as default Azure Monitor alerts. And just one nuance before I go into that is that for the job failure alerts, it is possible that some users may not want an alert for every single job failure and they might be using other things like custom queries to generate alerts for those failures. So that users don't get duplicate alerts for these job failures. We do provide an opt-in kind of thing where customers need to register a flag if they want to get alerted for these job failure scenarios as well via default Azure Monitor alerts. So that is something that the customer can control which is why you might have seen that the job failure alerts shows up under configured alerts and not under security alerts because those can be turned on or off by the user but security alerts are something which can't be turned off. Only the notification is in the hands of the end user. So what I'll do here is I'll go and trigger a delete backup data. So it might be that there is a script in the organization that is unintentionally deleting a production resource due to some issue. Here, I'll just try to simulate that in the portal to show an accidental deletion scenario. And let me go on and click on stop backup and this operation will run. And once it runs, we should see an alert. So while we're waiting for the alert, as you mentioned, your logic app will decide where the alert gets routed. But what if somebody doesn't want to have, like the action group, do you still have the capabilities of sending the alert, let's say to a very specific email address or to a team's channel or to anything like on top of being processed by your logic app? Yes, that's a great question. So what you're saying also does work. So if I actually go and kind of create, kind of go with that action group options, which we saw earlier. So just give me a second. Yes. So what you can also see is that when you create an action group, so we have some options to actually send an alert directly to an email address. So you can either email it to an Azure Resource Manager role, which for example, if you want to email it to the subscription owner or the subscription admin of the subscription where the alert is generated, you can choose this option, which is emailing to an Azure Resource Manager role, or let's say you want to email it to a specific email address, like my email address or your email address, you can enter this option, which kind of allows you to specify an email address that you want to send it to. And apart from the email options, we also have many other kinds of actions available. So what we saw earlier was the logic app, but you can also directly send it to an ITSM channel, like a service now, or if you want to run some scripts before actually sending the alert elsewhere, you can send it to a runbook or a function, and similarly to other customer endpoints like a webhook. So in that way, action groups offers a lot of flexibility in terms of which notification channel you want to send alerts to. And yeah, Azure backup onboarding to that helps us leverage all the goodness of these action groups. Okay, so you could have multiple that are either running in parallel, so you'd have your processing role that goes through your business logic and alerts the right people, but you could also have an action that runs a runbook where there's a power shock in to basically restart the backup if you've stopped it, for example. Yes, exactly. All right. Okay. So did we get our alert yet? Yeah, it looks like this operation has completed and now let me just check if I'm able to see it in the alert. So as you see, there is now an additional alert and hopefully this should also show up in the Teams channel. So yes, so since I had deleted the data of the VM belonging to web app one, you can see it is the alert is getting generated in this team's channel, which is for the web app one. And this has all the data related to the alert, like the description of the alert, the recommended action, the affected VMs, the affected vaults and things like that. So in this way, the action group actually helped us decide what to do with the alert, how to identify which application it's attached to and depending on that, send it to the required notification channel so that the right people can look at it and the alert gets traction. That's very cool. So I like the, this is really something I had not thought of before. I is using a logic app to basically parse through the body of the alert and make decisions based on which machine through the alert where the default occurred in order to write to the right people and then to send it to a team's channel versus putting something in somebody's email that might be missed if that person's not at the office, for example. I hadn't really not put those two things before together. I would have just done a whole bunch of coastal queries in Azure Monitor and created it. But I find that your way of using logic app is very streamlined and also very easy to manage, I find, because of the graphical nature of logic app. Plus it does allow you to plug into multiple connectors so in terms of teams or other channels or whether or not you're running a Slack channel for or any of those communication methods. So that's really cool. Let's say you have a company that has a lot of different subscriptions and tenants and they have all of the same backup procedures across all because of the corporate standards. Is there a way to get this like to, as part of your infrastructure as code or programmatically deploy those types of alerts and those types of notification? Yes, definitely. So again, you can use the existing programmatic interfaces which Azure Monitor supports. So I think, no, we have a lot of detailed documentation in terms of how to create an alert processing rule and an action group using PowerShell, CLI and other interfaces and yes. So essentially all of these things in terms of rule creation and creation or notification channel can be done via all these different programmatic interfaces that Azure supports. Okay. You showed us earlier the configuration alert versus the security alert. In your Azure portal, is there a way where I see like all of the alerts like regardless where they were generated for what type of alerts they are like in one kind of listing? Yes, so yeah, so again, if we go back to the earlier classification that we talked about, right? So we have these built-in Azure Monitor alerts which are fired by default. Then we have these metric based alerts and then we have, and customers can always write custom alerts on the data and log analytics if they have configured the walls to send data to an LA workspace. So today if you see in backup center, we surface information on the built-in Azure Monitor alerts and metric alerts. So both of these alerts can be viewed from within a single pane of glass. So custom log alerts based on the log analytics data is something that we haven't integrated in backup center as yet, but that is something which is definitely in our roadmap. If users have written alerts on their data and log analytics, yes, they will continue to work and users can manage them by going to a log analytics workspace or via Azure Monitor, but actually also integrating with backup center to be able to see a custom log alerts in a single place is something that we're actively working on, but the current state is that we show the built-in Azure Monitor alerts and the metric based alerts. So all of the metric based alerts also come under configured alerts because those are things that the user configures by explicitly creating an alert rule. And users can actually see in the alerts pane, they can select signal type as either metric or log. So log in this case refers to the built-in Azure Monitor alerts because that those alerts are more verbose and resemble logs, but if customers want to see metric based alerts, they can filter for metric alerts and see all the metric alerts based on their alert rules that were fired in the last 24 hours up to the last 30 days based on the current retention. And I think this is also a good segue to actually talk about this concept of metric alerts since we have so far been talking about the built-in Azure Monitor alerts. So where metric alerts comes in is that let's say users have additional scenarios for which they want to generate alert on. So for example, in the built-in Azure Monitor alerts, I mentioned that we send alerts for security scenarios and failure scenarios, but very often there could be users who also want to get informational alerts if the backups have succeeded, just to make sure that the backups are healthy and the right people are informed about it. Or let's say they may not want to get alerted on every single backup failure, but might want to get alerted only if there were two consecutive backup failures or three consecutive backup failures so that only the most critical ones which affect their RPO are actually kind of alerted on. So metric alerts enables customers to actually use, to get more flexibility in terms of which backup health scenarios they want to get alerted on. So how this works is that Azure Backup Services a set of default metrics for people to use and alerts are not generated by default here. So customers can choose to create alerts on based on which metrics and which thresholds they are interested in. So if I go to backup center and click on metrics, I can actually go select a scope. In this case, I'll just select it at a single vault level, just for simplicity. But in the real world, the users can also see metrics for multiple vaults within the same subscription and region which is the maximum scope that we support today. So as you can see, there are two metrics we have now. We do plan to add more in the future but right now there are two health metrics which are backup health events and restore health events. And the way you would just think about this is that whenever a backup job completes, a backup health event is emitted. And depending on whether the job succeeded or failed, the dimension associated with that metric changes. So I'll walk through that. So by default, the metric is at the level of a vault which is the top level arm resource supported by Azure backup. And by default, when the user selects backup health event, the chart here essentially shows account of all the backup jobs that executed for this vault in the selected time range. Now let's say I want to find out among these which were how many were successful jobs, what I'll do is I'll add a filter which basically helps me filter on the health status and I can choose a value of healthy. So essentially what happens is that whenever a backup job succeeds, a backup health event is emitted with the status healthy. So if I want to see a trend of all the successful backups that happened in the last 24 hours, for example, I'll choose backup health events and filter for health status healthy. And suppose let's say I want to look at the jobs which didn't succeed. So you can see multiple different states here and I'll talk a bit through this. So we have something called degraded and something called unhealthy. So essentially unhealthy is refers to the scenario when there is some issue with the Azure backup service and the customer's vault is unhealthy because of a service issue which they really can't fix themselves. Then job failures which are associated to such scenarios where the vault, where Azure backup service itself is not healthy. For those scenarios, the health status is unhealthy. But let's say the job is failing because of some user error and let's say a small fix at the end of the user can actually make sure the backup succeed. So those backup failures are actually emitted with the status as degraded. So this is the distinction between unhealthy and degraded where degraded are things which the user themselves can fix by making sure that the configuration is correct and unhealthy scenarios where for some reason although hopefully it shouldn't be very frequent but in scenarios where there might be let's say an outage or a maintenance and things like that. And there is an issue with the Azure backup service itself that the user can't fix manually that those get emitted with the status unhealthy. And one more useful thing that is there here is the concept of transient and persistent. So let's say a user wants to track items which have long running job failures. This field of persistent and transient will actually help them there. So essentially what happens is that whenever a job fails for the first time and let's say it is you do to service error then the health event will be emitted with the dimension transient unhealthy. But let's say there is another failure for the same backup item in the same world. Then the next time the health event gets emitted it'll get emitted with the status persistent unhealthy because there is persistently an issue now and which has basically affected multiple backups. So essentially that's the distinction between transient and persistent. And users can use that to distinguish those items which had transient failures versus those items which are having persistent long running failures. So this is on the health status. And one more important thing is that all this so far is at the level of a recovery services vault. But there are many users who have like hundreds of backup items per vault. And they actually want to know within the vault which backup items were actually unhealthy and which backup items cost the value of the metric to rise or dip. So users can make use of other dimensions as well. So we have something called the backup instance name where the user can say, I want to see the count of health events for this particular file share which was backed up. So now they'll actually be able to see the metric at the level of a backup item associated to the vault so that they can actually drill down into which item actually had the failure and decide if they need to take any action. And in a similar way, there are other kinds of parameters like they can select, if they want to see the health metrics only for VMs or only for file shares they can choose to select the required data source type so that the below chart gets filtered. So all of this so far was about how people can view these metrics and consume these metrics in the portal. And now let's say you users want to be able to generate and alert on this metric. Okay, so yeah, so still now we have looked at how we can actually consume the metric in the portal but let's say user also wants to generate and alert for that metric. So what they'll do is they can click on new alert rule and they can, and some of this gets pre-populated and let's say they want to get alerted whenever there has been a multiple backup failures for an item. So what they can do is they can select health status as whenever it was either persistent degraded or transient unhealthy, meaning that whenever there has been a failure for this item which could be due to any reason, either degraded or unhealthy. And they can say whenever the count of it is greater than one, meaning there were multiple such health events with these undesired statuses. Then what they can do is they can actually select the frequency and period of evaluation. Like let's say they want the alert rule to run every day and look back in the last 24 hours. What they can do is they can select look back period of 24 hours and they can configure the alert rule to run at the frequency they're looking for. And essentially once they do that, they can basically go and configure this alert as well to an action group. And as we discussed in the earlier part, we can, if they have an existing action group for routing alerts, they can use the same action group or they can create a new action group and then they can basically specify which severity they want the alert to be fired for and enter the details like the alert rule name, the resource group it should be created under and things like that. One more interesting feature is the ability to automatically resolve alerts. So let's say, yeah, so let's say you had an alert and it gets, and then due to some underlying issue, failures keep happening again and again. This will actually cause alerts to be sent again and again. But in the case of metrics, we actually have, there are two concepts. One is the concept of stateful alerts and one is the concept of stateless alerts. So stateful alerts essentially mean that when an alert has been generated and let's say in the next cycle of the alert rule evaluation, the condition is still true, a new alert is not generated. And whenever, let's say somebody has fixed the issue and the next three evaluations of the alert rule all look fine and it didn't find any failure in that period. The alert gets automatically resolved by the service and the next time a failure occurs only then a new alert is generated. So it's kind of like a way to reduce noise. So if you- And that's very important because in my career I've seen so many times where people set up alerts and once they get to a certain number of alerts, they just basically either stop looking at the alerts because it becomes noise because they say, oh, I looked at this and it's fine but for some reason I keep getting the alert. Yes, that is true. So essentially what they can do is they can, they can select automatically solve alerts if they want the alert to be stateful and reduce noise. But let's say users do want to get alerted for every single failure, just to make sure that the alert gets required traction. They can also choose to make it stateless by removing this check box. And when it's stateless, every time the metric condition evaluates to true, a new alert gets generated. So this is essentially how we create a metric alert rule based on the metrics that you're interested in and similar to the built-in Azure monitor alerts, these can be routed to an action group. Wow. I see so much potential in there. So much potential in there. That last one specifically in terms of managing the noise in inboxes and Teams channel or whatever the way you notify your people and your teams that's actions or events have happened or not happened, I think it's gonna be huge. Again, the way you're using or leveraging logic apps to filter and parse the information to decide programmatically where that information is gonna be sent to which channel, to a Teams channel versus somebody's inbox, I think it is fantastic. Aditya, this was very, very useful if somebody is trying this and they want more information, where should they go? Yes, so we do, we have documented these scenarios in the Azure backup documentation. So customers go to monitor and alerts, they can actually see an overview of the different monitoring and alerting and reporting options and along with guidance on when to use what. And having said that, if there are any other questions please feel free to write to ask azurebackupteam at microsoft.com and one of us will be very happy to answer your questions. Well, you've taken the words out of my mouth. I was gonna ask if somebody has feedbacks or questions how do they get in touch with you? And there it is. And all of those links and informations are gonna be in the descriptions below. So Aditya, thank you very much for taking the time and walking us through setting up those alerts and those notifications for Azure backup. This was very informative and useful. So thank you very much. Thank you so much, Piet. All right, and for you at home watching, stay tuned to the ITObstox channel for more content like this. And if you have any suggestions or scenarios you want us to explore, make sure to tell us in the comments below. Thank you very much. Have a good day.