 Welcome to Ops 106, how to be an AD Hybrid Health Hero. Let's go. Today's session is Hybrid Health Hero, and we have Grace, who is a Program Manager in Identity Engineering, and Mark, who's a Program Manager in Identity Engineering. Mark and Grace, please take it away. Thanks everybody. We're Program Managers in the Identity Division here at Microsoft. We're responsible for Active Directory, Active Directory Federation Services, and Azure Active Directory. Grace and I are a team that works with customers on their deployments of Azure Active Directory. We take the learnings and feedback from those deployments, and we work it back into the product to make it easier and better for everybody else. Let's get into this session. But before we go into the Hybrid Health Part, there's a few things we just want to make sure everybody talks about first. First, please turn on MFA for your admins. We should not have to keep talking about this, but unfortunately we do. How can you do that? First, you just click on the user, and you turn MFA on. That's like a really, really good start. Just go do that. If you do nothing else, just go turn MFA on for those admin accounts. A better way to do this is using conditional access because you can apply it to the roles. Anyone that gets added to a global admin or exchange admin, will automatically have to do MFA. But the better thing you can do here is Azure AD privilege identity management. There's no standing admin access. You can do workflow scheduling, all kinds of good stuff. We're not going to cover all this for this talk, but please go take a look at this, and you can buy those P2 licenses just for your admins. You don't have to license all of your users to get this benefit. You can just go buy those 10, 20, 30 for those admins and go use PIN. The second thing I want to make sure people are aware of, is we've published some guidance around how to be resilient to any issues with authentication or with Azure AD in the cloud. So go to ak.ms slash resilient AAD to read some guidance. But something that's important for this talk is I want to make sure everybody has at least one cloud only admin account. It's important as part of that documentation, but some of the stuff we're going to talk about today, if you make some changes in the incorrect way, you might lock yourself out of your tenants. We want to make sure you have one cloud only admin account before you get into it. Okay, so what are we going to cover here today? The first thing is we're going to cover the authentication stack health and how to make sure to think about that from a health perspective and operations perspective. Then Grace is going to talk about Azure AD connect, sync health, and then we're both going to cover some logs, and then we're going to leave you with some go dues. Okay, so the first thing from a health perspective you should take a look at is you should use a feature we have that's called Azure AD connect health. So there's agents available to install for ADFS as well as your domain controllers, as long as you're on server 2012 or later. And good news, it's built into Azure AD connect sync if you're on version 1.0.9125.0. And if you go look at our Azure AD connect history, that's not even on the list. I had to go like way back into the archives to go figure out how old that was, which is from November, 2015. Now you might be thinking, why are you talking about this? Like, doesn't everybody have this? Are people using things older than this? The answer is yes. And Grace is going to talk about that a little bit later from the Azure AD connect health perspective. And the things you're going to want to do once you go into connect health, is you're going to want to set up email notifications in your portal. So people are getting those notifications and from a licensing perspective, I'm not a licensing expert, but you can go here to the read through it. But basically you need 20, you get for every 25 Azure ADP one licenses you have, you can install one agent. Read that for more, take a look. This is some basic stuff, but okay. So with that being said, the question here for our audience, out of the tenants that are using ADFS and have the proper licensing, what is the percentage of tenants you think that have enabled ADFS connect health? 17%, 36%, 58% or 83%, what do you guys think? I'm going to go with 36. Okay. I'd like to think it was higher, but in reality, I don't think a lot of people realize that this is a thing they should be doing or they can't even be doing yet. So I'm going to go A. So it's 36%. So I mean, 64% of people are not using ADFS connect health when they could be and there's some really, really good information in here. That's what we're going to talk about a little bit today. So how do you get this set up? It's really straightforward. The first thing you're going to need to do is enable the advanced auditing on the ADFS servers. You don't need to do anything on the web. And if you don't have these logs enabled, you're going to want to do this anyways, because if you have to do any type of incident response type activity, you're going to need this log information. And then from a connectivity standpoint, you only need port 443 outbound. There's a PowerShell command here. You can run to validate this. And if you go to this link here, if you go to ak.ms slash AD connect health agent install, you can install the agent on your ADFS servers. You can start getting this data. We're not going to spend any time walking into that. You guys are all operations folks. You understand how to do this. It's really straightforward. Now, as Grace and I were putting together this talk, we started asking ourselves some questions. And one of the questions we asked is, what is the most common ADFS connect health alert we see on those tenants that have connect health on? Is it the health service data is not up to date? Is it the ADFS service is not running at all? Is it the extra not lockout is not enabled? Or is it the Windows transport endpoint is enabled? And that should be disabled. So that's why that's an alert. So what do you guys think? What's the most common alert we see? I'm going to say health service data. And I'm going to say extra not lockout. Extra not lockout. So the most common one is health actually. So, and this just usually means that there's some sort of either domain name resolution or connectivity problem that's happening. So that's the most common. So you take a look at that and go fix it. That's an easy thing to fix. The second most common one is it's in order. The second one is the ADFS service is not running but I think folks are pretty good at that. But the one that really concerned myself and Grace here is that the extra net lockout is not enabled. And if everyone has ADFS, you really, really need to have this on. It's super, super important. So let's talk a little bit about what that is and how to get that turned on for your environment. So if you're on ADFS 2012 R2, you have a feature called extra net soft lockout. And what this does is it basically stops trying to authenticate the user against active directory when you've hit that search and threshold. So the idea here is that if you set your soft lockout on ADFS to be about half of what your active directory lockout is, so if your AD lockout is 10, you really wanna set your soft lockout on the ADFS side to be like five or six. You don't wanna make it super close like eight or nine, you wanna give it some space. So basically what happens is if someone's trying to guess the accounts password, they'll basically stop at the ADFS side. ADFS won't pass that back to active directory and thus it's not gonna lock out that account. So that's what it does, it's there's not too much to it. But you wanna make sure if you're on 2012 R2, you at least turn this on and then you start migrating to 2016 or even better 2019 because then you'll get what's called as extra net smart lockout. So this is built in to 2019. On 2016, there's an update you need to apply, make sure at the 2016 farm level, then you run a PowerShell command to update the artifact database. And what this does is it keeps track of the known IP addresses we've seen you log in from. And it uses that logic to determine if we've coming from an unknown IP, we block you, but if we're coming from a known IP, we will let you through, okay? And I'm gonna go into that here a little bit deeper in one more slide. But something to note when you turn this on is the database will grow about one gig inside per 100,000 users. And the other really cool feature about this is this will also work with your LDAP identity stores in ADFS. So that's sometimes a reason why some customers still need to use ADFS is that their main identity store is actually not active directory. Maybe that's it for like their information workers but all their employees are in an LDAP directory. You can use this smart lockout even if your identity store is in LDAP. So how do you turn this thing on? First thing you're gonna do is you need to turn this in log only mode. So ADFS can learn those valid IP addresses that people are logging in from. And that's good. You want that to run for about three to seven days is the recommended amount of time. But if you're under some sort of attack, try to keep it on for 24 hours so it can learn the valid IPs before you turn it actually on into enforce mode. Again, that lockout threshold should probably be half of the active directory value. And sometimes a really common question we get is how many IPs does it store? It's 20 IPs per user. And when they hit that 21st IP, the oldest IP in that database gets removed. So that's how that part works. But I know this talk is supposed to be a little bit deeper. So we're gonna go one layer deeper than this and talk about how this actually works under the hood. So the first thing is, when the authentication comes in, we first check the network IP, the forwarded IP, and if we have it x forwarded for IP. And if all three of those IPs are on the familiar list in the database, we consider that you're coming from a known location and we'll let you the authentication proceed and go against active directory. If any of those are not on the familiar list, we consider this to be an unknown network. So we'll go through the next step here is we check to see if the bad password count is less than that threshold, or if the failed attempt is longer than the observation window. So those are basically the settings you can configure for the smart lockout. And if one of those is true, that means that you're not at the bad password count total and it's outside the observation window, we'll let the authentication proceed through. If both of those is actually false, the account is considered locked out. And then when that threshold window passes, we'll give you one more attempt to authenticate. And if it fails again, the window still gets updated and all that stuff keeps happening as it's happening to get locked out. Okay, so let's assume that one of those is true. So we're going to continue down the path and if the authentication is successful, all those IPs are now going to be set to that familiar location and the bad password count for that user is reset to zero. If the authentication is fails, the bad password count is increased and the bad password time is updated to that current time to be used for that threshold observation window. So this is what's actually happening when you have smart lockout, I'm unenforced. And if you ever need to go look at this data about your users for troubleshooting purposes, you're trying to figure something out, you can get this out via PowerShell. If you want to get ADFS account activity and pass that UPN, you can see all this status about those users in the database. Okay, so the next question Grace and I asked ourselves is how many tenants have zero ADFS Connect Health alerts right now, which is the time when we ran that data, right? So bear with us here. So the choices are zero. So everyone that has ADFS and ADFS Connect Health has a problem that they haven't addressed yet, 588, 1,424, or 2,650. What do you guys think? How many tenants are like perfect health? They're super clean from when it comes to ADFS. I'm gonna let Aaron go first this time. He's been nailed it, so I don't know. All right. Look, he's a Mr. Arden says it's zero, but I would say it's 4824. All right, any other guess? I'm gonna go with B, let's say 588. I think someone might have peaked at our slides here if I didn't know any more. Yeah, I'm right, yeah. So 1,424 tenants that have ADFS Connect Health have zero alerts right now, which is good, but we have quite a bit of room for improvement. And these alerts are really important. You're probably kind of saying like, who cares? Like there's alerts, there's stuff going on. We'll get to it when we get to it. And I think the really thing to the hammer home here is that your operations matters to how you can do enterprise security. And there was a thread from Swift on security a couple of days ago talking about the solar wind stuff. And I thought they kind of summarized this really, really well because I think a lot of people in operations don't really realize the impact that they have when it comes to the enterprise security. So I'll just read it here so everyone's on the same page, but I say it over and over, enterprise security is operational excellence in action. They're the same thing. These attackers blend in with simple misconfigurations and troubleshooting done by admins, things that are boring and classified as not security. That's why this worked. I really agree with this tweet thread and the statement specifically because how do you determine if you're under attack or if it's just a misconfiguration in your environment, these connect health alerts will help you with this. And to give you an example, one of the customers I was working with, we were looking at Azure AD sign-in logs, just looking through stuff. And we noticed there was a handful of clients that were showing up as Windows 8 and Windows 8.1. And I was like, well, what's going on with these? Are those Windows 10? And he's like, no, we're 100% Windows 10 here. I said, well, why did these folks get missed or what happened here? And they looked at the list, they go, no, those people are on Windows 10. I don't know what's going on. And they kind of ignored it. And I really think that's a mistake because either something is misconfigured on that machine that's reporting that it's Windows 8 and 8.1 to Azure AD, like in the user agent string, or maybe someone's trying to kind of like sneak by and think that, well, maybe there's less security applied to these devices because there's an exception not applying the same Windows 10 rules and things like that. So when you have a really good operational security on some of this stuff, it makes those types of things stick out like a sore thumb when it comes to the enterprise security side. So this is really, really important stuff. Don't ignore these alerts. This is good operations fundamentals that you can go back and do. So one of the things that you can do is just a question with that. Yeah, it's boring. Were the Windows 8 clients actually problematic or when was it a security incident or was there something just stuffed up? I, they never went forward to investigate. They kind of just put it on the back burner. It's interesting because that's a really big aspect of what IT pros do in terms of the work that we want to get done in organizations, keeping those operating systems up to date and and patch a lot of the decisions and the things that we need to move forward and even the things we block, we do, we do because of security. Like that is at the heart of everything that the IT ops does, you know, keeps things secure, keeps things running in and keeps things stable. So this, this, I love this point too when I saw it on Twitter, absolutely. Yeah, I mean, I know that some people, especially some of the security stuff, everyone's drawn to like the new shiny, right? And it's like, great. And we got a little new shiny in here coming up. But a lot of these like just core fundamental things is so important to enterprise security that I think sometimes people lose sight of that or they don't realize that, yeah, keeping things patched, keeping the stuff up to date, going and chasing down these weird things. It really, really matters in the grand scheme of things is how to do enterprise security, like especially at a very, very large scale. So it's just super, super important. I think it was, I'm trying to remember, Bruce Schneier. Yeah. I think that it's not the Hollywood attacks that actually get through most of the times. It's the really boring meat and potatoes attacks that are the most effective. Yeah. I think there's a lot of that discussion going on, especially with some of these advanced attacks that we've seen and people get distracted on that while like some basic stuff running with incorrect privileges or not maybe using least privileged and Grace is gonna talk about that from an Azure AD Connect perspective, which is really important. That's the stuff that ends up really getting you bad, right? Cause there's, it's not contained. We're not using these privilege and it just makes a bad situation way, way worse. So yeah, I totally agree. Okay. So something you can go do that we provide for you in ADFS Connect Health is we will show you a bad password attempt report. We show you the top 50 accounts. This gets updated every 12 hours. So if you're kind of just messing around with it and putting bad passwords for your account, you won't see it right away takes 12 hours, but we'll show you multiple IPs as well as the forwarded client IP if we have that. But the thing here I want to really stress is that you need to go look at this and take action on it because is a key account under attack or is it somebody changed their password a week ago and then they have like an iPad or some other device in a drawer that they may pull out on the weekends or whenever that has the old password and every 15 minutes it's just like just banging away on that account, just raising up that counter. Go look through this stuff, chase this down and then use this to go take action because maybe we want to do something that account because it is under a targeted attack or it is on some list somewhere. So take a look at this, this is in there. You don't have to do anything like once you turn on Connect Health you'll start to see this data. So take a look at this. It's a super, super important. Another one there we have is the risky IP report. So this is an aggregate view from the web app proxy servers that shows the risky IPs based on where they're failing these different thresholds and it's super, super useful because if you look at the bottom two rows you see these two different IPs you can see the unique users that they attempted. So the second one is one and the third one is two. So it looks like somebody's probably doesn't maybe remember their username, they put their username in, they put their email on, whatever it's right in there and they have these different password attempts. But that first row is 14. That's probably not a user that's trying to log in just can't remember, this is probably somebody trying to do something malicious. So that IP address like an indicator of compromise for you on something else. Is there something else you maybe want to do at your border level? Maybe shun that IP for like a couple of days until they move on to the next target. This is really, really rich information that you can take and then go take action on with your security teams, with your SOC, with your NOC. Go look at this stuff because it's providing there. So please, please go back and turn on ADFS Connect Health. You get this rich information just by having some. You don't have to do any other advanced configuration. Like it just will start populating. Super, super useful. Now, New Shiny, if you did miss this recently we announced that Defender for Identity now supports ADFS. So there's a sensor you can install on the ADFS servers just like you can install on your domain controllers for that part of it. It requires ADFS 2016 or 2019 and you have to turn on that advanced auditing again just like you do for Connect Health. You're gonna wanna turn that one on anyways. If you look at our security blog here it walks you through all this but this just came out within the last of this recording I think about two weeks ago, maybe not even. So take a look at this. Super, super useful if you have for Defender for Identity and you have ADFS, go take a look at this and go use that. All right, so some ADFS parting thoughts from me. You have to treat ADFS like it's a tier zero resources. The stuff that we talked about today we're really, really just scratching the surface but your ADFS infrastructure is just as important as your domain controllers and I think from an operations perspective people understand what that means. When you say you have to protect the domain controller and how you monitor and patch and all that kind of stuff people really understand that but I don't think people are treating their ADFS in the same way and some of the other stuff we're gonna talk about here it needs to be treated that same way too. So you wanna make sure you're dedicating the right resources for that care and feeding and protecting all of these ADFS servers just like you would a domain controller. Now, if you are going to be on ADFS for the foreseeable future for whatever reasons that is please move to the latest ADFS which is beyond 2019. I don't think you wanna hang around on 2012 R2 for the next five to six years. There's some really good features like the smart lockout like we talked about but you really wanna be on that latest version of ADFS. Look to store your ADFS signing keys in an HSM. If somebody is able to get a hold of those signing keys they can mint tokens as whoever they want. This has been talked about for some time. We saw this with some of the SolarWinds, SolarGate stuff. This is, it's a good rule of thumb if you lose your certificates bad things are probably going to happen to you in your environment. This is a really good example of that. So if you lose those signing keys they can really do some bad stuff to you and it gets bad stuff to you in your environment. The other thing here we have this is just a few things we're talking about go look at our ADFS hardening guide under aka.ms link here. There's a bunch of other recommendations you should do especially if you're gonna be on ADFS for the long term. Now, the other thing I just talked about and this is gonna be a lot of you because two thirds of you aren't even using connect health to monitor your infrastructure at all. You probably don't wanna be doing this. So if that's the case look to move to a cloud based authentication like password housing or password authentication and you can use our stage migration tool here that basically lets you put some users in this they move them from being federated to manage you can try it out, make sure things are working as you want and then you can go ahead and flip that whole domain name over but really the last thing here is treat ADFS like a tier zero resource if you need it but if you don't need it look to move off this as fast as you can to PHS or password authentication. Okay, so the next part of this I wanna talk about is let's say you've moved off ADFS and you're on something like PTA or PHS let's talk about those things. The first thing here is a feature called seamless single sign-on and what this does is it gives you that single sign-on experience for those domain join devices when they're on the corp net without ADFS which is one of the big reasons why people had ADFS to begin with was they wanted that single sign-on experience so you can now get that experience with pass through authentication or password hash sync. How this works is well documented if you go to aka.ms slash seamless SSO under the hood and there's a lot of links in this talk and going forward so we'll make them available but there's a lot coming at you. But if you haven't read this I could summarize this quickly is think of it as getting a Kerberos ticket for Azure AD like that's kind of what's happening under the covers. I think people are pretty familiar with Kerberos or at least somewhat familiar with it so think of Azure AD in this regard as you're getting a Kerberos ticket for Azure AD. Now a few other things that people may not realize if you have hybrid Azure AD join devices or Azure AD join devices you get a special type of token called the primary refresh token. If you have that and you have seamless SSO the hybrid Azure AD and Azure AD join primary refresh token will take precedence over seamless SSO. And if all of your machines are hybrid Azure AD joined which you should be trying to do there's lots of good benefits for that or you're moving to Azure AD joined. So if that's what your environment looks like you don't need to have seamless SSO in it. So think about that because there's some stuff you need to do here in terms of maintenance and health from a seamless SSO perspective. So if you need all your machines that hybrid Azure AD join you don't need to use seamless SSO. And the big thing I wanna make sure people understand is this computer account that's used for seamless SSO needs to be protected because it can create a curve roast ticket for any identity and Azure AD is going to accept that. So I've seen a lot of people's Azure AD SSO account be in the computers container. That's not good. We wanna make sure this is in an OU where only domain admins can get access to. We also wanna turn off delegation on that account and we probably wanna protect it from accidental deletion. This is a very, very privileged computer object. We wanna make sure only the main admins can access it. The second thing you probably wanna do is you wanna update that computer object to use AES 256 or AES 128 for the encryption type. So depending on your active directory environment how old it is, what configuration you set things like that it's probably by default gonna use RC4. So if you can support higher that's better. So go ahead and update this attribute MSDS supported encryption types. And here's the value put in the hex value we're gonna get if you wanna support AES 128 or 256 but please test this out make sure it's not gonna break anything in your environment. I don't know everyone's environment's a little bit different but test this out. But if you can definitely move to a higher level of encryption. And after you do that you need to roll the keys for this computer object. And we also recommend that you roll these decryption keys at least every 30 days. We have some work on the backlog that this is gonna happen for you automatically but in the meantime you need to roll this account at least every 30 days. Now right now if you're thinking I can't remember the last time that we've rolled this or I don't think we've ever rolled this how do I do that? Let's talk about that. So we have these really detailed steps here if you go to akms slash seamless SSO key rolling but this needs to be done per forest you're gonna run some PowerShell you're gonna give it some global admin credentials you're gonna give it some domain admin credentials and then basically you're gonna get to the point where you run this PowerShell command called update Azure AD SSO forest and you're gonna pass those domain credentials through and it will go ahead and do that for you. Now be warned if you roll this twice because I know sometimes people like well I wanna do it again just to make sure it got it you know I don't know if I've worked with anyone like that before. If you do that you might break SSO because the Kerberos tickets need to expire before the new ones are able to get refreshed and get that. So you can do a key list purge right like you can go get new tickets if you need to but you might not be able to do that for all all users in your environment. So you don't need to run it twice documentation calls this out but I just wanna make y'all aware that you just need to run this one time. Okay, now onto the actual authentication part is password authentication. If you haven't looked at this we cover how this works under the hood but basically what's gonna happen is when the user goes to put their credentials in it's gonna go from Azure AD down to Active Directory and tried against the domain controllers directly. So we cover how this works here we do some deep dive on this works here as well but you only need ports 80 and 443 outbound like that's the gist of how this works but you can go read the full full thing on it if you wanna know more. Okay, so what do you need to do from an operational perspective? First we recommend that you have at least three of these PTA agents installed for high availability and the first one's going to be installed on the Azure AD Connect server when you do the install. From a sizing perspective, each PTA agent can hold about 300 to 400 authentication per second on a standard box which we say is like four cores and 16 gigahertz. Now something to keep in mind is that all PTA agents when they're healthy are used for authentication but not simultaneously. So there's nothing that you need to do from a configuration standpoint that says like, oh, these four PTA agents are gonna be in this region for these users and these six are gonna be in this region for these users, all that kind of stuff. It doesn't work like that. It's just straight as long as the agent is available and healthy it can be used as a PTA agent to do authentication. The second thing I wanna make sure everybody's aware of is there's smart lockout settings for this. Now it's on by default, which is really, really good because now we're taking those credentials that you're putting into Azure AD and those are getting applied against your on-prem domain controllers and you could lock out the account if you have too many bad password attempts, right? That's how lockout works. So it's on by default but you wanna make sure the settings that are in the smart lockout settings in Azure AD are aligned to what you have in Active Directory. Again, Azure AD needs to be less than what you have in Active Directory. So if your on-prem is 10 you probably wanna have this like maybe five or six. The last part about this is please, please treat these agents like a domain controller like it tears your resource which makes sense if you think about it because this agent is getting the passwords they're encrypted from Azure AD and they're calling Win32 APIs to validate them against the domain controllers. If an attacker was to get onto that box they can do horrible, horrible things in your environment just like if an attacker can get on your domain controller they can do horrible, horrible things to your environment. So you wanna make sure you're using those tier zero type of principles. We have a couple of aka.ms links here to go look into but the aka.ms slash SPA really talks about how do you do this from a tier zero perspective you've never heard of that but you really, really need to treat these agents like a domain controller. Okay, from a monitoring perspective what should we do? On the agent machines themselves if you see a event ID of 1000 this means the connector re-registration failed the certificates either expired or it's going to expire. If you see any event IDs of 2000 through 2034 just to summarize something bad is happening usually with connectivity either the agent can't talk to the domain controllers agent can't talk to Azure AD or some other like general badness is happening with the agent go investigate that. On the Azure AD side there's a couple of event IDs you can look for here from an activity issue from Azure AD's perspective as well as if it's unable to decrypt the password something else is wrong you can go investigate and take some action on that. So you want to be monitoring on the Azure AD side as well as monitoring your PTA agents. From a troubleshooting perspective not a lot of people know that we have a PowerShell command that you can run that will walk you through this and kind of do that PTA authentication. So if you do invoke pass through off on prem login troubleshooter which is a full tab to complete that one you can walk you through you put the credentials in you can see what you're getting really really helps you with this and we have a really, really good detailed troubleshooting guide when it comes to PTA you can take a look at. Now I have a new feature to announce here. So Michelle is the feature PM that's building this PTA and seamless SSO to work with disconnected forest. It's currently in private preview. So if this is you and you have a bunch of AD forest with no network connectivity and you really need this to work across that feel free to email Michelle and she's happy to talk to you about what it looks like from a preview where we're going with this and have you tried out in a test environment. So go and email Michelle. She's the future PM for this area. Okay. Last thing here before I turn this over to Grace is password hashing. I am mostly out of things to talk about when it comes to password hashing. I've talked about password hashing a lot over the years. We're actually at 92% of tenants have password hashing enabled from 91% from ignite 2019. But at that session I talk all about why you want to do it all how it works all the good stuff. So if you've never heard me talk about that go check that out but I'm pretty much out of things to say about that really please just go turn this on. It's super, super important. So for those that do have it on please, please treat this like a domain controller. It's a tier zero resource. It might even have the same types of permissions as a domain controller. You have password hashing on. So please you need to treat this like you would treat a domain controller. It's super, super important. And with that I'm gonna leave it over to Grace talk about Azure AD Connect sync health. Mark, I just want to interrupt for a second and say that I love that you've brought up the importance of how critical it is to secure all of these resources. I think when we think about security it's quite often to think about securing the perimeter especially from the on-prem days when the perimeter was our main security boundary but even now we tend to think about security as being a tax from the outside and how can I sort of harden that shell and there very much is strength in making sure that those critical resources inside our organizations are as secure as they can be. So if something does happen we are limiting that amount of lateral movement that people can get through our systems and the damage they can do. We make sure that we've got extra levels of protection on those super critical resources. Then it's just that extra layer of security that we need Ryan. Yeah, I mean that's the zero trust, right? Like assume reach, lease privilege, verify explicitly like that's, yeah those are those practice those are those principles in practice, absolutely. Awesome, great. Thank you so much for all of that Grace. Great, thanks. So we've just run through a little bit around how to keep your all stack healthy. Now we're gonna move over into thinking about connect sync health. So the first thing that I just wanna bring to your attention is I know when you're in an ops capacity it can be really easy to, you know let version history slide. And that's okay sometimes. I understand that you're busy fighting fires but I just wanna call out that, you know DIR sync and AAD sync were decrypted in 2017. So potentially when you first were forced to implement Azure AD connect for synchronization you may have just gone in and potentially done the express implementation or maybe made some config changes to the default left it and have been monitoring it just to make sure it's up every now and again. And I think it's really important to emphasize here that, you know you need to make sure that these servers are current and have the latest releases of Azure AD connect. Cause we are constantly making upgrades to Azure AD connect and they include things not just for fixes for security issues and bugs but there's a lot of things around service ability performance and scalability improvements. And I'm gonna touch on a few of those new shiny things but actually it's a bit of a way of making your life easier because as we've upgraded Azure AD connect we've built in features that help you with things like auto upgrade and making sure that with the likes of auto update if there is an update that we think you should definitely have it's gonna update itself. And that doesn't mean that it updates to every single new version just the ones that we deem are necessary and that includes the likes of the security security things that we found in the box. So we know in practice customers are on extremely old versions and they can actually prove problematic and those issues may not be directly related to Azure AD connect because servers that have been in production for several years typically have had several patches applied to them and not all of these have been can be accounted for. So we see people generally about a year to two years behind so we think if you are having a look and hopefully a go do of this as you go and check that your health of ADFS and of your sync is that you take a look at the latest and greatest version history that's available at aka.ms-aadcdocs and also take note of if you are running on 12 to 18 months that you consider a swing upgrade as this can be the most conservative and least risky option. So when we talk about sync, what is sync? What is it? So Azure AD connect is actually a kind of umbrella for a group of services including synchronization and health monitoring. So this sync service actually has two major components. There's the on-prem aspect and then we also have the service side in Azure AD called the Azure AD connect sync service. And we've actually deployed a new endpoint or an API for Azure AD connect that improves the performance of the sync service straight through to Azure Active Directory. So by using the new V2 endpoint you'll get better performance both on export and import Azure AD and it also supports syncing groups with up to 250,000 members and also performance gains on export and import. So if you want to use the new V2 endpoint which I definitely recommend you at least putting in a staging environment is that you will need to use Azure AD connect version 1.5.3 or later and there are all the deployment steps that you should need in this document linked here as aka.ms-aadcdocs to step you through how to make use of that new endpoint. So I want to quickly show you from an architectural view actually what's inside Azure AD connect and how that synchronization flow works. So with the first step here as you can see with the orange number one is the import and this works from AD or from Azure AD depending on which attribute. So for example, if you've got an attribute which is honed on-prem and you make that change that's going to sink through into the connector space. However, if you make a change to something in Azure Active Directory such as an M365 group membership that's going to come into the Azure AD connector space from Azure AD. And these imported objects are all staged in the connector space and there is intelligence to determine what has changed from each data source. In step two, all of these objects get assembled in the metaverse. And this is where we have a real consolidated view and you can see here with the dotted lines that shows how those attributes flow whether it's copying or transforming data from one stage to another. And you can actually define the signals to influence that flow such as inbound or outbound. And I would make a point of saying if you have changed those default rules it's important to make a note of that so that if you do have to create a backup with the R server you can. And also you know where these behaviors come from if you have to investigate an issue. So with step three, this is the final export from the sync engine to the destination as per the direction of flow. So this pattern of sync from on-prem to Azure AD is key for identity and access management and is at the core of how we get our identities into Azure AD and exist in a hybrid state. So it provides that consistent way the JML flows keeping those attributes consistent preventing clashes and helps with provisioning for Microsoft services, third party apps and of course providing access to those apps. So hopefully you've understood there from the architect's back architecture how important the sync is. So please I cannot emphasize that this enough secure your sync. And this covers not just the basics of the machines itself we're talking about the accounts that are used. So specifically when we talk about the accounts used as part of sync we have three accounts. We have the AD connector account which is what's used to write to the Windows Server Active Directory. We have the AD sync service account which is what's used to synchronize the service and rewrite the data to the SQL database. And of course you have the Azure AD connector account which is used for that ongoing sync with Azure. Now when we talk about least privilege as part of zero trust as Mark's previously touched on I cannot emphasize enough how important it is to bring that frame of reference from the cloud into on-prem as part of your hybrid mindset. These three accounts when they're provisioned are over-provisioned with permissions. And depending on how you have configured as your Active Directory Connect services whether you are using things like PTA or PHS those accounts will probably have more privileges than they need either at the initial setup or once you finish set up. So please check the privileges of those accounts remove them to what's required and you can check with our AADC documentation to make sure they have the least privilege that you need. We'd also recommend that for your AD sync service account you use a group managed service account to make sure that they can do things like not being used for interactive logon. And the key thing here as well is I see a lot of over-permissioning of accounts because at the time especially if you're under a lot of pressure or potentially you've got SLAs to meet you just give an account say domain admin or directory sync role accounts like global admin so you can go in and investigate what's happening without hitting any commission rules. However, it's important that you know that you remove those permissions when you don't need them because if that account was to be compromised for example, then they're now the keeper of the keys not only to potentially the whole of Azure AD but all of your sync infrastructure and depending on how you set up those accounts as well if you're sharing accounts or if you haven't reset passwords and you're sharing passwords then suddenly you've got a lot of lateral movement and that is a big problem. So please secure your sync account. Okay, so let's talk about health. So hopefully by now you're fully bought into the idea that you need to make sure that the servers are up to date you're using the latest and greatest versions of Azure AD Connect. You're using auto update where you can keeping an eye on our upgrade paths and also using the V2 endpoints. So once you're in a healthy state how do you monitor what is actually going on within Azure AD Connect? So of course it's important as I mentioned that you install those connect health agents but turn on your notification. I know sometimes it could be a pain to go proactively looking for services that are down but if you turn on your notifications and make sure that you're monitoring your alerts these alerts are actually updated every 30 minutes and we have actually quite a nice statistic that 96.6% of sync tenants have Azure AD Connect health to sync. But as much as that sounds good we have just over 13,000 tenants that have never closed alerts which for someone who loves to have inbox zero or attempts to have inbox zero or no Teams notification that would drive me mad. And that also means if you do have that many alerts that are open going back to Mark's point earlier how do you know if that alert needs to be investigated or if it's a misconfiguration or you've got something that's down needs to be looked at? Don't take these health alerts for granted. And we've also got around 1,123 tenants only have one raised alert but zero results. And in terms of our top sync alerts that we see the most popular if you can call it up or the most alerted alert is the AAD sync, AAD auth failure which is the connection to Azure Active Directory has been failed due to an authentication failure. We've also got just coming up a second, the data freshness. So the health service data is not up to date. Shortly followed by the fact that the AAD sync, AAD import status that the import has failed or the export has failed. And so it's important that you keep an eye on these health alerts because as you can see here can access it in the portal or can receive notifications. And it shows you really, really useful stuff. Okay, so I just got a question on that. So with those notifications, can I pump those through in to say a Teams channel or into a separate IT ticketing system? Yeah, that's a great question. So when we talk about the email notifications of course, Teams channels, if you have them configured to receive emails, you could use the email address for that Teams channel as the notifications or for that email. Or if you wanna be a little bit more proactive and connected to say a third party or first party monitoring system, then we're actually going to touch on it next how you can control your logs and export them so you can set it up that way. Awesome. So just to show you here a little bit more double clicking into some of those sync errors. So you can see here I've got, I'm nice and healthy. Well, at least my Azure Active Directory Connect servers are. I've got no active alerts. I've up to date with my export, but I have eight sync errors. And I can look at these sync errors and I can have a look at what's the error. And this one here is actually a Jupyter attribute error. And it shows you exactly what the clash was, where it was conflicting on the attribute and can even help you through how you would resolve that. Okay, so let's talk about failover and backup. With some of the latest versions of AADC and if you haven't really looked at Azure Active Directory Connect, since you may be upgraded or implementing it, given that we see it's 12 to 18 months behind, generally on average. If you have a server, you can actually put that in staging mode. And so what you can do is you can make changes to the configuration and actually preview those changes before you make the server active, which of course I always say, you've got to measure twice and cut once. And when we're talking about a critical service that you need to again be treating like tier zero as Mark mentioned before, given how critical this infrastructure is, is it's important to make sure that you've got business continuity and disaster recovery available for Azure Active Directory Connect. And that's where you can use staging mode. It also allows you to run a full import and do a full sync to verify that any changes are expected before you make these changes into your production environment. So just to walk you through some of the reasons why our customers and how they've been using it is customers have been setting up new staging servers based on the active server configuration. And then we've also got a new feature which I'm going to run through next, which allows you to export the versioning and they use it for things like config changes or periodic validation of config changes that are consistent between active and staging. So not only does that help you if you are proactively trying to make a change, but actually it also helps if accidentally something gets changed. You have a staging area where you can do that and also the ability to import and export your configuration, which I'll talk about laterally. Typically, we recommend having at least a warm server and staging that's ready for DR purposes and that you can use that during a swing migration for say upgrade to a new version and also to help you validate that config before pushing it to production. Now I do want to call out that the most common misconfiguration is that a customer creates a staging server and makes a mistake in the list of scoped OUs because later when they enable this server in production they either see new objects being synced which is not great, but it's not the end of the world or they see mass deletions for OUs that didn't weren't enabled, which of course is really bad and has a huge customer impact and a very long recovery time. So we built this feature for staging so that customers can create an almost carbon copy of an existing production server and can make sure that it will sync in the same way as their current production server. And of course there's additional benefits for verging and logging of the server configuration. If you have a staging server, how do you get it up to scratch? How do you make sure that is representative? So I'm a big fan of being able to copy, paste and undo anything that I do. So if I do make a mistake, because of course I'm only human and I appreciate especially when you're in an IT operations role you can be under a lot of pressure or if it's working long hours or it may be 2.30 in the morning is to make sure that if you do make a mistake that you can quickly get back to what was the last known good. So that's where the import and export of Azure AD Connect config settings, easy to use to say which has just gone into public preview comes in handy. So what this feature does is it introduces the ability to catalog the configuration of a given sync server and then import the settings into a new deployment. So the different sync setting snapshots can be compared to easily visualize the differences between those two servers or the same server over time. So it's really easy to set this up and of course make sure you've got the latest version so you can see it. You can see from the screenshot here you choose the import synchronization setting tick box choose a location. And then each time the config is changed from the wizard a new timestamp JSON settings file is automatically exported to the program data file for Azure AD Connect. And then the settings file name is then applied in the form of applied sync policies with the last part of the file name being the timestamp and that's of course a JSON file which you can then use inversely on import when you set up that staging server. So now I'm gonna hand over to Mark to kick off talking about logs. Super, super helpful. Now let's talk about the last section here is logs. Logs, I love talking about logs. There's two types of people in the world. There's people that don't like logs at all and then those people that are the wrong and there's people that like logs like myself then they are right. Cause there is so much valuable, valuable information in here that we're gonna go ahead and dig into. So the first thing is we've covered this at previous events. So if you don't know anything about Azure AD logs the first thing is there's two parts. There's the sign-in logs and there's the audit logs. The audit logs you can think about are kind of any state change in the directory but I covered this at a sans logging summit in 2018. If you go to this YouTube video it's like 20 to 25 minutes everything you need to know about Azure AD logs. So I'm gonna assume that either know this already or go watch that and come back to go through this next part here. The key thing I wanna make sure people understand is getting these logs into the rest of their environment because as you're gonna be in this hybrid state you have things in your on-prem environment you're gonna have things in the cloud. How do you mirror that data together to make sure you're seeing the whole picture? So how do you get your Azure AD logs into your scene? So the first way you can do this is you can go in the Azure portal and you can click download and it comes down as a CSV file and you can do whatever you need to do with that to get into your scene. The second thing is you can use the reporting API and you do graph calls against Azure AD you pull the audit logs, the sign-in logs that way you have to make sure that you're protecting those credentials, you have to make sure you're pulling the right length of logs meaning that you're gonna have an overlap so you're not missing any events you can do it that way. You can do that way today and that's the way some people have been doing it if maybe you set this up four or five years ago. But the better way to use this is with Azure Monitor because you can do a few different things. The first thing is we keep 30 days worth of logs in Azure AD but some customers wanna keep like 90 days, some people wanna keep six months and people wanna keep a year but some people wanna keep it forever. We don't let you do that in Azure AD but you can send all of your log data to Azure Blob Storage and you pay for the amount of storage that you use. So if you wanna keep logs for five to 10 years you can do that, it gets stored there as a JSON file. The second thing is you can send the logs through a Azure Event Hub. So this is like the best way to get the logs into the scene because these are just pushed directly into the scene from the Event Hub. So no more querying, no more trying to keep track of which logs you've ingested, which ones you haven't. As the logs are generated they will be pushed into the scene directly and I'll cover that here on the next slide how that works. The third thing is you can send all these logs to Azure Log Analytics. So let's say that maybe you hate your same team or maybe your same team hates you. I don't know, all this IT stuff has a lot of like those political boundaries. You can use Log Analytics to look at the data directly there. You don't have to go through your scene folks to do this. And in here, we have pre-built workbooks for you that show you things like conditional access insights. So which policies are being applied? Which ones are not being applied? We show you legacy authentication. We show you sign-ins that have errors and lots and lots of really good stuff that you can use to start with and then build on top of that. So this is the way you get those analysis to look through the logs in your environment. And then lastly here, if you're using Azure Sentinel that's actually built on top of Log Analytics. Okay, so how do I get this into my scene? So on your scene side, you're gonna have to use one of the pre-built tools that they have. So we have all of them common scenes here like Splung, Sumo Logic, Curator. So if you go to these AK.ms links, they walk you through what you need to do to do this. So basically, you're gonna configure the event up on the Azure AD side. Then on the scene side, you're gonna configure the pre-built integration and the logs will start flowing that way. If you don't have a scene and you wanna get started, you can use Azure Sentinel. Or if you have a scene that's on this list, sorry. If you have a scene that's not on this list and you would like them to use Azure Event Hub, tell them you want this, have them reach out to us. We wanna get them onboarded so that way we can pump those logs through the Event Hub into your scene as easy as possible. So how do you do this? On the Azure AD side, it's pretty straightforward. You go to diagnostic settings, you click new, give it a name, you click stream to Event Hub and you can also click, select the storage account if you wanna do that or you can send it to Log Analytics. Then you pick the logs you want to send to where you want to send them to. So if you've done this before, you've probably clicked on the audit logs in the sign-in logs. And if you hadn't looked at this in a while, you're gonna notice four new log types that you can send to storage account, your Event Hub, or Log Analytics. So this is super, super exciting stuff for us log nerds. So let's go through what you get in these different types of logs. The first log here are non-interactive user sign-ins. So this is when a user is signed in and the application or an OS component completes that sign-in on behalf of them. So to give you an example of this, because this is, we're not gonna cover all the token stuff here, but when you first log in, you get that prompt for initial authentication, put your username, password, you hopefully have to do MFA and then you're logged into your mailbox. And then from there, you still keep getting mail, like an hour later, two hours later, eight hours later, it keeps working. So under the covers what's happening is the refresh token for Outlook is getting you another access token for Exchange Online and it keeps doing that for you silently. So those are the types of events that are gonna show up here in these non-interactive user sign-in logs. Same thing with hybrid Azure AD join devices, Azure AD join devices and the authenticator app. These get you a special type of token called the primary refresh token that gets you an access token as well. These are also what's gonna show up in these non-interactive user sign-in logs. So super, super, super warning. If you haven't figured this out, that is a lot of data, right? That's like every authentication request, every hour for all your users, it's going to be a lot of data, sometimes five to 10 times amount of the data. So before you check the box to just send this into your team, make sure you talk to those people to understand how much data they're going to get. And the question you need to ask is, do you really need that data in your scene? Maybe it needs to go to long-term storage. Maybe when you have to troubleshoot something, you just come to the Azure AD portal and they'll troubleshoot it there. But just beware, because like some people have gotten some sticker shock when it comes to how much their bill is, now that they're sending all this data to log analytics or into their scene. Now, everything I just talked about here when it comes to tokens, if you have no idea what I'm talking about, that's okay, we have this well-documented about access tokens, refresh tokens, session tokens. Take a look at this, go to ak.ms slash ad token lifetimes, as well as ak.ms slash ad PRT covers how primary refresh tokens work. So what you'll see in the portal and it looks a little bit different than what you're probably used to seeing in the traditional sign-in logs or the auto logs is that these are actually going to be grouped together where they share same characteristics like IP, which resource they're trying to get to but the time is different. So here I've done a filter for this user clay and our tenant and you can see there on the right-hand side number of sign-ins where you see more than one, three, four, two there. You click the little shove on there on the left-hand side and it will expand out and you'll see those different refresh tokens for that user. Okay, the next type of log is service principle sign-in logs, which is the applications. So these are non-user accounts and these are authenticating to Azure AD using a credential of some kind. They're hopefully using a certificate or they're using a shared secret, which is like basically a password. This is the stuff that ends up like in GitHub repos and stuff like that. So we don't want to use those. We want to use certificates where possible or manage source identities, which I'll talk about here next. And this is really, really useful as well. The size of these logs will depend on how many applications you have. But obviously, if you have a lot of applications, getting a lot of access tokens for these applications, this can be quite a bit of data. It's grouped in a similar way as those non-interactive sign-ins here you can see on the right-hand side, number of sign-ins, as it's going up to those applications. And this is a really, really good thing to monitor from a security perspective because your applications are probably going to behave probably pretty statically. You'll know those patterns. They're probably coming from the same IP range resources. If you see things change, that's a really good indicator that you may want to go investigate it if something may be happening with that application. Next, we have managed identities. So if you don't know what managed identities are, these are a special type of service principle that Azure will manage those credentials for you. So no more of having to keep track of the shared secret, no more trying to roll cert, Azure will do this for you. And you can use this against any Azure service that supports Azure AD authentication, including Azure Key Vault, which is super, super excited. So again, this is grouped in the same way. We could do a whole talk on managed identities in the future, go to ak.ms.ad, manage identities to read more. This is really, really useful and it's much easier from a management perspective just doing service principles, especially when you have to do certificate management and rolling those shared secrets. And the last type of new log are provisioning logs. And these aren't new, but they're new to the event time. So provisioning logs are the logs in Azure AD when you're provisioning users to these different SaaS applications. So the first thing I'm gonna tell you here is please convert your Timothy API to AAD provisioning. This is well worth your time. If you don't have a Timothy API, you might have a Betty API or a Jonathan API. That's basically a person that goes and does these stuff manually. And maybe they go into the SaaS app and they go create the user themselves. Maybe there's a CSV spreadsheet that has to get uploaded every three or four days, but there's some manual process that some person is having to do to provision them into these different SaaS apps. Stop doing this. This is well worth your time to automate. You can onboard those users much quicker, right? So when they get synced up, they automatically get put into a group with a dynamic group that grants them access to this application and they automatically get provisioned to that application. It's really, really nice way to do this. And then more importantly, these users actually get removed from this application when maybe they change roles but they don't work there anymore. We don't have to rely on somebody going in to manually remove them from that application. So some stuff that you'll see in here is you can monitor for those create, create, update and delete events. If you see skipped events, this means those users were not in scope for actual provisioning but there's lots of applications we have that support this. If you go to ak.ms.azuradappgallery, take a look at this and go convert your applications to make sure they're using provisioning. And if you have a SaaS app that supports GIM, you can have them get added to our app gallery. So if you go to ak.ms.azuradapprequest, we have a whole team of people that are happy to onboard this application for single sign-on and as well as GIM. But if the vendors don't support GIM, please tell them to support this. This is what everyone needs to be using and we can start doing this provisioning and deep provisioning events through Azure AD. It makes it easier for everybody else. And then a new thing here, Grace is gonna talk about because this has impacted her customers a real-life customer story. Yeah, so recently it kind of went under the radar actually, I think. So we released a new API that's currently in beta. So that's the graph API for last sign-in date time. And this is where for each interactive sign-in that was successful, you get an update to that underlying data source. And this doesn't live in the audit or sign-in logs. So this is a separate attribute. And what that allows you to do is detect inactive accounts. So you can evaluate that last sign-in date. And it's a really nice graph call. I've put it on the screen here. This is looking for somebody that's got my display name and can show you that, yes, I was logged in at 23 minutes past two. And the reason why this is really helpful is as we move in a hybrid state to where, especially at the minute, where we're working in a COVID world, working remotely, I had a customer, in fact, I've had two customers with two different scenarios where they didn't know this existed and it really helped them. So they've got a bunch of scripts, both for reporting, auditing, and deprovisioning of accounts that are all connected to the last log-in date, timestamp for on-prem. And these are connected to, say, 30 days after last log-in, if that date is over 30 days old, they actually start deprovisioning that account. They block sign-in. They reallocate licenses because that will be connected to potentially some JML processes that haven't quite been automated end-to-end or aren't being triggered manually. They're being triggered by this kind of length of time where an account hasn't been authenticated to. That's no good if a user has not been logging into AD. So that's where, especially in the cloud-first world, if it's painful for a user, well, they don't need to log in through, say, the VPN and they don't need line-of-site-to-domain controller. Then that date is not going to be updated or that time on-prem, but the last sign-in date time in Azure AD will. So what you can do is I really recommend that you have a look at this, see where you are potentially reporting on that, kind of sign-in date time on-prem and where you might need to adjust your reporting or alerting to have a look at this new graph API call. Because it actually got to the point unfortunately with one of my customers where suddenly about three months after they went fully remote working, as their users weren't connecting back on-prem, that they were losing their access and getting blocked out of their accounts or having passwords reset to something that they didn't know, all because these things weren't looking at the Azure AD authentication where they'd been interactively logging in quite happily for X amount of days to say exchange online through that outlet time. So take a look, you can use the graph API and call that however you want, whatever your flavor is, that's a graph explorer, if you've got some apps that you've developed to have a look at, you can search for users by name or users by date, depending on which reporting system. So definitely take a look at that and see where you can leverage that as part of your reporting. Okay, so we're at the end of our session. So here are some go-dos. Of course, I cannot emphasize this enough. We started up this session with Mark hammering this home and I'm gonna finish the session the same way. Please, please, please turn MFA on for your admins. Just do it, just do it now. Just go per user or put them in a conditional access policy as a group. There's even a conditional access policy setting where you don't even have to have the users in a group. You can use anybody with a directory that I've been privileged account. So just go do that, please. Next up, of course, if you wanna be a hybrid health hero, as hopefully you all do, is please use your Azure AD Connect Health for ADFS and Azure AD Connect. Make sure you're looking at those alerts, fixing those alerts, and of course, turn on smart lockout for ADFS. Make sure that you're protecting those accounts as well, those sync accounts, those domain accounts and not giving them any more privileges than they need and rotating your passwords. Also, please protect the Azure AD seamless SSO computer accounts and move to hybrid Azure AD joins or Azure AD join if possible. And of course, these are tier zero. You want to treat them the same way as you would your domain controller. So please also, another, an ultimate go do is make sure that you're leveraging all of the Azure AD into your team through the event hub. Whether that's interactive and non-interactive, of course, watch out for the increase of data by five to 10 times so that your SOC teams don't come knocking the door when they've got a huge Azure consumption bill or a same consumption bill. So it's best to have more data and not need it than not be able to investigate something. And last but not least, as a Brucey bonus that we'd say in the UK, please go and read the Azure AD operations guide. You can find it at aka.ms-aadopsguide. Thank you very much for listening. I hope you now all feel empowered to be a hybrid health hero. I just love that you ended with a Brucey bonus. Like I'm expecting the potters well in the soft toy and I'm sure our UK viewers will get that reference. There were so much information in there. I don't know anybody who would watch the session and not have at least one thing to go and do now or at least maybe double check in their environment that it has been done. If they think they really are on top of things but we will certainly pull a list of all of those URLs so that people can access those easily as well. And if you're interested in following in on the conversation come and join us at aka.ms-ops106-chat. I'll put that URL below. You can come and join our channel and have a chat, ask questions, talk about this session. And if you're interested in finding the recording of the session that copy of that slide deck and all of the other related sessions for our IT ops talk, All Things Hybrid Event come and find us at aka.ms-ops-talks. Mark and Grace, thank you very much for your time today. Thanks for having us.