 All right, hi everyone. So yeah, welcome today to this AWS and PagerDuty session on how to take data and turn it into actionable information using AI and ML. And it's great to meet you all today. My name is Julian Alcer. I am a product manager in our AI ops group here at PagerDuty. And today's agenda first, we'll cover how PagerDuty and AWS partner to deliver relevant insights in the incident response lifecycle to help drive your teams to action. Then we'll talk about how to reduce noise across an entire organization's technical ecosystem using AI. Next, we'll talk about how to use AI and ML for more efficient triage in order to drive faster resolution times. And then finally throughout this session we'll be sprinkling in some interactive polls so we can hear from you and we'll finish with about 10 minutes of Q&A. All right, so before diving into the world of data, AI and ML let's take a step back and ask the question why are we looking at AI and ML to help us? Well, I'm sure it's not news to you that operational complexity has reached a breaking point. And despite macroeconomic woes that are driving the need to do more with less digital transformation initiatives aren't slowing down. And so issues have become more difficult to manage in an even less forgiving environment than we found ourselves let's say three to six months ago. And when disruption hits consumers or customers customers feel it and they wanna know why can't this vendor fix this faster? Why am I paying this price for this experience? And the answer to us usually as well it's not that simple and complex problems can't fix themselves in a matter of seconds. But when incidents impact customers customers don't care about excuses they want results and we all remember the moments you see captured on the right hand side major airline disruptions, infrastructure failures but what is on the other side of that for those companies involved? Well, on the other side that probably includes time strap teams like many of you listening in today sifting through those complex systems and large data volumes to deliver the best experience possible. And what we see here at PagerDuty when we look across all of our customer data is that customer data volumes are growing at an exponential rate around 70% year over year. And as a result, the business suffers from too much noise and too much toil while sifting and manually going through those processes. This results in longer time to resolution and more customer impact. So dev leaders are focused now more than ever on reducing the cost of operations by streamlining processes, consolidating tooling and leveraging automation. And so organizations today will thrive or decline based on the experience you deliver. Those that consistently serve up outstanding experiences will earn and re-earn customer loyalty while those that don't will struggle to meet their customer needs. And so the vision of the PagerDuty operations cloud is to enable teams to automate and accelerate your critical work. How do we do this? Well, first this means being able to plug in your entire ecosystem into PagerDuty leveraging over 700 plus integrations. And with PagerDuty AAPS along with sending all of that data you will be able to do things like noise reduction as well as data normalization and enrichment as well as use machine learning that recognizes patterns in both your data and key human actions. So we can surface the right information to you at the right time. And then you can use incident response to organize and mobilize your team combined with process automation. You can do that automation of those redundant tasks. And finally, ensuring seamless collaboration with your support counterparts in your tool of choice. And so out of those 700 plus integrations that I just mentioned, one of our top integrations is AWS. AWS is a key partner for many of our customers who are either lifting and shifting into the cloud or maybe already native and growing in the cloud. And with AWS's over 250 services they make it easier for teams to take advantage of what the cloud has to offer. And together PagerDuty enables AWS customers to operate more efficiently and effectively by improving their processes and operations. For the purpose of this webinar we'll examine on how we leverage AWS data to reduce noise, identify the root cause of a problem and then automate incident response all in real time so your teams can transform your massive and ever-growing data volume into actionable insights. And so PagerDuty Aops solves for some key pain points that we see across all customers no matter their industry or size. So first what we're seeing is that everyone is drowning in noise and we don't need more data. I don't think any of us would say we need more data. What we need is the right data and we need data that is actionable. What we're also seeing is that once you have the right data it's becoming increasingly difficult to gain situational awareness to know what other critical information are you missing so you need actionable information to help you triage effectively when those seconds count. And finally, tedious and manual processes are hampering productivity. It's a trade-off between innovation and firefighting. So for today's webinar we are going to focus on these first two problems or boxes that you see here but for any of you interested in the third category of toil and automation we had a fantastic webinar about a month or so ago that I'll link out towards the end. Okay, so let's go ahead and start with noise reduction. And because separating signal from noise is that first step in creating a strong foundation to improve metrics like MTTA and MTTR this is normally the first area our customers focus on in AIOps because if you're drowning in noise you can't think about how to drive those metrics down for your team. And how do you know if you're drowning in noise? Well, here are a few indicators. First, do you hear your teams often complaining about getting notified over and over again with low priority incidents? One thing we also hear from customers is an issue with flapping alerts. These are alerts that come in from AWS they notify you and distract you only to quickly auto resolve minutes later. And what we also hear is once you start seeing a lot of these false positives you might start responding in a couple of minutes later which is pretty natural and it works for those transient alerts but what about the alerts that aren't transient? So we wanna always focus on driving down those metrics like meantime to acknowledge. Another huge issue we hear from customers is that similar or duplicate alerts coming in from monitoring tools are also distracting them. And so they're having to perform manual correlation or manual actions like merging which is another pain point. And then finally, the last huge pain point is cascading incidents or alert storms which is basically alerts firing off within the same system or across systems. Okay, so I have talked a great deal but now I'd love to hear from everyone here what kind of noise is impeding your teams? Is it you're being bombarded with low priority incidents or incidents that are not high priority? Is it flapping alerts or those transient alerts, duplicate or similar incidents, alert storms or all of the above? And we'll give everyone just a minute to respond and then we'll look at the results. All right, so the results are in. So we have 29% are bombarded with low priority incidents. That's good to know. Flapping alerts, 24%. We have duplicate and similar as 18. Alert storms, 18% and all of the above 12%. Okay, so good news. We're gonna cover each one of these today and I'll try to kind of highlight that. We'll actually start with those low priority incidents. Okay, so real quick, before diving into each one of those areas and how we reduce noise for each of those answers that you went through, let's take a step back and look at how do we think about, when we talk about noise reduction at PagerDuty, it actually goes through multiple stages of our event funnel. So how do we think about noise reduction? Well, when you think about things like, okay, how would I reduce noise? What would I actually realize? What we see is that customers can receive up to 87% of noise reduction by adopting one or ideally multiple ways to reduce noise. So how do you get to that 87%? Well, first, deduplication. So this is when multiple incidents are triggered for the same issue. So your team can use things like a dedupe key to go ahead and group those into the same incident. Suppression, which is front of pipe rules known to suppress no non-actual events. So think about those low priority incidents that was the first choice a minute ago. Service routing, which ensures that the events coming in are actionable and they map to a service in PagerDuty. And then finally, this circle we have here, this is where we use machine learning to help with issues like flopping alerts or duplicate alerts. And we're gonna actually focus in this area a lot today because things like suppression and deduplication are pretty well known and leveraged in the industry. But where we do see customers missing out is in things like machine learning techniques to reduce noise, mostly because they're worried about things like missing a critical incident or maybe don't understand the exact value that it can drive for this organization. So what I'm gonna do is spend a few minutes actually opening up that black box of the ML and talk a little bit more about how we use ML and how exactly it works to solve some of these problems. Okay, but the first piece I'll cover quickly are those low priority alerts that are bombarding your team. So typically these are known events that you can easily suppress via a rule and you can use things like rejects based on the whole field or part of a field. And then you can create those rules and suppress those known non-actionable events in just a couple clicks of a button. So it's very simple but very powerful front of pipe technique. Okay, and now back to those pesky, flapping or transient alerts. Again, so this is where an alert arrives, it notifies you and then it quickly resolves itself. For those of you who've been woken up by 2 a.m. by one of these, I'm sure you can probably attest to how truly annoying these are. And I do wanna point out here that we do support a rules-based approach similar to what you saw a minute ago to suppress these flapping alerts. However, we've actually created a capability called auto-pause incident notification. So what it does is it automatically identifies those flapping alerts for you. So what we do is we apply machine learning to automatically detect and pause transient alerts that historically auto-resolve themselves within a specific time period. So the way it works is you select your desired pause time when the model identifies an incoming alert as transient, it will pause it for that specified time. If we get the resolve event, it will go ahead and resolve on its own without ever waking you up. And if it does not, it'll go ahead and go on to trigger a notification. So what I wanna point out here is this is not permanent suppression. This is just pausing of that event to allow it time to heal on its own. So let's go a bit deeper on this one because if you're like me, I don't trust when someone says, hey, our ML just automatically works and it fixes the issue. So yes, we use ML, but let me walk you step by step of how this model works. So the following workflow defines the high level view of our prediction process. The main input for our prediction model is the alert title. So we don't use additional information such as time to acknowledge or number of responder because that information lives on the incident. And we wanna identify an alert as transient when it's incoming before it ever creates that notification. And the main logic for identifying transient alerts resides in the two components in the middle, what we call the alert title pre-processing module and of course the prediction model. So the top area of the workflow here represents the online pipeline. And it shows that when a new alert arrives from AWS, we do some textual pre-processing to create the template for that alert. And then that template is used as an input into that prediction model, that brain that you see there in the middle. And what happens is that the alert template is then compared to a minimum threshold that we use to predict when that alert is transient. And if the prediction score is higher than the threshold, then we predict that the alert is transient and we pause the notification for the specified pause period. In the case we don't get the resolve event, then that alert will go on to trigger a notification to the on-call responder. Whether or not we get that resolve event for that alert will influence the threshold. And that threshold that I keep mentioning, it's calculated using the process on the bottom, this retrain offline process that you see. So what happens there is that threshold is recalculated every day. And so it's a dynamic process to make sure that the way that we're classifying an alert as transient or not is dynamic, it's changing and it's relevant. Okay. So I wanna go ahead and pause here and just make a quick note for those of you who may have data science teams or you're thinking about how to leverage this. Obviously this is a proprietary model, but it would definitely be possible for you to build your own ML to solve transient noise or some other unique problem that you're facing. The only thing I would say is that in these economic times where resources are strapped, these are capabilities you have to build, you have to own, maintain. And to some degree, you probably also have to understand the data models across your different teams in your organization. So the nice thing about vendors like Pedro is we go ahead and we do that heavy lifting for you and you don't have to build, maintain and have all of that domain knowledge. And this is just an overview of how we solve transient noise. I hope that this did help open the ML black box for you a bit when it comes to these pesky flopping alerts. And so next we'll go on to discuss another problem. So we had that third problem of similar or duplicate noise. And it's one, this is one of our oldest but most use ML techniques, which is what we call intelligent alert grouping. So this is something we hear all the time from customers is that alert may not be exactly identical to another alert, but it's very close. So think of an AWS server down 500 error followed by another AWS server alert with a close but slightly different title. It's one of the most frequent types of noise, but it's also not something you can easily write rules for because there's so many small variations of that data. And in the spirit of opening up the ML black box, the way intelligent alert grouping works is by comparing incoming textual summaries of the alerts. So this is textual similarity. So what we do is we go ahead, we identify those alerts that are textually similar. We cluster them together into the same incident. We also have a grouping time window parameter, which is a rolling time window that is flexible from five minutes to one hour. And it's also used to determine if an alert is eligible to be grouped. In addition to the textual similarity, there's also a background learner that will adapt to certain user feedback, such as if an alert was missed or grouped incorrectly. And that ML learner will adjust future grouping activities based on those user behaviors. And here's a hands-on example of how that works. Here you see a sequence of alerts. There's actually three unique alert templates here, namely permissions host fail, no data stale email, and Splunk alert follow up 500. Now you can expect intelligent alert grouping to group these into three distinct alerts, three distinct textual alerts. However, because the user had previously indicated that no data stale emails is actually related to permissions host fail, the learner was able to remember that over time and grouped these alerts into the same incident. So all in all, what you see here is that it was able to group them into two separate incidents, not three because of that user feedback. Okay, so we talked a lot about how the model works. We gave an example, but we understand that still teams are nervous about adopting ML, mainly because they're worried about the risk of missing an alert and also maybe don't exactly understand the savings that are possible. So for this reason at PagerDuty we're big believers in explainable ML and we've invested heavily in reporting that shows you as soon as you start sending us data how exactly would the model group those alerts and the savings you would see? So think of this as a simulator but it's using the same model that will run when it's live. So as you saw a minute ago, our ML works out of the box based on textual similarity and it will start doing this analysis as start as soon as you start sending us alerts. So what this means is we start grouping on day one and then your users are really just fine-tuning the model. You don't have to do upfront training before adopting the feature. And so what most of our customers see as a result is noise reduction in days to weeks. Okay, so that's the process before you adopt the feature, right? Understanding how it would group, understanding the savings you would see but what about once you adopt the feature? When it's live, we also have multiple ways that we show you when the alert is grouping. So first we show you how many alerts are grouped in areas such as the incident list page. So users are able to very quickly see which alerts and how many alerts were grouped into that incident. The incident itself will also provide counts of the number of alerts in the alerts table area. We also have a grouping now pill that is active when the model is clustering and doing that grouping. And then finally, we also output the reason the alert was grouped in the incident timeline to ensure that your users don't miss any alerts when grouping's happening. So again, you could build all of this but resources are crunched right now. And here's a story from an actual Page of Duty customer. Chicago Trading Company or CTC is a derivatives trading firm that specializes in market trading across a variety of products and services. And CTC had a few challenges before implementing Page of Duty ranging from a legacy dashboard cluttered with unactionable events and alerts which created delays and incident acknowledgement and resolution times, alert storms that reduce the ability for teams to understand the makeup of and effectively respond to incidents. And they also had a lack of automation embedded into their incident response process which led to more manual work for on-call responders. So when CTC used Page of Duty to kickstart their automation processes they were able to do that because first they did that noise reduction saw great results. And then we're able to automate and all in all saw faster acknowledge times and faster time to resolution. Okay. So I think the last area we had in our poll was about alert storms or cascading alerts. So I want to transition to an exciting noise reduction capability being launched this year. You may have, now you may have noticed that everything I have discussed so far in terms of noise reduction is was within one service in Page of Duty. And so for many customers that might be enough maybe one service is receiving all your relevant alerts and applying noise reduction there is going to do what you need. But what we tend to see in DevOps teams or central IT teams is that you usually own a few or many services. And what may also happen is that you see noise that is co-occurring or cascading across those services. And so we have a few goals for upcoming release which we're calling global alert grouping. The first is to enable grouping across services. Think of a time where you would one database service go down and then cascading alerts on more of those database or infrastructure services. Or what about a backend service that tends to trigger a related front-end service like a payments UI service or some customer facing service. So reducing noise and correlating in these scenarios is our first goal. Then our second goal is to deliver a holistic experience when it comes to grouping whether on a single or multiple services. So that means allowing our users to combine different alert grouping methods. So for example, we have a capability called content-based alert grouping which is more of a rules-based approach that allows you to group on a specific field or fields to do that alert grouping. But maybe you wanna use that in conjunction with the ML. So maybe you want us to check region or source or a specific part of the payload before applying the machine learning. So with this release, all of these capabilities will work in harmony with one another to give you maximum performance and control. So let's take a look at what this will actually look like. So first you'll notice that you enter here on this noise reduction homepage. This is gonna be a new area where you can see all your noise reduction settings in one place. Also at the bottom kind of going back to that preview and being able to understand the savings you're gonna see. We also have an area here at the bottom that's gonna identify some areas with the highest impact that you may wanna consider enabling. All right. So remember that problem I gave you of if you have multiple services with co-occurring noise, maybe those infrastructure noise or other related services, here's where you could specify those services and select the type of grouping you'd like to perform when an issue comes up. So you can correlate those alerts for those services. So here is where you'd be able to specify exact fields if you'd like. You can use that on its own. You could use that with the intelligence or just the intelligence by itself. It'll really be up to you based on your use case. And then finally, you'll also have the option to tailor the time window, which is that grouping parameter that determines whether an alert qualifies to be grouped based on that time parameter. Okay, so that covers noise reduction. Now let's dive into our last area for this session, which is how to use data to create actionable insights during triage. So triage is probably one of the most complex jobs for a responder. It's almost like the holy grail of incident response. But here are some indicators that there's probably room for improvement in your triage process. So first, you have difficulty finding the right SME or trouble onboarding junior employees. Do you have difficulty knowing what other incidents may have contributed to your incident or understanding potential related incidents on other services mean? What other incidents are ongoing at the time of your incident? And then finally, we always hear that an incident was due to a recent change or recent deployment. So understanding what changes have happened. Okay, so it's that time again, this is the last poll for this session. So which one of these pain points when it comes to triage, do you experience the most? Is it that difficulty finding the right SME, understanding what recent changes have been deployed, unknown related problems or other incidents that are ongoing or finding potential origin points for that incident or all of the above? Okay, polls are in. Okay, difficulty finding the right SME, 16% of you replied with that. 11% can't see what recent changes have been deployed, unknown related problems, about 5%. Problem finding potential origin points, 47% and 21% all of the above. So some good responses there. Yeah, we'll cover each of these areas in the next few minutes. So thanks for submitting that poll. Okay, so let's go ahead and start with kind of getting immediate contextual awareness. So understanding how often an incident occurs. So all of your incident gives quick context. So when your responders look at an incident, being able to classify the frequency of an incident into three main categories based on your historical data. So first are anomalous incidents, which are incidents that have not been seen on a service within the past 30 days. Rare incidents, which are incidents that occur less than 5% of the time or frequent incidents. So these are incidents that occur more than 20% of the time. And so probably a lot of your incidents are gonna fall into this category. And what this means is that your responders can immediately, as soon as you look at the incident, get quick context on what category it falls into. And in the case of a lot of these frequent incidents, take a look at what was done previously. So this is where past incidents can help because it can quickly tell you information, such as when an issue happened, how often it's happened, and who resolved it. And the responder, so you can see what's done, learn from that and take immediate action. This is a specially useful tool for onboarding employees or when you have more junior employees. And it also can help alleviate always reaching out to that same SME over and over. So think of this as a quick catalog that you can take a look at, understand what was previously done and take action. All right, and what about understanding other incidents that are ongoing that might be related to your incident? So we often hear about a responder working in a silo, troubleshooting an issue for minutes, sometimes hours, only to later find out that there was another issue related to their incident would have been good to know about that when they first started responding to the issue. So here with this, you can actually see multiple related incidents that might be related to your incidents. We show things like the responders that are on those incidents, the business services, the technical services. So it gives you a full view of the blast radius based on those service relationships and other ongoing incidents. Okay, and once you understand the blast radius of your incidents, there might be cases where you suspect that your issue was not the root cause, but maybe there was some other incident that might have preceded your incident that maybe was the cause. So here we use machine learning to point to some potential origin points for your incidents with information such as service information, the incident, its status and what time it happened as well as a confidence score. All right, and finally, we always hear that incidents are frequently caused by recent changes or deployment. So instead of manually having to correlate changes to your incident or having to ping someone to understand what was recently deployed or digging into some tool to review recent changes, we're able to surface that for you. And we have three ways that we correlate recent changes to your incident. So the first is based on time. Did this change happen in close time in relation to the incident you're looking at? What about if a change event occurred on a service related to your service? So you may know all of your changes, but what about maybe another service that's related that might not be under your ownership? It would be good to know what changes have occurred there as well. And then finally, we use ML to detect similar incidents and changes that have occurred that might have caused your incident. Okay, so that wraps up the triage portion. So with that, we'll go ahead and we're gonna move to Q&A. We're gonna take some questions from the audience. Okay, and I'll give everyone, I don't see any questions in the Q&A yet, but I will give everyone a minute to go ahead and drop those questions in there. Okay, so I do see that we have our first question that's come in. So for change correlation, does PagerD integrate with other services like PagerD, like ServiceNow? So we actually support a number of change event tools, GitHub, GitLab. We have multiple others. We also have a really great and powerful feature called Custom Change Event Transformer. So what this does is we actually, by using this, it can integrate with any change tool of your choice. It can be proprietary even. So what we do is we, with that, we take the JSON payload from that tool of choice, can be ServiceNow, can be proprietary, it can be any tool and we transform it into a JSON payload that PagerDuty recognizes as a change event. So you can take a look. We have multiple supported change tools and we also have the Custom Change Event Transformer as well that you can leverage. Great question. Okay, we'll give a minute for any other questions. I'll give one more minute for Q and A. All right. Oh, one came in right before we're about to move on. So here's a question. How much data do we need to quote learn models correctly? Let me go ahead and choose to answer that live. So that's a great question. So, you know, what we see, so first what I would say is as long as you're sending data, so once you start sending, let's say alert data, right? I would say there has to be at least assuming it's a healthy service. So sometimes what we have are test services or services that are very low data volume. So think about if you have, you know, five alerts in a month, there's not really a lot of opportunity for any alert grouping, right? With five per month, right? So assuming this is a healthy service with, you know, normal data volumes, our ML actually starts grouping immediately because the model is looking for textual similarity. So what'll happen is it'll start grouping immediately. What we see is our customers tend to be very happy with that. I mean, we kind of saw some of the examples earlier with those no permissions host fail, right? So we're grouping based on textually similar alerts that have a very high likelihood of probably being the same issue. And if we see those coming in within minutes, those probably should be grouped together. So that happens immediately. And then in terms of the learning, in terms of the user behavior, that's more of you can almost think of it by exception, right? So, you know, if in that, again, in that case where they had very different alerts to be grouped together, something like what was it? No permissions host fail with, you know, Splunk Alert 500, two very different emails. As soon as we start getting those signals that you also want those grouped, then we can start learning from that pretty quickly. And so what I would say is that the model tends to work. We always say that it works out of the box. We see compression rates happening within days to weeks. And then your users at that point are really fine tuning the model. Okay, all right. So we'll go ahead and move on now to a few resources. So what we have is noise reduction and auto remediation with AWS. So this was that previous session that was done about a month or so ago that really focused on that third box, if you remember, of auto remediation, reducing toil and those manual tasks. We also have a link here to an AIOps interactive product tour. So that's gonna show you a lot of what you saw today. And then there's also a blog post, if you wanna take a look on the top three incident response problems in AIOps. Okay, so with that, thank you so much for tuning in. Today, I appreciate it. I will go ahead and hand off to Candace. Thank you so much, Julia, for your time today. And thank you everyone for joining us. As a reminder, this recording will be on the Linux Foundation's YouTube page later today. We hope you join us for future webinars and have a wonderful day.