 It's always a fee if you're talking in the last session of the conference, you know, so I'm glad so many people are here I'm going to talk about anomaly detection and root cause analysis and The particular case study that I'll talk about it's based on data from a mobile 4g network So here's an outline some motivation overview then I'll talk about the actual case study and Then summarize I'm going to try to allow enough time for questions at the end But if you have something you you can ask me while I'm talking. It's not such a large group So feel free to ask me Something so what's the motivation? The idea is that you know if we can detect problems in a network proactively Then maybe there's time to do something fix it so that we don't have a big problem in the network And we want to use now like everybody else. We want to use machine learning to do this and 4g networks, that's the LTE network. I think most of you have geo So you're all getting service from a 4g network now and it's It's complex because it has a lot of capabilities So it has the radio site and then also, you know the Core site where most of the intelligence lies it generates a lot of data Which is kind of different from earlier networks every time you do something on your phone You know a lot of a big record is generated So there's data from the radio side handoff data cell data some Experience indicators then on the MME which is like the real brains in the network There are other pieces of the information Generated like signal quality error codes and so forth So that's kind of the data we work with it's large and it's both categorical and numerical and Currently what they do is Obviously a lot of things happen now to the networks work well but they the methods tend to be reactive and It's better if we could have more predictive models. So that's our motivation for doing this work and The approach we've chosen is we want to come up with fast Non-parametric methods to detect anomalies and then once you detect an anomaly you look at some of the Other information that this is generating to in further root cause Because if you can't in further root cause just detecting the anomaly from the operators point is not enough Because they want they say well now. So what help us? Pinpoint the problem. So that's our big motivation for doing this work and Some challenges. I mean the network is complex. It does generate a lot of data The other thing is the customer expectations are high The idea is that the network is always available and typically telecom networks are very reliable. I know people Yeah, maybe your view is that you lose signal But that's often because you're in a building or elevator or something but the network itself you usually is pretty reliable the other thing is When you had simpler networks, you could use KPIs and said oh if you cross a threshold Then maybe I have some problem, but now the problems are more complex. They're more subtle and So those are not sufficient using KPI methods those continue to be used But there's always other new need for new methods You know, it's like I want to know now before the outage happens I want to know, you know, I don't want unplanned outages, right? so The other thing is these networks are very reliable. They're pretty stable. So they're not many events Which is good, but to build a model. We don't have a lot of failure data to build a model And some small failure events. They're considered normal by that. I mean that In these large telecom networks, you're not concerned if you're one your call gets dropped We're not going to do much about it because either you will Either you'll retry redial You either Redial or if there'll be some protocol even might try for to set up a session for you or something so but you know if Nobody in this hotel could make a call then it gets serious So that so that's what I mean by some failure events are considered normal so we don't Look at every record basically and then so these anomalies are rare and The idea is that some of these small problems. They could be a precursor to a bigger event So the idea is to detect them and then apply measures based on root cause Okay, so since I'm going to give us quick overview of anomaly detection It's really quick very high level, but just to set the context So anomaly as a dictionary definition. It's just data a pattern in the data That doesn't confirm conform to what you would expect and then different terms used for this outliers exceptions, etc and So they They should be in a good system They should be rare in a data set compared to the normal instances most of the instance instances should be normal and This anomaly detection is Actually quite a mature field in terms of theory And it's used in many domains intrusion detection, you know for where you monitor network traffic to detect Security violations, there's network monitoring where you look at network traffic performance logs and that's sort of what I'm focusing on then there's fraud detection like credit card fraud where you look for anomalies in spend patterns It's also applied to databases and file servers for data leakage Medical applications which are becoming more common now and then there's video surveillance You know all the cameras that record videos now There are algorithms for it's too much for a person to stand there to look at it So they're now algorithms for detecting anomalous Behavior in those so there are many domains where this is used So, you know a simple example there, how do you think of it? So if you have some data with just two attributes you plot them and then So you see here you have a cluster And then you would call this like a global outlier. It doesn't belong to any cluster here You have a small cluster, but you have some data that's possibly a local outlier because it's close to this cluster and Then you know you have another global outlier. So you have many small clusters now often a small cluster could be an anomaly When we use clustering for anomaly detection, we look at the small clusters because we think those are anomalies Yes Actually, you know for time series data, they're sort of a lot of established known methods Where there are algorithms where you can take a time series and you can separate out the various portions of the Like there'll be some weekly variation daily variation Like, you know, you'll have less usage at night more usage in the daytime Those kinds of things. So there are algorithms that will separate out all those things. They'll even be maybe seasonal You know, you have every quarter you have some fluctuation and after all of that is removed Then you'll see if there's some real anomaly. So, yeah, it's very typically applied to time series Yeah, well, I will talk about one but any data that you collect on an ongoing basis you can Run anomaly detection algorithms on that. Yes. Yeah, and I'm just going to give some example that I think that's the next slide Okay, so I'm just going to show some brief examples before actually get in the case study So we had some router data for about 60 routers and it was hourly data for one month. So 30 days times 24 And we didn't know much about the system and it's still this is hourly data So it's not very fine, but you can still learn a lot from this. So we took an average for the entire Month for each router and we plotted that so these points that you see so these are the router numbers and This is the CPU usage percent for each router and so we've plotted that and Then What we see is it's an average of 720 data points the monthly and here you see right away that You're seeing a group of routers that have very low average CPU usage and you have a group that has high CPU usage So this kind of points out the importance of when you're doing any of these kind of projects first Just do basic Exploratory data analysis with the data to see what's going on So right away now you have two groups of routers and now you can see depending What did you want to do with the data? Do you want to do capacity? Planning then you would say well, why do I have these routers that are not being used much? Maybe I did either I remove them or I reroute traffic or something And then I have these some of them have some I mean this the average for the month. It's pretty high, right? So that's one thing we did then just with the same data what we did For each router we made a box plot we had 720 points for each router We created a box plot and actually this is a little bit more finer grained than the previous one And you see actually three groups of routers one have very low CPU usage Even their maximum values. They're these whiskers. They are not very high. So these are hardly used So this is a group then you want to look at what are they doing in the network? Then these have some medium usage and then these blue ones some of them have very high usage and In case you're having problems, then maybe you focus on these and Like when we looked at this data, we said, okay, these routers that have like hundred percent usage on certain hours We then went and looked at the log for that router number and That hour of the day and we went to see what was happening in the system. So that's a kind of You detect some anomaly, then you go look for some root cause of course This is all manual, but this is something you can do in the beginning to understand your data So never sort of discount the value of this exploratory analysis So you are asking about different types of anomalies so there's point anomalies where you have data and Individual points are anomalous then they're contextual anomalies where in some context the data is Anomalous you can also think of temperature like what's the average temperature these days? here 2728 so if suddenly it went up to 40 then that would be an anomaly right if tomorrow was 40 While in Delhi, I think these it's closer to 40 So that would not be an anomaly right so there's the context And then there's a sequence a collective anomaly where you see a sequence of Events or observations and taken together. They're anomalous for example You would see this in a this an example from a ECG reading So there's and I'll also talk about these sequence Subsequence anomalies and then there are many methods of anomaly detection it depends on the kind of data you have the problem you're trying to solve whether you have labeled data or unlabeled data and so they're you know, they are classification based methods that would typically apply to Data that's labeled there's clustering based which would often apply to unlabeled data You can do statistical analysis for time series data and it can be parametric or non parametric There's nearest neighbor based methods Then there's other methods one of which is visualization. I've done some work Using self-organizing maps and kind of done some visual analytics I won't be talking about that today, but there are methods the Visualization methods you can use for anomaly detection Yeah, so they're all there then you get into this distance measurement great So to summarize just some Challenges for anomaly detection before I go too far if you want to learn about kind of just the basic concepts Theory about anomaly detection this there's this survey paper. It's very good by Chandola and others It's a very good survey paper So one is that you need domain expertise to define what is normal The way you can detect an anomaly is if you know what's normal and then you measure the deviation from the normal So there's some context for that And sometimes the normal behavior it evolves over time things change so you have to keep measuring normal and then for things like intrusion detection The the people who are doing it they adapt So you have to stay a step ahead of them, right? It's like, you know, if you run Microsoft Windows, you're always getting those security updates because something happened, you know Or like if you have a spam filter for your mail and then for a while You won't get any hardly any spam Then your spam filter has become old and then you'll get a lot and then so and Then as I said the notion of anomaly it depends on the application domain and the context There's the big issue of whether you have labeled data or not for training often you don't And then you have to be careful. You don't want to consider the noise as the anomaly so you don't want to And then this boundary between normal and outlying is not very precise. Sometimes it's like a spectrum So these are all sort of typical challenges which would be very by Domain Okay, so now I'm going to go into a specific case study we did with real data So I've already pretty much talked about the network But the log data that we are using so here we started with the data Sometimes if you're building some new theoretical model, you'll build a model Then you'll test it validated with some data But here we are driven with the data we have to make do with the data that the net network generates So what we have access to is this what I call network log data It's generated on a per procedure basis procedure means if you click on your phone for anything then it's of some kind of procedure and a big record is Generated or if there's a handover like if you're in your car and you're driving Then you'll be transferred from one cell tower to another So or when you get off the airplane you turn your phone on so you attach to the network So these are all procedures And we do get a lot of nays because depending on the type of procedure it may not generate all the fields So the data is not so clean Um We are not concerned with one procedure failing. We're interested in system wide failures But the data is on a per procedure basis. That's why I said it's not labeled Because per procedure we will get some error code whether it was success or failure um very few system wide failures happen Because it's a reliable system And then also the network behavior changes over time because you have new technologies, you know, there's 3g 4g now we're into 5g so um so So there are not enough failure cases to develop a supervised model So that's why this work is unsupervised modeling and We have to deal with the network evolution So this data is called per call measurement data um PCMD it's on a per procedure basis From one machine like from one MME we would get like about 700,000 records per minute And uh data is not totally clean many nays Okay, I think I covered all of these So here's our framework. This is what we developed. We developed the methodology for doing formally doing data preprocessing as most of you know a lot of time goes in the data prep So we did that here. We tried to do it methodically We first of all we aggregated the records So we are getting these records that are generated at millisecond level and we took chunks of 10 seconds worth of data And we aggregated it. So then we are looking then basically that becomes one observation So we are doing some processing with that and this way we could also reduce the number of nays Then we apply PC. We're doing unsupervised anomaly detection. So we apply PC a principal component analysis Uh on the to characterize the normal data Then we develop the model and then we can apply that to new data that comes in. Yeah Okay, so like I mentioned to you, we are not interested if you A procedure one observation is in error. So in that sense, we are not losing precision Yeah, okay, that's a good point. So here the focus is not on the intrusion detection It's more on failures in the network Normal failures in the network. There are other methods that are catching what you're talking about. Okay You lose a little bit something but let me go through this then then you can ask the question again You know if you still have the concern Okay So once you detect an anomaly, then you go into root cause analysis And what we used for that is in the same record We have a series of messages available and those are the messages that are executed to Do do the procedure? So one can think of that like a finite state machine and then we use that to detect anomalous sub sequences so The first part is the preprocessing of the data. So we aggregated into 10 seconds And what we did for categorical features? We generated dummy variables for each category Then you compute the relative frequency for of each category And for numerical features compute the median value. These are pretty standard methods Um, then we dropped features that were a hundred percent n a's or there was not much variation Or they were not very relevant to what we were doing Like id and stuff Uh, then after that we still had Fair number of n a's in the data. So what we did for categorical features We replaced that with zero which means that this happened with zero frequency, which is fair It's not and then for numerical features We replaced it with the mean and since we are doing PCA analysis. It was okay And then we standardize the data because this is Those variables represent many different things. So we standardize it by subtracting the mean for the categorical features And subtracting the mean and dividing by the standard deviation for the numerical features. So this is pretty much So, uh, you were good till here, right? If you impute then isn't all now you have some data and Something could be In milliseconds something could be in seconds and something could be signal to noise ratio So that could be a ratio There could be throughput so many megabits per second. So very different units And if I want to run some models on this, uh, I won't I'll have problems So what you do is these are standard procedures where what you do is Uh, if it's a categorical Variable you subtract the mean Okay, so you then kind of bring the Data together and if it's a numerical variable you subtract the mean and divide by the standard deviation for each variable So then the values are now closed together. Yeah normalization What say that again the categorical data was already hot and coated, right? It was made into zeros and ones dummy variable creation. Yeah, not zeros and ones Uh, it is say I have a category Um of something and it has five possible values So in the original data set it's one variable and it's going to say that this Was red. This was green yellow, but what I'm going to do in my data set could Um, I create five variables red green yellow blue And then I count Because I remember I aggregated the data for 10 seconds. So I count how many reds I have So if I have five reds, I'll put there. Maybe I have 10 greens. So I put that so it's not zero one It's the number the frequency Okay, so in that case once it is numbers, uh, why are we not doing Standard dividing by standard deviation for categorical ones This is standard sort of procedure that you apply to these are standard methods. I didn't make them up Sure. Okay Did that answer? There was another question. Did that answer? Yes. Okay You're good. Now we want to do pca So first we take the normal data Uh, we do this preprocessing which I described Uh, then we do the standardization or you can call it normalization And then we run the pca model And then we create a detection model which I'll show you on the next slide It's a and we compute a chi square statistic So I've got this model with the normal data Then when I'll get a target data I'll do the Same thing same processing standardization Uh, run the pca the principle scores Uh, and then do prediction. Okay Uh, for those who are not familiar with pca it's a Tool to reduce the dimensionality So if you have a lot of variables and many of them might be correlated This will create a few variables that are not correlated So it's a but you maintain all the information. Okay, so it finds the Projections that have the largest variations. This is standard method So, uh, based on I mean the we did uh extensive literature would review how to You know, what to apply it for our problem And what we found is and I'll show you in the results That we want to exclude the first few principle components because they capture things That are related to the technology and so forth We want to drop the last few because they're mostly Capturing very little variation. It's mostly noise so We are going to use the middle few principle components to detect anomalies Because they're time independent So we defined, um We said they must have some range of variation And the specific number we used based on some iterations was they explain at least two percent of the variation And not more than five percent for this study. That's what we use for your study. You would use something different, but The it's in that range Um, so this is the pca model Um, and there's a paper that had already used a model like this The thing that we did different is we are not using the first two principle components We are using the middle principle components three and four So what you do is so here's this you select the components that satisfy this variance, uh Limits, then you derive this chi-square statistic where this is the principle component Y ij and this is the variance The sample variance for that component And then you say this i-th observation is an anomaly if This value is greater than the chi-square Value that you would see from a table and alpha is the significance level and m is the number of middle principle components In this particular study that m is true. We tried different alpha values, you know, typically point zero five Point zero two five so forth and the results were pretty consistent. So I'll only present one set And for the root cause analysis. So once you have the anomaly, I have these message Messages so I would have a sequence of messages maximum 20 because that's the buffer size that's allocated for it Um, and I would look at these so what we see is a number and this number has a meaning It tells you num message number 20 is this Also with every sequence, we'll have a we'll know whether it's a was a success or a failure so we We analyze these sub sequence and sequences and find sub sequences that are Anomalous and the way we do that is with your with the normal data We uh came up with there were some few hundred unique message sequences And we extracted those that always resulted in success and then we created a like a probability transition matrix with that Meaning if you have this message if you guys executed message 20 What's the probability of now have the next message being 31 or whatever? So we created this matrix. So when we get the anomaly data We again look at the dominant failure message patterns and we identify the abnormal sequences sub sequences and then With associated with that we have some other fields in the data that give the error codes And that these two things combined the sub sequence and the error codes kind of Quite finally pinpoints the cause of the problem Okay, since we are a little bit low on time. I'm going to jump to the results So, uh, the data is like I said about 250 fields. It has information on various things that would be of interest in a wireless network um And we had a lot of normal data. We had normal data for Three different years not for the whole year, but we selected times from and we had outage data from these two years Hmm So we had outage two outage data. Okay. So we took the normal data and we uh plotted PC 1 against pc 2 and what we found is As I mentioned the first principle components. They're very They represent time dependent characteristics because what they're doing is so this is 2014 data This is 2015 and this is 2017. So this clearly tells me this is not going to be so good for anomaly detection And that was our suspicion When we looked at the PC 3 and pc 4 We found Just for the normal data. We didn't see any time grouping. It's more random. See you don't see any pattern So this led us to believe this is time in they're representing time independent characteristics So this was normal data Now now we took pc 3 and pc 4 uh the middle principal components and we plotted all the data Normal as well as for the two outages and we see that the outages are You know, they are sort of distributed separately the red and the purple the green is the normal there's some overlap But here you can see that it's separating out And then we applied this detection model that I described to the third and fourth principal component um Just for information you when you run up do principal component analysis you can also see The get the names of the variables that are explaining the variation And so you get you can generate these kinds of plots. So if I look at this just quickly I would say okay procedure duration is an important variable Uh network throughput is and there'll be number of variables. We just lumped them here that represent network throughput And so these would kind of stand out So you can also do this you get this information because this is good for explaining to your client Because they'll ask well, so what because pca A principal component is abstract now. So then it's important to show them this because they can relate to this So here are the results the way to read these is this red line is the critical chi square line Based on the degrees of freedom and alpha level So it's a little bit less than nine than 10 This the normal data And you see that most of the normal data falls below the chi square. So it's it's classifying it as normal The we had two outages. They were all very different characteristics um but It was known that the first one or two minutes was good data and then we had the outage So most of the so most of the data is Above the chi square line except in the beginning Which as we expected And the second outage was kind of quite long And it was complex, but again You see that after first few minutes Most of the data is above the chi square. So Uh, so the model is validated. We trained it and it's validated with the two outages I know if you were doing this, um Like I could not submit this kind of paper to kdd or something because they'll say you don't have enough data Enough outage data, but in our business we don't have that many outages. So the for our internal purposes This was good enough And then then we created the message patterns um, this was the first outage and We looked at the ratio of hundred percent success patterns So in the beginning Um, this outage occurred after one minute, which would represent, you know, like six time intervals It remember we took six 10 second windows And so we had highest success rate, but after that there was a abrupt It uh drop because there was an outage and so you see and then slowly the system is recovering So you're getting better and here we then identified the Subsequences like here We're saying, you know, these two messages this message shouldn't happen after this one so and then We get the error codes and the error codes, you know You look in the manual and they have a particular meaning in that context so And this the second outage which was actually more complex there's much more data and here again you can see the red sequences that are anomalous so And these messages You can read what they mean. So they will give you an indication Which element in the network was involved and then these error codes You can read and it'll tell you you had a problem at this note So this basically summarizes the results Here's the summary of the case study where we did pc we did data processing Used pca and then uh did root cause analysis Uh just to summarize So, uh An anomaly detection. I mean though there's a lot of theory on it in for your own context when you start doing it It's quite messy It really does require strong domain expertise So in our case when we first got access to this data, we were very excited Because we said now we can do all kinds of things we can get Close to a million records every minute and we can get so much of data And we'll just put it in just throw it in right? I'll run random forest. I'll do this that and we were not getting anywhere and Then we talked to the people. I mean people who build these machines people who do field support troubleshooting actually we were working with him and then he said We did all the exploratory analysis. We showed them You know this data looks like this and that and they said yeah, that's fine Then this person told us Don't look at all 250 variables. Why don't you focus on a few things? Our job is to improve the network reliability. Why don't you look at the error codes? And one key information he gave us. Why don't you look at there's a variable called Remember I showed you duration. It's the time it takes to set up the procedure So if you think of it just logically If you have a computer and it's taking longer than normal to do things That means it's overloaded or some you have some loop or some things happening, right? So it would make sense So then before we did this big analysis to do root cause We looked only at five variables procedure duration those error codes and some other one or two things and we have another paper I've listed it at the bottom. It's a published paper. That's a Univariate analysis. We just looked at procedure duration. We created distributions And I mean we may it was all very statistically significant and we were able to detect With one variable So if you can find a key variable and that algorithm it runs very efficiently They can actually code it in C though this work was in R They can actually code it in C and put it inside the machine So there's a benefit to coming up with some simple algorithms that run efficiently Because then you can put it right at the source. You don't have to move your data around So there, you know So our first kind of discovery came from somebody who doesn't do machine learning or anything He solves field problems, but he could guide us. So that was good for us and again, you know, it's important to do Exploratory analysis to understand the data rather than just jumping into building models and You know most data is unbalanced You seldom have data where the failure events are frequent because then you don't have a very good system, right? And then root cause analysis is very application specific and So you have to do it for your own domain And lastly as I mentioned sometimes simple models can give you very good results So that's all I have I'm going to stop there But when I put the slides up There are some links to related work and references so that you know, we didn't just Suddenly do this, but there was work that we built on top of And I'm happy to take any questions Yeah So I won't be so arrogant to say I can prevent the outages Our goal is to Be able to if we can Detect them either very quickly or we can detect small events Soon so they can do something to prevent the big outages Yeah, it's not just router data. It's from it's compiled in what's called the MME It's multiple Elements in the 4g network and it's all compiled. So it's much more than router data No, no, no, no, it has all these fields No, no, I'm using log in a generic way I'm using log in a generic way. I mean I'm using it. Oh, okay. Sorry. All right. Thank you