 Cool, so we both work in the CTO office in the AI ops group So one of the project is log anomaly detector, which I lead that project Michael Clifford works on it on the core ML part And yeah, we're just gonna we want to share some lessons learned of running machine learning in production for one of our internal customers Okay, so So we're gonna spend the next, you know 30 minutes or so talking about some of the challenges of building and machine learning system running it in production Yeah, I think this project started with Michael Clifford's Jupiter notebook and it became a whole Application with a lot of other features that were requested by the internal customer that we ended up building But Yeah, well, we'll go over the architecture We'll go over some core ML concepts on natural language processing Which is which exists in a lot of things like you know, um voice assistants and a lot of other systems use NLP so Don't all raise your hands, please Who knows what logs what are logs? What are logs? Does anybody know what logs are? Okay, it Okay, even if you know, I'll still tell you So logs have a timestamp They have a type they can be debug they can be info and there's other flags of in terms of severe severity The interesting part is the messages so a lot of systems operating systems Like Whenever you have a product and you want to get feedback you might log some information on how your customers are using your product So logging is pretty Widely used in all over computers, right? If you have a log for example and you have two applications that are Deterministic then you could probably recover your application back to the state that it was Based on replaying the same events that happen in our log and getting it back to the same state as before So logs have a lot of usage I Guess it's not new so Logs might be boring for some people as well And when you're debugging Systems sometimes you have to read a lot of logs So logs exist in databases transaction logs to kind of track like for example in databases fail Logs will people would have to dig through logs check error messages and Immediate issue web analytics like how many users you're getting from with different geographies you can kind of Get metadata from your logs and kind of build dashboards and Product analytics getting feedback about how well your product is doing with your customers and When it crashes you might send some logs back to further analyze For our purposes we were we wanted to use logs to do root cause analysis to help in root cause analysis and Yeah, we'd like to share some of the lessons learned when we decided to use machine learning to solve some of this So as I mentioned, you know logs logs are very boring to read. There's too much Too many lines of log lines Who here loves reading log lines? One person. Okay. I don't If I can automate it with machine learning that would and it saves me some time I would like to work on more interesting things so One thing that we wanted to think about is can we use So we can we use machine learning to identify Patterns in our logs like anomalies for example Like for example our applications working suddenly an event happens where the system fails and This would be in our case considered an anomaly it rarely happens But when it does happen, I want to know about it and maybe I want to do something about it and like all machine learning systems sometimes they may have false positives or false predictions and We had to kind of innovate there on how to solve that problem as well We'll share what we had to do in-house to solve that problem, so Just just a high level, you know when when you're building like all the things that Use data are considered like data products, right? Like Facebook news feed is a data product A lot of things that use data are data products when They may have a machine learning model. They may have one or they may have many machine learning models When you do deploy Machine learning system in production you want to make sure you track metrics on how well is it performing? how many predictions is it producing and How many false positives are you getting and if you can get a subject matter expert to give feedback That further enhances the system and you may want to do some A.B. testing maybe test different experiments as well And yeah, Michael, maybe if you can share some some motivations on Yeah, sure, so I'm just gonna talk a little bit kind of like a higher-order goal But I think about when thinking about log anomaly detection real-time log anomaly detection and kind of why would somebody actually want To do something like this obviously Zach talked about reading fewer logs because the higher-order goal that you might be thinking about is kind of There's a cost of both time and money General resources to applications being down. So really what happens is, you know when something goes wrong with your application there are two big costs and it's developer time both Just figuring out what went wrong and then fixing what went wrong and well that application is down You're if you're let's say you're an e-commerce site of some kind those transactions are like not occurring So you're losing money in that sense or just pure like if you just have some service You're having user confidence erosion. That's occurring. So down applications like a very bad thing And so what you want to do? So both of these issues are directly correlated with this notion of like time So the way that I would like to think about this problem is how can actually reduce the amount of time that a application is down for And I think though with automated log of anomaly detection Like that is the tool by which we decrease the amount of time that application is down and specifically With the particular implementation that we're working on now is focusing in on that like sleuthing time that a developer has to spend figuring out What exactly went wrong? So ideally here you kind of have a whole stream of logs instead of looking at all of them and figuring out which one went Wrong developing a system that can highlight and alert and kind of order the logs that came through by like level of severity or anomalousness So that's kind of the near-term goal and directly what this project is focusing on But obviously the far-term goal would be to actually have something that has some kind of automated fault recovery built in That's outside the scope of this particular project But ideally if this goes well, that would be like the foundation for that type of a system In the future and then yeah, Zach will tell us a little bit more about the design challenges that we have to overcome to solve these problems so so um I mean when you're collecting a lot of data, right? I'll a message right like the message the log message that you get how do you create meaning from it? And and then how do you how do you do? Unsupervised machine learning on it when you don't have labels for the data that's coming in and then the other part was Having subject matter experts giving feedback to reinforce was my model actually predicting the correct results that you're expecting and then system scalability challenges for example with elastic search if you try to make a single call with their API To I think there is like if you go past 10,000 log messages, then the client may See some scalability issues And then the volume of data as well and You know designing a system like this like Sometimes there's there's no blueprint of like exactly how do you solve like the feedback loop? problem, but That's what open source is all about innovating and Inventing if it doesn't exist So this is the the the high level architecture of the system The orange box that you see over there with the model training and inference. That's the component that Michael Clifford Created and then works on and that's the core brains of the system and Basically that training bit Michael Clifford is going to go more in-depth and give us more of a more of a high-level On natural language processing and how those bits tied together The other component is called the fact store, which is a an application which tracks the events the log events and Also tracks the feedback that the user would provide for example when a user does get a Get a prediction. There's they get an email sent and then with a with a link In that email and they'll they would get directed to the fact store to give feedback was this an anomaly Yes or no and They can provide some notes on more detail on why they say yes or no I Mentioned in a previous slide about making sure you monitor the performance and the quality of your machine learning system How well it's performing That both gives the data science scientists more insight on how well their code code is actually running and also Gets us to kind of like push to better better results and try to See if we may make some experiments and we can kind of weigh in and see how well is our experiment working out and finally The way that we do send notifications with the prediction results is we use this system called a last alert and What a last alert does is it just listens in to an elastic search index and whenever Data that follows a particular query Then it would send an email notification to the the user and Gonna pass it over to Mike Michael to give us more in-depth detail on The world crazy world of NLP Cool. Thank you Yeah, so I'm gonna talk to you guys a little bit about kind of Problem statement a little bit around log anomaly detection and how you do this But first I just want to step back and kind of help everyone get an intuition for what is log anomaly detection So here we have a data set here. Can anyone here tell me where the anomaly is by any chance? Yes, that's right. So there's an anomaly here up in the upper left-hand corner and You know if you have some Statistical background you might be able to tell this is like some kind of two dimensional Gaussian cloud of points And that clearly that point was not generated from the same underlying probability distribution But without knowing that like humans and 2d are very good at solving this kind of a problem But what if we had a different type of data set that wasn't like spatially distributed or we didn't wasn't represented like geometric geometrically Like this for example, let's say you require subject matter expertise And if any of you guys here are understanding the talk that we're giving right now I'm going to assume that you're all subject matter experts in English And so if I gave any of you guys here this small data set of Character strings red yellow green blue and chair I'm hoping you guys can tell me which of these is the anomaly here. Any volunteers Anyone did anybody really? Which one is an anomaly? Yes, so Yes That's a correct. Obviously chair is the anomaly here But it's a hot like but so I'm glad you guys are getting intuition for this type of problem And the question is like how do you how do you formalize this in a way that a computer can do this for you? Because what if you're trying to solve the same problem in like eight dimensions? Geometrically or you have a million words you're looking for like how exactly would would you do this and it becomes a Task that all of a sudden it becomes like well-suited for automation if you can determine a way to get this kind of a data representation into a machine learning tool So the way that I think about the LED cores release Existing in three parts. There's like your source of logs and the semi structured stream of logs I say they're semi structured because it's act noted earlier logs. Do you have some inherent structure to them? There's a timestamp. There's a flag and there's a message but a lot of like the rich information is really in that message and the message is is unsuper as a Unstructured data type essentially just list of character strings a variable length list of character strings Which barring some like pretty cutting-edge deep learning techniques is just like not well suited to most machine learning algorithms So we needed a method to go from the semi structured stream of logs to some kind of text encoding We actually have a like fixed-length vector representation of our logs That ideally like retains some kind of semantic meaning or meaning of any kind that we can use to represent these logs And then finally logs are application specific There's tons and tons of them They're really expensive to get like a labeled data set of these types of things Basically have to have like the application developers sit there and manually label each of these logs Which is just really not feasible. So you have to think about this as like an unsupervised learning problem So the three particular like tools that were implementing in this particular like instance Elastic search just our source of logs. No, there's some kind of a few different ways to get the logs in But that's what we're using currently and then we're using words of that as our encoding Method and finally we're using a self-organizing map as our inference engine to actually make decisions about these anomalies And so I'm going to talk a little bit about words of that can solve just again so you guys can understand what what's going on here So this is a super super simplified example of word to that but essentially you have some data set some corpus of natural language Here's the cat sat on the mat and then it defines each of the individual elements that are in this data set and Sets up your like vocabulary. So if one row for each unique word and then through kind of a Deep learning and coding process. You're able to determine these dense numerical fixed-length numerical vectors that retain the Ideally retains like the semantic meeting of the natural language that you put in if you go to the next slide and the way that it does this is through this process of essentially Taking the tube the over many many iterations and much tons of data It looks at like the words that surround a particular word and attempts to predict that middle word and after doing this a million times billion times with a Like a neural network of some kind you're actually able to like extract one of the hidden layers and use that as like an encoding tool to convert these variable length characters into fixed-length vectors and That maintain their semantic meaning ideally if you go to the next slide I'll give you kind of the classic example of what does it mean to retain its semantic meaning? So this is an example. I don't know if you guys are familiar with words of that at all but this is a very Canonical example that if you take the word vector for King Subtract the word vector for man Had the word vector for woman you end up in your like n dimensional word vector space somewhere Very close to the vector for clean so in this sense we can actually do some kind of We can visualize the data and we can do some arithmetic on it that is actually Like meaningful to subject matter experts in English So the goal here is kind of to do the same thing with our log So instead of having words we have logs that exist in this n dimensional log space that we've created through Kind of using word to back and hopefully we can get the same kind of thing We're able to identify these like clear outliers from the rest of the data set But yeah, so now we have a bunch of logs we've got from variable length character strings to like a space in which we can represent our Logs and do arithmetic on them visualize them But more importantly you can actually like put them into some machine learning algorithms to make decisions about them and so the machine learning The unsupervised machine learning algorithm that we decided to use for this particular Project is something called a self-organizing map or a cronin map When you're looking at here. So instead of using the logs This is just a this is a color space. It's being used as an analogy to represent what like how high-dimensional Data is processed through this this tool, but basically you go from some unorganized random map and through many iterations it kind of learns the ordered underlying like distribution of the The data that's being trained on so here it's essentially learning like the color spectrum Um, which is what we want. So we want to do the same kind of process with logs. So ideally after Taking these vectorized logs writing them through the self-organizing map We're left with a map that kind of maps to the underlying distribution of what our like normal log space Should look like and then once we've done that we have a trained map So like I said like kind of represents what what normal is in our in our log stream And then we can take in a new log that we haven't maybe we've seen it before maybe we haven't but it's new and Let's say we do something kind of silly and we use a three dimensional or a three vector length Word back in coding so we can represent it as this very specific shade of red here and then we can compare it to every other Node on the map and kind of find its best matching unit and from there we can actually just measure the distance in this log space that we've created for ourselves and Determine kind of how far away is it from its nearest neighbor essentially like on the map and from now We can generate an anomaly score that we can then use to Set a threshold and seeing anything over point nine five We're gonna consider anomaly or however we want to do it. So then from there We can kind of alert the user or whatever we want to do, but we're basically We now have gone from a Stream of unstructured character strings To a fixed-length vector embedding that hopefully retains some of Samantha meeting to a inference engine that we can use to actually Tell us something better anomalies as they're looting through only check the distance from the nearest neighbor or to the centroid itself Um So we're so there's no centroid. It's not sorry explain my question if you have two outliers So the second one They both have nearest neighbors that might be very close, but they are really out of the main centroid. Yeah, so the self-organizing map kind of Self-corrects for that kind of a problem because you only have you have so many logs And you only have so many nodes on the map that the nodes aren't actually like replicas of your data set They're like Approximations to things that had seen before so it's only things that are actually captured on the map That would be relevant and they should be generally not represent outliers and Yeah, I'm just gonna go ahead and talk to you guys about running this in production Yeah, so So as I mentioned before You know, it's it We do have a system currently this this training and running all the stuff Um, and just I'm just gonna go into more of the details on the production Well, we've learned in production So one thing that's very important is to have subject matter experts that can you know, um validate your these Anomaly and One thing that we've learned is that the system does not report any When there's no anomalies the system won't report any anomalies at all when there is anomalies it may report an anomaly but then There's there we have a as I mentioned before we do have emails that get sent with a link In the email that the that the the subject matter expert can respond to and and give us feedback on it kind of We want to we want to have a system that self-corrects and learns from Past mistakes like getting the false prediction So we Michael Clifford here had a hypothesis about you know, if we increase the frequency of Noise in the data set then we can decrease the score for the log that was reported as an anomaly and That that did cause cause the score to decrease and go under the threshold and then the next training run the the the model won't report this log as an anomaly anymore because the We had our our feedback system that was in place that that gave the feedback As you can see the score was up here, and then it went down here because we increased it to 10,000 Loglines for that particular one And one one thing that's very important is when when you do have Hypothesis of how like how these systems work You want to have a CI system in place and unit tests in place that will build when you build your code will validate these these things So the fact store a little bit more detail on the fact store is so it's a flask web application It uses sequel alchemy and as the back end to connect to a database We're currently using my sequel But I mean you could probably plug in postgres or any of the other database providers with just a connection string So very easy to switch databases It's it's really really good way to to just validate any concerns and Then the other thing that that also was important for us to build in place was dashboard metrics Something graphical So this is an older dashboard, but I'm going to show you a live demo of the new dashboard that Still in development, but um Yeah, let's I guess This is the email that would get sent to a user To say that this log line was an anomaly and this would be like a link here. We use an ID here We automatically generate for every prediction. We generate an ID and then that prediction gets stored in a in a in our fact store and then We have a foreign key in the other table, which is the feedback that connects the event To the actual prediction to the feedback that was given by the by the subject matter expert The fact store is super simple web application. It's a form that you would submit and Yeah, it just saves the metadata to the database Let's do a demo because you know all this Talking and presentation. It's it's better to actually see something real Wouldn't you agree? Okay, so we will show you the demo Okay, this is the Dashboard So we have logs coming in These are normal logs This is anomaly This is the current Ingest of data so right now it just ingested for logs This is the total anomalies that were found This is the average score This is the threshold This is the number of false positives so for example when the system finds that Okay, this message was reported as a false positive by our subject matter experts one It's gonna we're gonna track this as a metric that we found this Happening again, and then two we're gonna We're gonna disable the user the subject matter expert from getting another email because When once you give feedback you should never get that false prediction again so in Where we're we're making sure the the model in two places There's two checks the first check is the training and making sure that the model actually Doesn't provide any more false positive predictions And then the second part is if we do see that same message again We track that we see we see the same message coming in That was part of that list of messages that we've seen before that were false positives So this is a pretty simple dashboard Just shows some some information on on that and yeah, this project is in development So let's let's let's simulate, you know, maybe a lot of anomalies happening for example So let me just for no smokes and mirrors. It's good to have like no safety net and just Do stuff and hope that your demo doesn't fail and I have a lot of things on my screen Please don't judge and I lost my terminal And I found my terminal here it is Yeah, it's a font big enough, okay Okay, so this is the regular YAML file so the the code is basically Python of course because I wrote Python app.py And there's a configuration file that comes with it Let's just look at that configuration file Okay, so the configuration file takes certain parameters like okay, what what is the name of the model file and The time span the slice window of data that you want to pull down at a particular time The max entries as I mentioned before elastic search does have that limitation of like if you pull too much data Yeah, it may crash. So we we only pulled on a certain amount of data And then how many times we do it in inference So we do the first training run we pulled the data down we train it and then we produce a model and then We do the inference part where we pull new logs down and we do the inference and then there's inference loop That happens and this is all configurable So if you want to configure it you can and then the threshold the The stuff I was showing you guys earlier With the bar of how high it should be the threshold of what is an anomaly with score should be This threshold is a calculation Maybe Michael Clifford can tell us more about that Yeah, so one of the reasons I look at the I said the threshold we gather some information about the distance of things from the map And just like a good if we can assume that our kind of data is Gaussianly distributed and a good idea of like what is an outlier is three times Three standard deviations away from the mean So basically this is just a number that you can tune depending on how many standard deviations away from the mean You want to consider as anomalies So and and furthermore like these are the endpoints to connect to elastic search And that's pretty much it that's the that's the configuration file there's more configurations But you can check out our github and We document all these different parameters that you can set Okay, so let's let's mess with our demo a bit Let's let's give it some Let's simulate more anomalies. So if we do is that if we set this to like 0.1, then it's gonna be really the bar is gonna be really low. So whatever it comes through is it's just gonna be It might be tagged as anomaly, but this is just for demonstration purposes Of course, we're not gonna be doing this on a production system. We're gonna be doing this to demonstrate. So you guys get the idea but Like I said, no safety net That's how we roll. It's training. I Don't know what that's for It's pulling some data down Hopefully it's it's gonna work Yeah, it's training See the epochs See the time that the training took found some anomalies. It has some scores there We do track a lot of the stuff in Prometheus. So we can You can go into Prometheus and then it'll stop and it'll wait till the next sliding window When it's gonna pull more data down and then do the prediction again So I think we covered a lot of ground here before we Go first to summarize this Does anybody have any questions? Everybody's putting up the handle. Okay. One two three Yeah, you mentioned that in case of false positive when Something comes up as an anomaly and it's not an anomaly How do you make sure that the model corrects for it because are you creating more? Anomalies which are simple more entries which are similar to the false positive or how are you correcting? Because it's just Yeah, so that's exactly right. I mean basically on the next Retraining iteration we were insured to include like many copies of this like false positive So that the model learns to like basically associate that with normal behavior. Yes, I'll show you Mine was a good follow-up for that one because So yeah, you like basically retrain it with a huge duplication factor on certain numbers of Logs, I mean a soul has got kind of like a finite memory, right? I mean if it's yeah 30 by 30 it can store kind of like 900 archetypes. Yeah So if you ever noticed it Starting to forget some other normal things when you like reweight that and force it to learn a you know Anything yeah, I think there's a trade-off there That's definitely an issue if it just you know if you're packing it with a bunch of copies of the same thing It'll start to kind of become a homogenous map So that's definitely an issue but when you're trying to solve for the false positive case This is one solution that we came up with essentially But definitely having this number not just be infinite and some hyper parameter to be tuned is something to to consider Yeah Back there. Yeah, good. How do you handle false negatives? I don't still working on it We're still this is in development. So we're Really still making commits to it. So I mean that is a good question I mean, how do you find the things that haven't been found? So like how do you quantify that number? It's a tough problem for unsupervised learning But but one thing that we do know that it hasn't reported when there's no anomalies it hasn't reported that there there is an anomaly When there is an anomaly then that's when it's in question. We want to try to improve on that So before it was running in production before we are able to have a feedback mechanism to have the label for the data So how do you decide it's a working anomaly detection model? So how do you convince like the group to make it into protection? Inspection working alongside the subject matter experts is basically What we found to be the best way to do it because they're going to be ultimately the end users And if kind of provide them with a sample, you basically work with them and get a sample data set Run it through ask them, you know, does this seem reasonable to you? And based on that is how we're currently doing it Ideally we'd like to have kind of like a golden data set that's labeled that we could use to kind of pressure test all these things that's still In progress. Yeah, it's a question over here. Oh power cord miss us I assume that The application log whatever it is it'll have a finite number of words even if there are 10 million lines. Yep so it may be The training part once it is trained. It should be able to predict but whereas in case of natural language So let's consider two states the cat is in the mat Then there's another another word the set of words that says let the cat come out of the bag Mm-hmm that may be an explosive situation. Yep. So how does it differentiate this kind of situation? Yeah, so I mean the kind of the underlying assumption is that the data set under which we train it represents Like a series of not of normal behavior All right, so we're we're trying to model what normal behavior with this normal series of logs Looks like for some period of time and if you're starting to get a like high rate of logs that we consider anomalous Or it hasn't seen before Or yeah, that's very far from the map and you're going to want to say that there's there's an issue But if you so if you just had those two cases and you trained on those two cases You won't I don't think you'd be able to like make a proper differentiation my question is Instances may change the same Yeah, I think I understand your correction your question correctly just this idea that like logs are not natural language I'm like the wider sense. There's some kind of like Machine they're not machine code, but there's only a fixed set of logs Possibly come out of a system. So you have a smaller vocabulary to deal with so I think about that more as like in a way It's kind of like optimize the word to vet encoding part of it So you can kind of you'll see all the words that you're ever going to see rather quickly But then kind of what normal behavior is over time can change for a particular application So you want to always be kind of continually retraining this thing to manage the data drift I guess as I call it and just make sure that Basically you train it to Understand what normal operating procedure looks like and then if that deviates from that too significantly Somebody gets an email Thank you. So just Is there any more questions before we close up? Actually, I have my question is kind of fundamental normally when you construct a lock It's it's already labeled whether it's an error lock or normal info lock So the purpose I was guessing is your log is try to get some of normal Operating situations from the normal lock. It's not error lock, right? Yeah, exactly I mean if like yeah quick solution to this problem would just be you know find all error messages But yeah, we're trying to do something That kind of goes beyond that. Yeah It's basically looking for anomalies like things that are not normally that normally don't happen Then just making sure that DevOps folks can can this this customer can get notified first before Before it you know become a serious issue. Okay, another question is about the word vex Is that just for English do you ever try to another language for this? And no, I have not tried it for a different language. All right, not yet But we definitely love contributions and this this is an open source. So definitely If that's something that you're interested in and we would definitely open for contributions And yeah, just wanted to to recap on on what we we cover today We went through NLP just went through some Examples and try to get the understanding of the lad core LED core We went through the architecture. We looked at Running something in production and just some of the the considerations to to look at when running something like a machine learning system in production and And yeah, we learned about this project and it is open source like I said We'd love contributions Even if you have feedback for us to like kind of any questions or something that you want as a feature You know, we were definitely open to new ideas, you know, best ideas win doesn't matter how much experience you have It's it's all about best ideas. So we'll welcome them Thank you guys