 Yeah, I'm producing a math library. Each of you said you know what it is. It doesn't seem like we're speaking. It's like... I mean from my perspective as there's this admin, I'm trying to provide good tools. And the place where you're working needs, you know, you say they don't know how to solve it, because it's useless in our math. Yeah. Can you guys hear me in the back? Yes. Awesome. So welcome to DevCon. This is a system engineering track, and the next talk is going to be Ignoring Alert by Sanjay. So I'll hand over to him. All right, so we'll just give maybe 30 seconds for everyone to settle down, and then we'll start. Okay, so let's start. So I'm Sanjay. This is joint work with Oli Draper, who's right here. And so as the title says, it's Ignoring Alert. And to set the problem up, let's look at a simple example. So if someone there can't hear me, just raise your hand and I'll speak up. So this is a toy example. Let's say you have some system you're monitoring, and the yellow line is some metric you're measuring. So it's very well behaved. It's a sine curve. You can do magical things if you're doing artificial things. And let's say you put a threshold. You said this should never exceed the value of three. That's a red line. And of course this is a toy example, but this happens all the time. We monitor systems, whether they're rockets, they're computers, finance, time series data is everywhere. We monitor this. And this is okay. Every time it exceeds, it sends you an email and says, hey, my threshold got exceeded. Then one day this happens. This doesn't mean something's wrong. You know, maybe this is... But that's better, right? Like it's a mysterious voice. And let's say the system changes, the behavior changes. That doesn't have to be bad. So I'll pick an example from finance. I'm sure many of you have money invested in S&P 500. You wanted to go up. If you built a model 10 years ago and you said, oh, the value if it reaches, let's say 1200, let me know. I'll sell because you think it's a bubble. Well, today it's for 22,000 or something like that. So it keeps going up. It's not a stationary time series. And so you cannot put hard thresholds which don't look at new data. And so if you do that here, of course beyond a certain point, every single second you're getting an alert. And this happens all the time in monitoring systems. You get a static threshold or static rules. The data changes, the system changes. And then your alerts come in and they don't mean anything and you ignore them. What's the use of that? So what we would like to do instead is to learn from the data itself. So as data changes, as the system changes, can we come up with new rules which are automatically updated? And can we use these rules to basically monitor what's going on? Can we look for signatures of events that are interesting? They can be bad things. They can be good things. And machine learning is a buzzword nowadays. But many of these techniques, which are 50 years old, 20 years old, are easy to implement. There are libraries out there. They're fast to compute. And some of them work extremely well in finding hidden patterns, patterns that are not obvious to us at First Clients. So these are some applications for our universe. For SIS admins, monitoring performance of systems, detecting errors, optimizing with respect to various metrics. You want to minimize downtime. You want to maybe save energy. Maybe it's some system on a rocket. You don't want to use all your battery power. For programmers, of course, performance analysis, detecting bugs. And in general, in any root cause analysis, we generally find what happened. We record the bug. But we never use it again. Can I record various aspects of the bugs I found, of the root causes I found, and feed that to a system that can then detect new ones? It can say, look, I've seen something like this in the past. It looks like the same thing is happening again. Maybe it is. Maybe it's not. But it's a helpful tool. So this is one of the broadest problems you can think of, anomaly detection. Anomaly doesn't have to be something bad. It is something improbable, something unusual, something that wasn't seen before. So in this case, you again have some nice yellow curve. It's periodic. But in the third peak, you see a slight jump. That's unusual. Of course, as before, there could be a static threshold. But like we said, if the data changes, you might not necessarily catch that. And the goal is to catch that yellow bump. Of course, real systems are more complex. It's never that easy. And there are various ways of doing this. There's something in machine learning called unsupervised learning, which is when you just say, here's data. There's a lot more work that goes. You just don't feed the data. But here's some data. Can we look for unusual things? There's something called supervised learning where you build a model. And the model says, having seen some of this data, what do I expect in the next one week? Let me make that prediction. Now let me see what actually happened today, tomorrow. Let me get this new data coming in. And let's compare those two. If they're different, and again, I'm being imprecise here in my language, when I say they're different, I mean statistically different. So not just that the delta is more than zero. But if they're different, it means one of three things. My model is wrong. Something unusual happened in the data or a combination of both. But that's very useful signal where you say, I know how to model a system. I expect certain things to happen. And the thing that happened is really different from what happened, from what I expected. So let's look at that. So this is just an example. Right even here you can see as the orange area increases, so that's the delta that you see between what you actually observe and your model's prediction. Initially you might say, oh, it's only different by 1%, half a percent, who cares? And then it grows and it maybe becomes 50%. So at some stage you of course start caring. This is another example. I actually don't remember what this data set is. It's... Airline traffic. Airline traffic. So from 1920 to 1963 I think. And as you expect, it's growing, it looks linear. There's some periodicity there. And the yellow curve is what we used to build our data. Not the yellow, the blue one. And the yellow one is something that we don't touch at all. It sits there. Or in machine learning terms, if you have heard these terms, the blue data is called training data and the yellow data is called test data. And so this is just one example of a model. And so I'm not going to go into details of what Sarima is. The point of this talk is not to give a comprehensive overview of machine learning techniques or how to actually build a model and test it, just to show some examples. So the Sarima model basically looks at the blue curve and learns a few parameters and it can use that to make future predictions. So you make future predictions and you see the colors look a bit mixed up, but the green are the predictions. The yellow is the original data. It doesn't look that bad, at least visually. You can zoom into that data set and what you see is the stuff on the left. So yellow is what you actually observe. Green is what you predicted. And Sarima is a technique that can learn periodicity. And of course here if you look at the first peak here, it looks pretty good. As you go further out in time, they have a shift. It clearly didn't learn the right periodicity. There's some delta that adds up over time. And so there are already a couple of interesting things here. Number one, most models decay in time. So initially they work well, but as time goes on, the system changes a bit and the models don't work that well. So you have to constantly keep training them. The second one, of course, is you can look at the difference and as you look at the difference that grows, which is just expressing the fact that the model decayed, but what you would do is you would look at the, let's say from 1950 to 52 and you would say, well, if the gap increases to something that's uncomfortable for me, let's say 10%, then I'm going to look at this. So that can be an alerting system in this case. So any questions till now? So for this talk, we wrote two simulators. So, Uli with all his knowledge, wrote two artificial simulators that would, one, simulate a network. So you have a bunch of hosts. Some belong to your intranet. Some belong to the internet, external hosts. And they have some hard-coded patterns, some distributions that we put in for rates at which they connect to your machines to introduce the network topology. And we want to detect intrusions. So we can, you know, it's a simulation. By hand, you can go and put in some hostile hosts and let them do something bad and see if you can detect them in the data. The second one is process scheduling. So the idea is, let's say you have two processes. It's a simple computer. There are NCPUs, MIO units. You have some process and I'll go into details of how we define the process. You run this process in isolation and you see how many CPU cycles it spends, how many IO cycles. You run another one, which is completely different. What you would like to do is run both of them on some given hardware where you are packing them maximally. So by packing, I mean you want to take, you want to utilize both CPU and IO units as much as possible without penalizing any single process a lot. So you don't want their runtimes to double. So every, your clients are happy but you want to use all your resources. So a trivial example is, let's say you have two processes, each of which does 50 CPU cycles, 50 IO cycles, but at alternate times. And I want CPU one IO, I can perfectly pack them. When my sister processes doing a CPU operation, I do an IO one and then we flip. So that would be perfect packing, but of course in general, that's not possible. So this is a network problem. And again, as I said before, this is not comprehensive at all. So I'll just show a bunch of examples. Some are interesting techniques. Some are very simple techniques. The simplest thing you can do with a graph is you, you represent a bit of network is you represent it as a graph. Every host is a node. If two hosts connect, there's an edge. So you see all the hosts and the edges. Generally, these are directed graphs. I might connect to you. I might send packets to you. You might not. You can, of course, think of it as an undirected graph where there are no directions to the arrows. You just have an edge. And of course, you can plot graphs in various ways. This is something that's called a spring layer. Let's not worry about that. The whole point is to take each node, map it to X5 coordinates, and prop it on a 2D plot. And at least in this case, you can naturally see five clusters, five groupings. So you see the four fans, and you see something in the center. All the, I don't even know what to call that color, pink, salmon nodes are external hosts. So they are on the internet. All the blue ones are on the intranet connecting to these internet hosts. Internet-facing hosts. And all the green and yellow ones are back-end and control machines. So they're your infrastructure. And so now, what do you do with the graph? I can look at it and see something, but I don't want to look at 10,000 graphs a day. I want to do something in an automated way. And there's a very nice correspondence between graphs and matrices. So you can represent graphs, those graphs by matrices. By graphs, I always mean node's edges. And the simplest matrix is the adjacency matrix. And the idea is if you have n nodes, you identify, you give a number to each node. So from 1 to n. And then you create an n by n matrix. And if node i and node j, or host i and host j talk to each other, you put a 1. If they don't, you put a 0. So now you get an n by n graph with 0's and 1's, which is a very flexible tool. Because maybe you don't want to put 1's. Maybe each edge has a weight. It's the number of connections per day. So I can put that number into the graph. Maybe it's an undirected graph, in which case the matrix is symmetric. So this in general is a very powerful correspondence. And then there are derived graphs. So there's something called the Laplacian. We won't go into that. But what it does is it lets you explore complex graphs through the language of linear algebra. So you can look at eigenvectors, eigenvalues, if you have worked with those kinds of things. I'll state some sounds like a mysterious result. But there's something called the Laplacian of a graph. So it comes from the adjacency matrix. Think of another matrix. You can compute something called the eigenvalues and eigenvectors of this graph. And if you look at the number of zero eigenvalues, that tells you how many connected components the graph has. You can similarly compute all kinds of interesting graph properties by looking at the eigenvalues of this matrix. You can look at its eigenvectors and visualize the graph in different ways. And so again, I won't go into details here. If you are interested, please look for Oli or me after the talk and we are happy to talk about it. But the key takeaway point here is, given a graph that are interesting matrices and you can use linear algebra to compute useful things, and they let you monitor your graph. So you have to put some nice pictures. This is an adjacency matrix heat map. So what you see are some of the rows and the columns light up. Each row and column corresponds to certain nodes. And so by looking at this, you can already see something like there's a group here which doesn't really talk to each other. And if you see something that lights up, you can say, OK, that host is talking to a lot of different hosts. As a picture, it's interesting, but not that useful. The linear algebra really makes it useful. So going back to the original graph visualization, we had five clusters. As humans, we see it instantly. But imagine a very large graph and it's complex and you don't want to do this manually. So what you would like to do is have an algorithm that says these are the five groupings or 10 groupings I see in my graph. And for each one, can I find something unusual? And there's a technique for that called clustering. And clustering by itself has many variants. So what I'll show here is called k-means. And rather than talk about it, it's easy to show pictures. So again, we have a 2D plot. We have three clusters. And what you do is you say, so there is a downside. You have to initially tell it how many clusters you think there are. So here we'll say three. Of course, there are automated ways of scanning through a bunch of number of clusters and finding the best one. But here, let's say we pick three. So I don't know if it's easy to see, but now there are three centers. So you see diamonds, blue, purple, and green. And what the algorithm does is it says, I don't know where these clusters are. So I'm going to drop three random points. So you drop them. Then every single point in your data votes, it says which of these three points is closest to me. And if it's the blue one, then I'm going to color myself blue. So all the ones that are closest to the blue one said they're blue, similarly for purple and green. And then the second step is you take all the blue points and you find their mean. So take their x positions, find the mean. Take the y positions, find the mean or the average. And that will correct all your positions. So there are two steps, right? You drop three random points. You say every data point votes. It says I'm closest to that guy. So all of us who are blue will take our average and move our point. And then we repeat. Then again, we all vote. We say whose closest to which diamond. And so now you get something slightly different. And then you compute the mean again. And I'll just go through these a couple more times. So as you see, it converges. So what happens is they all keep moving the centers till the center is basically within your cluster. Now, I'll throw a couple of buzzwords out here too. There's a whole field or sequence of techniques called expectation maximization. This is one of them. In practice, you have to run this 20 times, end times, and find the best solution. But again, that's super easy to do. You just run it a few extra times. Another thing, if you're interested, K means it's good for spherical clusters. There are other techniques called spectral clustering, for example, or DB scan, which are good for other things. But this in general will automatically scan over your data set. And tell you what the clusters are. So just to make sure, this is an illustrated data set for two dimensions. And you might say, it's obvious. What we are working on normally are data sets with hundreds or thousands of dimensions. And anyone who's here who can imagine more than three dimensions, please see me. I have a Nobel Prize waiting for you. And then we do what we were talking about earlier, which is you, and it's hard to show this. We actually did this on the simulator. And you can put your hostile horse and they just pop out, just because they have a different connectivity matrix. They connect to more horse or they connect more often. But what you do is you look at each cluster and you say, who's in the minority? So in this case, you have the cluster ID. You have some horse address. And you take some property of the horse. And this is network specific. You can do this for other things. And you say, maybe the property I'm looking at is who is internal? Who's external? And you look for the minority. So in other words, we were talking about anomaly detection, which is looking for something improbable. You can use k-means for anomaly detection. You have a cloud of data in three, four, 10 dimensions. So in our case, two dimensions. Do clustering, look at each cluster and say, who's in the minority in my cluster? And those are the anomalies. So same point. Those are the anomalies. And in our case, at least the hostels just pop out. This is, again, there's no machine learning here. This is just visualization. What this thing is, is a graph where the y-axis is time. So time goes up. So you have your graph. You have all these horse talking to each other. At every point in time, you make one of these horizontal lines. And what the horizontal lines are, the traffic from an external host to one of my internal hosts. So there are four internal hosts that are internet facing. And those are the chunks you see. So this purple patch is one internal host. Same thing for the three orangish patches. And what you see is traffic as a function of time as you go up. And the reason the first patch is purple is you have this one host connects to twice as many clients but gets half the data from them. So it gets less traffic. Those three get the same amount of traffic statistically. And you clearly see some bands that stand out. So here, here, here and the last one. And again, there's nothing, no machine learning here at all. It's just visualizing traffic data as a function of every single incoming connection for every internal host. And it already pops out. So sometimes just visualizing your data in a simple way is super helpful or in multiple ways is super helpful. This is another one, not a very pretty plot, but the x-axis is again time. The y-axis is traffic. And here you see this is from the analysis we were doing. So you see two hostile machines. The green one is a scanner. So it's scanning ports. The red one then looks at a potential target and attacks it. So it has less traffic. And again, this is a simulation. It's cleaner than real life is. But you can do another interesting thing with this, which is to project it in two dimensions. So now the next technique we'll talk about is taking data in high dimensions. And by high dimensions, it simply means instead of having two numbers, you have 20 numbers. That's 20 dimensions. How do I visualize this in two or three dimensions? That's, again, a whole subfield, but we look at one technique, the simplest one. And it has a fancy name. It's called principal component analysis. So we look at an example. You have some three-dimensional data. You have all these dots. Each one is one data point with three coordinates. And they have different colors. And you can look at this from different angles. So you have three different angles where you just rotate this thing and you look at it. And you see if you can find some way to separate them. But you don't, right? Like some of them, the last one, the green and the red overlap. The other one, something else overlaps. So what you would like to do is basically find the right angle to look at it from so that you can separate this data as much as possible. Or in other words, at least in three dimensions, you want to find a two-dimensional plane so that if I project everything on this plane and by projecting, I mean just this operation, just drop everything perpendicularly on the plane. And I would get a picture like this. I want this picture to have maximum variance or variation. I want the data to spread out. So the question is, what plane do I think? And this doesn't work just in three dimensions. You can have a hundred-dimensional data and you want to find a 99-dimensional plane where you can project this data. And one of the techniques is principle component analysis. It finds that plane for you. I don't want to make this more mysterious than it sounds. Finding the plane is simply finding the eigenvectors of a certain matrix that are connected to the data. It's the covariance matrix. You find the eigenvectors. That's the plane. But don't worry about that. Basically, when you then take that plane and you rotate it, you drop it on a two-dimensional table, you see something like this. Now, again, sometimes it works. Sometimes it doesn't work. But it's a very powerful technique. So in the case of our traffic analysis, when you plot it on two dimensions, all the external hosts that are not hostile form those two blue blobs. And the scanner and the attacker, there are only one of them, one of each, they fall in completely separate spaces. And this is very powerful. It's very hard to visualize time data, time series data. I mean, you might have 2,000 data points per time series. It's hard to look at that. Doing something like this instantly tells you, summarizes each series in one point. It tells you what's unusual. And again, a few more buzzwords. So there are other techniques like this. There's something called a self-organizing map, T sneeze or T and S and E, many others. But PCA is a good go-to technique. So the second simulator we tried was the process simulation. And the idea here is let's write a simulator where you have N CPU units, M IO units, and you have processes running. How do you model processes? Well, they're four states. So they have computer and IO. And if they can't find enough compute to CPU or IO units, they're baiting or they're idle. So you have these four states. And what you do is you pick a state. If I'm a process, I pick a state, compute, and I pick a random number from a distribution. It can be normal. It can be something you pick. And I stay in that state for that amount of time. So I might stay in CPU for 50 cycles. And then there's a probability of me to jump to compute, to IO, to idle, to something else. And I repeat this. So this is technically what's called a Markov chain. You start in some state. You stay there for some time. That time is randomly generated. And then when that time is done, you jump to some other state randomly. It's a very simple model of a computer, very basic. And like I said before, what we want to model is, let's pick a simple example, right? Two machines, one CPU, one IO unit each. I run a process in isolation on, let's say, each of them. I measure the number of IO cycles, CPU, total time. And I want to find out what happens. Can I predict what would happen if I run them together on the same hardware? Can I pack them in a tighter fashion? And so this is the raw data that we generated. So the x-axis is you run each process. So if you pick two processes, run each 100 times. There are random elements. You have to average them. But the x-axis is the mean compute time and mean IO time. Time spent in the compute state, time spent in IO state, or in other words, time spent doing useful work for both processes combined. This is on the x-axis. And on the y-axis is if you plot both of these processes on the same machine, how many weight cycles are introduced? So the worst case scenario is if I am 50 CPU and 50 IO units, Uli is the same. We go on the same machine. We run for 100 cycles and all because we just can't multitask. So that's what you see on the y-axis. So again, x-axis is total work in isolation for each process. Y-axis is when I plop them together, how many weight cycles do I introduce? Ideally that's zero, but it's not. And again, with the caveat that this is a simulation, there are also some interesting features. You see a big blob on the right side where they take a long time to run, but they don't take that much. They don't have that many weight cycles. So they seem to pack pretty well. While all of these, it's almost linear. So there's some interesting pattern here, but we would like to predict the y. And so one of the techniques for doing that, and so as you can see, I'm using simple examples to motivate some machine learning techniques. So the examples are not completely rigorous. We didn't do all the checks that you should do. These are just examples. So a very useful and very fast and powerful technique in machine learning is called random forests. And it's basically a combination of something called decision trees. What is a decision tree? It's a sequence of if-else statements. So you give me some data. You give me x, y, z, which are so-called features. And you tell me, so let me talk about this example. You give me two points, x and y. Every point depending on where it lands is red or green. If it's in the center, it's red. If it's outside, it's green. I want to build a sequence of rules so that if you give me a new point, x and y, I can tell you whether it's red or green. Now, of course, this is visually a trivial example. You can say, oh, why can maybe draw a circle and see where it lies? But again, like Uli said, data can be more complex. It can be high dimensional. This is just an example. And what the decision tree does is it will construct a tree like that, just if-else sequences where the thresholds and the statements are learned from data. And so what the plot on the right shows is you just go across the grid and everything that would be predicted as green is in the yellow. Everything that would be predicted as red is in the blue. And so, of course, because there are if-else statements on x and y, you get vertical and horizontal lines. And we're not claiming this is the best technique to use for time series or for regression, but it's one of the first go-to techniques. Just to give you a flavor of how messy this gets, we actually took the raw data, the processed data, and constructed 123 different properties. Time spent in compute, number of transitions from compute to IO. Then you take the time spent in compute and you bucket them in bins. So you try all these things. You feed it to your algorithm. And this is what you get. So on the train, so you take some data and you say, my algorithm only sees that. You say, all right, I'm 9% off on average. So the error between the prediction and the actual values is roughly 9%. But when you actually test it on data you haven't seen, you get 50% errors. So number one, this is something called overfitting. This is severely overfitting. Your model is memorizing your training data that it saw. And then it sees something different and it conks out. But more than that, it's pretty bad. Then you reduce the data you use. You don't use 123 properties of each process. You use two, two for each. So four and all. And then as you see, on the train set, your accuracy goes down. It is still overfitting because there's a big difference between train and test. But your accuracy is much better. 20% isn't that bad. So I could predict on average with a 20% error that if you run end processes together, this would be the total runtime. I mean, that's a very rough crude model to help you schedule a distributed system in general. And so the key takeaway point here is one, of course, set up many techniques like decision trees and random forests. But the second one is more data is not always better. Sometimes having the right data, which is where domain experts come in, is far more powerful than just throwing data at something. And so lastly, I'll just show you the slightly messy plots. But on the left is the model that uses all the data. So 123 properties of each process. The blue points are the original data. The red ones are the predictions. And you can see you're really off here. So everything should actually have a vertical line. So this point was predicted to be here. Even here, you're completely off. All this stuff should be here. What you see on the right side is the model with just four properties in all. And of course, it's hard to know which one's better, except you can look at specific corners and say, oh, it's doing better on the right corner. There's still some bad things happening here. But this is the one that gave you much better performance on average. So that's it. Again, the key takeaway from this is not that this is the most accurate model or the most accurate simulator, but that it is useful to use some of these machine learning techniques. And there are many choices to help you automatically monitor systems. So the thing about this is that the techniques we have deployed and what we did there are deliberately not specific to this purpose. These are journal purpose techniques and you collect data anyway. So what we are suggesting is that instead of trying to guess what is going on there, you try out some of the algorithms like this and let the algorithm do some choice. It's not the dumb process. You don't just throw all the data you have at it. So this is what this example is supposed to show. But instead, you're trying out a couple of things and then the model itself, once you figure it out, you can just retrain over and over again. Usually for whatever, if the complexity of the model is not too high so the learning doesn't take too long, you can say every single day I'm learning it new. I'm learning the parameters. So that then if you're catching up on trends, remember some of the first slides where you have a trend in the data itself and you want to catch up on that. So these kind of things work well as long as you're not putting your own bias in the game. So as long as you're dropping that you know everything about it, you should know something about what is important, but you should not know, claim to know that, yeah, here this will never exceed this internet traffic. So just imagine things like back in the days we said, well, if you ever have more than five megabits of network traffic in a second, something is wrong. Well, guess what? Nowadays we have a 100 gigabit system. So these kind of things will, if you put this in the system like this, they will automatically learn. And there are many techniques. So you ask your friendly neighborhood machine learning expert about these kind of things, they will probably be happy to help you along with this. Any questions? So just for reference, Sanjay and I used to give classes in machine learning the very, very basic overview class on all the techniques only took on something like 40 hours. And that's not even going into details. So that's not something which you'll learn overnight. That's, well, three or three years for your process for you to catch up. So don't expect that you'll be able to do these kind of things, but you can find someone who actually can help you with that. Well, I like that. So maybe everyone just be quiet and all right.