 In this set of slides, we will talk about the performance and tuning of an IDS. We mentioned that the goal of an IDS is to identify and discriminate between malicious and normal OB-9 activities, for example between attacks and normal traffic. We also said that the key assumption behind the functioning of an IDS is that attack traffic differs from normal traffic. Let's now see why this is a key assumption and what does this mean for the detection process. Let's assume we have defined a certain metric and that we can compute from malicious and benign traffic. A metric can be as simple as a payload length or as complex as the likelihood of a sample to be malicious computed by a statistical model. Over a large set of samples, we are then able to observe how the possible values of this metric are distributed. For example, let's assume they are distributed in this way. This means that the metric is able to fully separate the benign and malicious traffic. It also means, most likely, that we are dealing with an easy problem. But let's go on with this our example. In this case, we are able to build a perfect IDS by properly choosing a threshold for our metric. Every sample with metric larger than the threshold is an attack and every sample with metric smaller than the threshold is normal traffic. Unfortunately for us, the real world is rarely that easy and most likely the traffic sample will be distributed in this way. The normal traffic distribution and the malicious traffic distribution will partially overlap, which means that there is a subset of the traffic for which we will not be able to say, based on our metric, if it is malicious or benign. It will simply look the same. This example tells us that the intrusion detection process will make errors and that we might not be able to avoid it. Since we have established that an IDS will most likely make some detection errors, it is then important that we are able to quantify how well an IDS performs. To evaluate the performance of an IDS, we need two ingredients. First, we need a ground truth. A ground truth is a data set for which we know if each sample, like a packet or a log entry or any other data your IDS works on, is an attack or not. A ground truth data set is therefore a label data set. Label data sets are quite rare in intrusion detection, since they are mostly need to be manually labelled. They therefore require domain knowledge and time to be created and one is often faced with a trade-off between how representative a data set is and how complex it is to have an accurate labelling. For reference, we will call the ground truth samples malicious and benign samples. The second ingredient is the IDS A output. We assume here that the ground truth data set will be processed by the IDS we are analysing, which will give us a data set label with the classification decision of the IDS. We refer to the IDS output samples as positive sample, in case the IDS labels the sample as an attack and negative samples, in case the IDS decides the sample carries normal traffic. Let's look again at the overlapping normal traffic and malicious traffic distributions. Now that we know about ground truth, we can say that those distributions are built on the ground truth data set. Let's assume the IDS uses a threshold to distinguish between positive and negative samples. We are now in a situation in which it samples as acquired two labels, one given by the ground truth and one given by the IDS. If the ground truth and the IDS output are in agreement on the nature of a sample, we have a correct classification. If not, we have an error. This gives for possible outcomes. If malicious traffic is correctly classified as positive, we say this is a true positive sample. Similarly, if benign traffic is correctly classified as negative, this is a true negative sample. Let's now look at the errors. We have two types of errors. A malicious sample that is labeled as negative is a false negative sample. This is essentially a miss. It can also happen that a benign sample is labeled as positive. This is a false positive sample. Think about it as an extra unnecessary alert. By the way, we call this a confusion matrix. These classes are also visible in our original traffic distribution plot. The area under the benign traffic curve to the left of the threshold will give us the percentage of the true negative samples. Similarly, the area under the malicious traffic curve to the right of the threshold will give us the percentage of true positives. The overlap zones give us the percentage of the false negatives and false positives. To summarize, the confusion matrix allows us to quantify the performance of the IDS. Once this is done, we are able to say how well an IDS performs given a certain ground truth, but we are also able to compare, in a verifiable manner, different IDSes. Now that we know how we can measure the classification performance of an IDS, let's think about how we can use this to improve the security of a network. Clearly, the underlying goal is to keep both false positives and false negatives as low as possible. If you take the point of view of a network administrator, false positives are costly, since alerts need to be manually checked. On the other hand, false negatives constitute a security risk, which you may or might not be willing to take depending on which attack will go undetected. This brings us to the last topic of this lecture, namely the tuning of an IDS. Although we might like to think that an IDS is a plug-and-play and it will work as it is in any network, reality shows that this is most likely not the case. The intuition behind this is that although certain characteristics of an attack will be similar in all networks, they might not be exactly the same. This happens for normal traffic as well, after all. For example, everybody knows that there is most likely a journal pattern in traffic, but if the peak load is 500 megabit per second of 2 gigabit per second will depend on a specific network. In the case of an IDS, this means that a set of internal parameters will most likely need to be tuned to work in a specific network. The good news is that also the error rate, false positives and negatives, can be controlled by tuning the parameters. Let's go back to the benign and malicious traffic distributions. By choosing a proper threshold, one might be able to reduce the number of false positives to zero. However, this will most likely raise the number of false negatives. Conversely, you can find a threshold such that the number of false positives will go to zero, but most likely at the cost of a larger number of false positives. This example tells us the following. First, error rates can rarely be treated separately and they are in most cases interleaved measures. Second, you can look for parameters values that minimize both the error rates. However, this is not the only option. For example, if following up an alert is a very costly operation and you can afford the risk of missing some attacks, you can tune your IDS such as to have a higher number of false negatives but lower false positives. Conversely, if you have for example an automated way of handling alerts, you can accept an higher number of false positives but lower the risk of missing out attacks. The choice of which strategies to apply depends on the security policies you wish to implement in a network. So far, we have reasoned in an abstract manner about performance and tuning of an IDS. To conclude, I would like to give you two examples of how this concept might look like when we get closer to real traffic traces. Let's first look at this graph. Here we have analyzed the distribution of the likelihood that a short-time sequence of flow measurements represents normal traffic or that it carries instead an SHH brute force attack. Likelihood is a probabilistic metric that tells us how likely it is that what we observe is an attack. Likelihood is not an absolute metric, but it refers to a model of reality, in this case, the anomaly-based detection engine used to analyze the data. This graph was creating using well-behaving artificial traces, therefore traces that we are fairly close to reality but were generated by a model. As you can see, the graph is already quite more complicated than what we have seen so far, but the error rates are clearly clear and well-defined. Now, let's look at the same picture for a real SSH trace. You can notice immediately that there is a lot more variability in the likelihood values. Also, the malicious and benign curves have a larger overlap, which we have learned means that a larger portion of the traffic is undistinguishable. Does this mean that we have seen a nice theory but this will not be applicable? Well, no, it is not that simple. It means that when designing an IDS, domain knowledge about the problem you are tackling, for example, which attack you want to detect, is fundamental for getting good results. And it also means that you should not assume your IDS is a perfect classifier, but you need instead to be prepared to handle detection errors.