 Hello, everyone. In my presentation, I will share with you data science methods that proved successful in our corporate setup, specifically for detecting advanced cyber threats and for performing prompt triage of security events. For the latter one, I will share a new type of visualization that I haven't seen in other security solutions. I also want to highlight that most of what you will see was built using free open source, and thus it can be reused in your environments. The agenda of my talk is as follows. In the introduction, I will suggest a perspective that fundamental science and specifically physics can bring into the cybersecurity domain. The main part of my talk will consist of two subparts, theoretical and practical one. In the theoretical one, we will discuss rule-based approach to detection of cyber threats and what advantages machine learning can bring into it. I will share with you the architecture of our solution that is using supervised and unsupervised machine learning and that allows us to detect advanced cyber threats and reduce the number of false positives. Then we will proceed to the practical or the demo part. Here, we will assume that we are cyber analysts who need to perform prompt triage of security events. I will also share with you one of the practical results of our work, a detected advanced cyber attack. Then I will conclude. OK, let's get going. I have physics background. And when somebody mentions physics, what probably comes to mind is Albert Einstein, Stephen Hawking, The Big Bang Theory. What I'd like to add to this list is multi-dimensionality, not only because physics studies systems in multiple dimensions, but also because physicists themselves are quite multidimensional. For example, Albert Einstein is one of the most renowned physicists. However, he was saying that the most joy he was getting in his life from music is that funny, he was a professional physicist, not a musician. Stephen Hawking, what is more inspiring? His research or his personal example. He showed that one can live a fulfilling life despite severe physical constraints. The Big Bang Theory, finally. Probably most of us have watched the show. I like it. It brings attention to people passionate about science and technology. However, at the same time, I believe it somewhat shrinks the dimensionality of their characters. I mention it here because I stand with the organizers of Norsak 2021 who started a campaign to raise awareness about biases towards cybersecurity specialists. I think it would be great if all of us, humans, could always recall and recognize that we are all quite complex, or if you prefer multidimensional. With that, let's move on to the theoretical part of my talk. I will be using credential attack or, if you prefer, TTP 1078 of MITRE Attack framework as for illustrative purposes throughout my presentation. I will also mention how to extend my analysis to other TTPs, for example, 1210, Lateral Movement. So let me please quickly remind you that credential attack is when a malicious actor gets access to user credentials and uses them to authenticate into protected websites. Why? For example, to steal private clients' data. Let's look at a specific example. Let's say source IP in one. We know it's controlled by a malicious actor who got a lot of user credentials and is trying to validate them against our login URL. Because there are a lot of credentials and the malicious actor wants to be done fast, the malicious actor generates a lot of connections. Let's say three per minute at some point. I would like to clarify that all the numbers throughout my presentation are provided only for illustration purposes. OK, how can we be notified about such malicious activity? We can create a rule and say that if somebody, some source IP generates at some given minute more or equal three connections, then we want to see another. The problem with this approach is that all source IPs are different. For example, it will work for source IP number two that never historically generated more than one connection a minute. Let's assume there are these ideal customers who never forget their passwords or user credentials. However, for source IP three, which is a public IP that can be shared by multiple customers and historically shows higher numbers, for example, 12 connections per minute, this kind of IPs will be a constant source of false positives for this rule with threshold of three. What can we do? We can, of course, mitigate this problem by raising the threshold, let's say, to 15. And this will solve the problem of false positives with source IPs like the one number three. However, what is bad here is that rule threshold of 15 will live under the radar malicious actors as source IP number one and the like. So what should we do? This is when machine learning comes in handy because it performs historical profiling of sources. Here you can see how unsupervised machine learning component of elastic detected the anomaly, this red dot. The color of the anomaly indicates the severity of deviation from the norm. Red standing for high deviation and orange yellow into green describing milder deviations. Our solution, which we internally call lightning AI, works by monitoring the environment of anomalies and triaging the anomalies. Of course, if wrong triage didn't give a definite answer and we need to fetch more logs, we can always turn to Kibana that has all the logs. I will show you how to do it. Okay, so let's see what this new unsupervised machine learning dimension brought into our analysis. Now all anomalies have two coordinates. First one on the X axis is the absolute value of the anomaly or number of connections per minute. So for source IP one in our example, it's three at this given me. Now on the Y axis, the value of the anomaly is 75, which stands for the high unsupervised machine learning risk. This Y axis value indicates anomalies relative value, meaning how relatively big is the deviation from the norm. Here we can see that unsupervised machine learning said it's 75, it's pretty big because the value on the Y axis could be between zero and 100. Okay, to be a bit more clear about analysis of two dimensional space, let's break it into two four quadrants. And let's put an anomaly into each one of them. We see that anomalies in quadrants one and two are red. It's because the Y coordinate is 75 and it's red. Anomalies in quadrants four and three are green because they correspond to the Y value axis 25. So let's say we were to stick with our rule-based approach and we were to set the threshold to be 10 connections per minute. Anomalies in what quadrants would be generating alerts? Correct, those in quadrants one and four because all of them have X current above 10. As we already seen in our example, it's a problem for everything in quadrant number four. For example, if this is our source IP three which is shared by multiple customers, to mitigate the problem of false positives, we can modify our detection to only alert about anomalies in quadrants one and two. Now it's good, we don't have problems with false positives here and we also alert on malicious attackers who have low absolute value in terms of anomaly but high relative value like source IP one. Okay, combination of rules and unsupervised machine learning covers nicely quadrants one, two and four. What about quadrant number three? Can we say something about it? Is it even interesting? Well, actually it is very interesting because most of our clients and also advanced attackers are all there. For example, this orange dot can be our client who forgot their credentials and is trying to get them. We see that in our environment quite a lot. And this one can be an advanced attacker who is trying to stay under the radar using automation tools. So does it mean that we cannot differentiate between real clients and attackers in quadrant number three? Fortunately, we can. But for that, we need to use our security expertise and security context we can bring by using supervised machine learning. Why is it even possible? What gives us an edge? It is that we defenders know real data. We know how real clients really behave while advanced attackers trying to mimic their behavior have to guess. And if they guess incorrectly and go too high, they get noticed. If they go too low, first of all, it's also a deviation from the norm. So they also get noticed. But even better than that, they have to reduce intensity of the attack or if they want to maintain it at the same level, they have to reduce intensity per IP and thus increase the number of IPs, right? Every IP has to perform less connections a minute. Thus, if you want to keep the same sum of connections per minute in total, you need to increase number of IPs, which means you have to drive the costs up because you have to maintain more IPs and so on and so forth. Thus, attacking us doesn't look like a very profitable thing. And as we know, a lot of cyber attackers are after money. Okay, perfect. So this way, we reduce the incentive to attack us. And if some of them leave, it means that our cybersecurity team is ready to deal with those who didn't leave and they have much less work so they can focus their attention. So it's a win for us. Let's look into our specific solution architecture. I know it's a loaded slide, so let's take it piece by piece. Here we can see different TTPs, credential attack, lateral movement. Let's start with the credential one. Here are a lot of unsupervised machine learning jobs. As they all look into different type of logs, for example, web logs and buff logs. Remember, multi-dimensionality is our power. They also look into different types of anomalies, volume anomalies and cardinality anomalies. For example, volume anomaly could be the number of connections to the login page per minute. And cardinality anomaly could be a number of distinct users connecting to our login pages per minute. Okay, of course we can create many more jobs, but I think you got the idea. So let's try to extend it to lateral movement. Of course, here we will be analyzing different type of logs. Let's say Sysmon logs and firewall logs. And here we still, however, can use volume and cardinality anomalies, but here volume anomaly will be standing for the number of connections from a given source IP and cardinality anomaly will be speaking about how many different destinations this source IP is trying to connect to. This is how we can extend our analysis to other TTPs. Okay, let's get back to credential attack. So we have a lot of unsupervised machine learning jobs. They generate anomalies. What next? Next is the second part of our solution, supervised machine learning. What is it goal? Its goal is to aggregate risk assigned to every source IP. So why is it important? It's because if we start opening tickets for every unsupervised machine learning alert, we will be flooded. This is why supervised machine learning is so important. It uses the security context and our knowledge of the environment. And it opens tickets only when it is this behavior of some specific source IP is malicious or at least shows it if it's suspicious. Okay, good. So what else can we say before going to the demo? Well, that's actually following the weaponization race. At some point, quite possibly, we will have to go to deep learning and complex architectures. But at this day and age, classical machine learning, for example, one contained in free open source library, scikit-learn does a great job. Okay, let's see it all in practice now. We will follow as promised an example of a scenario that we are analysts and we just received a ticket about some credential attack. Here is the ticket. Some credential attack is probably happening at the source IP 109XXX. This is the timestamp. Let's get to see our solution. Here it is. On the left axis, you see a heat map of all the anomalies in our environment. On the X axis, here we see a different dimension, timeline. On the Y axis, we see also a different dimension. In this case, all different source IPs that our supervised machine learning engine decided to show to us. So it considers it's either malicious or suspicious. So what else can we see here? We can see here this slider and it can choose us to filter out only the output from a given machine and supervised machine learning job. For example, from the volume ISCA. So what does this ISCA stand for? So volume we know. It's number of login URL attempts from source IP in a given minute. IS is actually Microsoft Web Logs, Microsoft Service Web Logs. And it's beneficial to split the traffic into several different baselines. Because for example, IS logs and Apache logs representing Windows and Linux servers will probably be hosting different type of application and for that reason, they will be also showing different patterns of customers. Then it's also beneficial to split source by country. For example, simplest one would be Canada and not Canada, but this is just so you get the idea of what is happening. Okay, I think you got it. So let's get back to our ticket. So we said that 109.6x is showing some strange activity. So let's zoom in and see what is going on. Here we see a couple of anomalies. If we hover onto one of them, we will see minimum information about it. The time it was generated, 533, source IP 109.6x, actual value. So in our case, because this is credential attack and volume type of machine learning job, this is number of connections from a given IP. Typical value one. So again, all the data is here provided for illustrative purposes. Record score. So this corresponds to 71, which is orange. This is supervised machine learning risk. And that's why they're normally orange. And finally, machine learning job ID, credential attack, volume IS, not Canada. Okay, great. So let's click on one of them and see more details. So on the right, what you see is this new type of realization I was talking about, weighted chains. I will discuss briefly how it was technically constructed, but I think that it has nice friendly interface. So it will be clear what is happening here without further explanations. So let's try it out. So first we see that it was accessed to the login page. Then we can see that this request were receiving status code 200 success. These were post requests. The user agent was Linux, device was not detected. Name of the user engine is Firefox. In total there were 120 chains executed and they represent 100% of the traffic from this IP. Okay, very interesting. So basically some IP is interested only in our login page and quite a lot. However, you remember our power is multi-dimensionality. So let's look into the same anomaly in WAF laws. Here the picture is more complicated. So let's just filter out just one of the chains. So this one. So this yellow chain was executed only once. And we can see that WAF so that it can consider as it said to be a brute force attack, it alerted on it, but didn't block. And the same picture would be for all other user names. So it looks like a sprain attack. How can we confirm our hypothesis? We should add more dimension into our analysis. Let's look into the time dimension. Here you can see the volume of connections from this IP, meaning per minute. And here some aggregated data. In the last 30 minutes, there were 301 connection. In the last 30 days, three hadn't one connections as well. What does it mean? It means that there is a source IP that was never coming to us and suddenly unloads a huge amount of login page requests. Yeah, this doesn't look good. Here on the left, we can see more information about the source IP participating in the connection. 109XXX is public from far away company, far away city, far away country. And here is our server, XXX4. We're going into our company, our city and this is the hosting. So what else can we see here? This is quite useful information here. It says that in the 30 minutes preceding this anomaly, there were no accounts were accessed from it. What does it mean? It means that so far, the attack is not successful, which is good news. If I was an analyst, I would have followed the incident response playbook, but most probably I would have just blocked this attacker if it was not done automatically. We will not proceed with this ticket. I think everything is clear here. So let's take a step back and look into the heat map. What else can we see here? This region is quite interesting. Let's zoom in and see what is going on. Here we see quite a bit of different source IPs. By themselves, they don't show anything red or orange or yellow even, which means that all the anomalies in terms of unsupervised machine learning are quite small. However, it's a bit surprising that all of them are related in time. So they all come at the same time. It looks a bit like a distributed attack. So let's click on one of the anomalies and see what is going on. Okay, so we again see the pattern of a spraying attack. This chain was executed once. It was brute force attack alerted not blocked and same for the other users. We can also look into the same anomaly in web logs. Here comes the power of the weighted chains. It's really easy to analyze this multi-dimensional data. For example, if I want to see traffic to the login page, I can see the response and get requests. I filter on the post requests. Okay, so there were 200 post success status code post requests to the login page in total three of them. If I want to unfilter for those and just move to another URL. So let's hope this is one, for example, it's very interesting. Our site.com then account and then slash some other URL. What does it mean? It means that quite possibly the malicious actor got inside of the account if we look at this link. And what was it doing? It was sending a get request and one of them. Okay, very interesting. So get request, it seems that it's possible that some data is being extracted. Let's confirm our hypothesis. Oh wow, here we see 12 different accounts were compromised in the last 30 minutes from this source IP. So this attack looks not that intense because it's not red orange in terms of unsupervised machine learning but actually it's really bad what is happening. It's good that it's training data. But here we can really see that advanced attackers trying to stay under the radar can fool unsupervised machine learning but not supervised machine learning working together with unsupervised. So let's illustrate this even in more details. You remember I was talking about ponders number three. So let's start it out by setting the sliders to 10 in absolute value and 50 in relative value. What we will see is only these two alerts specifically 109.xxx that we discussed. What does it mean? It means that unsupervised machine learning in rules can both detect 109.xxx attack. However, they're blind to the distributed advanced attacks. This is I think one of the key takeaways from my presentation. Okay, let's get back to the default settings and look into one more thing. I don't want to rag things under the carpet and I absolutely admit that every solution has is not perfect. So we also have false positives. So we, however, can trace them promptly not wasting effort of our analysts. Let's look into this one specifically. I know it's a false positive and let me convince you. Let's click into this one. Okay, let's see what is happening here. So again, we see kind of a strain attack. However, let's filter for this chain. What do we see here? User is joe.doe at joe.doe.com and there are two connections. Then let's filter this chain. Joe.doe and there was one connection. And then this one. Joe.doe and there were three connections. So in total, there are six connections to login page. All the user names are different. However, not so different. It looks like it's a client who forgot the user name and trying to guess it with combinations of user and password. So for example, this joe.doe thinks that this username is much more likely than this one because this person was trying three connections, three password for this one and only one password for this one. Okay, so we can also look into the same anomaly again in IS logs to confirm our findings. Again, the power of weighted chains shines here because we can immediately see here level six connections to the login page, three for one account, two for another and one for the third one. And we can also as experienced analysts, we could see that what is happening here is our site login slash login and then some URL one, some URL two. So what is likely happening if I was just showing you not anonymized data that the person is just trying to recover their password. Okay, with that, it's clear. However, I also want to mention to you that if you prefer, you can always switch to a different view or let's say you can use Kibana. It's simple, you just copy it here and you paste it here. Here you will see exactly the same traffic but represented in your native Kibana way. So feel free to do it if you prefer. However, our analysts confirm that they save time by using lightning eye and specifically weighted chain view. I also want to very quickly clarify as promised. So what is so unique about the weighted chains? How was it technically constructed? So it was constructed using three ideas. First parallel plot. However, in parallel plot, we never show aggregated data, we show raw data. Second, we like aggregations specifically group by. And so here what we're doing, we're actually doing aggregation for every column. Second idea, third idea. We in heat maps use color to reproduce the importance of the weight. And here we do it exactly the same way. There is quite a debate between with designers who say that color should not be representing magnitude. However, in physics, I can tell you, heat maps are quite popular and there is justification for this type of annotations. So here we can immediately see that we are using combination of all of three of them, parallel plots, aggregations and color of the heat maps. And we also use the last column to sort everything in terms of weight. So let me unfilter this and ask. So if I was to ask, what was the most common weight of chains that was executed? I would just highlight here and I would immediately see it was URL 0.51. It was a post request, successful macOS and so on, so on, so on. So this is why we believe that weighted chain is something interesting. So, and we recommend to use it to you as well. So with that, we can come to some of our conclusions. So we monitor environment using multiple dimensions. We use for that reason, multiple log sources, multiple types of anomalies, multiple baselines. We use supervised machine learning to aggregate anomalies risk of unsupervised machine learning jobs to open tickets. We also start with analyzing individual anomalies and we use weighting chains to provide deep dive into anomalies triage because it saves us time and it's easy to visualize everything visually, to look at everything visually. Okay, finally I will share with you an example of a real attack that our solution detected. In total, it would have hit us with 10,000 hits from 200 IPs coming from 50 different countries. And here you can see a summary of IP void reputation in the histogram view for this 200 IPs. Here I would like to give three takeaways. First, we didn't use IP void or any other enrichment tool to detect this attack. Second, even three days after the attack when this histogram was made, some of the IPs still had good reputation IP void. And three, none other solution was able to detect this attack other than lightning the eye. With this, I conclude. So first, machine learning allows to detect advanced threats that stay under the radar of the classical approaches. Our solution uses unsupervised learning by means of elastic. If you don't have elastic license, you can use unsupervised machine learning, for example, in Spark ML. It's free and open source. Ensemble learning by means of scikit-learn. Again, free open source. And Python front and backend that provide advanced yet intuitive environment monitoring and other stretch. If you have access, if you want to learn more about our experience or want to, if you have any questions, please feel free to reach out. With that, I want to conclude my talk. I would like to once again, thank the organizers for the opportunity. And I would like to thank every one of you for your attention. Enjoy the rest of your day and the conference. Thank you.