 our next speaker, they say you can't cheat time, but he can find foes and yourself with latency try, try terrorization. Our next speaker, let's give her a round of applause. Thank you. Hello, mates. Thank you for being here and joining me in this talk. My first talk has said about geolocation. You can cheat time as a quote from Dr. Rue, as you can see from the TARDIS next to it. And I'm Lorenzo, call me Lopok if you want, and you can find me on Twitter, you can find my code I'm releasing today on my GitHub. And let's talk about geolocation. Geolocation is something that has always been important for humans and was, has always been a power driver for progress and technical things like sextant, mechanical clocks or GPS too. So very important. And even in our realm, so the cyber realm and internet is important too. Now we face some limitation. Almost anyone knows large IP based geolocation services. I don't want to name all of them, but you can just type some IPs and find with a certain precision where they are. Due to the massive use of a cloud provider and CDNs, we can no longer use them because you just have one, anycast, one or more, anycast IPs and you do not know where they are. You maybe find the location of the company that is on that subnet, nothing more. So this is what has driven my work. So looking through the cloud cartain. So answer the geolocation question in the cloud provider word. Some theory before we start, it's a mathematical physics, don't be scared, it's very simple. This is a simple one. Distance is a speed per time. This is a first order estimation of distance starting from a time. So it's something very easy and have to run. So keep in mind but it's important. It's very easy. Another thing is to keep in mind. It's obvious but keep in mind. Nothing can move faster than speed of light. Even if we are hackers, we are computer scientists, we are anti-sust, we cannot break this law. This is a law of this universe, maybe in another one it's another speed but we have to remember that nothing can travel faster than the speed of light. Even information, even things on the internet. And some terminology now, latencies versus speed. We sometimes confuse them. Maybe the person that knows best what is the difference are video gamers because the latency is actually the lag. So it's a time measure and it's in milliseconds or more but it's a time measure. Whereas speed is an amount of data or an amount of whatever, over a given amount of time. So keep in mind this and remember the speed of light limit. Latency cannot be arbitrarily low because it's the time information takes from a point to go to another one. So you cannot make information travel from Sydney to San Diego in two milliseconds. It's impossible. This university doesn't want. So obey the rules. Now, trilateration. Trilateration is a well-known method for estimating at unknown position starting from known points. So it's commonly used for GPS, for tracking smartphone. Okay, it's not very good use but it's used for or rock radio stations and so on. So it's a very well-known method. And we need for this recipe some points, fixed points from where we perform our measures and we have to know their exact position. So radio towers or so on. We will discuss what we will use for our problem. Then distances. So a method for estimating distances between the unknown target and these points and then obviously the target that is what we are looking for. So this is the simple recipe for trilateration. Then we have another important powerful tool that is a mathematical tool that is a minimizer and we will use it for providing data we have collected and inferred like points, distances and it will output some of the best points the target could be. So this is something that I will not dive into minimizer but you can. Trilateration has been used for tracking people even by dating apps. It's not actually a knack because sharing data is something that is up to the user but has been exploited for tracking people. So the system is the same. Fixed points so where you have your measure points so your client and the target, the unknown target and then do some trilateration. Well, let's do some practice now. This is a ping. Almost anyone here have used ping at least one time in his life and it's the first approach to latency. Ping is a round-trip time measure from a client and a server and it gives us as output many parameters like the average time and so on and by using it's a very naive measure but by using it we can infer some important information. For instance, in this example, we have the minimum round-trip about 55 milliseconds and combining it with the upper bound of speed in this universe so the speed of light we can assure, we can literally sign with our blood that the target cannot be farther than 8.300 kilometers. So these are very important things but okay, surprise, surprise, it's useless. It's useless because two things. Services behind clouds are not IP services and so you cannot use ICMP protocol for reaching them. You will only reach the ingress point of the CDN so it's useless and will be the nearest to you. Second thing, internet is not linear like speed per time. Speed per time is a linear function so it's almost useless. Internet is a fucking large amount of nodes interconnected in every way is possible and governed by way protocol like BGP. I love BGP but it's another story and so it's not linear at all. And all of this is projected over a spherical body in the universe so it's not linear at all. We need to find a better model for estimating a distance starting from a latency. So I've used machine learning so anyone use machine learning today but I used it for estimating starting from a latency the actual distance more or less. Machine learning requires many data. I collected around 40 kilo measure, couple of measures, source destination. Each one targeted with a country code, source country code code actually and trained over the distance inferred via, computed via overseen function so the sources start between source and destination so we have as features country code and we use it as a bag of words for who knows what machine learning data science things works. And then latency times and as output the real distance. I used the NSVR support vector regression model is very simple to use and I will run on this because we are running out of time. Now recall the recipe. We need points. The best way to have many points around the world is using a cloud provider. Choose your favorite one and deploy some lines of codes on each of them. We are talking about cloud front services like command and control so we cannot use ICMP anymore but we have to measure latency on application layers so I use HTTP or so I use a cool for HTTP and HTTPS protocol. We have to deploy these lines of code on each point we choose. We have to compute latency. It's a simple instruction and the read manual is perfectly explained and then we have to compute the distance using our machine learning model and at this point we have to erase. A point of erase that are data centers and they coordinate and you have found your biosynth or googling them or whatever and distance inferred via machine learning model and the unknown point we are looking for. Now we have to do the magic with some math and we have to minimize all these things. I will go fast on these two but is a minimizing process over the residual between the hypothetical position and the distance we have collected. We have inferred before with the machine learning model. And I don't know if it's clear, yes. And you will have many outputs. I will use a small set just for make it better to be understood and this is a cloud of possible points with a centroid on the west of UK and the server was actually in London. So it works. The yellow circle is the upper bound of the speed of light radius. So we know that the point cannot be outside that circle. So it's very good for the speed of light radius. So we know that the point cannot be outside that circle. So it's very good for just latency-based method of inferring a position. But let's do something cyber. The main problem, the main things I was working on is common and control behind cloud front. At the moment anyone works on threat intel in case they have a fronted common and control or malicious site for the living malwares, raises their hands and say, oh, we cannot do nothing. We cannot say nothing on which part of the world the service is or isn't. Unless you call the cloud provider but good luck if you have tried, I don't know, let me know. But surprise now and until now it was impossible. I arrived to a good result in this method and this is not precise at all. I could have put an example that was more precise but I wanted to say that it's not very precise. It's not a kilometer precision method. Here we have a real service behind the cloud front located in Rome and using only latency's measures and no more than a dozen of point of measures. We have this fancy hit map and an error of 600 kilometers. That in Europe is a lot but imagine the rest of the world I think is a good result. Please appreciate it. Something more that we can use this method for is sandbox detection. Sandbox many times has a fake net behind. So a system that mimics the real behavior of a network of the internet but it's very difficult to cheat time or mimic the behavior of a latency versus some points. So we have just just we can embed our model inside the malware and a number of fixed points inside it embedded in it with target country and what I said before. And in presence of fake net, latency will be only few milliseconds so the result will be something that graphically is a non overlapping circles. So it's an impossible mathematically solution. It's impossible. It's impossible. There's no point that minimizes the function that we've seen before. We can even add some random latency fake net but I don't know if someone worked for it. We could try but I don't think it will work better than a real network. And the other thing similar to the sandbox detection is a malware self geolocation. And you can ship a malware with inside this model some fixed points and some automatic process for determining which area of the world is running and then turning on or turning off some features or the malware self. So I think it's nice things to do. Is it at the moment perfect? No. Is it bullet proof? Absolutely not but promising. I'm very clear. I don't want a wow effect. It's an early stage research so I'm very, very frank on it. Then I have a demo and here it is. Is it full screen? No. Okay. Sorry. Okay. This is the notebook. A Jupyter notebook that you will find on my Github. It's a poc for this and I ship it with my pertrained model but I will have to retrain it again. It starts with the measure that is a long process and then it starts estimating an offset that is important because of the inner latency of the computation due to the computational time of the server we are reaching. So we first determine an offset for all the latency measure. We subtract them to the measured latencies and then we have to estimate distance with our machine learning model. And these are the distances. This is the main process. Trilateration is done with a method that we can discuss of the records for avoiding some outlier and so on. And then we already have the 8 map now but it's not plotted. We find a cluster, the largest cluster of these points and we compute even an estimation of the error. And then the plotting time and here it is. No popping calc like my exploit friends but it's nice. Wait. Okay. So this is the one of the example in the slide. So it's an actual one and thank you for this. Then I'm done with two minutes earlier. So if some of you want to make some questions. Okay, BGP is a big mystery for me and for almost anyone but I came across the fact that BGP is more stable than what we used to. And the machine learning due to the fact that it's a large amount, not an enormous amount of data but it's a large amount of data collected. It responds very well even during a few months or a few weeks. So maybe combining it with a looking glass, BGP status and so on would be even better but I assure you that it's more stable than I expected. Sorry, I cannot. Okay, artificial latency. Yes, it's you can introduce in your services or artificial latencies but would be useful but if you collect a large amount of measures, not just one, you will find the main value. So you will have a distribution like a Gaussian one with a random latency artificially introduced in any service or so and then you can use all the smallest one or the main value and you can use the system in the same way just with a slightly modified check preconditioning of the data. I've tried my, but I'm not good at machine learning. I'm a physics actually but I've tried for SVR. It's stable, I know it well and it's very, you can control it well. So why not trying for some other models? Okay. Thank you guys. Girls, mates.