 It's topic is how to detect phishing URLs using PySpark by Hitesh. He is an independent security researcher. His interest lies in networking security, data science and big data. So, over to you, Hitesh. Hello, everyone. Thanks for showing up. So, I'm going to be talking about... I mainly do internet threats, work around security, mal code, privacy and malware engineering kind of efforts. This work has been... If you want access to these slides, you can actually go to the web page now and download them yourself in case you can't see them at the back. So, what is this talk about? Essentially, this talk is about my attempt to solve a problem. I wouldn't yet say that this was a successful attempt because it's still ongoing. But basically, the problem is about detecting phishing URLs and what has been done until now to detect these malicious entities on the web and how to protect common people against them, why I made the choices that I made to use PySpark and MLib and various other things. I am by no means a machine learning expert. So, you will have to take what I say about machine learning with a grain of salt. And this is not a talk about finding a success story to find a solution or anything, so to say. So, what in the world is phishing? Typically, I find that to be any form of credential theft, any form of theft where intellectual property or personal details like usernames and passwords and credit card numbers have been taken from you when you thought you were giving that data out to a legitimate person and you were in fact not. And why to solve this problem? And this basically stems from a very local issue. In my city where I am from, the police department gets about 50 complaints a day regarding phishing. Now, whether that is in the form of malicious emails or whether that is in the form of social engineering and someone calling you and telling you that I am calling from a bank and taking your credit card number, all that ties into the phishing problem. And this is a problem that has not been solved as correctly as many of the other problems. Like, for example, for viruses and malware, you have antiviruses. For APT-style threats, you have a lot of products in the world out there that you can go and buy. But there is not much for a phishing scam stopping, to a common man at least. And there is also this sort of moral side to it, right, that phishing takes more advantage of the gullible rather than the tech savvy, because I can pretty much assume that nobody here is going to get affected by phishing because you guys know what you're doing. But on the other hand, there is a majority of the populace that does not and to whom these problems have yet to be solved. And where it gets even more difficult is those two news articles that came out relatively close to each other where the Director General of Police of Karnataka actually lost money because he fell victim to a phishing scam. So what do we do in today's world to beat these sorts of malicious entities, right? We have these ever-prevalent blacklists. So you can go and download a list of URLs every single hour that will tell you that these URLs are now malicious or these URLs are hosting content that is detrimental to people in a broad sense. And then there are other people who write sort of Yara rules if you're ever familiar with them on email bodies and say, okay, if these words appear in email, then I can assume that this is phishing email or something of that sort. But there's no sort of solution where people are still happy with it. Google actually did a pretty good job at the safe browsing initiative, but the problem is that that is only applicable if you're using Chrome or if you're using some browser bundles that have been tied with Google that can offer you that sort of protection. So for example, if I get some link to me on my corporate account and I have to click on it, there is no filter, there is no safe browsing that I can leverage at that point to really make a difference. Also, people say that, okay, let's be strict about it and we will download this list of beautiful Alexa 5 million top domains and we will only allow things to be clicked only on those 5 million domains and if anyone wants to go to any other website, then we will sort of show a small warning before we make things go ahead. But those are sort of only making it a tiny bit harder to get on, but not stopping the problem at all. DMARC does a very great job, but you can stop spam to a great degree, but not phishing URLs, because typically, if you know about this during the, what you call it correctly, job hiring spree that happens every time graduates come out of school, you'll notice that people create identities like my company JobInterviews2015 at gmail.com. And you can't block gmail.com from sending you email at that point. So there is nothing you can do unless you actually look at the content that is being clicked at that point and then decide for yourself whether this is good or bad. So like any other approach, I had to start with Ground Zero. My Ground Zero is existing research on detecting phishing URLs, which led me into the machine learning direction for some time, because I had fondled with these things, but I'm, again, by no means an expert at machine learning, per se. And the most amount of success I had with getting features from these datasets of phishing URLs and emails and sort of running them through various classifiers was this concept about decision trees. And it intuitively makes sense to anyone who is in the security industry, because we as a clan tend to think of maliciousness and benign activity in terms of rules. So we say, okay, if this happened, this happened, and this happened, then it is bad. Or it does not meet my level of comfort. If this happens, then I am relatively okay with someone doing anything with it. Also, it was pretty expressive in sort of saying what I wanted to come out of such a model was, hey, if we find a URL and the content of the web page has such and such and such entity, then we don't want, we don't like this. So testing it out and humanly looking at it became very, very simple and easy to do. Which brings to my second choice about PySpark and MLlib. I tend to be a little biased towards Spark for my own reasons, but it allows me to sort of comb in a lot more web pages than I normally can. There is a very good resource for getting fishing URLs called fish tank, and you can get brand new URLs about fishing from there every hour, and you just run a crawler that fetches the HTML for you time and again. This led me to have a pretty good data set at this point. On a parallel note, MLlib uses something called Cyclolearn that you all know about, which made it very easier to even find documentation and sort of cross reference and see whether I am on the right track. So for this tiny experiment of mine, I gathered about 12 gigabytes of web pages, which doesn't seem like a lot, but then I realized that it's a pain point to parse HTML and extract features out of it for every single web page. And if you have two and a half web pages and about 5,000, 10,000 being added to them every day, you realize why you now need to do something with cluster computing engine like Spark and not just write a for loop and go to bed. Which again brings me to the point that I also did not want to roll out my own multi-processing framework where I say, this process consumes this section of these web pages and this other process consumes this section of the web pages. And I also wanted to say that I want to bring features together at the end, not something I want to reinvent the wheel kind of approach. Also, you can save a model in Spark and load it anywhere else, so that makes it easier for deployment now. I don't know why they had to wait until 1.4 to do that, but whatever. So what are the features that sort of work? Typically, if you go to see people use dynamic DNS domains mostly for malicious activity. Which you have explicitly not gone to, then I can assure you that it is not something good. It is definitely some bot trying to contact. You never also tend to go to a direct IP address in external situations because you would typically go to Google or search engine of your choice and type in some search query and go to a link. So you're never actually interacting with IP addresses directly. So these are the things that I thought would be indicators of phishing URLs happening because these URLs, these web pages for phishing are very, very short-lived until the point that the hosting provider realizes that, hey, this content is causing more harm than anything else and at this point I have to take it down. But the crux of everything comes in the dynamic part because the moment the only way humans detect phishing pages is that we look at the web page, we look at the URL and we say, you know what, this Yahoo! logo stopped shipping in 2002. How can they still have a web page with that logo on? Or any such thing, you may see that the web page doesn't load properly, you may see certain errors. These are the mistakes that phishing attackers make and these are okay for me and you to understand, but it's very difficult to convince an algorithm to say, look at a logo and say, tell me what logo it is until you go into a separate browser to figure out whether a particular place is genuine or not. A little counterpoint to this is the fact about SSL and CPU pinning, but most people don't pay attention whether they are on a pinned HTTPS site when they go to Google.com. If they see Google's logo and it says enter your email address, they will happily type it out. But the sort of bullet proof approach came to this when it said email address password. The moment you know that the post request of that form is actually not going to Google.com or any service, then you definitely know this is not something that is good for anyone for that matter of fact. And we leverage those kinds of features in MLlib and all we do is take about 10 or 12 features, put them in sort of a true or false sort of vector, I think that you call it a true vector and you let the model train, it gives you a very beautiful tree and it says if this, then malicious, if not then benign and so on and so forth. For someone who doesn't know a lot about decision trees, it sort of also tells me what is the most useless feature, because if I say that having a form with a password is a useful feature to detecting a phishing feature, that feature is present in benign pages and phishing pages with equal amounts of probability then that feature is useless because it's not giving me an ability to distinguish between any one of those two sets at any given point in time in which case I might as well throw it out. So it does this time and again, time and again until it finds out that okay, these features in this way gives me sort of the best fit so to speak. So a lot of people are using this type of language to identify a lot of errors in the way that this app is running and the bad one is the bad one is the bad one is thrown away and what we get sort of is something that sort of surprised me at the beginning because once the model got trained and I ran it on a piece of test data it classified about 99% of those web pages correctly which seem to do good to be true if Gmail login page is real or not, you will only get one example of it from Google.com. But if you go to find phishing pages for the same thing, you will find hundreds of them. So there is this problem where you have, the dataset for benign pages is far lesser than the dataset for malicious pages. And you cannot actually get more benign pages because there is only so many services there are in the world that you have to defend against. But it gave a lot more, there were a lot of false positives, about 35, 37% of false positives when you used, when you gave it benign URLs and it came back misclassified as malicious. Which got quickly offset with some white listing, right? So typically barring some edge cases, you will be okay if you are going to Google.com. Or let's say you're going to docs.google.com, you are relatively safer. There are cases where things might go wrong, but relatively safer. So offsetting these kinds of things by a white list, by some mechanism of trusted sites, can give you far better results. And if you incorporate something like safe browsing, hopefully we get to do that. If Google releases it as open source, we can do much, much, much better. So we quickly realized that having Alexa as a part of the feature, meaning if this URL is in the Alexa list, if that's a feature, that does not help at all. Because dinedns.org actually appears in the Alexa list, in which case all the dynamic DNS domains go out of the window right there. Also, you might say that hey, you know what these phishing URLs come up on very, very new domains, right? Like someone might register a domain last week and then use it to do something malicious this week. So you can give a reputation score to the domain name and then you can decide whether this is a good or a bad domain. Which again does not work because people register and keep domain names for a long time or use dynamic DNS services, or typically these attackers might have moved from more malware-like activity to phishing-like activity now and you cannot really judge just by the reputation of the domain about what is really going on. And there is still a tiny problem to this, right? Because every time now you need to classify webpages, you need to actually get the HTML for that page and then figure out what features it has and then send it to a model and send it back. Which is a good and a bad thing because if you have a browser extension that can compute these features and send them up to some service, the lookup takes less than a microsecond to do, right? Because once you have the model and it's listening on some port, it's very quick to make that decision. So the idea is sort of to extend this approach where the features are computed locally to whoever has an extension to stop phishing and then that sort of only the feature vector comes to us and we can just reply with, hey, don't click on this or click on this and it would be beneficial to a lot of common people in the world to be protected against phishing attacks. There are a couple of other problems also that arise out of this approach is because once you put this out in the wild and this is to, with any problem in security, attackers will find other ways to sort of subvert your feature gathering capability, if you're looking for a password field, instead of putting a form, they might actually put two text inputs and do some crazy things where they sort of try to evade your feature scanning approach, which again brings you back to square one because as they rightfully say, the defender has to be right 100% of the time and the attacker needs to get lucky just once. So that is going to be an active problem. Also getting new pages every now and again requires that you now have dedicated infrastructure to fetch these web pages, to crawl these web pages time and again and to update your model every single time. It's still not as bulletproof as a human looking at it, but it is better than most approaches that are out there. Also one of the problems is that Spark does not have an API phase, so you can't just take a model and say, accept any input that comes on port 80 and then just tunnel it to this model and then write the response to a database or something. You still can't do that, so you have to find hacks around writing tiny pipes between things and say, okay, you communicate with this guy, you communicate with this guy and it gets messy really, really fast. But again, we will find problems to such things time and again as they happen. That's sort of like a rough sketch about fishing URLs and this technique doesn't necessarily apply just to URLs but also to emails, right? Because the same way you extract features out of HTML, you can extract features out of email communications as well, whether that be an exchange email or whatever that is. That's basically sort of the approach that we've had until now. This is what we got. It's sort of a mix between static features and dynamic features. When I say static features, I mean features that just rely on the URL, right? Just the text of the URL, nothing to do with the content of the web page and then there are some features that rely on the URL, on the HTML of the page, which means, okay, we're actually looking at the content of the page and deciding for ourselves whether this is a legitimate page or not. So this is where we are now. That's all I actually have to share. If you have any approaches that you have tried and tested or if you would like to contribute to this, you are welcome to take the dataset or take the code. Once this meets sort of some measure of quality, it'll be open source since the point is to have it accessible and usable by everyone in the world who is sort of, can fall a victim to phishing at this point. And I can take any questions you have that are based on security or phishing. Any questions? So as you said, most of the things is like, so you need to pass the content of the page and then decide most of the things, right? So have you tried the approach of putting up, like playing out with some plugins with reverse proxies or some of the other stuff? Because that helps, right? So that helps identify, so as a reverse proxy, you are like having all the requests which are going through and the response which is coming back. And it's very easy, like a honeypot or something, to identify the patterns which you want to learn and then block or do some other stuff. So there are certain things which we have done in Paa. So I will be happy to like interact, post your talk to explain which tools we have used, which we use in email and other stuff which can be used to mark a ham or spam kind of stuff in phishing also using reverse proxy stuff. Yeah, so there's two things, right? The reverse proxy is useful only when you know where the traffic is going to pass from. For example, if I am here at PyCon, I know that all my web requests are going to be routed through some server, right? And I can put some defensive measures there. The also second alternative is to do this inside like a browser extension or something like that because browser extensions are really lightweight. They can run per computer regardless of where that computer is. Anyways, have access to the HTML page which means you don't have to sort of re-examine the page in transit or something like that. And again, extracting features will be simpler per computer on the browser rather than like at a gateway or something of that sort. So we've tried those two approaches and yeah. So only the learning part, I'm saying like, so like a honeypot, you put the reverse proxy, try to learn those things and put those features in your browser extension or somewhere else. True, and that'll be, I agree. That's definitely something that is being done. What are the features that you're used to to class fans, such as, is it mostly about textual content or features of the DNS or which features? So there are, all the features are a binary except for one. The features are, we see whether the domain name that you get in the URL is the same domain where the post request is going to write. So that's one feature. Another feature is to check if normally you see, for example, if you're fishing for eBay credentials, right? You will traditionally see that the URL will look something like signin.ebay.com.hello.xyz.somebaddomain.ru, right? And in that case, you know that the top level domain is actually not eBay.com and just, you know, sort of fuzzing itself to look like eBay.com. And we try to find these brands in the email and say, okay, if we find a legitimate brand in the URL, then true, so that's another feature. I'd be glad to share this with you because I have 15 and I can't remember them off the top of my head right now. So there are a bunch of these and we can talk about all the features offline. Hi, so I'm sorry. Yeah, I'm here. You're right. Yeah. Yeah, just do it. Sorry, I have a very bad accent. Hi. Oh, okay. So I'm sorry, but this is a machine learning question. Oops. Just out of curiosity, which other techniques than decision trees have you tried out and what made you zero in on decision tree? I have tried everything that comes in the ML lip package, everything from the logistic regression to the SVM classifiers, all of them. I found that decision trees are good. I have just discovered that even random forests are giving somewhat similar performance, but I haven't tried like a chance to deep dive into random forests and check out what they do. So Spark actually accepts, does not have, you don't have to change things much between changing classifiers. So I could basically run all of them and just check at the end, what's the best sort of detection rate that I can get. And decision trees turned out to be that one. Right, and classification accuracy is the only metric that you were looking for when deciding between? Yes, I mean that's sort of what we're aiming, I am aiming for as a detection efficacy point of view, because my problem is to classify something as malicious when it is malicious. So to me, at least classification is sort of the primary motto. I realized that there's going to be a sort of an FP rate and there's going to be another true FP rate, but as- But in this case, it doesn't really hurt anyone. Yeah, not to me. I am not doing that right now and I have found why I am not doing that. It's because typically if you go to, say you go to login.gmail.com, right, that page actually does not show up consistently from time to time. It may be something today, it might be something else tomorrow, not the viewing experience. I'm talking about the underlying JavaScript and HTML that comes with it. Also this begs to the point that then I can only protect people against web pages that I've previously seen benign versions of before. So for example, if I have the page for icicii.com's web login interface and I don't have it for IDBI Bank, then if I get a phishing page for IDBI Bank, I have no metric to compare that against. So that sort of seemed like a bottleneck that would not be a scalable solution at that point. But I'm happy to be proven wrong on this. I have one question. Like Apache Spark uses Scala, Python, and Java. And R. And R, okay. So which language you are using and which is the most fascinating for you? Obviously, so Spark cheats a little bit because when you use PySpark, you're actually not, it's not implementing everything natively in Python. So they have a gateway of sorts where it's sort of slipping, you know, sort of giving a check to someone through a channel underneath the door. And it just sort of ships tasks underneath and everything actually runs on the JVM rather than running on the Python interpreter. But yes, by far the experience to use Python is far better than using, I mean, I don't know R for even the first line. So yeah, the experience to use Spark as a cluster computing framework and as a means of sort of processing large data sets very, very quickly is very, very, very easy to do in Python than anything else. Thanks, Tesh. Thank you.