 Hi, everyone. So I have a lot of stuff to get through and not much time, so I'm just going to start now, even though everyone is making their way to their seats. Anyway, I am Ryan Mitchell. I know the name is really weird for me, too. Surprise! This is separating the bots from the humans. So who am I other than my name? I'm a software engineer. I work at a startup in Boston where I write their API, database, architecture, backend, Java stuff. I'm also the author of two books that have nothing to do with my day job, web scraping with Python and instance web scraping with Java that came out a couple years ago. I'm an engineering grad from Olin College of Engineering, go Phoenixes. I take night classes. I've been doing that for a few years. I'm graduating with my masters in software engineering in 2016. So a brief history of this talk. Around June of last year, I submitted a book proposal to O'Reilly for a hacking book. But O'Reilly doesn't really publish books. So I couldn't call it a hacking book. Instead I had to call it web scraping. And Python is a really popular language with the kids these days and it's a great web scraping language. So I called it web scraping with Python. And I did all the market research and I did the chapter outlines and put it together in a proposal and sent it to them. And they accepted it. They accepted my hacking book. And there's a lot of hacking in this book. Thank you. So there's a lot of cool hacking in this book and I don't have a whole lot of time. Instead I'm just going to focus on the first step. Separating the bots from the humans. Because if you're a web administrator, what do you want to do? You want to figure out who's the human and who's the bot so you can stop the bots and send them captures and cease and assist letters and block their IP addresses and all of that stuff. And obviously if you're a hack, I mean web scraper, you want to try to avoid the administrators from stopping you. So it's this constant back and forth. So it could be called separating the bots from the humans, but it could also be called how to look like a human when you're a bot. And as we've seen from the movies, this is very important. So we're at DEF CON. Hopefully you've heard of web scrapers before. You just curl W get, URL retrieve, whatever. Some remote file, usually like an HTML file or an image from a remote web server. Bring it back to your local machine. You can download it. You can use it in memory. You can parse it. You can do whatever you want. You can make more HTTP requests. Get, push, purl or get, post, delete, anything you want. Web scrapers can be really dumb. They can actually use web browsers. They can take their sweet time. They can be surprisingly smart. They can be really, really dumb. But we're going to start off with the defense side of things. So I'm just going to go through all of the stupid human things that stupid humans try to do to stop bots. Some of them are actually not as others. But let's start off with some of the really stupid ones. So robots.txt. Obviously completely legally and unenforceable. Not anyone's standard of anything. It is called the robots exclusion standard. But I mean the IETF doesn't recognize it. The government doesn't recognize it. It's like you and your friends got together and decided to have this cool little robots.txt file that you'll all follow. And one of your friends happens to be Google. So it's a little bit nice for getting out of their indexing. But other than that, it's completely useless for actually blocking bots other than the good ones who want to follow it. Terms of service. This can actually be slightly more useful but only in specific circumstances. And those circumstances are usually when you end the person who wrote the bot end up in court. You have to, if you have to click agree, you probably should, if you have to agree to the terms of service, then it's sort of a contract. So you should probably be really careful about scraping the site and following it. But if it's to the bottom of the page, go for it. Because that's not enforceable at all. In fact, they have to actually prevent you, send you cease and desist letters and block your IP address and do other things to access the site in order they can't just enforce the terms of service and sue you because it happened to be on the site. So headers. Totally not a bot. I promise. This, you can change the headers in like two lines of code. You can be someone on a Mac and using Chrome or you can be a Google bot. Totally up to you. It might be actually kind of interesting to surf the web as a Google bot. So most, although they're completely useless, most major websites do check the headers. So if you go to Amazon or Google and you're not a regular Mac user using Chrome but you're actually Python 3.4 bot, they will send you 403 forbidden. So it only takes a couple lines of code, really easy to get around. JavaScript. I know we all have our opinions about JavaScript. I'm not going to sway anyone here. It can, you know, you're now taking the code execution from your server where you have a controlled environment and you're giving it to the client. Sometimes problems can happen there. You don't quite know what's going on. It makes your site slightly less usable for people. But you're also marginally protected against spots. Of course most spots now can execute JavaScript. I'm going to show off that in a little minute. Embedding text and images. Don't do that. Your site isn't usable by anyone anymore. Text readers for the blind can't use it. It's not resizable. Just don't try that. And if it is actually readable by most humans, if you do have text that can be read, then it's very easy to OCR to. So you're not stopping anyone at all. Captchas. I actually kind of like captchas a little bit. We're getting better here. Most captchas are breakable. If you're going to use a captcha, make sure it's either really, really good, like the old Google recapture, or make sure that it's so obscure and that no one uses it that no one's going to bother breaking it. But they can be pretty effective in most cases. So behavioral patterns are really where the future of this is at. The Google recapture, which, if you haven't seen it, here's a demo here. And I've been doing this so often that it's probably going to mark me as a bot if you do it too many times it starts giving you challenges. Oh, I did it. But if I refresh it and click around again there, see now I haven't moved my mouse around enough for it to be convinced that I'm a robot and I just held it over the button, refreshed the page, bam, click. That's the kind of thing that a robot would do. So now it wants me to select all the kayak images. Did I get them? Hey, okay. So behavioral patterns, if someone isn't moving over the page appropriately, if they're loading things too fast, if they're typing too fast, if they're not scrolling down the page, that could be an interesting behavioral clue that they might be a bot. Honeypots, honeypots are a word that's used a lot in security, but with web scraping, I tend to think of them as a DOM element that follows three rules. Humans can't see it, otherwise they would get honeypotted. Bots can see it, otherwise it wouldn't work. And bots think that humans can see it, otherwise they'd be like, oh, obvious, honeypot, I'm going to avoid this. It's also very important to add the honeypot to the robots.txt. This can serve a couple of purposes. One, you have the bad robot saying, ooh, robots.txt tells me not to go there, I'm going to go there, honeypotted, right? And the other one is that the Google bot won't go to it, and you won't accidentally block them. Now is the fun part. We're going to go over optical character recognition for breaking captures, JavaScript execution for doing all the other fun things, and honeypot avoidance, which uses JavaScript execution. So optical character recognition. I have this nice page at pythonscraping.com, and you can leave comments on it if you feel like it. There's a capture here that we're going to go to. All right. So you can see a couple things about this capture. There's this prefix thing here, and if I change that, it gives me a new capture. Now most of the backgrounds are gray, I've spent a lot of time with this capture, great. It occasionally has the red brick one, I haven't even bothered trying to take that out. The thing is you don't have to solve all of them, you just solve most of them. But notice it has this gray background, if it loaded all the way. And it also has these sort of blue lines going through it. And we can actually build this image getter that uses the Python imaging library to go and clean this image. So you have functions like, is it basically gray? If so, take it out. You have things, you can go around the perimeter of the image and see what the first non-gray pixels are. And that's generally these lines here. And by following these rules and cleaning the image, you can get really nice looking things like this, and this. And then you can start to do things with it. So the first step in actually using this tool that I made that you can get to, I have the GitHub link here. Here's my GitHub link, the Tesseract trainer. So the first step in doing that is to go to the folder, you download a whole bunch of files, you clean them, you do your thing, and then you label them. So this is 4DAU labeled 4DAU.png. I suppose you could ship it off to Amazon Turk. But it's actually not that hard. Watch a few movies. It's kind of relaxing. The next step is to bring it into this software. This is, notice it's running, just being run for my documents. It's just JavaScript, really. And this can also be found in the GitHub page for the Tesseract trainer. So you can add a collection of files. Let's go to ones that I've already labeled. I've already done these, but let's do them again. Okay, so you open as many files as you want, and then you get these boxes. And when you start moving the boxes around, it creates this thing called box files. And if my screen was narrow or wider, this would be on the same line, but it's a little bit cut off here. But it creates these box files, and these box files give the coordinates of each letter. And this is a painful yet very necessary step for training optical character recognizers. And I've tried to make it as painless as possible. Let's just make a really terrible box file. So you download, all right, new one. And this just goes through all the images until you run out. And then you get a nice folder full of box files that looks like this. Oh, whoops, I need to copy them over. They're in the box files. So this is actually really important. You need to keep a backup of all of your box files before you run the trainer, because it will overwrite them and destroy them. So take your box files. You have the PNG box, PNG box, PNG box. That's what it should look like. And then put it into the images folder here. And now we go to the fun part, training. So this is what uses Tesseract. Notice that there are a lot of steps. You have to clean the images, rename the files, extract Unicode, create font file, run shape clustering, run MF training. I don't even remember what half these things are. And I spent hours with the documentation. This is a library that was sort of built by academics. It's very difficult to use, not really designed to be used by captures. But all you have to do is get all your box files and your capture files with the appropriate names named with their solution. And you say Python 3. I have both of them installed. Trainer.py. All right, go. And it just does everything for you, which is really nice because back in the olden days we had to write commands for each box file that we had. And then we had to take all the results of all those and put them together in shaping files and training files and more files and files, files, files. All right. Now the next step is you're going to need to do that. So our language that we're using for this is called CAP. If we wanted to solve any regular text, we'd use the built-in ENG training file. But we're making our own here. And we need to move that to the Tesseract data root directory. So we're just going to do that. Sorry. Oh, no, no, no. Sorry, I'm just trying to copy the file. So we created this file in the directory with all our box files and image files called CAP.trainedata. And now we just move it over. And now Tesseract knows what the language CAP is and how to run it. So we can do something like this where we want to get the text from, let's just use a random image that is in the images to be labeled folder. Image228.png. Y4EE. All right. Check that one out. Whoops. Hey, it's a Y4EE. So I did plan that one ahead of time. But it actually does work pretty well. But remember our original goal was actually to go to this and post a comment. So let's check out the code that does that now that we have everything all built. So notice that this uses basically the same code that's basically gray, color above and below, clean image. Remember we cleaned our images before we trained it. So when we get the image from the CAPTCHA, we have to actually clean it again before we can do that comparison. So the main function is post comment. It goes in here and the first thing it does is say get me a solved CAPTCHA. So it goes to the no box page. This is where we're posting our comment. It goes and it gets this code. That's the code at the end of the image that acts as a seed for the random CAPTCHA. It goes and it retrieves the CAPTCHA, downloads it into temp.png. You don't have to. You can work with it in memory. But I prefer to download it because it's easier to debug if something goes wrong. Then it uses Tesseract. This is the same thing we did from the command line. And then it gets the solution, checks if the solution has four characters because if it has more or fewer than four characters, something's probably wrong with it right off the bat. And then if all this test pass, it sends it post comment takes it. And then it puts in the parameters and it just makes a simple post request for the comment. And then you might get an error page in which case your CAPTCHA is probably wrong. And then you should probably recurse a little bit and try it again. If you recurse a thousand times, then Python will explode because that's its recursion limit. But hopefully that won't happen. So let's try this out. Python 3 what do I call this comment poster? Let's watch it. Is it going to work? It's loading the page. It has four characters so it's going to see if it can work. The internet might be really slow in here. Oh, success! Okay. All right. So I did that in like 10 minutes. You can too. That's a little bit harder. But I will post all the box files and everything and get to this page and do the same thing at home. So I mentioned Ajax. That's an important part of the internet today so I hear. And fortunately, I have a bot that can handle this. So there's this page. This is some content that will appear while the page is loading. Oh, great. Now the important text pops up. Of course, if we were just to view this page as source and if we were to get like a regular bot that would be this page as source, this would be the content that we'd scrape. We're not really interested in that, right? So what we need is Selenium. And in my slides, I have the download link for Selenium. Yeah, so pip install Selenium download phantom.js which is a browser that basically runs in the background so you don't have Firefox popping up and doing things and annoying you. You can just run it from the command line and let it do its thing. You can actually execute JavaScript from the command line. So remember I have that website. So this is the text it displays first and then you have the content it wants to retrieve. So if this script is doing its job, it should wait for a little bit while the page is loading. And then the text you want to retrieve should pop up. Oh, hey, great. So it didn't grab that first text. So what's going on? I get a new web driver Selenium and phantom.js. The driver gets that page sleeps for three seconds and then you display the text. You can also do things like wait for elements to load or just wait for anything on the page to change. You don't have to wait for three seconds but that's usually pretty safe. So honeypots. What this is all leading up to. Here's a bot proof form. Why is it bot proof? Because I have a couple of honeypots so most people when they make honeypots, they do things like input type equals hidden or even a hidden input filled. This is called intentionally blank here. And you can see in the CSS I moved it 50,000 pixels to the right. So these are things that bots would try to fill out or go to input type. Let's see. I think there's one at the top of the page too. Yeah, display none. So the style is display none. But there are a couple other ones too that I invented. And it's basically an input type with a couple of divs overlaying. And these divs have certain special CSS properties that make it so that you can't, I don't notice anything going on. There's one down here. You can't select that. It's invisible to humans sorry, invisible to humans invisible to bots. And we can actually check and see if bots think that it's visible to humans. So we're using the same selenium in phantom js. You can actually select elements, find elements by tag name means exactly what you think it means. And then there's this cool function and this is actually really good. It can tell whether or not something's visible on the browser. But if you have a couple of divs overlaying it so you can't select it and the text is the same as the background text color then it can be tricked sometimes. So let's run this. And it's getting that page and every time it feels, yeah there we go, every time a field is not displayed it will tell you. So if you're a bot filling out a form or you're a bot navigating from link to link on a website crawling a page it's really easy to actually just check and see if it's displayed first before you go to it. Especially if you think a form might be full of traps. So it actually found the type is hidden, that's a trap. The thing that's 50,000 pixels off the page, that's a trap. The display none is definitely a trap. We can't see it all. Those are not traps. So that worked out really well. In the last five minutes I wanted to take a couple questions but first run a really cool script that I like. This is, so the Amazon book previews, Amazon is pretty protective of them. This is all loaded with JavaScript. They checker headers. These are all embedded as images. My screen is so small it's a lot though. So these are all embedded as images and each image is AJAX loaded as you click through. But we can actually use a bot to go and grab this text and print it out. This worked a little better on a larger screen but I'm actually going to pop up Firefox so you can see it work. You could also do this with Phantom JS as well. Actually while this is doing its thing does anyone have any questions? Yeah? More and more he said do you have an idea of what the attacks might be against this new generation of defenses, the behavioral defenses. I think a big one is going to be things like this where you use JavaScript, you're actually working with a browser and actually Selenium is really powerful and you can make the mouse move, you can interact with drag and drop interfaces, you can send characters one key at a time. So you can make the mouse move according to different functions with some jitter added to it. So I think the next version of attacks is really just going to be emulating the behavioral patterns. Oh it broke. I'm sorry about that. It's no longer, oh it takes a long to load. I'll try it again. So how do you defend against JavaScript, something like this? I mean it's literally using a browser so you can't look for things like any difference between a bot and a human in this case is what their behavior is like, can they do things that a human would be able to do like solve captures? Yeah. Captures and behavioral. And obviously as you saw, captures might not be that great. And I just wanted to let you know that everything is available at python scripting.com I'm going to just zip that up and put it up after the talk. So and I will be in the Defcon 23 Lounge after this, drinking heavily. Yeah. Alright, thank you.