 So today's agenda we're going to review some basic screen scraper theory not too much, but a little bit We're going to define what I constitute as being a difficult case difficult site to screen scrape I'm going to demo some screen scraper tricks and When you look at what I'm doing you're going to say man, that's really um How do you manage something like that? I'm actually going to show you some ideas for some large scale deployment of this technique And we're going to wrap things up with a heartwarming moment with cap chas and puppies Okay, so what I'm going to show you is it's some rather unusual ways to do screen scraping, but but I think pretty effective You're not going to walk out of here with a Set of scripts you can use but I'm hoping that you walk out of here with some ideas What I want to do is I want to expose you to some stuff here and kind of get you thinking Along some new lines. Hopefully So I'm going to talk about a couple couple tools and I had some Criteria for selecting what I was going to talk about today They had to meet three criteria number one had to be completely customizable completely hackable Should be free or open source and Number three should be platform independent and generally what I do is I develop In Windows and then I deploy either in Windows or with Ubuntu or free BSD Little bit about me. I am Minneapolis based. I've been writing bots for about 15 years. There's a picture of my city Soon to be Las Vegas bound. Hopefully I'll be living here in a couple months Most of my clients are outside of the US either European or Asian I'm wearing my DC 612 shirt anybody here from DC 612 I'm a big supporter of of local defcon groups Seek them out there. They're all over the place and if you don't have one in your area There's nothing keeping you from creating one. So Real quick back in defcon five was my first defcon I Recut I recovered I Covered it for a computer world magazine. I think I've only missed three defcon since defcon five Defcon 10 I did an introduction to writing spiders and agents Defcon 11 I did one an online corporate intelligence two years ago. I did one on executable images and Not only do I keep getting email about that when I get actually get phone calls yet regarding that particular talk And today we're going to talk about Screenscrapers and scraping difficult websites Two years ago, I had a book came out and this is my connection with no starch It's available in English Italian supposedly a Russian copies coming out. I haven't seen it yet the Chinese version just came out a couple months ago So I talked about a lot of kind of traditional techniques and strategies in my book None of that is obsolete Basically the things that I'm showing you today are things you supplement those Those concepts and tools with Okay, why are screen scrapers important it it always amazes me we've got with the internet We've got the largest collection of information and services ever compiled and they're all digitized You can access them through a common format and it just amazes me that everybody uses the exact same tool The browser to access it right it's it's think about it. I mean it doesn't make any sense. It's like it's like every Every every problems a nail and the solution is always a hammer, right? Browsers in order to do what they're doing and to to do the wide variety of things that they do they have to be very General purpose and they make an awful lot of compromises First off, they're very manual tools and if you have people involved you're going to get error and it's time-consuming Which means it's also expensive to use browsers in a corporate setting a Browser can't make any decisions for you like a web bot can Browsers are not proactive a browser is never going to go off and tell you when something happened online as it's happening What I tell people is that you will never ever Be able to Excel by using the same tools that everybody else is using so if you're doing something online and The internet is important and you use the exact same tools that your competitors are using You'll never get a competitive advantage in doing that. You're just gonna If you're lucky you'll do as well as your competition Okay quick review of traditional screen scraping By that I mean downloading webpages managing cookies handling encryption server redirection Hiding your identity looking stealthy using proxies random timing trying to look like a person Emulating farm submission. It's also a very important thing and parsing information from the web and Then taking action on those things that you find Now all those things are you there's a one of my book came out I also open source to settle libraries for writing spiders and bots and that kind of stuff and If you paid to get in and if you have a CD There's the URL where you can get those libraries Okay, what constitutes a difficult case I've noticed that within probably like the last two years either by design or by accident a Lot of webpages have become a much much harder to scrape and we'll look at some of those reasons If you ever go to a travel website and You'll make your query like you're looking for seats or something and then you get one of these little pages kind of like that Little Northwest thing there where it's like, you know waiting and processing your your search basically what it's doing is if If you made a query that takes a long time for it to fetch the stuff in the database They basically give you something to look at and the page just keeps refreshing until your data is ready and then it presents your data This isn't really a difficult case. This is more of an annoyance and there are ways of handling that but it's still it's an annoyance for sure JavaScript can be also be an annoyance especially when JavaScript is used to to automatically modify forms after you hit the submit button that happens very frequently And again, there's a URL With a form analyzer. That's kind of handy for using on this kind of stuff And basically what you do is you just if there's a form that you want to analyze you Download that page save it as you know to your local file Change the action to that URL and you fill out the form Assuming that you also have all the JavaScript and everything you need It'll just redirect you to this web page and it does this analysis of the form and it's pretty handy when you're trying to analyze farms the thing that's become really prevalent is is Ajax and Ajax isn't just being used for little things anymore. It used to be used for things like You know changing a record in a database or something like that Ajax is now being used a lot for flowing content on pages Which is kind of bad web design because it basically defeats your your back button Which is something you never really want to do but like if you go to Expedia or Twitter or a lot of modern web pages and if you go to the second page of search results and do a Show source it won't show anything that's on the second page because that stuff is all dynamically flowed So that that is a problem. That's that's a very difficult problem when you're using traditional Techniques like curl to download web pages Flash is a problem primarily when it's used in navigation There are techniques for extracting text from flash so that's not such a big deal But when it's used as navigation that that is a difficult thing same thing with dynamic html The other thing that's really gotten bad is some just bizarre cookie behavior You'll have And again, I'm assuming that the only reason for doing this is to keep people from scraping the pages is You will find I've seen things like where there is Javascript that assembles itself after it flows and then writes cookies and certain sequences I've seen a lot of images that are writing cookies. So maybe they saw my talk two years ago. I don't know But just really strange cookie behavior and I've tried sitting there with You know analyzing network traffic and stuff and trying to recreate this stuff and it's really really difficult to do This is one of my favorite ones I've seen this from time to time Where instead of having a name for a form element you'll get like a you know a 200 character hash That seems random So it makes it very difficult to look for like if you want to emulate a form looking for the You know the form name yeah, it's a what okay, okay, and You really don't have to so So the comment was that was made to as a prevented or to prevent cross site scripting kind of techniques Okay For me, it's just a pain. It makes it hard to write scrapers But I'm going to show you how to defeat that the problem is is that is with all of the browser's limitations to a large extent We're still kind of tied to the browser because the browser does handle all this stuff automatically Sometimes you can fool a server into giving you simpler content for example, you can you can Spoof your bot and tell it it's an iPhone or something so you'll get possibly simpler content or you're some kind of mobile device But the real with what I found is that it became very important to find a way to a emulate a browser and Maintain full Capability to do anything I wanted to do while I'm emulating a browser So here's what I came up with I started using iMacro. How many people here use iMacros? I just yeah show hands. Okay. Good. Good. Good. I Macros is a browser plugin It's readily available. It meets all my criteria as far as being you know free Platform independent it solves every single difficult case that I've mentioned When I discovered iMacros Some of these techniques. I swear. It was like the gods handing me fire. It was like oh, thank you Jesus. This is wonderful because it is an incredibly great tool and You can hack it Very successfully to do things well beyond what iMacros was ever designed to do and we'll be demoing some of that stuff here today So iMacro sells all the difficult cases because it uses an actual browser Okay, that's important And I'm going to show you a few additional hacks now that turned it into a serious screen scraping tool So anyway, like I said, it's it's It's just an add-on for Firefox. It's also available for Internet Explorer. I just do a quick search on it download it It's a real quick and easy install once it's installed You click the little iMacros Thing up there and that will bring up the side panel where you can do things like recording So if you just hit the record button and enter a URL for example fill in a form This is not my real address by the way Hit the form press save and then stop the macro And then you can replay it you can you can go and you can pick the current macro hit play and I'm gonna do a quick demo here to show you how this works So after I'm gonna I'm gonna explain a little bit what I'm doing here After doing a lot of talks. I Know that it's best to keep your demos as simple as possible so basically what I have here is I've got a Copy of a patchy running on my laptop. So this is I'm not making any network connections outside of this machine In fact, I'm not even on a network right now So here's what I do. I've got my URL here. I'm already at the page. I want to go to So here's the little here's the I macros icon so you can turn the panel on and off So I'm just gonna come in here and I'm gonna click record And I'm gonna hit the record button and I'm gonna start filling stuff out. Where's my mouse? There we go. Okay, so I have filled in the form. I'm gonna hit save You notice stuff happening in the left column Okay, so I've got a result now and I've got a whole bunch of stuff over here This is the actual macro code over here. So if I hit stop I can choose to save this macro if I want to or I can just come in here and I can hit the the current Macro and I can come over here and say play And it should Redo what I just did so it loaded the page. I've got excessively long delays here just for dramatic effect This is gonna actually happen much quicker than this What happened I Have absolutely no idea how that happened Good thing. I've got a canned version There I'll do the canned version while I'm talking here So this is this in itself is very useful Basically, it's it's almost like having Active bookmarking so not only can you bookmark a page we can actually go through a series of links So for example, if you have a page on I don't know Some newspaper that you like you could set this thing up to go ahead and and click through whatever links it needs to click Through to get to your page and you could save a copy of it to read later And you could run it on a crown and you can do all kinds of stuff like that This this in itself is somewhat limited, but it but it opens up a whole opportunity whole set of opportunities as We will see Okay, so let's look a little bit of what's going on here. The real interesting part is line five Where it's basically saying okay go to this form and Here's the name of the element and I want you to put this content in there And it does the same thing on line six seven eight nine and then on line ten it essentially hits the save button And that's what the macro looks like. It's it's very very simple stuff Where you don't have tags for example like that that big long I think it's the last one I showed you the difficulties where you've got the the form name with the big hash You can hit XY coordinates to find those and what I've done is if you have varying lengths of content You basically do kind of like the way you would do a sled if you're doing a buffer overflow error instead of having a bunch of no ops You start clicking lower than where the button is going to be and you just keep clicking up With the width of the button till you hit it and it sounds stupid, but it works really well Okay, so here's here's the way I like to do these I don't run macros the way I just showed you what I do is I create a template file And I'll show you a little bit more what I mean by that And then I run a PHP program that converts that template into the actual macro and then I run the macro So this is what my template file looks like and here basically all I've done is I've replaced all of the content with a variable and I like to do something that makes the variable stand out so it's basically You know pound underscore the name of the variable underscore pound and it makes it very easy to do the next step Where basically I just do some substitution So what I'll do then is I'll have my PHP program running and I get some data from some place and we'll talk about where data comes from later and I will pull in the Template file which I call Proto file and then I just do a whole bunch of string substitutions So I substitute those variables that I put into the template with actual data and then I write out the The macro so this is a very typical way of the way I do things Now you can use this technique to do substitutions of field variables URLs you can change delay times you can change destination files Basically anything you can do with a program you can you can change so it's actually very powerful Plus you can also create loops You can change data sources. You can send status status messages to a central server. There's all kinds of stuff you can do And then I launch the program from PHP and Again, I don't know I guess you can't see it basically, it's you pull up Firefox and you point to a particular URL that runs your macro Now generally when I do this I Launch these usually from a cron file or some kind of a scheduled task And I've always had better luck doing this if I either do it from a bash program that calls the PHP or In the case of windows if I have a batch file where that calls the PHP program The other thing that's important to remember is if you're doing this on a flavor of Unix You want to make sure that you designate a display for this thing to show on or else it's not going to work And you're going to have a very long afternoon trying to figure out why your stuff isn't running Okay, a couple hints for using iMacros. I always dedicate a browser For using iMacros on a particular machine And the reason for this is that if you're doing this correctly One of the first things you do every time you have a session is You clear out all your cookies and that just kind of makes it difficult for normal browsing if you do that If you're going to do this Seriously, I recommend buying the commercial version of iMacros. They actually have their own browser For no other reason than it gives you better version control so you don't have iMacros Keeping telling you that you've got an update or Firefox telling you you have an update You know, it's nice to test things before they go into production and that just kind of makes it a little more difficult The other thing is iMacros is available for Internet Explorer. I don't use that because I found it to be a little flaky The other thing is that if you're going to be doing this and if you're going to be launching a macro from a Cron it would be to make sure that you've got the iMacros activated in Firefox before it comes up Well, it's your macros not going to run and again, you're going to have a very long afternoon okay a Couple of things I like to do an iMacros headers is I said a very long timeout if you set a timeout for 240 That's what 240 seconds. What is that? That's like four minutes Especially if you're going through a tour or something where you've got a really slow network connection Time is usually not of the essence. So I just let the things run I generally tell it to ignore all errors. I clear all the cookies I close all my tabs and I'll be showing you the importance of this in just a bit You have the option of turning images on and off and if you can get away with not having images displayed It's probably to your advantage to have images turned off And again, this is one of those things you can do in the string substitution that I showed you when you're writing your macro And again, if you especially if you're going through a tour or some kind of an anomalizer, it's nice to turn off your images sometimes The other thing that I do is is this little thing if you're extracting data I macros has a way of extracting data and you can put it into spreadsheets and stuff If you don't put this little line in here You're going to get these annoying little pop-ups that come up every time it's extracting data And if you have an automated system, that's just doesn't work. So I always throw this line in there as well The commands for iMacro are really simple and there's a nice command reference on Well as user forms and all that kind of cool stuff Okay, so let's look a little bit at where this data might come from This is kind of a typical configuration for me Well, I have my targets off to the left there and then I have the machine that's running iMacros I refer to that as my harvester And then I'll have some kind of a central server, which is usually just a website that I've kind of customized for doing this kind of stuff So my harvester will go out and periodically pull the central server to ask if it has anything for it to do And if it does it'll tell it what to do. So that'll include which websites to use as targets And what data to apply to them and then my harvester will go out and it'll download the page and It'll save screens and it'll parse results And then it'll send data back up to the centralized server that that's very typical in the way I do this kind of stuff so in large-scale deployment you can have lots and lots and lots of harvesters, okay, and All of these harvesters are running asynchronously and they periodically pull the central server Asking for things to do they go off and they hit the targets download the pages parse stuff off and send the data back to the central server now you can manage a whole bunch of Harvester's this way the central server can even do things like software updates for the harvester's Basically what this is is a botnet, right? that's essentially what it is and This configuration has really changed my Thinking in terms of how to How to host a or how yeah, I guess how to host what the serving requirements are for for screenscrapers Because you really don't need a data center anymore for this because the central server can be some cheap $10 a month website that you're using and all the harvester's can be You know $50 10 year old PCs that you bought on Craigslist And if you do this well, and if they're on separate power networks and using separate connections to the internet And I mean they can be in separate countries, so you've got all kinds of redundancy built in None of the stuff that you're looking at generally is anything that isn't already available online So you don't have security problems No data is stored locally. So you don't have any backup issues. So Hosting becomes really simple. In fact, one of the things I've been tempted to do is they make these little Linux boxes they look like like wallboards. They just plug into the wall and they've got built-in Wi-Fi I thought about just taking those and going some places where they have public Wi-Fi like, you know libraries And he just put a little sticker on there that says, you know Security please do not remove, you know and just plug it in and You got another harvester It could happen Okay, I'm going to show you some some hacks now to iMacros and make it really effective The first thing that I showed you is just a really straightforward iMacros kind of thing But we need to take it beyond that which means we need to add a little more programming power to it now iMacros does have a kind of a JavaScript like programming language, but I Don't want to learn anything new. I'm just going to go on use tools that I've got and Do something where I have complete total control and I'm not a big fan of JavaScript to begin with so iMacros has some limited parsing and data extraction capability, but I Haven't found any like pre-packing screen scrapers that do as good a job as a parsing script Because your scripts can be written to be fault tolerant and a lot smarter and actually make parsing decisions as Opposed to looking for things either at static places or with static names So without what I'm going to show you right here iMacros leaves you with a lot of the limitations that browsers have inherently so What would happen here's here's my hack suppose you could execute iMacros in one browser tab and then open another browser tab and analyze The screen that you downloaded in the first tab That you could write you could parse the data you could read and write stuff to a database you could pass data back to iMacros You could go open other web pages you could you could aggregate information you could do basically anything you want to do That's what we're going to do here. So what we're going to do here. I got another demo What we're going to do is when we get to this point in the first demo We're going to tack on some extra code that creates a second browser tab It launches a local PHP program that's running locally on this machine in Apache We're going to parse the web page that we saved in the first tab We're going to return the access code back to the first tab and then we're going to complete our form submission So you can see at the end here There was this big access code that it gave us that we needed to plug into this other form field to complete the Ordeal which looks a lot like a CAPTCHA, right? Okay. This is kind of like a simplified CAPTCHA kind of application All right, so let's go to demo 2 here Okay, so it starts like the first one and it fills out the form Again with dramatic delays Okay, so it did that now what it's going to do is it's going to save a copy of this screen It opens a second tab it goes to that tab It goes to a local program now that's parsing the data It actually doesn't take this long to parse But it's giving me some time to speak then it goes back to the first tab and it pulls in the data source and It's going to throw that number in that field hit save that's again dramatic pause and It takes us to our final screen. Okay, so you saw what happened there, right? The first tab grabbed a bunch of information and even though that was a really simple screen that could have been Entirely Ajax created. Okay Opens the second it saves that screen opens a second tab Second tab runs some custom code that acts on the save data Returns that data back to the first tab and it completes the process I'll often have three or four tabs going when I'm doing this kind of thing and it's it's really really powerful Let's take a little bit. Let's look a little bit what's really going on here Okay, so basically I tacked just a few more lines of code onto the old macro the first one is I'm saving the screen Using a I macro save as command Then I go ahead and I open up a second tab I move over to that tab and then I go to a URL you can use any URL including local URLs so I go to a local program that reads the parse results and Whoops and basically that program will read the results Send it back in a format that I macros can read then I come back. I open up tab one again I open up what they call a data source, which is basically just a standard comma separated value file It reads that in I'm telling it that I've only got one column of data and I'm showing where the loop is going to exist here and I just have one piece of data So there's really just one iteration in the loop The content has been replaced with the the value that's in column one and Then I submit the form and I'm done so that's all there is to this It's it's really really simple to really expand the capability of I macros and Again, you can throw any kind of page at it With any kind of JavaScript or flash or whatever and you can handle it. It's it's it's a thing of beauty Using additional tabs to run local programs Facilitates advanced features that are not possible with I macros configurations for example You can interrupt a macro to do other things like parse data from the pages that were downloaded You can act on the results you can interface with local peripherals In other words if you want to send a fax in the middle of an I macro session you can do that You can change your proxy settings if you want You can aggregate data from multiple web sources or websites You can also aggregate services For example, maybe you go off and you I don't know you For example, maybe you're looking at I'm making this up on the spot, okay Supposing you have a little business where you buy and sell books So you find something you think might be a good buy on eBay You got a bot out there looking at books on eBay and your and your web bot has to make a decision as to whether or not to buy it So maybe it goes off onto Amazon and it looks at what the sales ranking is and if it's a good sales ranking at a low Price, maybe you got a little formula you run and decide whether or not you want to buy the book You know, that's the kind of thing you can do with with this kind of power You can also upload data to essential servers in in mid macro, which is very powerful Basically anything you can do with a computer program you can now do with I macros Okay, I promised you a heartwarming moment So here it comes Since we're talking about captures Carnegie Mellon is the the group the university that came up with the whole capture idea and Maybe you've seen this recapture. They're being used everywhere now. They're used on Twitter. They used on Craigslist They're used all over the place Carnegie Mellon figured out that as a population we do a quarter billion captures every day So what they did is they came up with this free service for people and it's it's a decent capture service. It really is and Right now they're getting 30 million uses a day from their recapture service The cool thing about this is that the words that they're using for that that you need to type in are actually scanned from old documents right now they're doing old copies of the New York Times and What actually happens here is people type as people complete the captures. They're actually digitizing These old documents for later use, which is kind of a clever thing to do so right now when you do your capture on Twitter you're actually Digitizing old copies of the New York Times is what you're doing So it's it's a pretty clever idea really it really is They have an algorithm that they use and they look to see how many times Basically, they vote and I don't know how many votes are required to get a legitimate word But in this case if you know fishermen comes up 90% of the time for the first word. It's probably fishermen I've heard of hacks where people will you get a big group of people together and they all type in the same word Until it finally starts to accept it I don't know if that's true or not, but I've heard rumors that that that works So that that's how I guess that they're doing it. I don't know for a fact, but that's that's my guess and They're getting really great success. So The original documents, especially with these really old ones They're they're printed on with you know lead type and whatnot and the type isn't really great And when they do an OCR translation There are a lot of errors as you can see in the second example But the recapture of transcriptions look really clean so in addition to this the other thing that's happening right now is there are Services that will solve captures for you. In fact, they write actual APIs that you can use which is pretty cool and Unlike the optical character reader solutions that you may or may not have seen at DEF CON over the last couple years these are actually solved by real people and The way it works is not that I would know how this works, but the way I think this works is You take your your image and you send it off to these people and Most of them are based like in either India Bangladesh Pakistan Vietnam and Three individual people are sitting either in an office or in their home with a computer set up to do this And they'll see the image and they'll type in what they think it is You get three results back and what you do then is you write some software that basically votes if two match You can pretty much assume that that's what the real answer is and The accuracy is actually quite high It's it's it's a good deal and it costs about a half a cent to get a capture solved So it's it's a good thing. It's a good thing If you want to find out who's doing this I would just suggest doing a real quick Google search and you'll come up with a number of Organizations that do capture solving so here's how it works the captures displayed on a web page and You take the image the capture image and you set it up in their API and you send it off to the service And it's solved not by a human but by several human beings and they send you the text back and You enter that into the capture box and the capture solved Along with some unintentional consequences Basically, it's a feel-good win-win situation for everybody because you end up with spammers paying to digitize old documents and promote literacy worldwide and It's true. It's true and people in developing nations have jobs and my understanding is that This kind of work Puts a lot of college kids through school in places like Pakistan and Vietnam It's hard to get numbers as to How many people are doing this but the number that I heard is that it's a 20 million dollar a year business Capture solving at like a half a cent a piece But do responsible things with this. I Don't want to get spam email from all you guys So in conclusion we reviewed a little bit of screens a traditional screen scraper theory We described why it's becoming more difficult to write scrapers And we saw that I macros can solve really all the difficult cases if you can do these Second and third tab hacks on I macros It gives you absolute browser emulation capability It gives you complete control by writing these little scripts that execute in other tabs locally on your on the harvester machine and We also looked at managing some large-scale deployments of this kind of stuff. So I guess I'm done we have I Think we have a little time for questions and whatever doesn't get asked Here I'm going to go back to the vendor area where I'll be hanging out at the no starch booth. So The question was how fast do you get a reply from these capture solvers? My experience is that it takes between 45 seconds in a minute Typically And you think about what's going on. There's an awful lot of stuff happening during that 45 seconds So that's that's really not bad time And most of the bots I run It doesn't really matter how long it takes for him to run not not these particular bots Other ones are different. Yes Yeah, the question was Websites that don't finish loading essentially, right? I have seen that and the way I've gotten around it is to turn off images Because what you're pro Yeah, yeah, sometimes you got flash that just kind of like never stops loading if you turn off images flash goes away A lot of times you end up waiting for banner ads from slow ad servers, too. That's what really kills me So yeah, I just if you can get away with turning off images turn off images and that problem should go away Really really yes The question is have I done any multi-threading with this? No No, and I can't think of too many times when I'd really have the way I do multi-threading is I would do things I've done some strange things generally I do it by either adding more hardware or I've had bots running in Frames within web pages, you know, that's another way to do multi-threading. You just basically create a whole new instance but most things Online are very procedural And that's one of the reasons why I don't use OOP when I program because The web is very procedural so And and I'm an old-school Java programmer converted to a procedural PHP programmer. Yes the question is the question is can you install iMacros through a Act of X control. I'm sure you can I Mean, I'm sure you can and you know this does run on Internet Explorer, too. So yeah, I'm sure that's possible Anybody else? Great. You've been great. I'll catch you in the vendors area. Thank you very much