 So we're talking about dirt-simple day of mining. Hopefully it's a good time. A couple of formalities before we get into it. My name's Matthew Thorley. I'm Pat LoSolving-Masala. I'm posterist Twitter and Gmail. But I'm M. Thorley on GitHub, still trying to sort that out. If you read my bio, it says that I love Jesus, my wife and kids in the woods, and that I also write software. On the side, I do model for Wipe Trash Magazine. So if you'd like a mascot or anything for your site, I'll give you my card later. I work at global-based technologies with the great team. I owe a ton of credit. Most of the stuff I'm going to go over in this talk are a lot of it anyways. I learned from working there. If you code in a hole in a cave somewhere, I feel sorry for you. I've learned so much working with other shark developers here, some of them. Here's another pic of them. That was during mustache May. We had a little bit of a party, so for all support in our stashes there. Couple of legal things to go over. This is all my own code. I work for a company that has NDAs and stuff like that. I wrote all this stuff from scratch. Also, it's for educational purposes only. You need to read Robot, TXT, and ULAs, and Terms of Use, and all that stuff. If you work for a company, and you're processing this, you probably have lawyers and teams of lawyers. Every time the ULA changes, they go and they read it, and all this kind of stuff. I don't know anything about that. Get a lawyer. This guy, Peter Warren, you might have heard of. He's a real neat data mining researcher, and he gets into visualization and stuff like this. He went and crawled 210 million public Facebook profiles, made all these great graphs and visualizations, and then it was like, hey, I'm going to give this public data to the world for people to use and explore and visualize, and Facebook sent him a nasty gram and said, if you do that, we're going to take your life and kick your puppy. And so that didn't go over too well for him. Just some things you want to be thinking about. Anyway, we're talking about data mining, dirt-simple data mining. So what is data mining anyway? It's according to this guy here, the extraction of hidden predictive information from large databases. That's what data mining really is. When you talk about data mining, I used to work at the Center for High Performance Computing. You know where they're very large clusters, and they have terabytes of data that they're processing, and they're trying to make use of that information by finding things that are hidden or patterns in the data so that they can research medical technologies or whatever. Anyway, to put it simply, we're not going to do that. What we're doing is a little more like this. We're going to talk about building a data set. We're going to talk about crawling and parsing. Are there any math geeks here? Okay, math geeks? No math, okay? This is not real data mining. This is really about how do you build a data set or how do you consume open services? I said in the talk description, I was going to talk about storage and retrieval. As I started writing code and putting slides together, that just got way outside the scope of what I could cover. And so we're going to talk about crawling sites. If you're disappointed, here's a picture of a car in a demolition derby. It's wrecked. It's pretty exciting, maybe not for you. I liked it. And so let's talk about why. Why are we going to mine data? It could be for research. It could be you want to go find out about your customers, for your business, you want to do some research on the people that are following you on Twitter. You want to know a little more about your demographics. You might want to write your own search engine. You might be doing it, you know, as somebody mentioned earlier, they work for a medical company. And so there's all this information out there about their patients and their staff and stuff like that. You want to pull that all together. A real simple example is my kids like Playmobil. And so let's find Playmobil for sale locally. Here in Utah, there's a place called KSL Classified. You can find lots of rad stuff for selling online there. And here's an example of a very simple crawler that will go to KSL. It takes all the search terms that you pass in on the command line. It passes them to the URL. It uses two basic regular expressions to get the title and the price. And it just puts them out on standard input. So when you run it, it looks like this. You see it works, you know, kind of okay. We missed the title on the first element there. We did find a match for Playmobil. It's trained with railroad trucks that cost $30. But for this talk, we want to do something a little more advanced, a little more interesting. And we're going to crawl Bebo. Bebo is a site that my employer does not crawl. So I was free to take a stab at it. And we want to find people on Bebo. As I mentioned earlier, David Richards, when he gave his talk, he was talking about open systems. I think this is really what he's talking about. Going to a site, he called it scraping, you know, just pulling pages off, getting information out of it. Maybe you've all done that a little more, a little less just for fun. Or we want to kind of do it at a large scale. And so to do that, we need to find out how the site works. And so we're just going to throw a email address in the search box here and click it and see what happens. And so here's an email address that I set up. It's just a fake email. I created a fake account and Mike Dragon, he's a male, he's 29, he put his email in there. And this is the page you get. And if we examine the URL a little more closely, we can see up here the search term is Mike Dragon at globalbase.com. And then there's some other stuff at the end that it turns out isn't necessary. So if you want to search Bebo for an email, you just go hit this URL, you drop the email in there and you get a page that looks like this. Now on this page, there's the user's name there, Mike Dragon, we click the profile link. So we can see his profile, what he's interested in. And we get a glorious page that says, you must have an account. So in order to call Bebo, we're gonna need something that's a little more advanced than just a curl script or just a simple web URI open. We need something that's going to log in, pass credentials, save cookies, persist state, and do all of this kinds of stuff. So to do that, we're gonna use a library called Mechanize. It's a great tool for crawling sites and let you handle cookies and forms and all kinds of neat stuff. And so let's take a look at the source. This is the source of the login page that we just saw a minute ago. And the highlighted portion you see there's a login form. The action points to secure.bebo.com, signing.jsp. And then there's some fields. There's a password field, user name field. And then there's this extra input field which is kind of labeled funny. We'll find out more about that in a minute. Here's how we do this with Mechanize. We create an agent, which is just an instance of the Mechanize class. We set our agent to Maxifari. We follow meta refreshers. That turns out to be pretty important. Yeah. What do you make that lighter? Lighter, I don't think so. How do you ever, how do you ever on that? Control option, command eight. Control option. Command. Command. Why not? If you can do this really fast, we can have a disco. Doot, doot, doot, doot, doot, doot. Doot, doot, doot, doot, doot, doot. So here's a real simple script to log into Bebo. And you see, we go and we get the Bebo page. And then we just look for this page.forwith, where the action is equal to a typo there. It's supposed to be pointing the sign in URL at the top. And then we take that form, and we set the email username, we set the password, we submit the form, and then we just write out the result locally so that we can look at the file and see what we got. And what we end up with is this. You're still not logged in. And so this is the kind of thing that happens a lot. When you're consuming these open resources, when you're going and getting at websites, there's little tricks that you have to start to pick up on in order to use their service. I'd like to think that they don't do this intentionally. It could be that they do, especially in the Ajax world now, there's so much going on with JavaScript and the way sites are passing information back and forth. Mechanize doesn't support all of that. So you need to start reading source and debugging the site, so to speak, to actually figure out how it really works. And so there's another great tool that we use for discovering websites. It's called tampered data. Tampered data is a little Firefox plugin. I don't know if they have it for Chrome or not. I've been using it in Firefox for so long. Don't have any reason to switch, but what it allows you to do is it shows you what the browser's really doing. It's kind of like a wire shark for your browser. And so anything that your browser does, your script can do if you can figure out what it's doing. And so here's the Firefox. We just click tools, click tampered data, and we open up a tamper session. This is an example session. And what I've done here is I open tamper, I click start tamper, I went to Bebo.com, I entered some credentials on the homepage, and I click submit. And while it was running, you can see on the left there is a portion of all of the requests that were passed to the site. When you dig through that stack, you can find the actual post request that was submitted to the login page, and you can look at all of the parameters that were passed. And if you look here, you see that more than the username, the password being passed, there's also this extra variable called sign-in. That doesn't exist in the page source. That's something that they've done with JavaScript. And so there is that extra form field in the page source, but what they've done is sometime in the sign-in interaction, they've taken that field, they've changed its name and its property, and they've inserted a value in it and then submitted it to their site. And so until you open up something like tamper, you're not gonna see that happening. So to get around this, what we do is we go and just get the sign-in URL directly and we make a post to it. And so we pass in the username and the password, and now we're just gonna pass in this extra parameter called sign-in. Give it back down the way again, you can't see it. Oh, thank you. There you go. So there it is. We're gonna say agent.post to sign-in.jsp. We're gonna pass our email, pass our password, then pass in this extra bonus parameter that they've given us to discover. Well then when we do that, this is the page. So we're logged in now. Mike Dragon has officially made it to his page. So let's go back and look at this search results page again. So now we're all logged in. Now when we go to search results page, we ought to be able to click a profile and if it's public, be able to see its results. Keep in mind that all of the stuff we're going over only covers public information. We're not talking about hacking their site or trying to find secret backdoors, or some way the Jimmy Rig thing. So you can do any data that this process shows is something that people have marked public either because that's the default settings or because they've chosen to publish that information about themselves. And so now at the search results, we need to find the profiling. So we open up the page source and there's the profiling right there. They've got a member ID and some other junk tacked on it. And let's take a look at some code. What we wanna do now is we wanna write a caller that logs in, that passes an email to the search, gets the profile page, extracts the member ID, and then takes us to that person's profile. So let's take a look at what that looks like here. So first up, get member ID, takes an email. It goes to the search URL there, passes the email in and then we use a regular expression to find that person's member ID. You'll notice here that the portion of the source that we're finding with the RIGX in order to get the member ID is different than the one we looked at before. Initially, I started just searching for profile.jsp member ID equals whatever. And it turns out that earlier in the page is a link to our own profile that we're logged in. And so it's returning this false positive where it always gives you back your login ID. And those are the kinds of things you gotta work through. And so you figure that out, it looks really pretty on the screen, but that's like 30 minutes or an hour of the debugging going back and forth. Why am I getting the same ID? This is so, something's broke. You know, what did I change? Yeah, well anyway. And so then we get the ID. We've got a method here called public profile. It takes the ID. It goes to the profile URL, tax that ID on the end. And then, oh, that's great. Blue screen at death. I don't think Max had those. Bummer. We're also gonna check if the profile's private. If you go to a profile that's private, it's going to say, as you see at the top of the screen, you must be friends with this person to view their profile. So we just put the whole string in the red X and we just test against that return nil if it's true. Those are the two driver methods. Let's take a look at the whole class here. This is the crawler class. We've got our own crawler here. We subclass some other crawler. Here's our logging method. We've got a little crawler method there, kind of like, you know, template method. We log in if we're not logged in, we get the ID and return nil if we don't have it and then we're just gonna return a hash as a result that includes the ID and the public profile body and then there's the two other methods that you've seen already. One thing to note at this point is we need a better agent. Mechanize is great, but it doesn't do everything that you might need for debugging. We want something that's also going to, like, write pages out to a directory so that we can go look at them and see what's actually being downloaded because you're gonna, a lot of times when you do this kind of thing, you get to a place where you're like, man, all I keep getting is the signing page or all I keep getting is this nil page. What's going on? And you wanna actually be able to go back and look at each URL that you hit and the result. And so we've got a little fancier agent here. Down at the bottom of the page, class agent just subclasses mechanize and then with modules, I just add behavior. I chose to do it this way because it allows me to plug and play what I want the agent to do in production. I would just not include these two modules, but when I'm debugging, I want it there. And so if we wanted a little more advanced agent, we could pass you the hash and say, turn this option on, turn that option off, do this thing, do that other thing, or based on the environment, set up your agent like so. But I did the simple way. We've got one module at the top called write pages. It's a little complicated looking, but all it really does is it takes the page that was downloaded and it writes it out to a directory. If the page had content, it writes it as .html. If the page was empty or we got a nil response, it writes it as .nil. And then we've got another one called log gets that just puts the URL. So when we're running this thing on the console, we can see every URL that's being gone to and so on. And so that's the way this site works. So let's take a look at an actual profile page. And so when we run the crawler, we get a page that looks actually like this. This is this person's, this is an email I had on hand, this is a public profile. If you go to that URL right now, if you have a Vivo account, you're logged in, you'll be able to see their profile. They've got a band set up here. And so this is what a profile looks like. There's lots of neat little information on there that you can link to their photos and their friends. It shows you their gender. It shows you their hometown. It shows you their comments. A lot of times it tells people's interests. You've all been to Facebook or MySpace or whatever. There's just tons of stuff that's available. And so we also, in addition to this page, want to get their friends page because we may want to build an example or a network of all of the people that they know. So that's an example of the friends page. Here is the code we need to get that friends page. We just update the crawl method here, extend our hash to now also get the friends page and then we write a little method called get friends list that takes the URL, gets the page, checks for a private profile that returns the page if it's found. I like to write my crawlers this way where you basically have one method per page. You can sometimes be tricked into thinking that it might be more efficient when you're on this page to also get this other thing and wrap it all in one method. You're only shooting yourself in the foot. If you need to get some pages, you have to get other pages before then. If you need that kind of thing, what you want to do is you want to cache it in like an instance variable and hold the result and then go and get it and then use it in another method. But anyway, one method per page, that's the way I like to roll. Let's go back to the profile here and take a look at it. Once we have the profile page, we want to parse it. And when it comes to parsing HTML, I really think in most cases, regex is your frame. You can use stuff like H-Precott, no Kagiri and I do have one example of that. They're neat for certain things, but hands down I found that just regexing a page is the quickest, most efficient way to find what you're looking for. And also, your parser needs to be flexible because as some of the speakers earlier mentioned, pages change and your Facebook comes out with a new layout for their page. They're not going to tell you that. They don't send you an email and say, hey, by the way, in three weeks we're going to roll out a new format and it looks like this. So all of the developers that are scraping us, nah, it doesn't work that way. You wake up in the morning and all of your crawlers don't run anymore. Somebody mentioned it being really important to have tests for this stuff. I failed in this example, there's no tests but it's really true. We have a whole suite of tests at work that we run so we can see if we're actually able to crawl and extract information from a page and also be able to see what information we're getting, what we were getting, what we're not getting now and on and on and on and on. And so let's write a flexible parser parent class. These are the two driver methods. The way we're gonna write our parser is we're gonna have one parse method per item we want to extract. And so we have a method here called parsing methods that finds every method that begins with parse underscore. This is like convention over configuration. So in our actual parser class we're gonna write some methods that look like this where we just say parse age and then it returns a result. And our class will be full of these things, pages and pages of them. I've only done four in this example, but going back anyway, we have a parse method which when it's called, gets all the parsing methods. For each one it gets the key which is everything after parse underscore like age for example, and then just throws it in a hash and uses send to call that method. So it gives us a way to write a class where we say parse age, parse name, parse birthday, parse interest, parse whatever we want, use a regex or no sugary or whatever we need to extract the information and really simply work through parsing a page. So here's an example of the whole parser class. We've included no sugary here. We pass it some page content. We also create a no sugary document of that content. So we can use either regex is on the raw content or we can use no sugary to walk through the document. And then there's, you see extractive. That's just a little helper method because we're gonna do content.scan regex.flaten.first a whole lot. So that makes it a little simpler to do. So here's an example of the profile parser. We've got parse last active, which just takes this regex and that actually returns the date that they were active. Parsh gender, again, another regex parse age, same thing. And then parse hometown, it was actually nicely tucked away in an LI for us with the class hometown. And so it was pretty easy to pick that up using no sugary and CSS. So if you're familiar with CSS, this is just a selector where it's saying, get the LI element that has the class hometown, get the text of that element and strip off all the white space because there's a ton of white space. And then at the bottom, there's just a little helper there. So what it does is it creates an instance of the parser and then it passes the content of whatever the first argument on the command line was and parses it. Little things like that are very helpful when you're testing is you wanna be able to run this thing against a whole set of profiles that you've downloaded to see what you're getting and what you're not getting if things are working and on and on and on. So here's an example. When we run that profile parser, we just say Ruby include lib, lib this parser. We pass the page pages, free.html and here's the result we get. This person didn't have an age, their hometown is from the United States. There was no less active and their gender is male. All right. This is the parser for the friends page. It's a lot simpler. All it does is gets an array of every member ID that matches this pattern. And so the way I figured this out as I went and I looked at the source of the friends page, opened it up in Chrome or whatever and just started searching through, looking for member IDs, trying to find my member ID, trying to find their member ID and then trying to find the pattern in the page where it's just their friend's member ID. It's a lot of cat and mouse. It's really kind of fun if you like puzzle games and things like that. If you don't like puzzle games and things like that, probably don't wanna write crawlers and parsers. So last but not least, we need to put this all together. So what we have now is we've got an agent that our crawler uses, our crawler logs in, goes to the, enters the search results, it goes to the profile page, goes to the friends page and then we wanna pass those results to our two parsers and get a result in the end. So that's what we're doing here. Very simple little script. We just instantiate our crawler and get the result. With the result, we check it. We say, you know, get the ID if we had it and then here we say if there's a public profile parse that, if there's a friend list parse that, put the output if we had some or just put no result. And when we run this on the command line, we just say run, pass it an email address which is blurred out here. It goes to sign in JSP, it goes to the search results, gets the profile, gets the friends list. It runs the parsers and then it returns this hash. Again, they're ages, nil, they're from the United States. Here's the friends list, here's their ID and on and on and on. And I hope at this point to have some slides where we put this in MongoDB and do some different things but I just didn't get that far. So that's basically how the whole thing works. All of this code is online. I'll show you the link to that in a minute so you can go to it and browse that. You know, I got way behind the screen down here for a second. A couple of caveats that you'll want to consider is sometimes you get pages like this. Like you start crawling a site, you go hit them a thousand times in less than a minute and they say, what are you doing? There's lots of little ways that you could get around it. Some people sort of hinted at it earlier. You could use proxies. You could throttle your crawling. You could use different accounts. A lot of sites have different ways to detect who you are and why you keep hitting their page and what the heck you're doing. Google is by far the best that we've seen. Most of the times, though, there's a way to work around this. Sometimes it just comes down to slowing your crawl. If you want to get a lot of information from any site for whatever reason, you know, kind of be considerate. You don't want to do a DOS attack on them. You can kind of judge by the size of the site and the traffic that they get about how frequently you can crawl them. Some sites linked in, for example, will actually throttle you. Amazon's another good example of that. If you start pushing really hard and trying to get links too fast, they'll just put a huge delay. So pretty soon you're getting one link every half a second and then it's every second and then it's every two and then it's every four and then it's every eight and then you're waiting 12 seconds a request. If you minded your own business and stayed at half a second per request, then they wouldn't throttle you. So something you can consider, another example of this is they'll just block your IP. They'll say, hey, you can't use the search anymore. We don't like what you're doing. And so you gotta find ways to work around that. You gotta be creative. Again, legality is a concern. If you're going to a page that's blocked in the robot.th tier, if you're using the account in a way that violates the terms of service, you gotta work that all out yourself. If you have right to the information, if it's your information, you gotta figure that out. I can't tell you what to do. But anyway, the code is on github at mthorley.git. You can get it. If you get a Bebo account for yourself, you can plug that in to the login information and you can go run and check your own email or your friend's email or whoever else's email or everybody just follows you on Twitter's email or whatever you want to do. That's it. Any questions? Ready, go. Did I go too fast? Go ahead. The login, the one that I've struggled is logging in when they have some app, some other app. And it's got the security, it's got those web forms. Microsoft does a lot of it. And I've been unable to get past that. I mean, I know you did the post. I'm just curious if you could show that, talk about that a little bit more. Yeah, logins are tricky. Every site's different. Some sites do it plain and simple where you just get the form and fill it out and post it and you're logged in. Other sites like Bebo make it a little trickier where there's hidden variables. There was one site in particular. I didn't write the crawler with it, but one of the Shark Fellows that I worked with did and he ended up having to open a wire shark, sniff the packets as they were coming through and then in his crawler, he opened a socket and passed a stream to their server to say log in. It's that complicated sometimes. But again, anything that the browser is doing, you can do. You might just have to bring it down to the packet level and send the right stream of data to tell them, hey, I'm legit, you can give me credentials. Question in the back? Those ones you have to find, often a view state, they call it, is like in the JavaScript set, you have to find that pattern match and find that view state and pass that along with it. So usually the old Microsoft ASP stuff, for those specific ones, that's one of the main points. Okay, so you said on Microsoft ASP sites, you look for the view state and that'll give you clues to how you can get logged in. Yeah, right there. You can find that, I guess, with tamper. I didn't know about tamper, but that'd be a good piece of tamper against finding the view state. Yeah, tamper's a great tool. I mean, usually we start with easiest thing first. You just go to the site, fill out the form and see if it works. That doesn't work, then you crack open tamper, you see what's going on. I did get fooled when I was working on this Firefox, and I think Chrome also has a site verification system and all that shows up in the tamper log and so they're going and getting this secure cert and doing this encryption thing and I was duped, I thought that Bebo was doing that. So I spent a lot of time futzing around, wasting, trying to deal with all of this encryption authorization junk and when I just disabled that in Firefox, it all went away and I was able to see what was going on, use tamper to get through the whole go on presentation. I thought that. Okay. Is there any reason you use Mechanize instead of something a little higher like WebRat and Capybara? You know, when we first started this back in the day, WebRat and Capybara were in round and so that's why we use Mechanize. We used Mechanize before 1.0. It worked great for us. One of the guys has played around with WebRat and we did use it in some tests for something. The way it is right now is it's what we know. We know Mechanize, it works really well for us. It's really solid and stable. It does everything we needed to do but your mileage may vary. Something like WebRat or Capybara might be better for what you're doing or if you're just getting into that, definitely look at those and try them out and see if they're gonna do what you want. Yeah. I think we need good resources for the data mining and analysis portion of this too like, because I don't know anything about that and if there's any introductory data, data mining analysis stuff out there. Yeah, there is. I meant to include a link for that and didn't, I'll put it in a read me in this git for you guys but if you go Google for data mining it's like the fifth result. This guy's got a great page and basically he just goes through all of these different kinds of data mining types of techniques you can do including you know, Bayesian filtering and other things of that sort. It's really math intensive. I suck at math and so you're not gonna see me doing very much of that but if you're good at math or you really wanna learn, you can go get those and there's a lot of great tutorials on there to how to get started and make stuff work for you. Do you know the guy's name by any chance? I don't know. Is it called Statistical Data Mining Tutorials? I think that's it, yeah. By Andrew Moore. Yeah, that's right, Andrew Moore, very good. Thanks, somebody's already Googling. Couple more questions us off you hand someone if I answered your question, right there. Anybody automated the login to ADP timekeeping? Say what? Automated the login to where? ADP, yes. Yes. Good, that was a chance that I'm glad you had one. I don't know how widely we've released the gem but there's a gem called Mortland that fills in ADP with regular time intervals. Right, right. Any more questions? All right, great, thanks very much.