 Are we good to go? Well, welcome to the session on the introduction to writing spiders and web agents. My challenge here today is that I've got, what, 45, 50 minutes to introduce the idea of writing agents and spiders. And so what I'm going to do here is I've got three main goals I want to accomplish here in the session. One of them is I want to convince you that writing these things is a worthwhile activity. Even if you don't do it professionally and there are a lot of opportunities to do this kind of stuff professionally, it's just a really fun thing to do. I know a lot of people who maybe don't code as much as they'd like at work and they get home and they got this need to code, you know. So they do silly things like, you know, do a database like catalogs or CDs or something silly. You can have a lot more fun writing spiders and web agents because once these things start moving and they just become kind of organic, especially when they interface with some very non-web kind of things like, you know, cell phones and printers and fax machines and stuff. It can get very fun. The other goal I have is I want to get you guys thinking about ideas for spiders and web agents that you want to go off and design on your own. So I'm going to give some suggestions for ways of coming up with ideas. You know, it's easy to write a spread or an agent. The hard part is having a really great one that does something really cool. So hopefully I can kind of give you some ideas on how to come up with some ideas. The other thing I'm going to do is provide you with some basic technical skills. And pretty much all you really need to do these things is basic technical skills. There's so many great software available now that you can download to do wonderful things. And there's some great programming languages. So it's taken a lot of the mystery out of doing this. So a lot of it is just being introduced to how to do this kind of stuff. So I'm going to make some suggestions for programming languages, things that I've used in the past, some downloading techniques, some parsing methods, things that I've discovered over the last, you know, three, four years. And ways of making fault tolerant designs. You've got to keep in mind that unless you're spidering your own site or your own server or a known server, these things are going to change at the whim of whoever is the webmaster. So I'll give you some ideas for writing fault tolerant spiders and web agents. Real quick bio. My name is Mike Schrank. I am a freelance internet strategist and developer. I'm based in Minneapolis, but I work all over the place. The nice thing about being based in Minneapolis is that you only like two, two and a half hours from anywhere in the country. The bad thing about being in Minneapolis is that you're two, two and a half hours from anything in the country. So spiders and agents are a large part of my business. They're not the only thing I do, but I'm doing more and more of them as time goes on. People are seeing that you can do some really cool things. For example, I have one client that's basing their entire business on what I'm doing. And I've got others that see that they can get a real competitive advantage over their competitors by being able to know things before anybody else knows it and then being able to act on it really, really quickly. So there's some real powerful things you can do here that make a lot of business sense, in addition to being a lot of fun to write. Okay, what's the connection between spiders and agents and hacking? The way I do it, it's the classic hacking activity if you belong to the MIT School of Hacking. You break down technology, you grab something, you look at it, you figure out what it does, how it works, and then you recombine technology to do something that was never intended by the original designers. So it's kind of the MacGyver approach to web development. You're taking things from all different places and doing really bizarre things. You can also do some things that are useful, security things that you can monitor logs, you can look for unauthorized servers on a network, you can test system integrity, you can do all kinds of cool stuff like that. Okay, what is an agent? Well, basically here, if you can think of a browser as being a very general purpose agent. And it's a general purpose in the fact that basically you just point it someplace and it goes there. Okay? If that's all you do at the internet, you're really missing out on a lot of stuff. And I'll show you some examples of some stuff here in a bit. All agents work on a client-server relationship. So you've got, in this case, a browser sends out a request to some web server. It gets a response back. And generally the initial response is some headers, HTTP headers that are sent back. We'll talk more about those later. And also some HTML. Now a client must ask for everything. So in other words, if you went to the old DEF CON page, there are nine separate images that are on that page in addition to the HTML. So after it receives the HTML page, it's going to send out requests for those images and all those images come back. What do spiders and agents do? Like I said earlier, this is not difficult stuff. They download data, they parse and filter data, and then they take some kind of action. And in all the examples I'm going to show you pretty much, they all do the exact same thing. What they download and what kind of action they take is entirely up to you. When you're thinking about agents and spiders, don't limit yourself to spiders. Spiders are just a general purpose, excuse me, they are a special purpose agent that acts specifically on links. So they're going to grab a web page, look for all the links, and then they'll grab a web page, look for information on that page, then gather up all the links, and then do the same thing in all those links. So don't limit yourself to spidering, that's only a tiny, tiny portion of what you can do with these things. Try to instead think about providing web services. Now if you think about web services, you can still employ spiders if you want to in your services, but I think it gives you a little better framework to stand on when you're thinking about what kind of things you might want to do. Why write a spider or an agent? Like I said earlier, browsers are kind of dumb, you point them in one direction, they'll go off and they'll do it and then they'll stop and wait for the next instruction. Well with an agent or with a spider, you can automatically surf the web, which is very nice. You can also filter out information that you're not interested in. You can also download and repackage information for later non-web devices. So in other words, you could write a little deal that goes off to your favorite web page, downloads it and does clipping and stores it in a manner that would be good for a PDA, or you can send stuff off to printers or fax machines. The other reason you want to do agents and spiders is especially if you're not interested in images, you can download stuff really, really quickly. So speed is a big, big advantage. Okay, where do project ideas come from? Think about areas where you can bridge things. In other words, I've got a bunch of stuff there, everything from HTTP servers to phones to webcams to printers, instant messaging, fax machines. Think of instances where you would want to combine those types of things. Another thing to think about when you're looking for coming up with an idea for writing a spider and agents, look for things that you can automate. For example, there's a web page called www.whois.net that is run by network solutions. And the really cool thing about this page is that if you're looking for, if you want to take out a domain name, instead of going to register.com or something and just typing in the name, oh, taken, how about this one? No, it's taken. It's not very effective. What you can do with whois.net is you can type in a word or a series of characters and it'll return all the domain names with that series of characters as a subset of the full name. So if you're looking, if you want to do a website, for example, on food or something, you can type in food. It'll return all the domain names that have the word food in it. It's pretty useful. The thing that makes it not quite so useful is that it only returns, I think, 10 or 15 names at a time. And as you can imagine, this might return, you know, it's like they're going next, next, next, and there's thousands of these things, right? So one little spider I wrote is I go out and I spider this page and I just type in the word that I'm looking for as a subset of a domain name and basically it just returns all of them much, much faster. Another example or another thing to think about when you want to automate stuff is think about things that you can perform periodically. I talked a little bit earlier about doing things with PDAs. Let's say, for example, you work for a corporation that has people in the field all over the country. You could set it up so that at night the field representative's computers would go out and download some information from your corporate website and then this agent could transform this stuff and put it where it needs to go so that when the salesperson or whoever it is in the morning there's a hot sink on their PDA, they've got all that information in their PDA. So that's an example of something that you'd want to perform periodically. Okay, let's look at some ideas here. Automatically send an email to your mother every time there's a story about you on CNN. So it may or may not be a good idea. I guess it depends on your mom. So what would you do? You would download pages from CNN. You'd parse, look for stories about you, and that's really a matter of a simple string comparison, really. I mean, it's not hard. Then the action is you would email your mom. You can either email a link to the story or you could email the actual story. Okay, create a service that provides real-time weather data. In fact, I know several people who are doing this. This is actually a useful thing. Download pages from the National Weather Service. Parse out weather information and forecasts, and then make the stuff available via some type of either a SERP or some other web protocol. So basically it's just a service sitting there that people can access. Track the web technology used by each of the Fortune 500 companies. Okay, how would you do this? You would download the HTTP headers from a set of web pages that you have stashed away in a database. You'd parse out the server information, and then you would store the stuff in a database and run comparisons and do whatever you want to do with it. You guys remember the cartoon? There were two dogs sitting on the street, one of them said to the other one, the nice thing about the internet is no one knows you're a dog. That's not always the case. I actually met my dog on the internet. There's Jackson right now. He's showing the proper technique for standing on a 17. If he knew I was talking about him, he'd kill me. Anyway, no, I actually met my dog on the internet. A lot of animal shelters and rescue groups will have websites where there are post-dogs that are available for adoption. It's kind of like a canine dating service, and that's where I met my dog. That made me think, you know, if you wanted to, you could periodically run a server, excuse me, run a spider, to check on a Jack Russell or any other kind of dog that's available at your local shelter. And again, it's a simple download, parse, filter, take action. You download pages from your local animal shelters and dog rescue organizations. You'd filter out information about Jack Russell terriers, and then you can send yourself an email when a dog is found. Number five, scout all the top employment websites for jobs that match your skills and automatically send the contact person a resume. Good idea, I don't know. I don't know if I do this one, but it's the kind of thing you can do. Again, download, you download pages from employment websites, parse and filter, jobs that fit your skills and gather employee email addresses, and the action you take to automatically send the resume to appropriate employers. Number six, write a web agent that returns the best price on a dozen roses delivered in Chicago. This is actually something you could sell if you could do a service like this. In fact, there are some like this out there. My Simon, I think, comes to mind, and I think CNET does something like this for computer hardware. So what you would do, again, you download delivery pages or flower delivery pages, check for best prices in Chicago or wherever it is you're looking, and then you'd print the names of suppliers, offerings, and prices. Okay, you're starting to see some similarity here. These are vastly different kinds of ideas of things you can do. The process is very, very similar. I had a camera in your medicine cabinet. When the door opens, take a digital picture and post it on a website. www.medicinecabinet.com, I don't... This might also be an idea for a business, I'm not sure. So here what you would do basically is you'd have some kind of a trigger mechanism, and you can interface any kind of hardware you want to an agent, right? Anything, even a medicine cabinet. So you'd have some kind of a switch with the door opens, bam, a picture's taken, you really don't have to parse anything, but it's pretty easy to post the stuff on a website then. Okay, automatically analyze server logs for anomalies and send to a printer when found. So here instead of going out on the internet, you could conceivably load a local file, a file on your own network. I suppose even a log file you find on somebody else's network. You could parse out anomalies and then you could print out things that don't look right so you can look at them later. So I did an email the fact service. This is what I actually did. I started writing spudders and agents back about four years ago when I was working. I was a consultant over at Medutronic and was involved with a lot of telemedicine kind of stuff. And this is what I actually did. I set it up so that if you sent an email to a specific email address, it would take the contents of the body and it would fax it to the number that you put in the subject line. So how do you do that? When you load, you go out and you get the emails that are sent to a specific address. You parse the message in the phone number and then you fax the message to the phone numbers. It's embarrassing or easy. This idea I love. Write a new front end to somebody else's online catalog. The idea is not as strange as it sounds. For example, if you have a client that receives a royalty on things sold on another website, people might want to make the whole thing look like it's theirs, right? And if they have a relationship with this other company, they have everywhere to do that. So this is not such a far-fetched thing to do. So here you would somehow gather the contents of a form that would normally appear on this other website. And you gather the variables. They all come from databases or cookies or stuff. And you make sure that your variables and methods match the form that your target is. And then you basically submit the stuff to the target site. This idea I like, but I'm hesitant to do it. Start a digital clipping service at databases, information and images about both your client and their competition by spattering web FTP and new servers along with DNS records and new domain name applications. I'll tell you why I'm not doing this a little bit later. But basically what you do is you download company data from various sources. You parse out everything that's applicable about your client or your competitor. I suppose if you were really smart about this, both the client and the competitor would be your clients, right? So save work. And then the action you would take is you somehow post all this stuff on a central database with a centralized web server so that they could go and access their data. Okay, those are some ideas, okay? You see they all have very similar process. What they do though can be very, very different. I've written agents in a number of languages. I first started doing things in Java again back when I was doing telemedicine things. And I was doing a primary in Java because I wanted to have a graphical user interface. That was a nice way of doing it. I did some things in Perl. Perl is very powerful for this kind of stuff. I did some stuff in Tickle-TK. In fact, I actually wrote an article. It's the March 2000 edition of Web Techniques, which I think now changed their name to New Architect. If you're interested, I'll have a reference to this in all URLs that I'm using at the end of the presentation. But I did an article about writing intelligent web agents in Tickle-TK. For about the last two and a half years, I've been writing almost everything in PHP. I love the language. I've never had to debug the interpreter, which is a very nice thing. And it just runs so reliably. And it's very similar to Perl. But I believe the syntax is much simpler than Perl. So if you're going to download files with PHP, how do you do it? Well, if you're doing a local file, you use the method I'm showing in number one there. But basically, you just create a file handle to whatever it is you want to access. If you want to do a web page that's number two, it's the same thing. You just open a link to a URL. Then you've got a handle that you use to reference that URL. PHP also allows you to do FTP if you need to do FTP. I think that I tell you I wasn't going to show you any code. My intent was to not show you any code, but here's some code. Actually, here's a partial listing. If you're interested in looking at this, I'm not going to walk through it. This is on the CD-ROM, I believe. Basically, this is an example of doing socket programming in PHP. And this particular example goes out and it performs a who is function. It looks like this. I had a client who we were talking about some stuff and about doing some stuff like this. And he told me that the people who run the who is servers, they watch how many hits they get from particular IP addresses, and they only allow so many in a 24-hour period. Anybody else here about this? Do you know what the number is? Thousands? Good. Okay, launching your agent. I tend to do everything from a command line, even though I write this stuff in PHP. A lot of people don't write PHP programs this way. They'll write pull programs that they launch from a command line. You can also do it from PHP. If you're going to do it, this gives you some advantages. For one, you don't need to run a web server. So if you're doing stuff in PHP and you're doing it as a module with a patchy or something, I tend to not like doing that because I might be fooling myself, but I feel safer not having a server on the system. So there's also a standalone version of PHP that you can see in the first line here. PHP also has a separate command line version, which you should really use if you're going to do this, because in addition to not having a lot of the webby kind of stuff that you would normally have, you can also have separate configuration files and all kinds of cool stuff. It's also a smaller footprint, I believe. You can run these things as a scheduled service, so they can run like every hour, every six hours. You can have them run at midnight. You can have these things run on startup when people log in. You can run and have them loop continuously, which is a fine thing to do. In fact, you can have lots of these things running continuously on the same machine, and you can get some incredible efficiencies by doing that. Because most of these spiders and web agents just send out a request someplace, and they spend almost all their time waiting for a reply. So if you can run them simultaneously, if you have a need to run them simultaneously, it's a worthwhile thing. Okay, I showed you how to download stuff using PHP. I tend not to do that. What I use instead for downloading stuff is a really great program called Curl. Can I see how many people here are familiar with Curl? Excellent, excellent. Curl is a command line tool for getting and sending files using a URL syntax. If you're interested in getting a copy, there's a couple of places where you can get it. I was thinking about putting it on the CD-ROM, but they've got binaries you can download for like 50 different platforms, and I didn't see the need to do that. Again, these URLs will appear at the end of the presentation, too. Curl supports a ton of internet protocols. And it's open source, but it's not GPL, which means basically there's no license whatsoever. My understanding, and if I'm misspeaking here, or somebody please raise a hand, this stuff is completely free software, so you can do anything you want with it. Don't confuse Curl with the small URL with Curl, the client-side programming language, which is a product that the people at www.curl.com are selling. There's also a copy of the Curl manual on the CD-ROM, so if you're interested in looking at all the features, it's got tons of features. I'm going to just show you a few of them here. To use Curl, you just type Curl, and then your target, and it'll take everything it receives and direct it to your standard output device, which in most cases is going to be your screen. You can also redirect the contents to a file. You can also use the number three there. You can use the O-switch, so you can set the output to a specific file. The way I tend to use Curl is I use it inside of PHP, where I will do things. I'll actually run like a little shell to execute Curl within PHP. The nice thing here, if you look, it'll take the entire contents of the default HTML page from this fictitious website and load it all into this variable, so I don't even have to create a file or anything. It's just all in that variable. Once it's in that variable, it's very convenient and very easy to parse out. If you're just writing PHP for a web application or something, never do this. You don't run command line stuff inside your web applications. Again, maybe I'm fooling myself. I'm feeling safe in doing this because there's no server attached to it. Here's an example of one of the options you can have with Curl. In this example, I'm downloading the headers for the New York Times website. There's all kinds of cool stuff. I can see what kind of server they're losing. If you're thinking back to the one idea where you monitor the internet technology that's used by the top companies, this is basically how you would do it. You can also see there's a cookie that's being set. Curl also allows you to send cookies back to servers, which is very nice. This was an HTML page, but you can download anything with Curl. It doesn't have to be HTML. It can be images or flash or whatever. You've got some nice ways of dealing with password protected sites. You basically just set up your URL, you set up your user and password, and then you use the U switch in Curl to pass the username and password to whichever site you're using that uses standard authentication. The other thing that it's got that's really slick is it allows you to download sites that are encrypted. And all you have to do to do that is add an S to your protocol right here. And then it automatically decrypts it and it will throw everything into this variable called page data, which is pretty slick. Something you need to remember when you're writing agents and you're going up to servers, you're going to leave a log in their log file that shows that somebody's been there. And if you don't do anything with Curl, I think basically it just says Curl in the version number. But you can make it look like anything you want. For example,