 When you're looking for data, another great way of getting data is through scraping. And what that means is pulling information from web pages. I like to think of it as when data is hiding in the open, it's there, you can see it, but there's not an easy immediate way to get that data. Now, when you're dealing with scraping, you can get data in several different formats, you can get HTML text from web pages, you can HTML tables, the rows and columns that appear on web pages, you can scrape data from PDFs, and you can scrape data from all sorts of media like images and video and audio. Now, we'll make one very important qualification before we say anything else. Pay attention to copyright and privacy. Just because something is on the web doesn't mean you're allowed to pull it out. Information gets copyrighted. And so when I use examples here, I make sure that this is stuff that's publicly available, and you should do the same when you're doing your own analyses. Now, if you want to scrape data, there's a couple of ways to do it. Number one is to use apps that are developed for this. So for instance, import.io is one of my favorites. It's both a web page that sits address, and it's a downloadable app. There's also scraper wiki, there's an application called tabula, and you can even do scraping and Google Sheets, which I'll demonstrate in a second and Excel. Or if you don't want to use an app, or if you want to do something that apps don't really let you do, you can code your scraper, you can do it directly in our or Python, or bash, or even Java or PHP. Now, what you're going to do is you're going to be looking for information on the web page. If you're looking for HTML text, what you're going to do is you're going to pull structured text from web pages, similar to how a reader view works in a browser, it uses HTML tags on the web page to identify what's the important information. So that's things like body and h one for header one and P for paragraph in the angle brackets. You can also get information from HTML tables, although this is a physical table of rows and columns I'm showing you. This also uses HTML table tags, that's like table and TR for table row and TD for table data, that's a cell. The trick is when you're doing this, you need the table number. And sometimes you just have to find that through trial and error. Let me give you an example of how this works. Let's take a look at this Wikipedia page on the Iron Chef America competition. I'm going to go to the web right now and show you that one. So here we are in Wikipedia, Iron Chef America. And if you scroll down a little bit, you see we got a whole bunch of text here, we got our table of contents. And then we come down here, we have a table that lists the winners, the statistics for the winners. And let's say we want to pull that from this web page into another program for us to analyze. Well, there's an extremely easy way to do this with Google sheets. All we need to do is open up a Google sheet and then sell a one of that Google sheet, we paste in this formula, it's import HTML, then you give the web page, then you say that you're importing a table, you have to put that stuff in quotes and the index number for the table. I had to poke around a little bit to figure out that this one was table number two. So let me go to Google sheets and show you how this works. Here I have a Google sheet. And right now it's got nothing in it. But watch this, if I come here to this cell, and I simply paste in that information, all this stuff to sort of magically propagates into the sheet makes it extremely easy to deal with. And now I can, for instance, save this as a CSV file put in another program, lots of options. And so this is one way that I'm scraping the data from a web page because I didn't use an API, but I just use a very simple one line command in Google sheets to get the information. Now that was an HTML table, you can also scrape data from PDFs. You have to be aware of whether it's a native PDF, I call that a text PDF or a scanned or image PDF. And what it does with native PDFs is it looks for text elements, again, those are like code that indicates this is text. And you can deal with raster images, that's pixel images or vector, which draws the lines. And that's what makes them infinitely scalable in many situations. And then in PDFs, you can deal with tabular data, but you probably have to use a specialized program like scraper wiki or tabula in order to get that. And then finally media like images and video and audio, getting images is easy, you can download them in a lot of different ways. And then if you want to read data from them, say for instance, you have a heat map of a country, you can go through it, but you'll probably have to write a program that loops through the image pixel by pixel, to read the data, and then encode it numerically into your statistical program. Now that's my very brief summary. And let's summarize that. First off, if the data you're trying to get at doesn't have an existing API, you can try scraping. And you can use specialized apps for scraping, or you can write code in a language like our Python. But no matter what you do, be sensitive to issues of copyright and privacy. So you don't get yourself in hot water. But instead, you make an analysis that can be of great use to you or to your client.