 Welcome to case studies in our this is intended to be a very kind of a whiz bang check out all the cool things you can do with our kind of demonstration, and it is cool but it's intended to show the utility of our. It's also intended to build upon the introduction to our series that I've been doing this past year I kind of flipped my intro workshop, and we'll talk more about that so i'm going to try and start at a basic level of being kind of fluid with tidy verse. We won't review tidy verse really too much we'll use tidy verse, and I'll show you how you can catch up after the fact if you want to do that, but i'm glad to have you all here. If you will please give me your attention, I like to start out with this landing acknowledgement. So I'd like to take a moment to honor the land and Durham North Carolina Duke University sits on the ancestral land of the chicory the Eno and the Catawba people. This institution of higher education is built on land stolen from those peoples. We're here before the colonizers arrived. Additionally, this land has borne witness to over 400 years of the enslavement torture and systematic mistreatment of African American African people and their descendants recognizing this history is an honest attempt to break out beyond persistent patterns of colonization and to rewrite the erasure of indigenous and black peoples. There's value in acknowledging the history of our occupied spaces and places. And I hope that we can glimpse an understanding of these histories by recognizing the origins of collective journeys. So that's a very serious thing and I thank you for giving me the opportunity to read that we're going to move on to things that are more technical in nature and maybe not related but with any luck, one of you all budding scholars can use what you're doing today to help resolve or improve some of these injustices that we have. In any case today's demonstration as I said is about web scraping it's intended to be built on earlier workshops of gaining a certain level of fluency with tidy verse but if you're a newbie I welcome you anyway if you're brand new beginner. Please stick around and listen. This workshop will be recorded you can go back and watch that and now in a second give you some links to some other materials. It's important to acknowledge at this point that web scraping is fundamentally a deconstruction process right we're going to. We need to know just enough about the web so that we can break apart into its constituent parts, a website or some parts of a website. So in that spirit we're going to learn just enough HTML and cascading style sheets to understand how those technologies which are foundational aspects of the web, understand how those technologies are important for deconstructing a website. We're going to use the library called our best. So if you haven't installed that and comfortable installing packages and you want and you want to try and code up. Code with me simultaneously please go ahead and install our best will also use the per library to iterate specifically using the map function. For people who are more comfortable with base are per is or mapping is akin to similar to base our old school and apply as supply and apply. You could use those as well. But I'm going to use per. And that's a whole separate iteration is a whole separate workshop but we'll learn just enough that we can see how it works. And along the way I'll try and point out some useful documentation. All right, so taking taking a half step back I just want to sort of point out again that this is building on earlier workshops now that everybody's joined I'm going to put in this GitHub code URL again, and you're welcome to download that if you want. But our fun is a sub branded site something I came up with which is really just to pull together all of the resources that I use and teaching workshops. It's part of it's a sub brand for my center the Center for data and visualization sciences. And you can scroll through here and each one of these little squares represents some different aspect of learning about our. But when I say that we're going to build on that we're really building on the quick start part. Click into that just so you can see that there are some. This is part of my flipped workshop so what I did is I created smaller videos 1020 20 minute some of them are a little bit longer but not much that cover different aspects of our. So if I cover things that you're not familiar with feel free to go back and watch some of these videos. Feel free to also check out these are links to other videos. If you don't mind I'm going to click you on your on your microphone because it's ringing a little bit. Anyway, if you know I don't know that we're going to talk about joining emerging or assignments and pipes but if you see stuff that you didn't recognize you can go back and check those. And of course I'm always available for consultations. So as I said, our fun is a sub brand for my center. I'm going to link to my center right there, and just do a really quick hopefully under two minute commercial center for data visualization sciences. We generally help answer questions on these support areas are these themes. This is the full time staff. So we do consulting will often the easiest way to get started with consulting is just to send an email to ask data do you. There's a lot of workshops that are generally front loaded on the beginning of the semester. Most of the time for all of the workshops there's a recording that's available after the fact and resources. You can usually find those recordings under this link online learning. And then, as I had mentioned, if you're just interested in our stuff. You could just go straight to our fun and find videos and links and links to code and all that kind of stuff. Onward and upward. Oh, one other thing that I always like to mention, because I didn't. You can you can chat with us as well as well we've been working on trying to make our lab available to people virtually we have 12 workstations that are really powerful so if you find that you're doing some kind of processing where your workstation is thrashing because there's a lot of RAM or whatever. We've just created a connected a connection to those 12 machines in the lab even though the lab's not open right now. And it's using a technology called splash top which is a free free download so if you email us at ask data. We can give you a connection to a more powerful machine that has a curated list of dedicated data types of software including R and Python and MATLAB and Tableau, a lot of the stuff that you can get anyway but the advantage here is machines that have, I think they all had 32 giga RAM, and a lot of things are memory bound so it can be handy, especially if your personal computer is is being text. The other thing I want to mention is that we also have two graduate students who who specifically they're graduate students in a program called computational economics so it's kind of like a dual CS econ master's level. And they help most often with statistical modeling questions or advanced economics questions. So, the full time staff generally kind of trying to fit into these things but if you do have a statistics, a statistics question in particular, our econ students do not make reservations for data consult so you can just connect to them when they're available so Annabelle or handling, and they're really helpful. You can just click on that green button to connect with them when they're available. So moving back to the slide deck. A couple caveats on what we're going to do today. And if I haven't said it before I'll just say this again, I want you to feel free to unlike and ask questions and I don't think we have a full hour and a half worth of material. So I will definitely hang around as we're winding down if anybody has specific questions. And if you don't want to ask them today. Just again remember that you know we do take consultations and I'm happy to meet with you. But caveats about web scraping this is an important one. There's no real set of rules about how websites get developed. There are a lot of protocols and there are some standards. And the website is its own island in that sense and, and basically the older a website the more likely you're going to run into some weird and idiosyncrasies or inconsistencies so I always like to point out like, we're going to learn how to web scrape but web scraping is as much of an art sometimes as it is a science and sometimes just inconsistencies that you run across become barriers to how you can automate the scraping so keep that in mind. You can do a lot, but you can also run into some really funny thorny issues that, if you think of them as puzzles there, they're interesting puzzles to solve. Another couple suggestions or caveats to keep in mind. Many websites have a terms of use clause or link, particularly the larger websites that are really all about content so you can think of anything from the New York Times. And the Washington Post which is copyrighted content to some of our larger bibliographic databases that you might be familiar with like nexus uni or pro quest or things like that. A lot of them have rules where they say you are not allowed to scrape our website. But you need to keep that in mind the library has a Office of scholarly scholarly communication and copyright. And so we have staff that can help you understand how the terms of use may impact your research questions. Outside of the issues of copyright sometimes there are reasons to, you know, maybe investigative journalism but not there are reasons to do scraping that go outside of the lines I'm not going to advise you on that I'm going to advise you just on the technical aspects of how you do web scraping. It's important to understand what is officially allowed. And it's important to also understand that like the larger the website, the more likely they're looking for you, they have little robots set up to make sure that you don't break the rules right. And one of the downsides that can happen I just say this real quickly is, like if you decide you want to scrape all a pro quest, and, and they discover that. I just want to implicate pro quest because I don't remember which companies but some companies will do this. And the next thing you know they'll shut down not only your ability to scrape their website but they'll shut down the whole campus's ability to access their website until, you know, the powers that be talked to the people and the perpetrators if you want to call them that. So all that is to say copyright is complex and there are financial reasons why companies who own data want you to follow their particular rules. It's important to note that there are things called API's or application program interfaces interacting with API's and web scraping have a lot of similarities. But I just want to point out that one of the reasons why API's exist is so the companies with big web properties can segregate traffic right one area the web portal for point and click manual data gathering. API is designed for machine to machine interactions, and by pushing traffic into different types, you ensure a high degree of robust experience for everybody who's using the site. So it's more than just copyright. And so anytime an API exists, you're probably going to be more efficiently interacting with the API than you are writing a web scraper so it's always a good technique to, you know, like if you identify site people I want that data, like a good great first step is just to Google, you know, does this company have an API. Can I get access to it is this open source site also have an API in addition to their website. So a lot of sites have what's called a robots dot text file that's the standard and you can look at more in there but that's a standard it's developed in the early days of the web by search engines such as Google and, and other other kinds of people people that write robots that go out and scrape websites. And it's a good idea to honor those it's a good idea to look for them and honor this. There's nothing forcing you to honor a robots dot text file it's really kind of a notice sort of like a no trespassing sign that if you've ever listened to what he got three says the best part about being seeing a no trespassing sign as being on the backside. But that pair of paraphrasing that is not represented properly but my point is, it's just a notice. It doesn't enforce that you can't go into that area of the website, but you should look for them and understand what the rules are. The caveat there and I won't, and I won't go on too much longer is that when you're writing these, what I'm going to call writing a robot but it's a scraper when you're writing that computationally. Remember, you know computers don't need to breathe they don't need to pause they don't need to take a step. And so if you write your web scraper kind of poorly, you can just, you can accidentally pose as a denial of service attack, which is yet another reason why a web host might try and shut you down. An easy way to avoid that and what's generally considered to be just kind of like good citizenship is to just put a little pause between each iteration of every web page that you're gathering and I'll show you how to do that. Just like a one second two second pause something to to break up the amount of traffic that's being requested. All right, so just about two or three more slides. Let's just introduce the topic web scraping. Generally speaking, gathering or ingesting web pages for some kind of analysis. People who come to me oftentimes they've never really approached the topic before and they just want to know how to get started so to them web scraping is just that it's almost like there's a magic button. There in fact are many magic buttons but the easiest tool to use then is the R best libraries read underscore HTML function, which basically does the same thing as if you clicked on a link, or if you put a link into a web browser. It's just going to pull down all of the HTML and render it and then in a web browser renders it so that you can read it but outside of a web browser you get down the raw HTML and we'll talk about that in a minute. And what I also want to suggest to you is that web scraping is really at least these two parts right it's crawling and it's parsing. So crawling is not part of the artist library crawling is what I would call in in our context, the iteration. So you kind of have to develop this plan of what data do you want. How are you going to work your way through the navigation that working your way through the navigation so that you can get every page or every element of what you want. So that's what I'm going to call crawling, and we can use per map for that to iterate. You can also use supply apply and apply, or probably just apply. I'm not I don't use those functions as much so I couldn't advise you on that but map will work just fine. And then the other part the real are. I think maybe to me the most exciting part of it is the is the parsing right. So we're going to look at an HTML page here in just a second. But the primary functions that are going to be useful to us are HTML underscore nodes HTML underscore text and HTML underscore ATTR which stands for attributes. And as I go forward you'll start to see how those are useful, but all all of them basically help you winnow down to just getting the information that you want. All right, so here is the world's simplest, most easy to read HTML document you've ever seen, just the part inside the gray box. Right, we're actually looking at a web page as it stands. But in any case all HTML documents generally have an HTML opener and closer and they usually have an opening body statement and closing body statement. Most of them also have a header that exists above body. Sometimes there's even more information. There are things like this where this is a heading size one and a close heading tag, a paragraph opening tag and a close paragraph tag. This is called an anchor tag, and that's a close anchor tag. And the anchor tag has an attribute called an href which stands for hyper reference or hyper text reference. And what you should see that is familiar to you is the hyper text reference is a URL, or a destination to some other part some other web page. If you were reading this in a web browser, all you would see is after it was rendered by the web browser all you would see is a bold header that said my first heading, and then a paragraph that said my paragraph contains a link. If you would click on to w3schools.com. Right. So that's the trick is this is what you get back when you read underscore HTML and then we want to be able to identify which parts of that HTML structure, help us identify what dot what information we want. Maybe what we want are all the links on the page, in which case we want all of the href attributes for the anchor tag, the a tag. Maybe we want something else we'll work our way through some examples. Some additional technical mark, it's not markup. I'm not sure how to how to refer to this but some additional technical elements to a web page are cascading style sheets. Right so HTML is a markup language that provides some structure to a document like what's a header. What's a bullet list. What's a link what's italics, but how the web browser renders those things is really kind of up to the web browser and then the web author can further identify the the style of which things get rendered right so in the previous example. We had a header size one and if I didn't have any special cascading style sheet. This is actually an example of a of a header to right here. Without any CSS that would be rendered as some kind of like times Roman or Sarah font, probably black in color, probably about, I don't know what is that 20 point font. But a style sheet allows you to affect them right so I can tell it I wanted to show up in kind of a Duke blue with a sans Sarah font. That's what CSS does for you. The elements that you would see an HTML of a CSS are things like div tags with class attributes, and their arguments and span tags, and both of them can have IDs or classes and have arguments. That's all you would see in the HTML. If you really wanted to see a style sheet. And so let's just click on one right here. I'm going to open this one. This is a style sheet for the one that for the site we're going to scrape. It looks like this. It's not all that useful at this point. I just wanted you to see one. So for example at some point we're going to try and pull back some navigation. And here's the style sheet that identifies how the navigation bar appears. The reason why that's important is that the existence of style sheets. The style sheet indirectly adds more structure to the document so that you can use a combination of CSS HTML to narrow down on just what parts of the site you're actually interested in, in gathering. All right, so I have this little slide here. And again, if you looked at the GitHub link that I sent you there should be a you can download that whole file. There's a repository, and there's a folder in there called slides, and this slide deck is in there. This is a general kind of workflow of what happens when you're web scraping, right? Most of the work is in the development phase, you want to put off the production phase as long as possible, because of that issue of how you can sort of unleash an unsuspecting robot onto somebody else's site. And a lot of times when you do that it takes some time so you want to do plenty of development work. The important phases are right just looking at the raw HTML to figure out what you want back, looking at the navigation of the site to figure out how you're going to work your way through it, and then developing a plan where you can parse and iterate. And then finally, putting the whole thing together. This particular this is I'm pretty certain my last slide this is just a visual to kind of represent a different way to think about how what a whole website might look like if you think about it as a hierarchical tree, where the homepage or your starting navigation point is the trunk of the tree. And then there's a series of branches that lead off into every sort of record on the tree you can think of every web page final destination as a leaf on that branch from the root of the tree. And then so what would happen is if you were iterating is you would go to the pagination and choose page one, and page one might have a URL that you would then put into a different part of your script. And you would go to that URL to parse out the title and the date and press release and the subject. Right, then you would go to page two, and you would, and you would separately pull down the URL for page two and get the data you want, just a different way to visually think about what you're about to accomplish. All right, that being said, bring a good time to ask if there are any questions, and if there are I'll try and answer them now and then we will transition to our studio and I'll do a demonstration. John, I have a question so in the tree model. So for example if I want to search articles from a website which has a lot of pages, and I know a keyword that I want all the articles, which has that keywords let's say election. And so do I need to go to each page or I can just give it them for example New York Times just give New York Times and it will go to each and every tab and page and pull me the articles, and like from every tab and page or how we have to specify each page. So, I, a lot of web scraping is context specific so I think that you're, I think that as I understand your question, it probably is the case that you need to gather the URL for each of those results, or it might be. Sorry. Yeah, so I'm saying let's say I want when the articles were published like the date of that article which has elections, and I want to see for last two years. So, if it may very well be that the results page has all the information you want, in which case you don't have to go beyond the results page. Right. But if, if the results page is only a summary of an article. And there's more information that's not in the summary that you want. Then of course you do I think that's the answer to your question. So anyway, I think the example that I'm going to use may help answer this question as well because what we're going to do is we're going to look at a results page and figure out how we can get the data that we want. I have another question as well. Would you like when we're looking at a website and we're trying to differentiate between HTML and CSS to be able to select things. Is there, is there like a way to know that sort of offhand I don't know our heuristic way. So what I'm going to do is I'm going to show you a couple different tools that will help you identify what what HTML tags are important for the for the data that you want to get back. Specifically, we're going to look at something called selector gadget. And it's probably also since you're asking that question right now let me just also point this out let's go to, let's go to Wikipedia. And so if I do a view source on this view page source I'm in a Chrome browser. I can see just a whole mass of the HTML that makes that Wikipedia page relevant or makes it render into something that I can read, right which is right here. So that's one aside from selector gadget sometimes you do actually have to look at the raw HTML. Another thing that you can do which is really handy is for example if I wanted. Trying to figure out how I can get this to happen. If I wanted this text right here. Actually, don't let's. Alright, let's look up Dolly Parton, because she was just in the news. Yesterday because she got her coded back seat. So if I wanted this text right here. This is a good example. I could also I can view the source and try and fair it through that page I could also in a Chrome browser. Right click get a context menu and choose inspect now this only works in a Chrome browser. But what happens is inspect now gives me the ability to kind of selectively get some sense of what the different elements of the page are and drill down into them. So sometimes you're going to use the Chrome browser to inspect a different element like what makes. What if I wanted acting career and personal life and discography like what what makes all of those show differently from all of the other words on the page. But the good news is there's even an easier way. And that's what my example is going to show we're going to we're going to use this thing called selector gadget. But I want to I want you to at least realize that there are these other elements other techniques, because sometimes you have to go beyond the selector gadget depends on the complexity of the page. So, let me, I'm going to move my, I'm going to move my, some of my zoom things around zoom boxes and I'm going to change what screen you can see. Oh, before I do that. Let's go back to to here. This is the, this is the repository for the code that I'm using today. I'm going to start off here and in the oh one scrape underscore case study. I don't have any other exercises there but I did. I'm going to start off an oh one dot rmd. And if you've not downloaded data from GitHub before you can. You can click on the green button download zip and then you'll have to expand that. But let me go to my R studio. So we have this open. And, like I said, I'm going to start off here in oh one scrape that underscore case studies. Right, so the first code chunk I'm going to run on this is, you know, the code chunk that begins at line 18. The only tools that are really important for, for what we're going to see our tidy verse and our best the other two, just allow me to do some. Some display of, of creative commons fonts down at the bottom of this document, but our best is for harvesting websites tidy versus just generally useful. Now this particular document has a ton of words in it I tried to narrate exactly what's going on. It's going to be potentially a useful thing for you to reference later, but I'm not going to read very much of it today we're just going to read the code. So the very first thing we do let me close that down is we're going to use the read underscore HTML function. Before we do that, let me open this, we open this URL copy, and I go back to my website. If you'll excuse all of my doing. I will show that site so we can, we can navigate to what we're doing here. Share screen. So what the site we're going to try and scrape is this site called cardico. This came to my attention because last semester somebody asked me how to scrape this site. And I looked at it and thought it was an excellent case study for a workshop, because it's a, it's a fairly austere site it doesn't have a lot of extraneous HTML. It appears to be open source. I didn't see a terms of use document. I went to independently I went to the root domain name and type in robots, robots dot text. And when I got there. It says it says forbidden but that's not what a robot that text file actually tells you. So that means what that's telling me is, I don't believe they actually have a robot dot text file. If they have one, they're not allowing me to view it which doesn't make any sense, which also is further evidence that there's no robots text file. But all of this tells me yeah okay I think it's a good site for me to, to scrape, particularly in a workshop setting. Right and the information that you'll find here is from this university in Amsterdam the Center for Study of the Golden Age and it's information about we can actually click the about but it's information about some artists and artisans, mostly biographical information. So if I go to Maria Van Alst. I can see some general information, biographical information about her and that she was married once she had one child and some other information. And in the end, the scenario that I want to try and set up for us today is my goal is to gather to walk through this site and gather all of the names of all of the children and maybe the URLs. Right. So typically, I'm a little old school on this my goal would maybe be just to do a quick view page source, just to see is this information, kind of useful for me. I might do a free text search on children but this step is not absolutely necessary it's just something that comforts me so I'm recommending it to you might comfort you to find it. If you find that it's confusing or skip it. So that's my goal is actually back up and ask you about the robot. So was when you were talking about that that it was forbidden was the basic idea just that if they were thoughtful about wanting to restrict access or moderate access. They would have made a robot. No, I was looking for a, that's probably not a good example I was looking for a specific. Robots got text file response. So let me see if I can find one. Here's a here's a good example of a of a robots dot text response so if you see something like this from a file that it's always going to be at the root. And it tells you in this case it's going to say oh there's an MSN bot. There's some things that I don't want the MSN bot to look at. And for everybody else represented by the user agent equals star. These are the things that I don't want your robot to look at. So, the fact that it said forbidden to me tells me that there's no file there. Ideally, I would want the web web server to have responded 404 error not found, but it said forbidden in either case. I can't read that there's a robot that text file so I'm making the assumption that therefore they don't have any restrictions for my robot, does that answer your question. Yes, yes it does. So following up on this question. So, for example, I see that this is not allowed for me. So, so if I write a code and then start scraping, then why is there a problem I mean I am not allowed so I'll not be able to get this information but I can scrape whatever else is available. I think we might want to come back to that question said but as I understand it the short answer is that the robots dot text file is really intended to tell big automation robot scraping sites, such as Google and I'm going to mute you if that's okay, such as Google and some other and other big search engines like that. So that, so that you can, you can alert, if I, if I'm a web host I want to alert any random robot that I don't know about, please don't come and search these particular sections. But this is not an enforcement mechanism this is more of a notice. So I say it's good to look at that just in case it gives you a sense of what the web host feels about their site. But it won't prevent you from, from actually writing a robot to scrape their site. So that's more of a, you know, what would prevent you of anything is lawyers, right, because most of the stuff that I'm going to show you is not going to prevent you from is it can't be shut down, but has to happen is on a technical you know the web development team at that site would have to write robots searching for robots and then detect robots and then block IP numbers and all that kind of stuff and that's kind of outside of the scope of what we're talking about. So let's just assume openness and a scenario, we're going to try and gather the names of all the children from this site. And so the first thing I'm going to do is I want to start and get a sense of how do I navigate through this site. And I can look at this first page, and I can see that there are 50 names on here, and that there's also some navigation, so that I can go 50 names at a time. And any one of those names if I, if I click in on it, this person has almost no information, and there are no children listed but that's okay. I can find the structure of the site, and this goes on for about 20 pages, I think 22 pages total. And on any given pay any given summary page, there's not a link to every 50 page summary, there's just a link to about, what is that seven or eight of them. So all of that information is useful to know. So the first thing I'm going to do is I'm going to read in page one. So let me skip back to my studio and move around some zoom windows. And that's what I did right here, I read in or I'm going to read in page one of the first browse page. And you'll notice in my environments variable I got back a list of two items. And I can look at that list. And it doesn't look much like the HTML that I would see if I did a view source, but it has, this is, as I mentioned earlier, every HTML page has a body section. That's the stuff that I'm really interested in. And so that's there. That's what I'm going to end up parsing is the body of the HTML page. Right so this is just a review of what a body looks like. And what we're really going to do is we're going to try and get the links to each person on that page. So just to, and the way we're going to do that is we're going to put in this code. Now the magic question is how do I get that code. So let me once more time. I'm going to go back to the card ago, you should see the card ago page now. And what I want is I want these 50 names. And so that's where I'm going to use this selector gadget that I have downloaded earlier to go to selector gadget calm, you can get this to it's free. Like, when I engage it, I can hover over all different parts of this page. And you'll see that every time it finds something that it's trying to select. It also has a little bit of information under the orange box that I will share something in just a second if anybody has a chance share the GitHub URL to fuel. So there's a little bit of information there under the box. It tells me the elements that I want to catch. All right, let's see. Let's see. So going back to the site. So like your gadget is engaged I'm going to click on a name and what happens is everything or a whole bunch of stuff gets highlighted in yellow. So I'm going to click on something that has 71 different elements. I only want 50 names. So my next step and select your gadget is very handy is I want to, I want to eliminate things and so I'm just going to click on things that it will automatically start eliminating it's not like a control click or shift click is just a click. And when I clicked on that a whole class of things disappeared. I still have these four elements, and I don't want those either. So every time I click on something, this code changes. So I'm going to click on it again. And now I have this magic code. And I can see right here that I have 50 elements. There are 50 names. That's a pretty good thing. So I'm just going to copy that into my buffer. Before and then just for the moment I'm going to close that down and just view source again so you can see what's going on here. It's these 50 names that I want right. They all have a very common element. They have an ally, which stands for list item which is the bullet that appears in the web browser. They have an a for an anchor tag and href for the relative URL of the page with the text of the link to that page and a close anchor tag and a close ally. Right. So going back to our vest. Or to our studio. I copied that code into my buffer. And then I'm going to paste it right here. So remember earlier I gathered results by searching the URL of the first page. And now I'm going to paste that code as the first argument of HTML nodes. If I run just those two lines. You can see that that's that's the HTML that I wanted to single in on. And then I can pipe that to HTML text. And that will do is give me just the text within the anchor reference right so there's the anchor reference tag, and then there's the text for the tag. So if I run that I now have a vector of 50 items, which I can, which I put into an object name called names. And I can build any data frame out of that in a minute. And the other thing I want to do same stuff, rather than getting the HTML text, I want to get the value of the anchor reference ht ref attribute, that is the value of this argument. And so I do that right there. And when I run that, it brings me back a 50 vector item list of the relative URLs, which is really useful. It doesn't have the full URL so I'm going to have to build that up. Right this is the base URL for the site. Oops, why did they do that. Okay, so that's what I'm going to do is I'm going to build that up. But let's first build a data frame of my results. Right just using a standard tidy verse function. I had already built the names function have already built the URL function. I'm going to now have a table of the first 50 names from the first link with the name and the relative URL. And then I'm going to transform that a little bit just by pre pending the base URL in front of the relative URL. And I get back this link to the first 50 records. And ultimately what I want to do is I want to get back. I want to generate a long sort of to do list of all 2200 names, but I have to gather 50 names at a time that's the crawling part. So so far what I've done is I parsed. I've crawled one page and parse the results for one page. And what am I going to do next. Now I'm going to create a plan for crawling the rest of the navigation. Okay, so moving back to my cardico site and turning on selector gadget again. I want all of these links. So I'm going to turn on selector gadget. And I'm going to click. Let me click clear first. Yeah I don't want just one mistake I made I click on just one. That's not what I want. I want to click this whole line. And I had sort of the same issue right where I've got more results than I want. It sells me it has three items. One of them is this yellow one at the top I don't want that one of them is green one. And one of them is this yellow one it's a repeat of the one at the top so if I had to gather both that'd be fine I could parse it in our I could, I could do something with it, but using selector gadget. So notice that what I want is anything called dot subnav, which is a class of a cast dating style sheet element. If I click on this changes it to form plus subnav, and now one element. So again sort of through the magic of selector gadget. I've got the stuff that I want. I just want this one line. These links for this one line is what I want to get back. I'm going to copy this into my buffer. And shift back to our studio. And that's what I hope I paste in this time to HTML nodes. Right so just to make sure that I've got the stuff I want I pipe it to HTML text. Although that in this case the text isn't all that useful for me I'm just doing a verification here. What I really want is. Remember, on unparsed. It looks like this. And what I really want are the value of the href attributes. So just like I did before. And then HTML node information, and then HTML drip href. And that brings back for me a five element vector with a URL to each one of the navigation summary pages of 50 links. It doesn't have page one because I started on page one and page one doesn't is not self referential. It goes up with page two. It goes up if you if you looked at that if that site it would go page 2345 and then it has a dot dot dot ellipse. And then it has a link to the last page page 22. So 50 links 50 rough, maybe not on page 22 but otherwise 50 links on each page. So I'm wanting to build up that list of 1100 URLs of 50 names on each page. So I'm going to do that sort of in the same way that I did before right I'm going to put it into a table because tuples in tidy bars are the easiest to navigate row by row. And what I really want to do now is I just want to. I want to build up a longer list by leveraging the information that I already have here. I know that I have a page one I know that goes through page 22. So I just want to use the tidy bars to make 22 rows, where the last value is a little bit different. And that's what happens. Right. Oh wait a minute, I did something different here let's look at what I did here I think I'm going very elementary. I added I added a page number. Variable by using some regular expressions to get the final digit of this free text right this is a character string. So just, just to make it clear what I just did there. And I would talk about this for a minute is I wanted to get just the numbers 234522 because I want to manipulate that further. But also, in order to do that. I leveraged another part of the tidy verse, the stringer library. And in this case specifically the string underscore extract function. But the stringer library is what what extends are to be able to use regular expressions to find patterns in full text, or any kind of character vector. And so I want to bring that to your attention, because really anytime and are you're working with text, you're probably sooner or later going to want to be able to find sub patterns of the text. So the way to do that is with regular expressions. Now regular expressions have been around a long time since the 60s at least. It's the basis for finding replace every application you've ever used, including Microsoft Word. But it actually also works a little bit differently in every programming language like there's sort of dialects of regular expressions, and in our the regular expression that I'm writing here says take a digit, a digit symbol in all in our regular expression is slash slash D. And then there's a multiplier this says one or more digits. So anytime you see a pattern where there's one or more digits. And then there's an anchor. In this case, the dollar sign is an anchor for the end of the line. So that pattern is really saying, fine for me, any single or double digit numbers at the end of a line, right 234522. It's time to go into all of the specifics of regular expressions. But I just want to interpret that for you and then tell you, if you're going to be successful at web parsing using our, you're probably going to want to learn more about regular expressions. They are incredibly extensible, but the basics of them, you could, you can learn in half an hour to an hour without any trouble, maybe five minutes maybe you just learned everything you just needed to know. So let's see that I'm using regular expression right here again in string extract, and then another one in string replace. So let's break those down a little bit. What I have right now is that two column table with 234522. And I am going to. I'm going to do that again where I guess I'm doing something repetitive here, where I'm extracting. I'm going to call navigation, and I'll interpret that. Rejects for you in just a second. No, but, but you can see the results of it. What that regular expression did is it took everything before the page equals URL. Right, let's look at this again. This is what the URL looks like before it ends in page equals and then a number. What this string extract says is find everything that precedes an equal sign, which is, which is escaped with a double slash, followed by one or more digits that end at the end of the line right so it's basically saying, find this pattern for me. But really what it's saying is find this pattern, and then take everything that precedes it, which is, which is that. And then I'm also building up a page number thing turning it into an integer. And I'm, I'm just doing this for one line as it turns out, I'm saying, anytime you see where page number equals begins and ends with a number two. And you'll see why I did that in just a second. But that's going to turn this to into a one. And I did that, because I want to, I want to just use the tidyverse to expand the whole range from one to 22, which you can do with, you can do a number of different ways you don't have to do it this way. But this is just an easy way to do it. Right when I run that whole bit. I now have 22 rows. And I have a page number subset it out for each of the 22. And then the last thing that I'm doing is I am rebuilding my URL to be a full URL with the navigation and page number. I'm not sure why I'm doing that. I did this about several months ago I started the code on this, but it's working so maybe I'm doing a few steps in a redundant fashion here. But here's a URL. And this is exactly what I want, right, I want a actually can't even see the whole thing let's see if I can expand it. I want the full URL rebuilt for all 22 pages. So that's my kind of my, my to do list is that I'm going to, I'm going to crawl each one of those pages to then get links to each name. So this is where iteration comes in. Now I'm going to use the map function. And this is another point to provide a cautionary note. Right. I'm going to map over the URLs in that table that I just built up here. But I don't really want to unleash my robot just yet I just I still want to be in development phase. So I am putting in this. Subsetting designation for the URL vector for that table, just so I can work with only three rows. Once I'm sure that this works. I'll take out that part there and later on in the code and unleash it on the whole on the whole website but I still want to be a little conservative about what I'm doing. So I'm going to build a table, a table where I have a variable called HTML results that looks at the first URL. And it pauses for two seconds. It's part of the map function. And then it reads in the HTML for that first URL. And then after two seconds, it will read in that HTML and then it'll build up another, another URL because what I'm trying to do here is keep a record of everything that I gathered and where I gathered it from. And then it's going to go back and iterate and do that for the second row, and then go back and wait two seconds in the rate for the third row. So if we run this right now this should take at least six seconds to run, because I only have the three rows that I'm running over. I get back something that I think is a little surprising. But so that's why I wanted to show it to you is I got only three rows back. I just gathered the HTML body for the first three navigation pages. And that's all right here in this result set. This is the result set that you saw up earlier in the in the document. Anytime you use read HTML, you get a list that has a header and a body section. So we're going to parse through that in a second. And then the other thing I have is just a, again, this is like a guide for me or a record of where that information came from. So that the information in row one came from page one the information row to came from page two, etc. This is redundant information that after a while but it's useful so that I can do some verification. Right so then this is what I end up I want to parse through right I could have done this all in one step but I want to break it down so you see what's happening. So after. Now I have a table. Let's look at it again just so we can see what we're starting off with. Nope, not that one. This one. I have that three rows set. And now I'm going to build up a new table with that has a summary URL right that record of where I got the information, and then maps through the information using HTML nodes and HTML attribute to pull out the title for each name, and to pull out the name using HTML text for each name. So if I run that. That's the iteration, still a little bit confusing to me I only have three rows, because what it did is it put a list for row one of the 50 names. That's right here. Sorry, come on computer. That's the list of 50 names for row one. First summary page and that's a list of your 50 URLs for the first summary page. So on so forth online, but having those in a list is easy to unnest. Right you just use the unnest command. And when I run that. I now have 150 rows. I have 150 rows for each page. So I have a designation of what page it came from. And as I scroll to the right. I have the relative URL for the person I want to gather. And the name for the person I want to gather. Right. So, all of this that I've done so far is really just about working the navigation of the site. I now have. I have 150 but I could easily have 20 2100 rows, one link for every leaf of the site every person as a as a endpoint on the, in the sort of web database if you will. And so then why I want to do is I want to go to each one of those people's pages, and further select information that I'm interested in remember, from the beginning, I said what I really want to know is what are the names of the children that exist for anyone are in this whole data set. So I'm going to copy this URL just so you can see kind of what step two is. And I'm sure at this point you're getting a pretty clear sense. As I said in the beginning that web scraping is really a process of deconstructing. Right. And that's what I'm doing. I'm just sort of looking at the at the information source that I have and trying to deconstruct it into its elemental parts. So here's an example of the final bit, right. Emmanuel. Adrian, if I'm saying that right is a male and born in Antwerp in 16th century, and Emmanuel had three children. And I could, and you saw when I was using other examples earlier. Some people have children listed some don't have any children listed that doesn't that's not necessarily a definitive act of our account of whether or not they have kids but it's a definitive account of whether or not this database knows that they have children. And I just want to pull back a list of all of those known children. So I'm going to have to go to every website for every record, and then parse, turning on my selector gadget. If I click on selector gadget, it comes back and it tells me I have seven elements, if I just parse the UL, which stands for unordered list. But I don't want the seven elements right I just want the three elements so I'm going to click on that. And now I've got four. So I don't want that I'm going to click on that. And now I've got one, which surprises me. Let me do this. Let me try something else here. Clear. Sometimes you kind of fiddle with it a little bit. Yeah, I guess that's what I want. Let me look at my code just to make sure that it's interesting. I came up with something different yesterday. So I'm going to. I'm going to do it this way because this is seems to be better. I'm going to click on a name and start unclicking the things I don't want. And I don't want that. Yeah. And now I have three elements. And yeah, the other one might have worked as well might have just required more parsing, but this is specifically what I want I want three elements that I can iterate over. And that's the little magic code. So I'll copy that into my buffer. And then I'm going to go to studio. And you can see in my code right here, that that's what I put right there, right in HTML nodes just like I did with all of the others. So I read in that page into an object called a manual but in the future I might just call it my page so that I can iterate over it. And then from every page I'm going to pull back that much of the HTML node, which will look like, which will look first have to read it in. And that would look like this. And so there are the children's names, and then it's just a matter of doing what I said which was getting the HTML text for the children's names. So if I run that whole thing. There I had that vector of three children. And then for this exercise I did not go through the process of then further iterating on this page because I've showed you the sort of the technique where you can iterate and actually the technique where you can read in. And I don't want to unleash 20 of us on this site, because quite honestly it's a university site. I probably didn't design their site to have to be a part of a model classroom. I don't want to overwhelm them, but I'm sure if you do work on your own. They'll be fine with it. But those are when I say those all of those concerns I have those kinds of concerns, I want you to have when you approach web scraping. I don't want to say that it's not okay to just scrape whatever you want, but you should have some concern about the idea that you can just scrape whatever you want, because lots of things can happen you can accidentally scrape copyrighted material that you have no right to, you can accidentally pose as a denial of service attack and they can shut you down. And that's the worst part is not. It's not that. It's not overly litigious about the rules. It's, I want you to be able to accomplish what you want and if you get shut down by their web server, then what are you going to do, right. Now you got to go find another machine that their web server doesn't know about and move your code over whole thing is not is not worth the hassle it's just better to sort of operate all of the above board process. I hear somebody asking a question and now is a great time to ask, because I'm at the bottom of my code so please, please ask away. So john. So, as I understand that if we have to pull the data we need to first create a data frame with the links, all the links from where we want to get the data from all the pages, like people's pages links right. Right. That's that's sort of the answer to your question. Go ahead. Yeah, so but my follow up question is for, because I was thinking about it for example I want to pull data from New York Times, and for example I want to pull that the Dakota pipeline articles and when they were published. Now, I have to go and search do the because the article, if you open a page us just go to New York Times and click on us. The latest articles will be there so I have to go to search and type the quota pipeline and all the articles which have been published in last seven eight years will come online. So, I have to go to each article open the link and then get that. So, let's just, let's, I would tell you first off said that I am pretty certain that when you look in the terms of use that the New York Times does not want you to scrape their site. Let's put in Dakota pipeline. I'm a subscriber so I should be allowed to do this. And there may be other sources for this information for you and that's, that's something I might be able to help you with. Okay, but if you look up at the URL, right that's the base URL that you want to start with is a query that is going to pull back in this case, 1030 results. Now, the problem that you're going to have with this is that in order to get more than the first looks like 20 or 15, you have to click on show more. So one of the things you're going to have to do is figure out, let's do that again, let me hit F5 and took it if I hover over show more I don't have any indication that that's a link. So this, this brings in a whole nother complexity to, to web scraping. I'm pretty certain that what makes that particular bit work that show more button is something, some, some JavaScript technology. And so now you need to be able to respond to the results that come back. I'm going to do a view source on this. I'm going to actually before I do view source let's do a selector gadget and just see if it tells us what show more is doing. Ah, it's a thing called a div button. Let's have a quick look at the view source under this type in button. Yeah, that's 37 of them but maybe it's the last one. I might show more. There we go. And then the question is, how do I react to that and I will tell you off the start. I don't actually know. I might show more I might view source. I use my selector gadget. I'm probably going to try that other technique that I mentioned, where I inspected see if I can get any kind of clues as to what. To get it to respond to this. Now having seen all this, I can tell you two more things. One is, you don't have to use our to do web scraping there are lots of other tools. And there is a tool that I really really like called. As a matter of fact I'm pretty certain I have it up here. Yeah web scraper.io, which the last time I looked. Web scraper.io was free. And I am certain last time I looked that web scraper.io has a built in. Oh, that didn't work did it. Let's try this again let's just web scraper.io do it as a search and see what happens. Okay, I have pricing information yeah they have a free tier. Good. And I'm certain that they have a mechanism. To go past that show more button so that that's one option. Another option is to look in the harvest documentation to see if they have a technique for iterating and interacting with with a JavaScript button like that. Another option is to run something called a headless Chrome browser, which is basically a command line tool that doesn't have a visual rendering engine. So it'll use Chrome to gather data, and it'll pull it back and then there will be other other things that you'll have to learn about headless Chrome browsing but you could do that. There's still another option bringing it back into our studio, like there's to our studio options, and there's two outside of our studio options outside of our studio options are headless Chrome, or web, or at least web scraper.io there are literally thousands of web scraping applications out there. A lot of them cost money, but in my experience I've never found one that costs money that's any better than web scraper.io which happens to be free. So our options are learn more about our best on how it could interact with that show more button, or there is a other package called our selenium. And I'm certain that our selenium could do this. I have never used our selenium. But our selenium is more advanced than our best. And so if our best can't do it the nice thing about our best is that it's sort of learn and works for a lot of cases but if it has a technical limitation on interacting with JavaScript buttons, I feel pretty certain that our selenium does not have that technical limitation. I actually used our selenium and you can definitely like click on on things using it even like JavaScript. Good good. It's a little bit. It's a pain to download that's the part that's annoying like I it took me I think two hours because I just didn't understand what was what it needed but once I did download it it was not that difficult to build things. That's great information stuff on. I, the thing about web scraping is that I found is that you always end up with a problem that you're not quite sure how to solve. So our best is a great way to start because it like I said it's sort of, it's very consistent and it's a good way to start out but it maybe is not as advanced as our selenium. And as you might imagine with a name like our Selenium is actually a package out there called Selenium that may have been first developed for Python. And so our Selenium is just the port of that. But it might be a little harder to get started with our Selenium. In many case based on what we just talked about today. Hopefully you would, you would have a little easier time than approaching it from scratch. So the time the time that step on you had to spend a day or two, getting comfortable with our Selenium hopefully some some others can avoid based on your input. And I'm happy to look in some of the stuff with you but of in many cases with with web scraping like I said you just run across these errors that you've never seen before. You just have to kind of break them down into little elemental parts and see what can be done. In terms of our Selenium I was just going to say the biggest difficulty I had in installing it was that you get the choice between trying to run basically your own browser through another R package but that R package or whatever it leads to doesn't actually So you then have to actually install like Selenium or Docker. So like another download to then be able to use our Selenium as like a virtual machine in your computer. Right. But it works. Once you do all that. So, let's see I see I've got some questions in chat and I maybe haven't been answering so let me go back and and look at some of those Seth Morgan said. So if Morgan is gone got got to go to another seminar. Okay, well I'm glad Seth you could join. Marco says, could you walk us through scraping data tables for the web using our for Excel. I've never used our for Excel, but as it turns out scraping data tables is probably one of the simpler things. Let's see if we can find. An example of a table would be, but let's meantime, maybe let's look up our for Excel. And if I if I can shed any advice while we're meeting. I will do that are for Excel, or unless I'm misinterpreting your question Marcos are you saying. I assume that there's some package called our for Excel. Is that what you're saying Marcos may be gone. But feel free to type in. If you know homepage are for Excel users. Ah, here we go. Well this might be. This is different that's advice. All right let's see are. I'm not sure how to interpret that question are scraping HTML tables. Let's just do that scraping tables maybe it's maybe the question is how do you how do you ingest Excel tables. Yeah, and let's see Rebecca says so let's wait for Marcos to clarify what the question is Rebecca says could you elaborate a bit on the process of creating a web scraper to download and store files for example PDF, and any other packages or scripts. Alright so. If you have a collection of PDFs that are available through a website. And this is actually something I hadn't hadn't discussed fully but let me just say this. This is from years of working in a library that. And actually believe it or not 20 years ago I was actually working as a web developer, but one of the things that I discovered was that while you can spend time writing scripts to download chunks of information, particularly if it's open source. Particularly if it's a governmental or quasi governmental organization. The easiest way to get that data is to pick up the phone or get your email and reach out to the web host and say, Hey, can I have all of your information. And I mean honestly if they say yes to that, and it's and it's free and legal, and it takes you three minutes. That will that will save you so much time, but that aside that's that's the social engineering side of gathering information. And there's a website that has links to a bunch of PDF documents, it's going to be very similar to what we just discovered. I'm not sure which screen I'm sharing let me see here. Okay I'm sharing the web browser, like if this is rather than a link to an HTML it's a link to a PDF document, then we should be able to I'm wondering if there's an alternative to read Excel. That's just download, and maybe actually download would work because there's a download file. Download download dot file are that what should take a URL as one of its. Yeah, download file takes a URL is one of its arguments. So the rather than using read underscore HTML, you would just use the argument. I'm going to put it up here in the URL because I can type it, but this would be an argument in your script download dot file. And URL equals, for example, HTTPS food.com any two columns there and not that this is going to work slash my document dot PDF. And then of course you would assign that to some object for iteration. My PDF. There is the pseudo code there that should work as a means of downloading PDFs now there are a whole bunch of other our packages for scraping through PDF documents. Let me see if I can find some of them. PDF tables. Tabula is one. So here we go I think this is this might be it. Yeah, there's another one of the same people who have our Selenium made this tool called the PDF tools package so if it's PDF documents you're after, you might actually start with this package called PDF tools, which should help you not the ingesting again it's not necessarily specific to our best or get tools, but the parsing of the PDF markup to into data, which would have a similarity to what we just did right we just parsed HTML to get raw data but what you're going to want to do in that case is parse PDF to to raw data. So you might want to use something like that. And again, I'm happy to take to set up consultations with people on details. So I'm going to put in this URL. Some of you may know this but if you if you want to set up a consultation with me you're welcome to that URL will help you just schedule it based on your own time. I just wrote back and said what I meant was to scrape data from the web, like data tables of public financial reporting using our and then exporting them to excel. Right so Thank you Marcos. So, the scraping part would be similar particularly if it's public data tables because I would say this that financial data in particular a lot of times. If it's public financial data there are lots of sources of that information so you may be able to get a download quite simply without having to write a web screen, but example. HTML table let's, let's see if we can find a example, expect element. Yeah, that's a table right there you can see there's a table tag, and there's a table body and a bunch of TR rows. So let's just real quickly. So what you would do is you could you could download that table and to a data frame and then of course you would just write out as Excel. So write underscore. I'm pretty certain right under score Excel as a function in the switch back to my R studio. And let's get a notebook go in here and how are we doing on time are we going till three o'clock and we're almost done. So let me just keep on going here but if you have to go I understand I appreciate everybody who came. So reverse and library Excel. What is that what is this. Remember what the name of the import data set from Excel. It's called read Excel, you know, copy. So library read Excel, and then read Excel just for writing out read Excel, right. I don't see it there but I don't. I'm sure there is a way to do that. Maybe it's not part of read Excel left to find a different package I'm 100% certain there's a way to do that, but let's, let's do this copy caught my table. I'm going to call it my doc, my HTML doc. So we already loaded our best but that's library. I spell it right, our best. Then read underscore HTML. And that's just for the fun of it, have a look at that. Right so it's going to be in the body of that message so we're going to write my underscore HTML doc HTML nodes, nodes, and remember that we can use selector gadget to figure out exactly what we want. Let's try that let's try to put that into practice. Here's my selector gadget, and I'm going to click on. Yeah, those are all TRs. Let's see if if I click on that. Oh that's interesting. There's a little bit of a challenge but let's see what we get to put that in there. Nope, not that really like that. Hold on, let me see if I can get something different here. I'm just, oh there we go. Well TDS would do it. And then this would be a TH. Let's just see what we get with TDS. But it says 46 I don't want 46. Where are the others. Oh here we go. I don't want this. Now that's 18. And by the way I'm looking at this number right here. So that's pretty good let's let's customers TD that's good I like that. Maybe customers would work. Copy, and put this in here. Does that look right for what my source was. I got Alfred and Maria's row two so this isn't quite exactly how I want it, because what I really want to do is I want to gather, almost by column. And so let me give me a chance to refine what I'm doing here I don't want that. So use I know that you can't see what I'm doing but trust that I'm just using selector gadget to highlight what I want. There we go. Now we're cooking. So now I've got Alfred through. Yeah, that's the first column. Right so then I could HTML underscore text. And I'm going to call this call one. And then I'm just going to do this, where I say, I, I downloaded. The first column is. Call one. First, just so it's not ambiguous first column. Call one. And my table and if I run both of those. So there I have. Yeah, that didn't occur to me but anyway there I've gathered column one so I would have to iterate. I mean there's a number of different ways to iterate and then even though I was looking for a read Excel, like I can always write underscore CSV. My downloaded table to my, my dot CSV and of course now that I have a CSV file. That should show up down here let's just run it and see what happens. But I have a CSV file I can easily import that into Excel. So, I hope that that got close to answering. Marcus question. Oh, yeah, Rebecca has an input there's a right underscore access, XLSX, which could be used directly. Yeah, thank you Rebecca. Oh, somebody asked when's the next webinar. Next webinar is. It's coming up soon let me look at my schedule. I'm going to. I'm going to go to our fun first because it might be down here at the bottom. Let's see spring workshops. Today is March 4 so on March 16. I'm doing another intro to our for the data fest contest that's something sponsored by the stats department. My colleague, Angela is us is doing a visualization on our and, and if you haven't had a chance to to attend one of Angela's workshops. She is extraordinary knows a great deal about visualization and are. I'm going to do a Twitter data gathering workshop on the 25th. I'm going to do how to how to make the slides the slides that I showed you today are done in a program package called sharing again. I'll show you how to use sharing in on the 6th of April, and then kind of picking up on web scraping and Twitter gathering will do a little bit of sentiment analysis on the 15th of April. So, that's that. Have a great day.