 Good morning and welcome to this week's edition of Encompass Live. I am your host, Krista Porter, here at the Nebraska Library Commission. Encompass Live is the Commission's weekly webinar series where we cover a variety of topics that may be of interest to libraries. We broadcast the show live every Wednesday morning at 10 a.m. Central Time, but if you're unable to join us on Wednesdays, that's fine. We do record the show as we are doing today and it will be available for you to watch at your convenience later. And I'll show you at the end of today's show where you can access all of our show archives. Both the live show and the recordings are free and open to anyone to watch, so please share far and wide to anyone who you think may be interested in any of our shows. For those of you not from Nebraska, the Nebraska Library Commission is a state agency for libraries similar to your state library, and so we provide services to all types of libraries. So you will find shows on Encompass Live for all types of libraries. Public, academic, K-12, corrections, museums, archives, etc., etc., anything and everything. We do bring in speakers from across Nebraska and across the country, but we also have Nebraska Library Commission staff that do presentations, and that's what we have with us today. Today is the last Wednesday of the month, so that means it's pretty sweet tech day. Yay. With Amanda Sweet, our Technology Innovation Librarian here at the Nebraska Library Commission. Good morning, Amanda. Good morning. And the last Wednesday of the month, she always comes on to do some show first. There's something more of a techy kind of focus. There may be other shows that have techy stuff on of course throughout the months, but always Amanda's pretty sweet tech session you can guarantee. If you're into tech, she's the one to follow. And today, she's going to talk to us about how to scrape the web without going crazy. If that's driven you crazy before, or you just have no clue how Amanda will tell us. So I'll hand it over to you, Amanda. And you might just be curious about how web scrapers work because they're kind of everywhere now. So you might have heard about like spiders and crawlers that go around the web just to kind of like map out what a website looks like and it might extract data for uses and other purposes. And that is basically what a web scraper does. It extracts data from the web and then it turns it into a format that can be better used by humans for different purposes. Now in the case of web scraper IO, which is the platform that I'm going to be demonstrating today, web scraper IO will help us convert the HTML and kind of what you see on the page into an Excel format. So you can download it as a CSV or as an Excel like the standard Excel format. That'll help you use it in databases, it'll help you build like a mailing list, it'll help you build, it'll help you transfer the data so that you can use it in different ways. So then after I go through a web scraper demonstration to give you an idea of what it looks like, I'll talk about some use cases of how this has been used specifically in libraries and then a few just that I find kind of fun. And then I'll talk about the benefits limitations and is a web scraper actually legal because there have been a few legal cases that have come out recently about web scrapers and how you can grab the information from a website, how copyright and website copyright content comes into play and how you can legally use this information that you are pulling from the web because you'll see it is extremely easy to grab information from nearly any website, but should you? So that's what we'll talk about. I hadn't even thought about that part. Right? But the law offices definitely have. Yes, of course. So I talked about this one already, but how do you actually do it? You're grabbing information from the web, and we're going to be using a tool. It's a Chrome extension called WebScripper IO. And this is a free extension that's going to let us do this very thing. And it's going to ask us to go through these various different steps. And I put this into the slide so that you can reference it and kind of start walking through these steps on your own if you decide to use a WebScripper yourself. But first and foremost, we just need to grab the extension, pop it into Chrome, which I've already done. WebScripper IO does have a tutorial just in case you've never installed an extension before or don't know how to do it using the Chrome platform. But the trickiest part of it is actually opening it once it's installed into Chrome. And then we're going to create a site map because a web crawler actually needs to be told which information you want it to grab. So we're going to start by dissecting the page and wrapping around different elements that we want to point the web crawler to. And we'll select the different pieces of information. And I'm going to go through one scenario where this will work really well. And then I'll go through another scenario where you'll see kind of the limitations of where the WebScripper isn't going to work so well and why that is. And then I'll run you through how to actually grab this data once it's been pulled and how to start cleaning it because the extraction part of it is just it's automatic. What actually takes longer is cleaning the data so you can use it for your own various purposes. So once it's in Excel, there's still a lot of stuff you might actually have to do like remove duplicates or find missing bits of information and find out if you can either retrieve that information from other places and how you can merge that all together so that it's all organized and clean. And I've got some different guidebooks and some tutorials that you can use to start doing that cleaning data part of it. And the have fun is just I don't know how you're going to use this stuff, but use it wisely and have fun with it. You never know what you might find it might be very if you're for people who are like dad and nerds, this is like like heaven, I think. Get a little giddy. And so these little reference slides are going to be available just in case you need to go back and find the tutorials or find out how did she do that again. So these are linked in here so that you can refer back to them to grab them when you need them. But I am going to pull Chrome over here. So you should now see the micro bit website in front of you. And so in I've already loaded the extension. Again, there's a link to the tutorial just in case you haven't done that before. But I now have web scraper loaded into my extensions. To open it, I'm going to go into this the three dot menu, and I'll go into more tools and developer tools. So when people first see this, they tend to freak out because most people have never actually looked at this unless they're a developer. But you don't again actually need to code to be able to use any of this stuff. I'm going to click on the web scraper section in here. And these are the web scrapers I've already put together. These are the list of the different site maps that already exist. But we are going to create a new one based on this micro bit section. And it's going to look like this. So you're going to when you start putting together these site maps, you're going to see different selectors and elements on here. And I'll show you what that'll actually look like. And I'm going to go into find out more. So this scraper is going to be mapped off of this website. So we first have to grab the link to the website itself to tell the web scraper what we're actually dealing with. And then we'll start breaking it down into the various different sections that the web scraper knows what to look for. And I'll show one little trick in here. Because this is going to give you the perfect visual that you need to understand where the scraper is going. So I went into the site map. And I went into a selector graph. And this is actually graphing out where exactly you are sending your web scraper. So the root is going to be the website where you're starting out. This is the micro bit.org slash projects do your do your bit. The wrapper is going to be and I'll scroll down a little bit. It's these little cards here. The wrapper is telling the scraper to grab information from this card. Open link is telling it to click on this read more button. And then once it opens the link, it'll automatically open up this page. And we're building another wrapper that tells it to start looking right here. Because for my purposes, I needed the name of the activity. So this little selector, the activity name selector is telling it to grab the title of the activity for the counter. And then the experience section is grabbing this block of text right here. So this is what we're going to be telling our scraper to grab. And so we're telling it, we're telling the web scraper how to navigate the site and which information to grab. So from scratch, it looks like this. So I'm going into the main page. I'll click on find out more. And now we have our starting URL. I'm going to go to create new site map, create site map, grab our URL, copy it and paste it into the start URL. The site map name is what you're going to refer to this map again so that you can find it in the list later on and either reuse it or have your merry way with it. And I'm going to call it micro bit test. And I have no spaces in here. It just all one long line of text. And I this is called upper camel case. So it's got a lower case letter. And then the next word is an uppercase. So you can tell the difference between these different. It's easier to read it that way. So go to create site map. And now we're going to start creating those selectors so that the web scraper knows where you want it to look. So this is our main homepage. The first thing we want to do is give it a sift through. You're going to get an understanding of what the page looks like and zero in on the specific sections of the site you actually need. So for our purposes, I didn't actually I don't really need any of this stuff here. I want to zero in on the projects itself because I'm making a reference list of different makerspace projects that libraries can do. And I want to link back to these specific projects and I want to be able to tell the experience level, the language that they need to do it, and the different sustainable development goals that each one of the projects are associated with. And then I'm going to use this to make kind of a mini recommendation chart for different activities. So my first selector is going to be telling the web scraper that I want to deal with these little boxes right here. So look here for information. I'll click on the add new selector. And I'm going to call it hard wrapper. You can call it whatever you want, as long as you know what it is. When you go back into that little graph to find out what you did. Now I'm going to choose element. And then I'll click on the select. And then I'll click here to enable hotkeys. What that does is that when I click up here to select what I want the web scraper to deal with, it won't activate any of the links or activate any of the information if I click to enable the hotkeys. Now this, when I click on the first one, the scraper is only going to look in this one single box. I want it to look in all of these boxes. So I'm going to click on all of them. And I want these down here too. So now the web scraper is going to act going to run through everything that is highlighted here. And it's selecting it based on a CSS selector. CSS is kind of the code that makes the text on the screen look the way that it does. It purifies everything. And it tells the computer, it's like a little marker selection that tells the computer which little piece of information is which. You don't need to know it. The computer does it for you. And so then you can do an element preview. And it'll highlight everything to make sure that you grab the right stuff. And I'm going to click on multiple because there are multiple little things on the page that you wanted to look for, not just one. Then I'll save the selector. And then I'm going to click on the selector because I now want it to do, I want it to interact within this wrapper that I just created. And I want it to open up this link because the information I need is actually on the next page over. So if I want it to activate this link, I'm going to add a new selector. And I'm going to call it open link. And I'm going to call it link. And then I'll select again. And I'm going to click here to enable hotkeys so that when I click this, it doesn't actually open it, it just selects it. Then I'll click it. And done selecting. So now grab that one for me. I'll save the selector. And I'm going to do an element preview. And we grab the right one. And now I'm going to actually click on this one. Because now we need to build the new little wrapper that says once you've opened up this next page, you're going to be dealing with the information that I'm going to indicate on this page. So I'll click on the open link. And now we're going to add a new selector. And I'm going to call this wrapper two. And I'll call I'll make a new element select. And I want to go up here. Here. So I'm I moved my mouse around until I had this entire little mini section highlighted so that it knows you can search anywhere in this section. And then I'll say done selecting. And I don't need multiple of them because there's only one on each page. But I mean the world's not gonna end if you leave it there. And then inside wrapper two, I now want to grab this title and this content right here. So add a new selector that's called activity name. And I'm going to grab the text from it, select it, highlight this little counter section. And now it's going to be grabbing that top level heading. And then I'll add a new one that says activity. I'll just call it information. And I'm going to select this will highlight this little section because we want up to pull that. And save selector. Now you can so now this is all mapped out. Now we're going to go into the selector graph and I'm going to make sure that I built it right. So our root is this URL that was up here. And now it goes out to the card wrapper. And that is these here. And then I open the link so that's telling it to click on the read more. And then once the link is open, it built a new wrapper telling the computer to look right here for any information it's going to grab. And then I wanted it to grab the activity name and the information that's associated with that activity. And it'll automatically grab the URL that is associated with each thing. So we don't have to worry about that. And now I want to scrape. So now this is going to be one of the more important parts of one year scraping data, especially large amount of data. The request interval is like I am going to click on one of these items every two seconds. And it's going to have a page load delay of two seconds. So if you have, if you start making too fast of a request from a website, it might block you out because it might be saying that we know that you're a web crawler and you are trying to hit our site too many times in a row, you're giving us too much web scraper traffic for blocking you out. So if you find out that a website is starting to block you out and it's not letting you grab anything, it might help open a little like forbidden error. There'll be like a little number and it'll say forbidden. And that means that the website has blocked you out because you're giving it too much traffic or they just don't allow web scrapers. So if you are just grabbing a little bit of information, I usually leave it at this two seconds because that's about normal. Computers know how fast it will take for a regular human to do this. And they've also tested how long it takes an automated computer to do it. So when you have multiple windows open at the same time, all doing scraping, I usually change this and vary the number. So that one will be going at 3,000, one will be going at 2,800, one will be going at 3,200. And that way you don't have regular interval traffic that will overload the server. Otherwise they get mad and forbiddy. We don't want to do that. We don't want to make a med. Right. Then so I changed that to about two and a half seconds. We'll start scraping. And this will actually be pretty quick just because there's not like a whole ton of information on here that we need to get. And right now it is, you'll actually see it opening up the pages on the screen. And when you're running this, if you're running like a lot of information all at once, you can just minimize this. And as long as you keep this window open, it'll just be running in the background. I run this overnight and it's been, it's been perfectly fine with a few caveats. And because I know that this process can take a second to do, I have, so this is what it looks like when it's all set and done. So once I have downloaded and extracted the data, it has converted that text information into a CSV. So now we've got our. So the only thing that I really did to clean this is that initially when this was first loaded, all of these different columns were smushed all into one cell. And I just grabbed that cell, went into data, I hit text to columns. And then you can choose by a character that says every time you see this character, I want you to create a new column with information. So I choose delimited, went to next. And the character that was in there was this one. I honestly don't actually know what this is called. I've been calling it like vertical bar. I'm sure it actually is called something. But the way it looks like, yeah. So and I untick the tab and just choose this other. And then it's a shift. And then it's the key that is built between the backspace and the enter if you're using a full size keyboard. I've already done it. So it's not actually going to do anything. But if you just go to next and then put it as text, unless it's a number, then don't change anything. But then you finish it and it'll automatically split everything. So that stuff was actually cleaning the data. And then you can also relabel these different columns so that it makes more sense to what you actually want to use it with. And it will automatically grab the link that references that this specific project over in this area here. So when you're referencing people back to say if you're interested in this activity, click on this link to access it. This is the one that you would want to grab for that. And so this has actually sped up my life quite a bit. So that I'm actually able to grab this stuff, format it and have my merry way with it. And so what I've also used this for, and I'm going to open up this, my password is saved in Mozilla. So I'm going to open up Mozilla again. And I will open up this. So this is a collection of different businesses and different industries. So what I did was in order to help libraries kind of tackle the workforce development issue and start to pinpoint different companies that are actually using different technology tools and resources. I used a web scraper to grab a bunch of different business information. And then I categorized it according to this SDD chart. And so let me grab or according to this industry chart. And then I added in the sustainable development goals that are related to each one of those businesses. And in order to do that. Oh, let me grab my little chart here. This one has a better one. So this is a collection of all the different nonprofits. And I did it for Nebraska. And you'll see I did it for Nevada just because they asked me to. And so I used the web scraper to grab all this information. And then I categorized it. So this is going to it's helping people kind of better choose a career that's relevant to the problems that they actually care about. And I use the web scraper to grab it. Scrape it if you will. And then I cleaned it all up, removed the duplicates and had my merry way with it. So are there any web scraping questions at this point? Did the way that I described that's more or less make sense? Let's see here. Open up the window. All right. Yeah, if you have any questions, I don't see anything right now, but if anybody does have any questions or comments or thoughts, type into the questions section of your go to webinar interface. I can see that and read off what you have to Amanda. Nothing has come in yet, but I'll keep an eye on things. It looks like to me, it's the kind of thing that like many things takes some setup to get it going. But once it's doing its thing, it has a lot of it's going to throw a lot of data and information at you and be very useful. And once you start doing it, like regularly, it becomes more of a just like anything you learn, it becomes a habit and you figure out what you need to grab and how to do it. And yeah, practice. It's interesting. Oh, what was that? So I told you that I was going to show you one process where this system works out perfectly. Then I was going to show you one process where not so much. Yeah, so you can always go the way you planned and that's okay. Any questions or anything you want Amanda to show or a specific something you were thinking about using a web scraper for you want to know more about it, go ahead and type in the question section, but I'll keep an eye on that for one. And if you do have like a specific thing, like I wanted to collect information from Worldcat and I wanted to pull it into a spreadsheet so I can use it in a different catalog system. Or I wanted to grab information from this website and I wanted to know how to do this like pull information from this specific site. You can also send me an email and I can help you kind of work through that. Sure. And so I opened up the web scraper again just went into this little menu, more tools, developer tools, and it opens up this. So now I'll show you an instance where this doesn't always work out so well. Because one of the questions that I had gotten asked was I might be interested in getting either some tech kits or some book club kits. I don't know if I want them right now, but I want to be able to make a list. And it's easier to use a web scraper than it is to copy and paste all the individual information into the spreadsheet. What they wanted was to be able to do a search for an author name, find out the titles that were available as a book club kit, find out the number of copies that were available and find out that's basically it. And they just wanted like a massive list of it. So we'll go into collections at NLC, the NLC book club kits. And they had asked they had just I'm pretty sure this was just a random thing that they came out as an example, but they use James Patterson. And so and so then so I just searched for Patterson. So whenever you do a search in the way that this is set up, it will actually give you a unique URL. So it'll tell you that it's searching for author and Patterson. So you can actually use this as your starting URL. And then you would actually you would want to wrap this into an element and then tell it to search for information in here and then it'll then you'll ask it to grab this title and you want this information here. So I built this down here just for the sake of time. And you can see that I built in this little wrapper. It's going to be searching for this, but here's where it starts to get tricky. With the card system, you are able to select a wrapper that only had this text. But this is actually formatted as a table. So it's grabbing every single one of the different table cells. So it's even including the pictures in here. The book covers that we've got. Yeah. So then when you click on when I so I clicked over to the next level saying that within that wrapper, I want you to grab the title. So now we do an element preview. It'll grab the title. But when it gets over here, it's not going to be able to find anything. It'll search and search and search, but it will give a blank result because there's none in it. There isn't one there. Yeah. And when you go to copies, when I tried to do that, you'll see that it's trying to grab everything in this parent element. And in a second, I'll tell you why that is. So I'll go to Scrape. And as far as I know, I don't think Murn and they put any put any anti web scraper blockers, but I didn't have a problem when I did it. You can also, if you're scraping a lot of information, you can also hit this refresh button on the screen and it'll give you a preview of the information that had just been grabbed. So now I will export the site map or export the data. I'm going to grab it as a CSV. It really doesn't matter which format you grab. You can convert it once you're done. I don't care. So now when you open this, here and here. And I'm going to reformat this so it's going to wrap. So it'll be easier to read this. So now you'll see a bunch of blank different lines on here. And I've already explained kind of why that is. And it's because it's searched in the cells that had just the pictures and it didn't find anything. But if you if I hadn't just told you that when you pulled this, you'd be like, well, aren't there supposed to be 12 different results in here? Where are my other titles? And so that is starting. That's kind of like one of the breaking points is because when you clean data, you have to find the missing values and why those values are missing. And they are missing because it's trying to start in an empty box. But if you didn't know that, you would think that you just have faulty data. And you have no in the copies because it couldn't access the right information because it's trying to find it in the. So in the background of all of this, it wants you to point to a specific area of the code that says you are looking for this paragraph or you were looking for a paragraph that is styled in this way or you're looking for a heading that's styled in this way. But because the background code didn't have any of that specified, it wasn't able to find it. And the way that I found that out is because I went into and again, you do not need to know code to be able to use this. But if you want to know why it broke, you might need to know. So I'm going to right click and go to inspect. And actually, I prefer to do this in. Mozilla, so I'm going to do this you page source. I instead of doing my initial way, I went into the view source. So if you don't know anything about code, if you have never seen code before, the best way to navigate code for the first time is to just find a single little keyword on here. So the Christmas wedding is text that you know appears in this website. So if you do a control find for the Christmas wedding, you know that it is right here. So you know that you are looking for any of the texts that is appearing right right here. So the Christmas wedding, this is what actually displays on your screen. You know that it displays on your screen because the A tag and the link over here is showing that this is the clickable link that is available. What I would expect is that right underneath that there is a paragraph or something that's showing that there's other content. But right here, if you search for the eight copies that appeared on the screen, there is nothing that says that this is a paragraph. There is no HTML tag and there's no additional information that the web scraper would be able to differentiate this piece of information from any other information on the screen. The reason that something like this happens is because so when I code, I just go into a website. I open up a text editor and I actually manually code everything because I'm anal retentive about my tags. But when you use something like a web expressions or another, like there's something that actually assists you in building code, sometimes that code editor only cares that it looks right on the page. It doesn't care what the back end looks like. So when you open up that little editor, type in eight copies, then eight copies appears on the screen. But it doesn't have any tags because it doesn't know that you want it to be a paragraph or it doesn't know that you want it to be a heading. It just knows that you want eight copies to appear on that screen and you don't care what the back end of it looks like. It just looks the way it's supposed to. Not all editors look like this, not all editors work in this way. But. When they first built the Internet, they didn't build it for web scrapers. No, we didn't have them then. Yeah. And they haven't changed since then. Right. So that's why, depending on the site you go to, you might actually run into something like this because when you try to tell a web scraper that you want to grab this information, there's nothing running in the background that says this is a paragraph. You can select it. You can grab it. And you'll see like you may and may not have noticed in this code that it just says eight copies. There is no tag. And there is nothing that says what this is. And so like the Library of Congress actually automatically scrapes a lot of the state library sites and a lot of different major websites. So depending on what it's looking for, it might not find that. So it's kind of this is also something that's kind of changing in like the wider world of websites is kind of caring more about what that looks like in the background. But that also takes time, money and a quarter past forever. So there is that. And so let me and that is why we get this. So this is why it looks like you're going to have a lot of missing information and a scraper is not going to be able to piece together what is actually available on your website. It's only going to be able to piece together what you gave it to find. And so I'm going to close this. So that is a case about why it wouldn't work and kind of like the background of why that is why it's true that that didn't work. But again, to use it, you don't really need to know it. It just kind of a fun fact to just have the information. So let me close these. I'm not going to say this. I don't need this. And I'll close this out. And I'm going to go back into Missoula, which has my slides up here. So I did say that I would give you some different examples about how you could actually use this specifically in the library. My use case of pulling business information and helping libraries kind of figure out the organizations that are in their area. I have a sneaking suspicion that most libraries are probably not going to do that. Um, who knows? But you can also use it to generate marketing and mailing lists, like if you wanted to target a specific like businesses that are in a specific demographic, if you had just opened up a makerspace and you wanted to find out different marketing and graphic design companies that you might be able to reach out to that might be able to come in and help with the session. Or if you wanted to reach out to different coffee shops in the area to find out if they want to do like customized laser cut coasters or customized laser cut, whatever, you can pull the information from like the Better Business Bureau or the Yellow Pages and just generate yourself a little marketing mailing list so that you can reach out to these companies. And that's basically the equivalent of doing a massive copy and paste. It's just a lot faster and will make you go crazy. And the example that I used in here was a quicker way to pull different makerspace activities from a website. That was what I demonstrated with the micro bit thing. And you can merge different data from multiple sources. That's that's actually what I did to get that business information that came from about four different spots. And some of the columns in there actually had to come from outside resources. And it was it was a whole thing. But so you can do it. It just the web scraper makes life easier to do it. One way that libraries have been using this is to grab information from WorldCat so that it's easier and cheaper to be able to build their own catalog. I don't actually know that that's a recommended practice. But yeah. But I mean there have been some libraries that have been doing it. And you can also so if a patron comes in with a reference question and they wanted to know different events or activities that are coming up around the area. And you have different websites that have different little specs of information. Instead of sending people over to 12 different websites, you can use a web scraper to grab the relevant information from each different site and compile a list. And then you can give them that list and then your source information. And then they will love you forever. But and you can also gather. You can use the web scraper.io as a makers based activity to introduce the basics of data science. Once you kind of get a feel for it or you bring in someone in the community that's more comfortable with it. Then you can lead a maker activity that introduces either seven to eighth grade students, high school students, adults, people of pretty much all ages and how to use the thing and send them over to different data science resources. And you can also use it to collect information about different demographics, what they're interested in, what they like to do based on maybe stuff from like a Facebook profile or stuff from interest group website or various different things. So again, I did say that I would give you some of the limitations to the site. So web scraper.io it has a free option and it has a paid option like most web scrapers like this do. This is not the only web scraper in the world that will that works in this way. There's also like Octoparse and there's parse a lot of them have the word parse in it. And it's because that is what it's doing. It's parsing information from the web and translating it into an Excel format or something that can be used in another way. It's a it's transforming formats. So parsing is a thing. But the whole caveat to it is that there is an auto parsing option so that so that I wouldn't have to manually go through and say, hey, could you split these columns? Could you remove the white space so that my computer or my catalog system can read it? And it will with the paid version of web scraper.io it will automatically do all that for you. And when you are trying to work with really large amounts of data, if you're trying to do it manually through Excel or Google Sheets, it can crash your computer. And yes, I had learned that the hard way. But and so it happened in Google Sheets because I mean, it's web based and it is required to use to have a connection to the web to be able to do that. It was drawing too much power to that one site and it started to slow down the throttle and it took a quarter pressed forever to load about thirty nine thousand records and which is understandable. Like I'm not saying I'm understandable. It just it took a minute. I just let it sit there and grabbed a new cup of coffee and it was done by the time I got back. But it only crashed for a minute. It was OK. And in Excel, it is probably a lot easier. I just happen to be using Sheets at the time and then I converted it over into Excel. So is web scraping legal? I put this together as kind of like a quick reference guide. Some websites actually now have anti scraping policies, which I did not know before I started some of the projects I started working on. I initially actually wanted to grab information from like the Chamber of Commerce website so that I had to like start grabbing different business information. But then I looked at Lincoln's website and I found out that they have an anti-scraping policy. The reason they have the anti-scraping policies because they also sell their lists as marketing lists. So because they want to sell it, they don't want you to scrape it. They want to be able to benefit from it. So they don't. OK. And so the a lot of the lawsuits that were coming out against web scraping recently have been about people who have been scraping data and then kind of massaging it around a little bit and reselling that data to different organizations or collecting like semi-private information and selling it. So in the wider world, that would be called data brokers. Data brokers actually do that for a living. They pull like they pull like public information off the web. They pull information. They have partnerships with different organizations. Smash it all together into a giant Excel sheet or whatever format they find friendly and then they sell it off to other people. So they say if you want a mailing list, you want a marketing list, you want a marketing analysis, blah, blah, blah. Have I got the data for you? I don't know. Like they don't want the everyday web scraper to be able to grab private information and resell it. So commercial use has been like the major thing for a lot of the lawsuits. We do have a question actually and possibly going back to what you're previously talking about before the part about part about it being legal or not. Someone wants to know where do you see the anti-scraping policies? Where did you find out? Find that. So I'm going to open this site, like how did you come across it? So let me grab Lincoln Chamber of Commerce. And while you're doing that, I will mention to everyone it is a little after 11 o'clock officially encompass live this go 10 to 11 a.m. central time. But we did start late today about 15 minutes due to some technical issues. So we will keep going until we're done with the show and they will answer any questions you have. So please do stick around with us. If you need to leave because you only allotted up to 11 a.m. central for this, that's fine. We are recording and you can always watch the rest of the show at your convenience when the recording is available. Everyone who attended today's show and registered for today's show will get an email from me probably sometime tomorrow, letting you know that the recording is ready for you to watch. And before I I am going to do a search make sure that it is it actually is a really good question. I'm trying to backtrack to find where I found their anti scraping policy. And it was either in their terms of service or their privacy policy. That makes sense. One of those two places potentially. Yeah. Not the kind of thing you'd see like broadcast on their main page because it's not something I want to bring to your attention necessarily. But yeah. Right. Yeah. So so they have an anti scraping policy is I think you might imagine is their way for pages to block. You from being able to scrape them I guess is all that was what. Yeah. So there is a robot dot txt file that has come into existence. And I'll pull this over here. So a robots dot txt file is kind of like a it's a human readable file that will show developers and users how they can scrape and interact with their site. So if a if a website has made special rules that say that you can only make it. You can only make a request every three seconds or you can only pull x amount of data. Then they'll put it into a robots dot txt file which is and I'll show you the article that I linked over to you that will show you how to access this. So the web scraping introduction best practices and caveats will show you how to access that robots dot txt file. And this web scraping is now legal. We'll go over the some of the major legal cases that have come up surrounding the web scraping process and kind of give recommendations for how to do it the right way. And how to once you've pulled information how to repost it on your own site and whether you can repost it on your own site and how you have to credit that information once it is done because regardless of whether you used a web scraper or you just copy and pasted the information from the site just on your own. There are still copyright laws and some of that some of the content on these websites are protected. And it would basically just be like scanning a page from a book uploading it and saying it's your own thing. You can. Yeah. And it's the same exact thing with a lot of these different websites. So yes, I grabbed a bunch of the activity like the activity resources from the micro bit website from Microsoft. But I also link back to Microsoft's website. And that's kind of like the librarian way is to be able to. Siding where you got your. Yeah. And the way that it gets people into trouble is by is when they hold and scrape this information repost it on their site and try to pass it off as their own. So that's another way that people have been getting in trouble, but people have also been getting in trouble with that since the dawn of time. So I mean, that's not new, but it's just faster to do it now. And if you do have any questions about this process, again, a lot of you probably already have my contact info, but the last slide on here will give you. A way to reach out to me. So I will actually share this and make sure that you can actually open it. And I'll put this into the chat just in case you want these as a reference point. And I think, Krista, you usually put this into the Encompass Live Archives to write. Yes, those your slides. So if you do want the. Instructions or you want a reference point for his web scripting legal or you want to reference any of these resources, then you can open up that. This is just in Google Slides and you can just click it open and have your Mary Mary way with it. Yep. Let's see here. And I did try to make it somewhat shorter. I think we're still right around on time. But I'll just say that that's about all I've got for you. And if you have any questions, you can type them in. Yeah, OK, so I sent the link to the slides out to everyone. But yes, it will be also included in the archive. When that goes up as well, you have a link directly to it with everything that is included there. That Amanda had in her slides. Let's see. Does anybody have any questions? Yes, we're about 10 after not bad. Yeah, pretty close to our hour long time. If anybody has any last minute desperate questions, you would like to ask of Amanda or examples you want her to show anything you were wondering about that she didn't show yet. She's getting back over to my screen. There we go. That's there. Type into the question section. Of course, if you want, as she mentioned earlier, if you need want to do have her help you out with some specific question, you can always email her and she can help you work out how you might want to use this and what you could do particular information or data that you might be looking for. If you are searching for tutorials on the web scrapper.io section, I recommend going to the how to section first and then the documentation. If you're if you're an absolute beginner, then the how to is more friendly and the documentation. I mean, if you're a developer, you probably just be lined to anything that says documentation. But it's helpful that these are just more user friendly and the video tutorials are super user friendly. Nice. We do have a question and some thank yous coming in for people to thank you for all the great info. What about scraping behind a login? If you were a member of an organization and permission to log in is scraping allowed? If it is so if you are scraping information that is only supposed to be available behind the login and then you were turning around and making that information public. Then. That would be an issue. Yeah. Stripping the information for personal use or for only for use for other people who also have access to that information. Right. And that's actually what she just clarified for the benefit of the organization, not for public use. Then yes, if it is not for public use, then yes, the only thing that you might come up with is that you might run into an issue with is that if there is a login time out and you're gathering like a lot of information, the website might automatically log you out if it thinks that you're not interacting with it in the right way. And you may use what you're not getting everything you think you might be getting if it cuts you off in the middle. Or if the like the web scraper might also keep the session live just because the website might think you're interacting with it correctly. Right. And you might actually have to re log into the site within the window that the web scraper opens to be able to give it access depending on how your passwords are saved. So it depends. Yeah. She says, yes, thanks. Makes sense. Yeah. Yeah. And I guess we'll give it a second to see if anyone we can't tell if you're typing. So we'll just kind of yeah, it's not like something so you see a little someone's typing and a way to the whole thing comes up. Yeah. But while waiting, I guess I will say yes, we are recording the show and it will be available at the latest the end of the day tomorrow as long as go to webinar and YouTube cooperate with me. Everyone who attended today and registered for today's show will get an email for me letting you know when the recording is ready. I don't see any new questions coming up. So I am going to pull presenter controlled back to my screen just quickly to show is it there? There we go. So here's where I was looking at the slide. So I've got the link to Amanda's slides for you on our end compass live website. This is our upcoming shows link to our archives is right here underneath all of our upcoming shows. Click there. Today she'll be at the top of the list. It'll be a link to the YouTube video and link to the slides. You can search our archives for any other topics you may find of interest. You can you might be interested in you can search the full show archives are just most recent 12 months. That's because this is our full show archives. I'm not going to scroll all the way down. Going back to an end compass live premiered in January 2009. So we're talking like 12 something years worth of shows but everything has a broadcast date. So pay attention to that date when it was originally broadcast because things may change from the time we first broadcast a show to today. Some of our shows stand the test of time but some of them become outdated old information and correct information links my break etc etc. We do have a Facebook page for and compass live. If you like to use Facebook give us a like over there. We get reminded of when we have things coming up. He's reminded about today's show. You little introductions representers. Let people know when recordings are ready. If you also use Twitter or Instagram. We also post on there using the hashtag and comp live little abbreviation for our show. That's a show. So that will wrap it up for today's show. I don't think see any new questions. Thank you everybody for being here today. Thank you Amanda for telling us how to scrape the web without going nuts. I think we'll have a lot of people trying this out in the future. And as I said Amanda's here the last Wednesday of every month. So she'll be back again on February 23rd. We'll see what our topic will be that. Do you have any thoughts on that yet or is still deciding. It's a month away. You don't have to decide now. I think maybe an update on the tech kits because I don't think. Sure. The new ones. I don't think we did a session with the new ones yet. Yeah. There's lots of new tech kits that people can borrow from us. That'd be great. Awesome. So look forward for that. Next week our show will be on internships intentional design crafting a mutually beneficial internship program in a university archives. So this is a people from our University of Nebraska Omaha Wendy Claire and Laurie will be with us talk about how they have handled internships in their archives. But it could be applied to any sort of internship program of course. So if you have interns are thinking of having them register for that show. I also have other shows coming up some of these empty dates February and March. So keep your eyes on the calendar here. I will add them as I get things finalized and get confirmations of descriptions and what we're always adding new ones. So I think that will wrap it up for today. Almost exactly an hour in the end. Cool. Starting a little late and not a problem. Thank you everybody for being with us today. Thank you Amanda. We'll see you in a month. And we'll see everyone else at a future on a future episode of Encompass Live hopefully. Bye bye.