 Welcome to the web scraping with Google Chrome. And let's start with the most fundamental question, what is web scraping? In simple words, web scraping is the process of gathering information from the internet. There are a variety of ways to scrape a website to extract information for reuse, even copy pasting the lyrics of your favorite science a form of web scraping. But this can be unpractical if there is a large amount of data or if it is spread over a large number of pages. Instead, specialize the tools and techniques that can be used to automate this process. Some websites don't like it when automated scrapers gather their data, while others don't mind. So if you are scraping a page for educational purposes, then you're unlikely to have any problems. Still, it's a good idea to do some research on your own and make sure that you're not violating any terms of service before you start a large-scale project. So that's what web scraping is about, to use some tools to fetch information from the internet. Now, why would you want to scrape the web? The primary objective is very obvious, to save time and effort and to make manually impossible tasks possible. You can think about maybe some lyrics example that I mentioned before. Maybe you want all the sound lyrics from a specific album and you don't want to keep clicking around and copy-pasting. So a way to do that would be to automate it and just to pull all the information with the tool. So in this workshop, we are going to talk about collecting news information from Manchester City Council website. Like many other news platforms, it involves many little cards that contains the title, the date, description of the news, where the title is a hyperlink to a detailed page. So in this case, it's just like, remember it's like 20 or 10 entries, but once you learned how to do it, you can apply it for like other large-scale projects with millions of entries. So that would be very helpful. The web has grown organically out of many sources. It combines a ton of different technologies, styles, and personalities, and it continues to grow to this day. In other words, the web is kind of a hot mess. This can lead to a few challenges. You will see when you try web scraping. And the first challenge is variety. Every website is different. While you are encountered general structure that tend to repeat themselves, each website is unique and needs its own personal treatment if you want to extract the information that's relevant to you. Think about the web containing a bunch of websites. Then you can think of them as a little snowflakes, essentially. So each of these pages has its own structure. It's very unique to that specific page, and it might have a different size. It might have a different structure, but you can't just think of one page and then think that you can take the same structure over to a different page. So if you write a scraper for one page, it's not going to work for a different page because every page is special. You're going to have to get to know each page individually and write a scraper for that specific page in order to be able to scrape information from it. Practically, for us, this means that you are looking at Manchester City Council Newsboard, and you come up with a function to scrape information from it. You're not going to be able to apply that same function to the BBC News website because every page is unique and has its own structure that is specific for that page. This is one of the challenges of web scraping that we want to subdue under the term variety. And let's next look at another challenge of web scraping, which is the durability of your scraper. So see, you build a really nice scraper that does exactly what you want for the newsboard, and then you wait for some time passing, and suddenly it doesn't work anymore. That's a very, very common scenario when you are building a scraper for any site out there. And the reason behind that is that website change, and their structure doesn't stay the same because people keep working on them, involving them, changing something in the air, changing something here. And if the website changes that your web scraper is customized to, then your scraper is going to break. So this is something that you have to be aware when you work with a web scraping that the work never really ends because the work on the website never really ends. So if there is a change on the website, you're going to have to adapt your scraper accordingly. Usually, if you have the web scraper built in the first place, the website doesn't change completely, it's going to be a small fix, but you have to stay vigilant in order to make sure that your scraper is still working. So to sum up again, the two main challenges of web scraping are variety, which means every website is different. So you have to customize your scraper to the individual structure of a website. And the second is durability, which means that the website changes over time and that your scraper is going to break and you're going to have to keep it up to date. That's being said, there are ready loads of resources and tools that helps people to do the web scraping. I classify them into three categories. The first one, perhaps the most common method for computational scientists who are familiar with programming is to adjust to the ready made web scraping packages. Scrapy and beautiful soup are the two most popular packages, perhaps. It is the most flexible and direct way and perhaps the most powerful as you can adjust the code and output to meet your needs. But the downside is also very clear that you need to know the basic programming and also the website designing languages, usually HTML or JSON, so that you can locate the thing you need in the website source code. And that is not easy for beginners. In fact, many books and tutorials are available to teach people how to do that. And the second one is to use APIs, which is also usually programming in nature. API stands for application programming interface. The difference between this method and pure programming is that the API could save you some time and effort extracted to some extent. It's not available for all websites, but usually for large platforms like Facebook and Twitter. And you need to state your purposes with applying for the API. It's like the waiter in the restaurant, the customer says to the waiter that, hey, give me some treats, and then the waiter brings the treats from the kitchen, while the customer doesn't know anything that happens in the kitchen. So like how the chef retrieves the treats and presents them. And there's a third kind. The third kind is to engage with some software or apps. Thanks to the huge demand from the market, there are many web scraping apps available now that are non-programmer friendly. And here I list the three, Perth Hub, Dexi, and Octoparse. The tool that we use today, WebSquipper, can be regarded from this category. It's not an app though, because it's a Chrome extension. So it's kind of an embedded part of Chrome browser. And before jumping to the practice part, I'd like to introduce some resources that I found useful when I use WebSquipper. The first link is the video tutorials on how to perform basic tasks with WebSquipper, where some of the content overlaps what I will talk today. So if you want to review this session in future, you can check this website. Or of course, you are more welcome to check the KDS YouTube channel, where the recordings of all webinars and workshops are available there. And the second link is the documentation of WebSquipper, where you can find the definition examples of the functions and parameters for WebSquipper. The third link is the public discussion forum, if you have a problem with using WebSquipper, which is quite common. You can throw the questions there, or maybe find some similar posts and solutions there. All right, so without further ado, let's start our practice sessions. In this tutorial, you will build a WebSquipper that fetches news catalog from the new story page on Manchester City Council site. I will send the link later, so don't worry. Your WebSquipper will pick out the relevant pieces of information and store them in CSV files. You can scrape any site on the internet that you can look at. But the difficulty of doing so depends on the site. This tutorial offers you an introduction to WebSquipping to help you understand the overall process. Then you can apply the same process for almost every website you want to scrape. So as you may already know from the email we sent before, that all these tasks are going to take place on Chrome browser. So hopefully, you have that already in your laptop or your ERPC. And let's start. So I will share my screen to my... Eh... Yeah. Can you see the Micron browser? Yes, okay, cool. Thank you very much. All right. All right, so first of all, we need to have that WebSquipper installed in our Chrome. So, yes. You should be able to see the link in our chat box. And if you open the link, you will see this page. This is the official page of WebSquipper. And if you click Learn, you can see the resources that just shared before. But what we are going to do is to click this install. And it will lead you to the new page of the Chrome Web Store. And if you haven't installed WebSquipper to your Chrome, then you can just click here. If you have already done that, just ignore it because you are going to see exactly the same as mine. We move from Chrome. Don't click that. All right, so I assume you all have installed that. And before we really start, start, let's check whether we have that turned on. And if you go to More Tools and to Extensions, and you can see all the extensions you have, and please check that your WebSquipper here is on. If not, just click on this button and it will be on. This is essential to carry out the following tasks. All right, so after we've finished that, then I want you to come to this website, which is the Manchester City Council website and their new stories page. This is the webpage that we are going to scrape. So I will send this link to our chat. And if you have this page, and please you can choose to, again, to the Settings and More Tools and Developer Tools, and you can see this very magical panel. And if you go to the last one, WebSquipper, you are going to see the interface of WebSquipper. Well, I have some projects here already, but for you, if it's a new one, it should be empty. Alternatively, you can just press F12 on your keyboard and it should also lead to this panel. All right, I assume that everything's fine. So the first task is the simplest one, that is to scrape one feature from this website. Let's see the titles. So to do that, we need to click this Create News. So let's go to Create New Sitemap and click Create New Sitemap and you can just name it anything you want. So let's tell it A and task one. And we put a URL of this website to here. And then we can click Create Sitemap. It's very intuitive. And then we come to this empty board. And so if you're following me, let's click Add New Selector. And this is to tell WebSquipper that we want to scrape something. So, because it's a simple one, so let's just do it for the title. So ID, we can just put title or anything you want. And let's choose the type as text because it's a piece of text. And then the important one is to put the selectors. So we can select the title of it. And because there are multiple titles, so let's click the other titles. And once you click the second one, you can see that almost all the, not almost actually all the titles are selected automatically, which is beautiful. And don't forget to click Done Selecting, and that's it. So we can choose to see the element preview where we can see all the things we selected on the website. And also we can choose to see data preview. It's not working because, yeah, we need to click this multiple because we want to select multiple titles on this page. So click that, and then we see data preview. Yeah, it's working. So we can see that all the titles on this webpage is scraped. And you can ignore these rejects and the power selectors now. And then we can save selectors. And that's it. So again, you can check the element preview or data preview here, but we could just down that. So it's fine. And to wrap up, let's choose set map and then scrape. And for this one, because it's not loading a new page, so we can just start scraping directly. And you can see that it's running. All right, and then let's refresh. And you can see that here, we got all the things that we want. We got a web scrape or order. We can ignore that and the URL here. And the most important part, the titles that we want to collect. And then if you're satisfied with this, you can choose site map and export data as CSV files. Click on that and select download now. And then it's working. If you open that, huh, yeah, let me drag it here. Yeah, if you open that, you can see that it's all saved in this CSV table. So that's it. That's it to scrape one feature. Yeah. So the next task is to scrape two features. And so to do that, we can continue with the web scraper we just created before. And to do this, I will delete the two elements. And for this time, I'll start with add a new selector and I will take it name as wrapper. Because eventually I want to have the title and as a date together in one row. So I need to kind of like a wrapper to wrap them together and to show them in one row. And this time for this wrapper, I will use this element because it's not a piece of text, it's not an image, it's not a link. It's actually like an element. And if we go to selectors and we can select, actually we want to select this like card here, but it's not okay to like, it's not that, yeah, it's not that easy. So let me see. To do that, we can use this P and C here. P means parent, C means children, child. So that means to kind of slightly define which part you want to select. So let's do P. Okay, we got it here. And we also want the, again, all the elements here. So we click on the second one and then we have all of them. Again, like the last one, you need to be careful to like click on the right thing so that you can have them selected. And then again, don't forget to click this down selecting. And this is also a multiple one. So we need to click multiple and let's have a preview of the element. Yeah, everything's cool. And then we put it under the root and we save selector. And that's it for the wrapper because we haven't defined any scraping in this. We just defined the element and then we need to click this wrapper and go inside and we add a new selector. This time, again, like last time we did with title, we did the exact, now we do exactly the same. We put title, we put text, we select the title. But at this time, because in this card, we only have one title. So we just click on that and this down. Yeah, we don't click this multiple because in this card, in this element, we defined that we only have one title. So it's the single one. And then as you see that is automatically under the wrapper because we created it within the wrapper, we leave it there and we save selector. And again, we add new selector and we also want to scrape the date so we can put date. And it's also a piece of text. So we put text and we select the date. And again, it's just a single one within this wrapper. So we just leave this empty for multiple and we save selector. And so if we go, that's it for script, like multiple features. The essential part is that we need to build a wrapper so that we can put everything together. So if we go back to the root and we have the wrapper and if we click the data preview, we can see that we have what we want. The title and the date are in the same row and we scrape all the pairs here on this website. And also you can click here on the second option, set map, and then if you click selector graph, it will have a visualization or flow chart of what's happening within this web scraper. You can see that the title and date are wrapped together. So if you have already done that, then we can go to, again, scrape and we can start scraping. Yeah, and then it's finished. We can refresh and we see what we want. And the important tip for this task is that do not create the two selectors directly on the root. Because in that way, you can try yourself later. You will see that the title and the date are not put in the same row, but rather they were put in the different rows and there will be a lot of blanks in the CSV table. And that is because that's basically you're telling the web scraper that the first task is select this title and then the second task is select these days and then they are doing them independently and individually and then they are not going to be presented in the same row. So that's basically how to scrape two features together or if you want to scrape multiple features, you can add more like selectors here within wrapper. All right, so let's jump to the next task. That is a complete one. So it's still this page, but this time we want to scrape all the like useful information here. So we want to have the title, the date, the description and also if possible, this image here. So it's a complete one and also a difficult one. So again, we come to this set map that we just created. And again, we use this wrapper because again, it's like we want to click the information from those kind of cars here. And we already have title and date here. So all we need to add is description, image and also this link, because you can see that each title is leading to a new detailed page of it. And we want to like collect the URL for this detailed webpage. So to do that, let's start with description first. So we create a new selector when we call it description. And again, the type is text. And within that, we select this piece of text and we choose done selecting. And again, leave this multiple as empty because in each card, we only have one piece of description and we save selector. So like title and date, it's very simple. And then what we want to do is to save the link of each title. So to do that, we create a new one, new selector and we call it link. And this time, we should be minded that because it's a link, so we choose link. And the reason that I don't choose pop-up link is because if you click on this link, you can see it's loading on the same page rather than popping a new window for it. So it's actually, it's like within one page, like new link, whatever you call it. So we choose link. If it's a new window, then you need to choose pop-up. It really depends on your scenario. And then we click this, the link here. Which is also the title and we choose down selecting. And again, it's just a single one. So it's just, we don't need to click multiple. And if you're not sure about this, you can put data preview. And voila, we have the link, which is, we got the name of it, which is exactly the same as title. So in future, you can just maybe drop that title if you want to avoid repeating. And then the important part is here, we have the link, H-R-E-F, that is the URL of each new, each piece of news. So that's it. And we put that in the wrapper and we choose safe selectors. And last but not least, we can see that there are some images attached to some like news card. And we also want that. So we click add new and reboot image here. And we select type as image, of course. And then we have an issue here, because for this sample element or sample wrapper, we don't have any image here that we can click on. So it looks impossible to select that, even though we have images done for the other like cards. So to do that, what this requires is some like web page designing knowledge and it's totally optional. So what I handle, what I do with this kind of situation is that I can like right click on this image and I select inspect. That is basically to locate this image in the source code of this web page. And then I can see that the source code it writes IMG and the source is this link and then some other things. Well, and if I open another, inspect another image, I can see that it's another like something similar. So it's kind of telling me that all these images are under the same name is IMG here, but just with different sources. So what I can do is I go back to web scraper and for this selector, I put IMG because this is like a route or a path to the element that I want to, to the selectors I want to define. So, and if we click data preview and I can see that, although most of them are not, but you still can see some got empty here and later we'll find that whether it works or not. So again, we don't click mouthful because if there's one is only one and we click save selectors. And that's it, we got title, date, description and image and also the link on title. And we can go back to the wrapper or you just click on route and then we can select the data preview and we can see that we have everything we want and sometimes let me see whether I can drag it to the right. Yeah, we got image at the end. It's not still not showing that. Let's, yeah, let's scrape it and start scraping. I have to say that sometimes the data preview option is not that well-established, like useful. So it doesn't damage at all to just to scrape like as many times as you want. Yeah, and here is the result of our scraping and we can see that we have the title, date, description, link and also the link URL and the image source here. And here for some of the entries that got the image we can have a link to the image. Yes, that is kind of a downside of WebSquip or that you cannot have the image directly but rather you will have a like a link to that image. And it actually, if you paste this link to like your browser it will load in that image for you. So on the website of WebSquipper they offer some like packages, Python packages that teach you how to like download the image with those kind of sources and you're welcome to perform that but I personally I don't find them helpful because it requires you to have Python installed and then do that within Python. All right, so if you are satisfied with that we can just export data as CSV. So click here and then click download now and again it will download the CSV table for you and you can find that all the information are stored very structured here in this CSV table. So that is the way to script all the information we want. It's actually in essence it's the same as scripting multiple features. So if we go to the selector graph you can see that this time within wrapper we got more kind of more selectors here and that just to script everything we want. So let's go to the page nation. So the page nation part there are a variety of manners to load in more data depending on the website design and it's very common that they will have that. So these manners include simple number page nation that is one, two, three or scroll down to load more or click more or next to load more and also you need to notice that sometimes even though you're loading more data the URL doesn't change at all it's still the same URL and also sometimes it will change the URL and it's actually loading another page like kind of like in another website it's just another URL. So what I'm going to show here in this case is the next page nation. So to do that let me go back to our Chrome because there's no page nation on this web page so let's try another one. This is the news archive of Manchester City Council and here I opened August the 2020 archive and we can see that we have this next page button here and it will lead us to in total remember three pages so I will send this link to our chat so that you can use it as well. Yes, so if you receive that please copy this link and open that in your Chrome and let's start with a brief inspect on this page so we see that if we click next page it will lead us to the second page and we can see that essentially the URL changes so before it's one at the end and now it's two that means actually it's loading another page and that means the second page of this August archive and we have another next page and then you can see just one second it's loading a bit slower maybe because I'm running the zoom I hope you don't have that problem Oh, yeah, okay great fantastic, we got it here so it's the third page apparently this is the last page here because we don't have next page button and we got three here means that this is the third page so that means this archive we got three pages in total and each of them has its own URL so what we want to do is again like we did last time so we want to scrape all the information here within this within this archive so we want to scrape everything within these three pages and to do this I will introduce the first method so again we call the web scraper press F12 sorry it's not working press F12 or call it from more tools and develop tools it's exactly the same and go click this last option web scraper and here because it's a new project let's create a new set map and we call it I started again with A and we call it page nation and here we can put the URL here and because we already know that it's like 1, 2, 3 so we can actually well you can click here to select plus and click the second URL here and just change the last digit to 2 and then put another URL but alternatively like see you have like a hundred of them it's not possible to input them manually you can select to put that in a square bracket and put 1, 2, 3 so that basically telling the web scraper that we want to perform the same tasks and on three URLs so and then you can just create a selector like we did for the last task that is just for a symbol for a single page one is quite straightforward I think but sometimes you will find that some like loading more content doesn't change the URL at all even though you click on next page button so to do that we need to do some extra tasks and I'm going to show the method of handling that kind of situation with this website so that is I will perform how to create the information from multiple pages even though I only have one URL see this URL so again we choose we create this site map we choose that start URL as a single URL and we create a site map and again we click here and the type is element and it's multiple and then we select the wrapper we want to define and we select all of them you should be very familiar with them I hope and we choose done selecting and if you want you can check and we choose element preview and we can see that everything is selected and we save it and then we go within this wrapper click on it and we add new selectors so we can see let's start with title and text and we select the title because again in this element there's only one title so we put this multiple as empty and then we save selector and again we do the same thing sorry date not date here and for the date and for the description yeah yeah alright it seems that this is a description even though it contains this it covers this image we can if you are not sure yeah we can select preview data preview it's very nice and then we put this multiple as empty and we save selector and then I want to yeah I will do the last one as this image and the link because I just showed it before so let's just skip that one let's do it we select the image fortunately we got the image here so we can just select it without inspecting and we save selectors alright yeah I will just skip the links you can do it like we did last time and if we look at the data preview we got what we want but it's just for this page yeah and that is not something we want so if we go to so let's create a new selector let's call it page and this time we choose link because essentially we want to use the link of this next page and we select this one and we put it as multiple because we see that we have actually two next page one for this page and then if we click it we got another next page so we fill it down selecting and we save the selectors alright so now we have the wrapper for each news card and we got the link the question is how can we kind of like structure it to like make them kind of like interact with each other so here I put a kind of flow chart or selector graph here so essentially what we want to do is we want from the beginning the rudies can be regarded as the beginning page from the beginning page we want a wrapper to scrape all the information on this beginning page and we want a link to the next page and then after within that next page we want a scrape wrapper to scrape the news card on the next page and then but also a next page button to the following page and then so on and so forth so if you're familiar with kind of maybe computer science then we usually call this remember it should be recursion or something that is to recall the function within function itself we recall the page function within the page function itself so that we can just continue this next page thing so with this flow chart in mind it will become easier that means we want to put this wrapper in both the beginning the root and also the page and we want to put this page on the root and a page I hope that makes sense so with that in mind we go back to this wrapper and page which click edit that will enable you to edit everything you just put there and as I said we need to put like two put it under two power selectors so if you press control and then you can select another one page yeah that's it and we save selectors and we can see that here the power selectors are changed and we can do we can do exactly the same to the second one we press control and we click page and we save selectors so if we go to the sector graph it will show you what I just to show you on the power points that we can just call page within page and then we just perform the wrapper each time it will just go on if there's as long as there's the next page button so that looks really reasonable and nice let's scrape and it should be minded that because now actually we're loading a new page so we actually we can activate this page load the delays because I just noticed that my browser is running a bit late so I can change that to 5000 so that it will allow the web scraper to wait for longer time and request interval let's put it 5000 as well this really depends on your device and let's start scraping alright so it's scraping the first page here and then the second page looks very well and then the third page alright cool let's refresh this and we see that is what we want we got the title date description image and also this two extra column may be unnecessary you can drop them if you want it's hard to calculate how many entries we got here so we can export data as CSV and download it because it's a relatively small one so it's downloading quick and we can see here we got 21 entries in total because we got the first one is the name of the columns and that's what we want because for each page if you calculate there are two news cards so in total there are 21 news cards we got all of them so I hope all this makes sense to you that's it for how to perform the web scraping for multiple pages with next page button and starting with one URL